Optimizing Storage for Version Control

Thierry Onkelinx

Introduction

This vignette focuses on what git2rdata does to make storing dataframes under version control more efficient and convenient. All details on the actual file format are described in vignette("plain_text", package = "git2rdata"). Hence we will not discuss the optimize and na arguments to the write_vc() function.

We will not illustrate the efficiency of write_vc() and read_vc() since that is covered in vignette("efficiency", package = "git2rdata").

Setup

# Create a directory in tempdir
root <- tempfile(pattern = "git2r-")
dir.create(root)
# Create dummy data
set.seed(20190222)
x <- data.frame(
  x = sample(LETTERS),
  y = factor(
    sample(c("a", "b", NA), 26, replace = TRUE),
    levels = c("a", "b", "c")
  ),
  z = c(NA, 1:25),
  abc = c(rnorm(25), NA),
  def = sample(c(TRUE, FALSE, NA), 26, replace = TRUE),
  timestamp = seq(
    as.POSIXct("2018-01-01"),
    as.POSIXct("2019-01-01"),
    length = 26
  ),
  stringsAsFactors = FALSE
)
str(x)
#> 'data.frame':    26 obs. of  6 variables:
#>  $ x        : chr  "V" "U" "Z" "W" ...
#>  $ y        : Factor w/ 3 levels "a","b","c": 1 2 NA NA 1 NA 2 1 NA 1 ...
#>  $ z        : int  NA 1 2 3 4 5 6 7 8 9 ...
#>  $ abc      : num  -0.382 -0.42 -0.917 0.387 -0.992 ...
#>  $ def      : logi  TRUE FALSE NA FALSE NA NA ...
#>  $ timestamp: POSIXct, format: "2018-01-01 00:00:00" "2018-01-15 14:24:00" ...

Assumptions

A critical assumption made by git2rdata is that all information is contained within the dataframe itself. Each row is an observation, each column is a variable and only the variables are named. This implies that two observations switching place does not alter the information content. Nor does switching two variables.

Version control systems like git, subversion or mercurial focus on accurately keeping track of any change in the files. Two observations switching place in a plain text file is a change, although the information content1 doesn’t change. Therefore git2rdata helps the user to prepare the plain text files in such a way that any change in the version history is an actual change in the information content.

Sorting Observations

Version control systems often track changes in plain text files based on row based differences. In layman’s terms they only record which lines in a file are removed and which lines are inserted at what location. Changing an existing line implies removing the old version and inserting the new one. This is illustrated in the minimal example below.

Original version

A,B
1,10
2,11
3,12

Altered version. The row containing 1, 10 was moved to the last line. The row containing 3,12 was changed to 3,0

A,B
2,11
3,0
1,10

Diff between original and altered version. Notice than we have a deletion of two lines and two insertions.

A,B
-1,10
2,11
-3,12
+3,0
+1,10

Ensuring that the observations are always sorted in the same way thus helps minimizing the diff. The sorted version of the same altered version looks like the example below.

A,B
1,10
2,11
3,0

Diff between original and the sorted alternate version. Notice that all changes revert to actual changes in the information content. Another benefit is that changes are easily spotted in the diff. A deletion without insertion on the next line is a removed observation. An insertion without preceding deletion is a new observation. A deletion followed by an insertion is an updated observation.

A,B
1,10
2,11
-3,12
+3,0

This is where the sorting argument comes into play. If this argument is not provided when a file is written for the first time, it will yield a warning about the lack of sorting. The observations will be written in their current order. New versions of the file will not apply any sorting either, leaving this burden to the user. This is illustrated by the changed hash for the data file in the example below, whereas the metadata is not changed (no change in hash).

library(git2rdata)
write_vc(x, file = "row_order", root = root)
#> Warning in meta.data.frame(x, optimize = optimize, na = na, sorting = sorting): No sorting applied.
#> Sorting is strongly recommended in combination with version control.
#> 03d6faf2209eb466cc12c9bf6d274a3ee2c7f0db f621ac5ac48f6c7c7671c71538470c31a03ef9df 
#>                          "row_order.tsv"                          "row_order.yml"
write_vc(x[sample(nrow(x)), ], file = "row_order", root = root)
#> Warning in meta.data.frame(x, optimize = optimize, na = na, sorting = sorting, : No sorting applied.
#> Sorting is strongly recommended in combination with version control.
#> bc07dd2ba60e9d5fd2bc536a56d1e47d35b857ad d3215ab5aea7e42c88276df1f2b3d6058f7833d6 
#>                          "row_order.tsv"                          "row_order.yml"

sorting should contain a vector of variable names. The observations are automatically sorted along these variables prior to writing. However, we now get an error because the set of sorting variables has changed. The set of sorting variables is stored in the metadata. Changing the sorting can potentially lead to large diffs, which git2rdata tries to avoid as much as possible.

From this moment on we will store the output of write_vc() in an object to minimize the output.

fn <- write_vc(x, "row_order", root, sorting = "y")
#> Warning in meta.data.frame(x, optimize = optimize, na = na, sorting = sorting, : Sorting on 'y' results in ties.
#> Add extra sorting variables to ensure small diffs.
#> Error: The data was not overwritten because of the issues below.
#> See vignette('version_control', package = 'git2rdata') for more information.
#> 
#> - The sorting variables changed.
#>     - Sorting for the new data: 'y'.
#>     - Sorting for the old data: .

Using strict = FALSE turns such errors into warnings and allows to update the file. Notice that we get a new warning: the variable we used for sorting resulted in ties, thus the order of the observations is not guaranteed to be stable. The solution is to use more or different variables. We’ll need strict = FALSE again to override the change in sorting variables.

fn <- write_vc(x, "row_order", root, sorting = "y", strict = FALSE)
#> Warning in meta.data.frame(x, optimize = optimize, na = na, sorting = sorting, : Sorting on 'y' results in ties.
#> Add extra sorting variables to ensure small diffs.
#> Warning in write_vc.character(x, "row_order", root, sorting = "y", strict = FALSE): - The sorting variables changed.
#>     - Sorting for the new data: 'y'.
#>     - Sorting for the old data: .
fn <- write_vc(x, "row_order", root, sorting = c("y", "x"), strict = FALSE)
#> Warning in write_vc.character(x, "row_order", root, sorting = c("y", "x"), : - The sorting variables changed.
#>     - Sorting for the new data: 'y', 'x'.
#>     - Sorting for the old data: 'y'.

Once the sorting is defined we may omit the sorting argument when writing new versions. The sorting as defined in the existing metadata will be used to sort the observations. A check for potential ties will be performed and results in a warning when ties are found.

print_file <- function(file, root, n = -1) {
  fn <- file.path(root, file)
  data <- readLines(fn, n = n)
  cat(data, sep = "\n")
}
print_file("row_order.yml", root, 7)
#> ..generic:
#>   git2rdata: '0.1'
#>   optimize: yes
#>   NA string: NA
#>   sorting:
#>   - 'y'
#>   - x
fn <- write_vc(x[sample(nrow(x)), ], "row_order", root)
fn <- write_vc(x[sample(nrow(x)), ], "row_order", root, sorting = c("y", "x"))
fn <- write_vc(x[sample(nrow(x), replace = TRUE), ], "row_order", root)
#> Warning in meta.data.frame(x, optimize = optimize, na = na, sorting = sorting, : Sorting on 'y', 'x' results in ties.
#> Add extra sorting variables to ensure small diffs.

Sorting Variables

The order of the variables (columns) has an even bigger impact on a row based diff. Let’s revisit our minimal example. Suppose that we swap A and B from our original example. The new data looks as below.

B,A
10,1
11,2
13,3

The resulting diff is maximal because every single row was updated. Yet none of the information was changed. Hence, it is crucial to maintain column order when storing dataframes as plain text files under version control. This is illustrated on a more realistic data set in the vignette("efficiency", package = "git2rdata") vignette.

-A,B
+B,A
-1,10
+10,1
-2,11
+11,2
-3,13
+13,3

git2rdata tackles this problem by storing the order of the columns in the metadata. The order is defined by the order in the dataframe when it is written for the first time. From that moment on, the same order will be reused. The example below writes the same data set twice. The second version contains exactly the same information but randomizes the order of the observations and the columns. The sorting by the internals of write_vc() will undo this randomization, resulting in an unchanged file.

write_vc(x, "column_order", root, sorting = c("x", "abc"))
#> 23caf26a42333b87baf034d5732d5565d19bfd01 03ff8e4f8edd0e0ef2ca2775def0a3ac78d85e58 
#>                       "column_order.tsv"                       "column_order.yml"
print_file("column_order.tsv", root, n = 5)
#> x    y   z   abc def timestamp
#> A    1   18  0.572192852110693   0   1537467120
#> B    2   14  -1.64221062655002   0   1532421360
#> C    NA  5   0.0228713954429028  NA  1521068400
#> D    2   20  -0.683183900695259  NA  1539990000
write_vc(x[sample(nrow(x)), sample(ncol(x))], "column_order", root)
#> 23caf26a42333b87baf034d5732d5565d19bfd01 03ff8e4f8edd0e0ef2ca2775def0a3ac78d85e58 
#>                       "column_order.tsv"                       "column_order.yml"
print_file("column_order.tsv", root, n = 5)
#> x    y   z   abc def timestamp
#> A    1   18  0.572192852110693   0   1537467120
#> B    2   14  -1.64221062655002   0   1532421360
#> C    NA  5   0.0228713954429028  NA  1521068400
#> D    2   20  -0.683183900695259  NA  1539990000

Handling Factors Optimized

vignette("plain_text", package = "git2rdata") and vignette("efficiency", package = "git2rdata") illustrate how a factor can be stored more efficiently when storing their index in the data file and the indices and labels in the metadata. We take this even a bit further: what happens if new data arrives and an extra factor level is required?

old <- data.frame(color = c("red", "blue"))
write_vc(old, "factor", root, sorting = "color")
#> c7281c65549ea8f2569c8b6d7f932d84d4f99641 8bf29eeb13475347e15e3a86730b5a3df69f801f 
#>                             "factor.tsv"                             "factor.yml"
print_file("factor.yml", root)
#> ..generic:
#>   git2rdata: '0.1'
#>   optimize: yes
#>   NA string: NA
#>   sorting: color
#>   hash: 03c3898451e17cf436da59dd0e712606ea63a838
#>   data_hash: c7281c65549ea8f2569c8b6d7f932d84d4f99641
#> color:
#>   class: factor
#>   labels:
#>   - blue
#>   - red
#>   index:
#>   - 1
#>   - 2
#>   ordered: no

Let’s add an observation with a new factor level. If we store the updated dataframe in a new file, we see that the indices are different. The factor level "blue" remains unchanged, but "red" becomes the third level and get index 3 instead of index 2. This could lead to a large diff whereas the potential semantics (and thus the information content) are not changed.

updated <- data.frame(color = c("red", "green", "blue"))
write_vc(updated, "factor2", root, sorting = "color")
#> c0c8bf91feb6acbee4246ad290fcadd76f622e9c 468f555c19bdfa25fe90cdebb1a64c9bb7ebc0b1 
#>                            "factor2.tsv"                            "factor2.yml"
print_file("factor2.yml", root)
#> ..generic:
#>   git2rdata: '0.1'
#>   optimize: yes
#>   NA string: NA
#>   sorting: color
#>   hash: f2cc274714fef0b55e17ae432e99b73e5c880e2d
#>   data_hash: c0c8bf91feb6acbee4246ad290fcadd76f622e9c
#> color:
#>   class: factor
#>   labels:
#>   - blue
#>   - green
#>   - red
#>   index:
#>   - 1
#>   - 2
#>   - 3
#>   ordered: no

When we try to overwrite the original data with the updated version, we get an error because there is a change in factor levels and / or indices. In this specific case, we decided that the change is OK and force the writing by setting strict = FALSE. Notice that the original labels ("blue" and "red") keep their index, the new level ("green") gets the first available index number.

write_vc(updated, "factor", root)
#> Error: The data was not overwritten because of the issues below.
#> See vignette('version_control', package = 'git2rdata') for more information.
#> 
#> - New factor labels for 'color'.
#> - New indices for 'color'.
fn <- write_vc(updated, "factor", root, strict = FALSE)
#> Warning in write_vc.character(updated, "factor", root, strict = FALSE): - New factor labels for 'color'.
#> - New indices for 'color'.
print_file("factor.yml", root)
#> ..generic:
#>   git2rdata: '0.1'
#>   optimize: yes
#>   NA string: NA
#>   sorting: color
#>   hash: e0ed4c773b2179346042fef6f8c22c42c22a7c00
#>   data_hash: e75427336dfc55af5623b5769cf3e0a53c64f78e
#> color:
#>   class: factor
#>   labels:
#>   - blue
#>   - green
#>   - red
#>   index:
#>   - 1
#>   - 3
#>   - 2
#>   ordered: no

The next example removes the "blue" level and switches the order of the remaining levels. Notice that again the existing indices are retained. The order of the labels and indices reflects their new ordering.

deleted <- data.frame(
  color = factor(c("red", "green"), levels = c("red", "green")))
write_vc(deleted, "factor", root, sorting = "color", strict = FALSE)
#> Warning in write_vc.character(deleted, "factor", root, sorting = "color", : - New factor labels for 'color'.
#> - New indices for 'color'.
#> bfc2a5648af37831b3ba755e3b9a736ff6f0c2eb 938a6a59a09cff4a4f9c825addc11201a00e6952 
#>                             "factor.tsv"                             "factor.yml"
print_file("factor.yml", root)
#> ..generic:
#>   git2rdata: '0.1'
#>   optimize: yes
#>   NA string: NA
#>   sorting: color
#>   hash: 3cadfe4021fe5e2990d0bb057100c608e3b602fa
#>   data_hash: bfc2a5648af37831b3ba755e3b9a736ff6f0c2eb
#> color:
#>   class: factor
#>   labels:
#>   - red
#>   - green
#>   index:
#>   - 2
#>   - 3
#>   ordered: no

Changing a factor to an ordered factor or vice versa will also keep existing level indices.

ordered <- data.frame(
  color = factor(c("red", "green"), levels = c("red", "green"), ordered = TRUE))
write_vc(ordered, "factor", root, sorting = "color", strict = FALSE)
#> Warning in write_vc.character(ordered, "factor", root, sorting = "color", : -
#> 'color' changes from nominal to ordinal.
#> bfc2a5648af37831b3ba755e3b9a736ff6f0c2eb 4a576396de241f19dbe02134171589f9f0ee57be 
#>                             "factor.tsv"                             "factor.yml"
print_file("factor.yml", root)
#> ..generic:
#>   git2rdata: '0.1'
#>   optimize: yes
#>   NA string: NA
#>   sorting: color
#>   hash: 57ff604596058d60e97fbb9c93ee6869f32c1850
#>   data_hash: bfc2a5648af37831b3ba755e3b9a736ff6f0c2eb
#> color:
#>   class: factor
#>   labels:
#>   - red
#>   - green
#>   index:
#>   - 2
#>   - 3
#>   ordered: yes

Relabelling a Factor

The example below will store a dataframe, relabel the factor levels and store it again using write_vc(). Notice that both the labels and the indices are updated. Hence creating a large diff, where just updating the labels would be sufficient.

write_vc(old, "write_vc", root, sorting = "color")
#> c7281c65549ea8f2569c8b6d7f932d84d4f99641 8bf29eeb13475347e15e3a86730b5a3df69f801f 
#>                           "write_vc.tsv"                           "write_vc.yml"
print_file("write_vc.yml", root)
#> ..generic:
#>   git2rdata: '0.1'
#>   optimize: yes
#>   NA string: NA
#>   sorting: color
#>   hash: 03c3898451e17cf436da59dd0e712606ea63a838
#>   data_hash: c7281c65549ea8f2569c8b6d7f932d84d4f99641
#> color:
#>   class: factor
#>   labels:
#>   - blue
#>   - red
#>   index:
#>   - 1
#>   - 2
#>   ordered: no
relabeled <- old
# translate the color names to Dutch
levels(relabeled$color) <- c("blauw", "rood")
write_vc(relabeled, "write_vc", root, strict = FALSE)
#> Warning in write_vc.character(relabeled, "write_vc", root, strict = FALSE): - New factor labels for 'color'.
#> - New indices for 'color'.
#> bf17088d58f065022f58624c5d2f2b2a627c87c5 7af11e0cb209c7e2b4c7b7d25fd736da20538db5 
#>                           "write_vc.tsv"                           "write_vc.yml"
print_file("write_vc.yml", root)
#> ..generic:
#>   git2rdata: '0.1'
#>   optimize: yes
#>   NA string: NA
#>   sorting: color
#>   hash: f6730454185caeb173c6883ce56200c376975567
#>   data_hash: bf17088d58f065022f58624c5d2f2b2a627c87c5
#> color:
#>   class: factor
#>   labels:
#>   - blauw
#>   - rood
#>   index:
#>   - 3
#>   - 4
#>   ordered: no

Therefore we created relabel(), which changes only the labels in the metadata. It takes three arguments: the name of the data file, the root and the change. change accepts two formats, a list or a dataframe. The name of the list must match with the variable name of a factor in the data. Each element of the list must be a named vector, the name being the existing label and the value the new label. The dataframe format requires a factor, old and new variable with one row for each change in label.

write_vc(old, "relabel", root, sorting = "color")
#> c7281c65549ea8f2569c8b6d7f932d84d4f99641 8bf29eeb13475347e15e3a86730b5a3df69f801f 
#>                            "relabel.tsv"                            "relabel.yml"
relabel("relabel", root, change = list(color = c(red = "rood", blue = "blauw")))
print_file("relabel.yml", root)
#> ..generic:
#>   git2rdata: '0.1'
#>   optimize: yes
#>   NA string: NA
#>   sorting: color
#>   hash: bb25c6cc455f6d8e52b7daeb176adf83d8c5b0f9
#>   data_hash: c7281c65549ea8f2569c8b6d7f932d84d4f99641
#> color:
#>   class: factor
#>   labels:
#>   - blauw
#>   - rood
#>   index:
#>   - 1
#>   - 2
#>   ordered: no
relabel("relabel", root, 
        change = data.frame(factor = "color", old = "blauw", new = "blue"))
print_file("relabel.yml", root)
#> ..generic:
#>   git2rdata: '0.1'
#>   optimize: yes
#>   NA string: NA
#>   sorting: color
#>   hash: a4050f89a749abce203ae6e1fe6b41483d385c2d
#>   data_hash: c7281c65549ea8f2569c8b6d7f932d84d4f99641
#> color:
#>   class: factor
#>   labels:
#>   - blue
#>   - rood
#>   index:
#>   - 1
#>   - 2
#>   ordered: no

A caveat: relabel() only makes sense when the data file uses optimized storage. The verbose mode stores the factor labels and not their indices, in which case relabelling a label will always yield a large diff. Therefore relabel() will only handle the optimized storage.


  1. sensu git2rdata