Maintaining variable classes
R has several options to store dataframes as plain text files from R. Base R has write.table()
and its companions like write.csv()
. Some other options are data.table::fwrite()
, readr::write_delim()
, readr::write_csv()
and readr::write_tsv()
. Each of them writes a dataframe as a plain text file by converting all variables into characters. After reading the file, the conversion is reversed. However, the distinction between character
and factor
is lost in translation. read.table()
converts by default all strings to factors, readr::read_csv()
keeps by default all strings as character. The factor levels are another thing which is lost. These functions determine factor levels based on the observed levels in the plain text file. Hence factor levels without observations will disappear. The order of the factor levels is also determined by the available levels in the plain text file, which can be different from the original order.
The write_vc()
and read_vc()
functions from git2rdata
keep track of the class of each variable and, in case of a factor, also of the factor levels and their order. Hence this function pair preserves the information content of the dataframe. The vc
suffix stands for version control as these functions use their full capacity in combination with a version control system. Efficiency in terms of storage and time ### Optimizing file storage
Plain text files require more disk space than binary files. This is the price we have to pay for a readable file format. The default option of write_vc()
is to minimize file size as much as possible prior to writing. Since we use a tab delimited file format, we can omit quotes around character variables. This saves 2 bytes per row for each character variable. Quotes are added automatically in the exceptional cases when they are needed, e.g. to store a string that contains tab or newline characters. In such cases, quotes are only used in row-variable combinations where the exception occurs.
Since we store the class of each variable, further file size reductions can be achieved by following rules:
logical
is written as 0 (FALSE), 1 (TRUE) or NA to the datafactor
is stored as its indices in the data. The index and labels of levels and their order are stored in the metadata.POSIXct
is written as a numeric to the data. The class and the origin are stored in the metadata. Timestamps are always stored and returned as UTC.Date
is written as an integer to the data. The class and the origin are stored in the metadata.
Storing the factors, POSIXct and Date as their index, makes them less user readable. The user can turn off this optimization when user readability is more important than file size.