Title: | Store and Retrieve Data.frames in a Git Repository |
Version: | 0.5.0 |
Description: | The git2rdata package is an R package for writing and reading dataframes as plain text files. A metadata file stores important information. 1) Storing metadata allows to maintain the classes of variables. By default, git2rdata optimizes the data for file storage. The optimization is most effective on data containing factors. The optimization makes the data less human readable. The user can turn this off when they prefer a human readable format over smaller files. Details on the implementation are available in vignette("plain_text", package = "git2rdata"). 2) Storing metadata also allows smaller row based diffs between two consecutive commits. This is a useful feature when storing data as plain text files under version control. Details on this part of the implementation are available in vignette("version_control", package = "git2rdata"). Although we envisioned git2rdata with a git workflow in mind, you can use it in combination with other version control systems like subversion or mercurial. 3) git2rdata is a useful tool in a reproducible and traceable workflow. vignette("workflow", package = "git2rdata") gives a toy example. 4) vignette("efficiency", package = "git2rdata") provides some insight into the efficiency of file storage, git repository size and speed for writing and reading. |
License: | GPL-3 |
URL: | https://ropensci.github.io/git2rdata/, https://github.com/ropensci/git2rdata/, https://doi.org/10.5281/zenodo.1485309 |
BugReports: | https://github.com/ropensci/git2rdata/issues |
Depends: | R (≥ 4.1.0) |
Imports: | assertthat, git2r (≥ 0.23.0), methods, yaml |
Suggests: | ggplot2, jsonlite, knitr, microbenchmark, rmarkdown, testthat |
VignetteBuilder: | knitr |
Config/checklist/communities: | inbo |
Config/checklist/keywords: | git; version control; plain text data |
Encoding: | UTF-8 |
Language: | en-GB |
RoxygenNote: | 7.3.2 |
Collate: | 'clean_data_path.R' 'data_package.R' 'datahash.R' 'display_metadata.R' 'git2rdata_package.R' 'write_vc.R' 'is_git2rdata.R' 'is_git2rmeta.R' 'list_data.R' 'meta.R' 'print.R' 'prune.R' 'read_vc.R' 'recent_commit.R' 'reexport.R' 'relabel.R' 'rename_variable.R' 'update_metadata.R' 'upgrade_data.R' 'utils.R' 'verify_vc.R' |
NeedsCompilation: | no |
Packaged: | 2025-01-24 16:18:33 UTC; root |
Author: | Thierry Onkelinx |
Maintainer: | Thierry Onkelinx <thierry.onkelinx@inbo.be> |
Repository: | CRAN |
Date/Publication: | 2025-01-24 16:30:02 UTC |
git2rdata: Store and Retrieve Data.frames in a Git Repository
Description
The git2rdata package is an R package for writing and reading dataframes as plain text files. A metadata file stores important information. 1) Storing metadata allows to maintain the classes of variables. By default, git2rdata optimizes the data for file storage. The optimization is most effective on data containing factors. The optimization makes the data less human readable. The user can turn this off when they prefer a human readable format over smaller files. Details on the implementation are available in vignette("plain_text", package = "git2rdata"). 2) Storing metadata also allows smaller row based diffs between two consecutive commits. This is a useful feature when storing data as plain text files under version control. Details on this part of the implementation are available in vignette("version_control", package = "git2rdata"). Although we envisioned git2rdata with a git workflow in mind, you can use it in combination with other version control systems like subversion or mercurial. 3) git2rdata is a useful tool in a reproducible and traceable workflow. vignette("workflow", package = "git2rdata") gives a toy example. 4) vignette("efficiency", package = "git2rdata") provides some insight into the efficiency of file storage, git repository size and speed for writing and reading.
Author(s)
Maintainer: Thierry Onkelinx thierry.onkelinx@inbo.be (ORCID) (Research Institute for Nature and Forest (INBO))
Other contributors:
Floris Vanderhaeghe floris.vanderhaeghe@inbo.be (ORCID) (Research Institute for Nature and Forest (INBO)) [contributor]
Peter Desmet peter.desmet@inbo.be (ORCID) (Research Institute for Nature and Forest (INBO)) [contributor]
Els Lommelen els.lommelen@inbo.be (ORCID) (Research Institute for Nature and Forest (INBO)) [contributor]
Research Institute for Nature and Forest (INBO) info@inbo.be [copyright holder, funder]
See Also
Useful links:
Report bugs at https://github.com/ropensci/git2rdata/issues
Re-exported Function From git2r
Description
See commit
in git2r
.
See Also
Other version_control:
pull()
,
push()
,
recent_commit()
,
repository()
,
status()
Create a Data Package for a directory of CSV files
Description
Create a datapackage.json
file for a directory of CSV files.
The function will look for all .csv
files in the directory and its
subdirectories.
It will then create a datapackage.json
file with the metadata of each CSV
file.
Usage
data_package(path = ".")
Arguments
path |
the directory in which to create the |
See Also
Other storage:
display_metadata()
,
list_data()
,
prune_meta()
,
read_vc()
,
relabel()
,
rename_variable()
,
rm_data()
,
update_metadata()
,
verify_vc()
,
write_vc()
Display metadata for a git2rdata
object
Description
Display metadata for a git2rdata
object
Usage
display_metadata(x, minimal = FALSE)
Arguments
x |
a |
minimal |
logical, if |
See Also
Other storage:
data_package()
,
list_data()
,
prune_meta()
,
read_vc()
,
relabel()
,
rename_variable()
,
rm_data()
,
update_metadata()
,
verify_vc()
,
write_vc()
Check Whether a Git2rdata Object is Valid.
Description
A valid git2rdata object has valid metadata.
Usage
is_git2rdata(file, root = ".", message = c("none", "warning", "error"))
Arguments
file |
the name of the git2rdata object. Git2rdata objects cannot
have dots in their name. The name may include a relative path. |
root |
The root of a project. Can be a file path or a |
message |
a single value indicating the type of messages on top of the
logical value. |
Value
A logical value. TRUE
in case of a valid git2rdata object.
Otherwise FALSE
.
See Also
Other internal:
is_git2rmeta()
,
meta()
,
print.git2rdata()
,
summary.git2rdata()
,
upgrade_data()
Examples
# create a directory
root <- tempfile("git2rdata-")
dir.create(root)
# store a file
write_vc(iris[1:6, ], "iris", root, sorting = "Sepal.Length")
# check the stored file
is_git2rmeta("iris", root)
is_git2rdata("iris", root)
# Remove the metadata from the existing git2rdata object. Then it stops
# being a git2rdata object.
junk <- file.remove(file.path(root, "iris.yml"))
is_git2rmeta("iris", root)
is_git2rdata("iris", root)
# recreate the file and remove the data and keep the metadata. It stops being
# a git2rdata object, but the metadata remains valid.
write_vc(iris[1:6, ], "iris", root, sorting = "Sepal.Length")
junk <- file.remove(file.path(root, "iris.tsv"))
is_git2rmeta("iris", root)
is_git2rdata("iris", root)
Check Whether a Git2rdata Object Has Valid Metadata.
Description
Valid metadata is a file with .yml
extension. It has a top level item
..generic
. This item contains git2rdata
(the version number), hash
(a
hash on the metadata) and data_hash
(a hash on the data file). The version
number must be the current version.
Usage
is_git2rmeta(file, root = ".", message = c("none", "warning", "error"))
Arguments
file |
the name of the git2rdata object. Git2rdata objects cannot
have dots in their name. The name may include a relative path. |
root |
The root of a project. Can be a file path or a |
message |
a single value indicating the type of messages on top of the
logical value. |
Value
A logical value. TRUE
in case of a valid metadata file. Otherwise
FALSE
.
See Also
Other internal:
is_git2rdata()
,
meta()
,
print.git2rdata()
,
summary.git2rdata()
,
upgrade_data()
Examples
# create a directory
root <- tempfile("git2rdata-")
dir.create(root)
# store a file
write_vc(iris[1:6, ], "iris", root, sorting = "Sepal.Length")
# check the stored file
is_git2rmeta("iris", root)
is_git2rdata("iris", root)
# Remove the metadata from the existing git2rdata object. Then it stops
# being a git2rdata object.
junk <- file.remove(file.path(root, "iris.yml"))
is_git2rmeta("iris", root)
is_git2rdata("iris", root)
# recreate the file and remove the data and keep the metadata. It stops being
# a git2rdata object, but the metadata remains valid.
write_vc(iris[1:6, ], "iris", root, sorting = "Sepal.Length")
junk <- file.remove(file.path(root, "iris.tsv"))
is_git2rmeta("iris", root)
is_git2rdata("iris", root)
List Available Git2rdata Files Containing Data
Description
The function returns the names of all valid git2rdata objects. This implies
.tsv
files with a matching valid metadata file (.yml
). Invalid
metadata files result in a warning. The function ignores valid metadata
files without matching raw data (.tsv
).
Usage
list_data(root = ".", path = ".", recursive = TRUE)
Arguments
root |
the |
path |
relative |
recursive |
logical. Should the listing recurse into directories? |
Value
A character vector of git2rdata object names, including their relative path.
See Also
Other storage:
data_package()
,
display_metadata()
,
prune_meta()
,
read_vc()
,
relabel()
,
rename_variable()
,
rm_data()
,
update_metadata()
,
verify_vc()
,
write_vc()
Examples
## on file system
# create a directory
root <- tempfile("git2rdata-")
dir.create(root)
# store a dataframe as git2rdata object. Capture the result to minimise
# screen output
junk <- write_vc(iris[1:6, ], "iris", root, sorting = "Sepal.Length")
# write a standard tab separate file (non git2rdata object)
write.table(iris, file = file.path(root, "standard.tsv"), sep = "\t")
# write a YAML file
yml <- list(
authors = list(
"Research Institute for Nature and Forest" = list(
href = "https://www.inbo.be/en")))
yaml::write_yaml(yml, file = file.path(root, "_pkgdown.yml"))
# list the git2rdata objects
list_data(root)
# list the files
list.files(root, recursive = TRUE)
# remove all .tsv files from valid git2rdata objects
rm_data(root, path = ".")
# check the removal of the .tsv file
list.files(root, recursive = TRUE)
list_data(root)
# remove dangling git2rdata metadata files
prune_meta(root, path = ".")
# check the removal of the metadata
list.files(root, recursive = TRUE)
list_data(root)
## on git repo
# initialise a git repo using git2r
repo_path <- tempfile("git2rdata-repo-")
dir.create(repo_path)
repo <- git2r::init(repo_path)
git2r::config(repo, user.name = "Alice", user.email = "alice@example.org")
# store a dataframe
write_vc(iris[1:6, ], "iris", repo, sorting = "Sepal.Length", stage = TRUE)
# check that the dataframe is stored
status(repo)
list_data(repo)
# commit the current version and check the git repo
commit(repo, "add iris data", session = TRUE)
status(repo)
# remove the data files from the repo
rm_data(repo, path = ".")
# check the removal
list_data(repo)
status(repo)
# remove dangling metadata
prune_meta(repo, path = ".")
# check the removal
list_data(repo)
status(repo)
Optimize an Object for Storage as Plain Text and Add Metadata
Description
Prepares a vector for storage. When relevant, meta()
optimizes the object
for storage by changing the format to one which needs less characters. The
metadata stored in the meta
attribute, contains all required information to
back-transform the optimized format into the original format.
In case of a data.frame, meta()
applies itself to each of the columns. The
meta
attribute becomes a named list containing the metadata for each column
plus an additional ..generic
element. ..generic
is a reserved name for
the metadata and not allowed as column name in a data.frame
.
write_vc()
uses this function to prepare a dataframe for storage.
Existing metadata is passed through the optional old
argument. This
argument intended for internal use.
Usage
meta(x, ..., digits)
## S3 method for class 'character'
meta(x, na = "NA", optimize = TRUE, ...)
## S3 method for class 'factor'
meta(x, optimize = TRUE, na = "NA", index, strict = TRUE, ...)
## S3 method for class 'logical'
meta(x, optimize = TRUE, ...)
## S3 method for class 'POSIXct'
meta(x, optimize = TRUE, ...)
## S3 method for class 'Date'
meta(x, optimize = TRUE, ...)
## S3 method for class 'data.frame'
meta(
x,
optimize = TRUE,
na = "NA",
sorting,
strict = TRUE,
split_by = character(0),
...,
digits
)
Arguments
x |
the vector. |
... |
further arguments to the methods. |
digits |
The number of significant digits of the smallest absolute
value.
The function applies the rounding automatically.
Only relevant for numeric variables.
Either a single positive integer or a named vector where the names link to
the variables in the |
na |
the string to use for missing values in the data. |
optimize |
If |
index |
An optional named vector with existing factor indices.
The names must match the existing factor levels.
Unmatched levels from |
strict |
What to do when the metadata changes. |
sorting |
an optional vector of column names defining which columns to
use for sorting |
split_by |
An optional vector of variables name to split the text files.
This creates a separate file for every combination.
We prepend these variables to the vector of |
Value
the optimized vector x
with meta
attribute.
Note
The default order of factor levels depends on the current locale.
See Comparison
for more details on that.
The same code on a different locale might result in a different sorting.
meta()
ignores, with a warning, any change in the order of factor levels.
Add strict = FALSE
to enforce the new order of factor levels.
See Also
Other internal:
is_git2rdata()
,
is_git2rmeta()
,
print.git2rdata()
,
summary.git2rdata()
,
upgrade_data()
Examples
meta(c(NA, "'NA'", '"NA"', "abc\tdef", "abc\ndef"))
meta(1:3)
meta(seq(1, 3, length = 4), digits = 6)
meta(factor(c("b", NA, "NA"), levels = c("NA", "b", "c")))
meta(factor(c("b", NA, "a"), levels = c("a", "b", "c")), optimize = FALSE)
meta(factor(c("b", NA, "a"), levels = c("a", "b", "c"), ordered = TRUE))
meta(
factor(c("b", NA, "a"), levels = c("a", "b", "c"), ordered = TRUE),
optimize = FALSE
)
meta(c(FALSE, NA, TRUE))
meta(c(FALSE, NA, TRUE), optimize = FALSE)
meta(complex(real = c(1, NA, 2), imaginary = c(3, NA, -1)))
meta(as.POSIXct("2019-02-01 10:59:59", tz = "CET"))
meta(as.POSIXct("2019-02-01 10:59:59", tz = "CET"), optimize = FALSE)
meta(as.Date("2019-02-01"))
meta(as.Date("2019-02-01"), optimize = FALSE)
Print method for git2rdata
objects.
Description
Prints the data and the description of the columns when available.
Usage
## S3 method for class 'git2rdata'
print(x, ...)
Arguments
x |
a |
... |
additional arguments passed to |
See Also
Other internal:
is_git2rdata()
,
is_git2rmeta()
,
meta()
,
summary.git2rdata()
,
upgrade_data()
Prune Metadata Files
Description
Removes all valid metadata (.yml
files) from the path
when they don't
have accompanying data (.tsv
file). Invalid metadata triggers a warning
without removing the metadata file.
Use this function with caution since it will remove all valid metadata files
without asking for confirmation. We strongly recommend to use this
function on files under version control. See
vignette("workflow", package = "git2rdata")
for some examples on how to use
this.
Usage
prune_meta(root = ".", path = NULL, recursive = TRUE, ...)
## S3 method for class 'git_repository'
prune_meta(root, path = NULL, recursive = TRUE, ..., stage = FALSE)
Arguments
root |
The root of a project. Can be a file path or a |
path |
the directory in which to clean all the data files. The directory
is relative to |
recursive |
remove files in subdirectories too. |
... |
parameters used in some methods |
stage |
stage the changes after removing the files. Defaults to |
Value
returns invisibly a vector of removed files names. The paths are
relative to root
.
See Also
Other storage:
data_package()
,
display_metadata()
,
list_data()
,
read_vc()
,
relabel()
,
rename_variable()
,
rm_data()
,
update_metadata()
,
verify_vc()
,
write_vc()
Examples
## on file system
# create a directory
root <- tempfile("git2rdata-")
dir.create(root)
# store a dataframe as git2rdata object. Capture the result to minimise
# screen output
junk <- write_vc(iris[1:6, ], "iris", root, sorting = "Sepal.Length")
# write a standard tab separate file (non git2rdata object)
write.table(iris, file = file.path(root, "standard.tsv"), sep = "\t")
# write a YAML file
yml <- list(
authors = list(
"Research Institute for Nature and Forest" = list(
href = "https://www.inbo.be/en")))
yaml::write_yaml(yml, file = file.path(root, "_pkgdown.yml"))
# list the git2rdata objects
list_data(root)
# list the files
list.files(root, recursive = TRUE)
# remove all .tsv files from valid git2rdata objects
rm_data(root, path = ".")
# check the removal of the .tsv file
list.files(root, recursive = TRUE)
list_data(root)
# remove dangling git2rdata metadata files
prune_meta(root, path = ".")
# check the removal of the metadata
list.files(root, recursive = TRUE)
list_data(root)
## on git repo
# initialise a git repo using git2r
repo_path <- tempfile("git2rdata-repo-")
dir.create(repo_path)
repo <- git2r::init(repo_path)
git2r::config(repo, user.name = "Alice", user.email = "alice@example.org")
# store a dataframe
write_vc(iris[1:6, ], "iris", repo, sorting = "Sepal.Length", stage = TRUE)
# check that the dataframe is stored
status(repo)
list_data(repo)
# commit the current version and check the git repo
commit(repo, "add iris data", session = TRUE)
status(repo)
# remove the data files from the repo
rm_data(repo, path = ".")
# check the removal
list_data(repo)
status(repo)
# remove dangling metadata
prune_meta(repo, path = ".")
# check the removal
list_data(repo)
status(repo)
Re-exported Function From git2r
Description
See pull
in git2r
.
See Also
Other version_control:
commit()
,
push()
,
recent_commit()
,
repository()
,
status()
Re-exported Function From git2r
Description
See push
in git2r
.
See Also
Other version_control:
commit()
,
pull()
,
recent_commit()
,
repository()
,
status()
Read a Git2rdata Object from Disk
Description
read_vc()
handles git2rdata objects stored by write_vc()
. It reads and
verifies the metadata file (.yml
). Then it reads and verifies the raw data.
The last step is back-transforming any transformation done by meta()
to
return the data.frame
as stored by write_vc()
.
read_vc()
is an S3 generic on root
which currently handles "character"
(a path) and "git-repository"
(from git2r
). S3 methods for other version
control system could be added.
Usage
read_vc(file, root = ".")
Arguments
file |
the name of the git2rdata object. Git2rdata objects cannot
have dots in their name. The name may include a relative path. |
root |
The root of a project. Can be a file path or a |
Value
The data.frame
with the file names and hashes as attributes.
It has the additional class "git2rdata"
to support extra methods to
display the descriptions.
See Also
Other storage:
data_package()
,
display_metadata()
,
list_data()
,
prune_meta()
,
relabel()
,
rename_variable()
,
rm_data()
,
update_metadata()
,
verify_vc()
,
write_vc()
Examples
## on file system
# create a directory
root <- tempfile("git2rdata-")
dir.create(root)
# write a dataframe to the directory
write_vc(
iris[1:6, ], file = "iris", root = root, sorting = "Sepal.Length",
digits = 6
)
# check that a data file (.tsv) and a metadata file (.yml) exist.
list.files(root, recursive = TRUE)
# read the git2rdata object from the directory
read_vc("iris", root)
# store a new version with different observations but the same metadata
write_vc(iris[1:5, ], "iris", root)
list.files(root, recursive = TRUE)
# Removing a column requires version requires new metadata.
# Add strict = FALSE to override the existing metadata.
write_vc(
iris[1:6, -2], "iris", root, sorting = "Sepal.Length", strict = FALSE
)
list.files(root, recursive = TRUE)
# storing the orignal version again requires another update of the metadata
write_vc(iris[1:6, ], "iris", root, sorting = "Sepal.Width", strict = FALSE)
list.files(root, recursive = TRUE)
# optimize = FALSE stores the data more verbose. This requires larger files.
write_vc(
iris[1:6, ], "iris2", root, sorting = "Sepal.Width", optimize = FALSE
)
list.files(root, recursive = TRUE)
## on git repo using a git2r::git-repository
# initialise a git repo using the git2r package
repo_path <- tempfile("git2rdata-repo-")
dir.create(repo_path)
repo <- git2r::init(repo_path)
git2r::config(repo, user.name = "Alice", user.email = "alice@example.org")
# store a dataframe in git repo.
write_vc(iris[1:6, ], file = "iris", root = repo, sorting = "Sepal.Length")
# This git2rdata object is not staged by default.
status(repo)
# read a dataframe from a git repo
read_vc("iris", repo)
# store a new version in the git repo and stage it in one go
write_vc(iris[1:5, ], "iris", repo, stage = TRUE)
status(repo)
# store a verbose version in a different gir2data object
write_vc(
iris[1:6, ], "iris2", repo, sorting = "Sepal.Width", optimize = FALSE
)
status(repo)
Retrieve the Most Recent File Change
Description
Retrieve the most recent commit that added or updated a file or git2rdata object. This does not imply that file still exists at the current HEAD as it ignores the deletion of files.
Use this information to document the current version of file or git2rdata
object in an analysis. Since it refers to the most recent change of this
file, it remains unchanged by committing changes to other files. You can
also use it to track if data got updated, requiring an analysis to
be rerun. See vignette("workflow", package = "git2rdata")
.
Usage
recent_commit(file, root, data = FALSE)
Arguments
file |
the name of the git2rdata object. Git2rdata objects cannot
have dots in their name. The name may include a relative path. |
root |
The root of a project. Can be a file path or a |
data |
does |
Value
a data.frame
with commit
, author
and when
for the most recent
commit that adds op updates the file.
See Also
Other version_control:
commit()
,
pull()
,
push()
,
repository()
,
status()
Examples
# initialise a git repo using git2r
repo_path <- tempfile("git2rdata-repo")
dir.create(repo_path)
repo <- git2r::init(repo_path)
git2r::config(repo, user.name = "Alice", user.email = "alice@example.org")
# write and commit a first dataframe
# store the output of write_vc() minimize screen output
junk <- write_vc(
iris[1:6, ], "iris", repo, sorting = "Sepal.Length", stage = TRUE,
digits = 6
)
commit(repo, "important analysis", session = TRUE)
list.files(repo_path)
Sys.sleep(1.1) # required because git doesn't handle subsecond timings
# write and commit a second dataframe
junk <- write_vc(
iris[7:12, ], "iris2", repo, sorting = "Sepal.Length", stage = TRUE,
digits = 6
)
commit(repo, "important analysis", session = TRUE)
list.files(repo_path)
Sys.sleep(1.1) # required because git doesn't handle subsecond timings
# write and commit a new version of the first dataframe
junk <- write_vc(iris[7:12, ], "iris", repo, stage = TRUE)
list.files(repo_path)
commit(repo, "important analysis", session = TRUE)
# find out in which commit a file was last changed
# "iris.tsv" was last updated in the third commit
recent_commit("iris.tsv", repo)
# "iris.yml" was last updated in the first commit
recent_commit("iris.yml", repo)
# "iris2.yml" was last updated in the second commit
recent_commit("iris2.yml", repo)
# the git2rdata object "iris" was last updated in the third commit
recent_commit("iris", repo, data = TRUE)
# remove a dataframe and commit it to see what happens with deleted files
file.remove(file.path(repo_path, "iris.tsv"))
prune_meta(repo, ".")
commit(repo, message = "remove iris", all = TRUE, session = TRUE)
list.files(repo_path)
# still points to the third commit as this is the latest commit in which the
# data was present
recent_commit("iris", repo, data = TRUE)
Relabel Factor Levels by Updating the Metadata
Description
Imagine the situation where we have a dataframe with a factor variable and we
have stored it with write_vc(optimize = TRUE)
. The raw data file contains
the factor indices and the metadata contains the link between the factor
index and the corresponding label. See
vignette("version_control", package = "git2rdata")
. In such a case,
relabelling a factor can be fast and lightweight by updating the metadata.
Usage
relabel(file, root = ".", change)
Arguments
file |
the name of the git2rdata object. Git2rdata objects cannot
have dots in their name. The name may include a relative path. |
root |
The root of a project. Can be a file path or a |
change |
either a |
Value
invisible NULL
.
See Also
Other storage:
data_package()
,
display_metadata()
,
list_data()
,
prune_meta()
,
read_vc()
,
rename_variable()
,
rm_data()
,
update_metadata()
,
verify_vc()
,
write_vc()
Examples
# initialise a git repo using git2r
repo_path <- tempfile("git2rdata-repo-")
dir.create(repo_path)
repo <- git2r::init(repo_path)
git2r::config(repo, user.name = "Alice", user.email = "alice@example.org")
# Create a dataframe and store it as an optimized git2rdata object.
# Note that write_vc() uses optimization by default.
# Stage and commit the git2rdata object.
ds <- data.frame(
a = c("a1", "a2"),
b = c("b2", "b1"),
stringsAsFactors = TRUE
)
junk <- write_vc(ds, "relabel", repo, sorting = "b", stage = TRUE)
cm <- commit(repo, "initial commit")
# check that the workspace is clean
status(repo)
# Define new labels as a list and apply them to the git2rdata object.
new_labels <- list(
a = list(a2 = "a3")
)
relabel("relabel", repo, new_labels)
# check the changes
read_vc("relabel", repo)
# relabel() changed the metadata, not the raw data
status(repo)
git2r::add(repo, "relabel.*")
cm <- commit(repo, "relabel using a list")
# Define new labels as a dataframe and apply them to the git2rdata object
change <- data.frame(
factor = c("a", "a", "b"),
old = c("a3", "a1", "b2"),
new = c("c2", "c1", "b3"),
stringsAsFactors = TRUE
)
relabel("relabel", repo, change)
# check the changes
read_vc("relabel", repo)
# relabel() changed the metadata, not the raw data
status(repo)
Rename a Variable
Description
The raw data file contains a header with the variable names.
The metadata list the variable names and their type.
Changing a variable name and overwriting the git2rdata
object with result
in an error.
Because it will look like removing an existing variable and adding a new one.
Overwriting the object with strict = FALSE
potentially changes the order of
the variables, leading to a large diff.
Usage
rename_variable(file, change, root = ".", ...)
## S3 method for class 'character'
rename_variable(file, change, root = ".", ...)
## Default S3 method:
rename_variable(file, change, root, ...)
## S3 method for class 'git_repository'
rename_variable(file, change, root, ..., stage = FALSE, force = FALSE)
Arguments
file |
the name of the git2rdata object. Git2rdata objects cannot
have dots in their name. The name may include a relative path. |
change |
A named vector with the old names as values and the new names as names. |
root |
The root of a project. Can be a file path or a |
... |
parameters used in some methods |
stage |
Logical value indicating whether to stage the changes after
writing the data. Defaults to |
force |
Add ignored files. Default is FALSE. |
Details
This function solves this by only updating the raw data header and the metadata.
Value
invisible NULL
.
See Also
Other storage:
data_package()
,
display_metadata()
,
list_data()
,
prune_meta()
,
read_vc()
,
relabel()
,
rm_data()
,
update_metadata()
,
verify_vc()
,
write_vc()
Examples
# initialise a git repo using git2r
repo_path <- tempfile("git2rdata-repo-")
dir.create(repo_path)
repo <- git2r::init(repo_path)
git2r::config(repo, user.name = "Alice", user.email = "alice@example.org")
# Create a dataframe and store it as an optimized git2rdata object.
# Note that write_vc() uses optimization by default.
# Stage and commit the git2rdata object.
ds <- data.frame(
a = c("a1", "a2"),
b = c("b2", "b1"),
stringsAsFactors = TRUE
)
junk <- write_vc(ds, "rename", repo, sorting = "b", stage = TRUE)
cm <- commit(repo, "initial commit")
# check that the workspace is clean
status(repo)
# Define change.
change <- c(new_name = "a")
rename_variable(file = "rename", change = change, root = repo)
# check the changes
read_vc("rename", repo)
status(repo)
Re-exported Function From git2r
Description
See repository
in git2r
.
See Also
Other version_control:
commit()
,
pull()
,
push()
,
recent_commit()
,
status()
Remove Data Files From Git2rdata Objects
Description
Remove the data (.tsv
) file from all valid git2rdata objects at the path
.
The metadata remains untouched. A warning lists any git2rdata object with
invalid metadata. The function keeps any .tsv
file with
invalid metadata or from non-git2rdata objects.
Use this function with caution since it will remove all valid data files
without asking for confirmation. We strongly recommend to use this
function on files under version control. See
vignette("workflow", package = "git2rdata")
for some examples on how to use
this.
Usage
rm_data(root = ".", path = NULL, recursive = TRUE, ...)
## S3 method for class 'git_repository'
rm_data(
root,
path = NULL,
recursive = TRUE,
...,
stage = FALSE,
type = c("unmodified", "modified", "ignored", "all")
)
Arguments
root |
The root of a project. Can be a file path or a |
path |
the directory in which to clean all the data files. The directory
is relative to |
recursive |
remove files in subdirectories too. |
... |
parameters used in some methods |
stage |
stage the changes after removing the files. Defaults to FALSE. |
type |
Defines the classes of files to remove. |
Value
returns invisibly a vector of removed files names. The paths are
relative to root
.
See Also
Other storage:
data_package()
,
display_metadata()
,
list_data()
,
prune_meta()
,
read_vc()
,
relabel()
,
rename_variable()
,
update_metadata()
,
verify_vc()
,
write_vc()
Examples
## on file system
# create a directory
root <- tempfile("git2rdata-")
dir.create(root)
# store a dataframe as git2rdata object. Capture the result to minimise
# screen output
junk <- write_vc(iris[1:6, ], "iris", root, sorting = "Sepal.Length")
# write a standard tab separate file (non git2rdata object)
write.table(iris, file = file.path(root, "standard.tsv"), sep = "\t")
# write a YAML file
yml <- list(
authors = list(
"Research Institute for Nature and Forest" = list(
href = "https://www.inbo.be/en")))
yaml::write_yaml(yml, file = file.path(root, "_pkgdown.yml"))
# list the git2rdata objects
list_data(root)
# list the files
list.files(root, recursive = TRUE)
# remove all .tsv files from valid git2rdata objects
rm_data(root, path = ".")
# check the removal of the .tsv file
list.files(root, recursive = TRUE)
list_data(root)
# remove dangling git2rdata metadata files
prune_meta(root, path = ".")
# check the removal of the metadata
list.files(root, recursive = TRUE)
list_data(root)
## on git repo
# initialise a git repo using git2r
repo_path <- tempfile("git2rdata-repo-")
dir.create(repo_path)
repo <- git2r::init(repo_path)
git2r::config(repo, user.name = "Alice", user.email = "alice@example.org")
# store a dataframe
write_vc(iris[1:6, ], "iris", repo, sorting = "Sepal.Length", stage = TRUE)
# check that the dataframe is stored
status(repo)
list_data(repo)
# commit the current version and check the git repo
commit(repo, "add iris data", session = TRUE)
status(repo)
# remove the data files from the repo
rm_data(repo, path = ".")
# check the removal
list_data(repo)
status(repo)
# remove dangling metadata
prune_meta(repo, path = ".")
# check the removal
list_data(repo)
status(repo)
Re-exported Function From git2r
Description
See status
in git2r
.
See Also
Other version_control:
commit()
,
pull()
,
push()
,
recent_commit()
,
repository()
Summary method for git2rdata
objects.
Description
Prints the summary of the data and the description of the columns when available.
Usage
## S3 method for class 'git2rdata'
summary(object, ...)
Arguments
object |
a |
... |
additional arguments passed to |
See Also
Other internal:
is_git2rdata()
,
is_git2rmeta()
,
meta()
,
print.git2rdata()
,
upgrade_data()
Update the description of a git2rdata
object
Description
Allows to update the description of the fields, the table name, the title,
and the description of a git2rdata
object.
All arguments are optional.
Setting an argument to NA
or an empty string will remove the corresponding
field from the metadata.
Usage
update_metadata(
file,
root = ".",
field_description,
name,
title,
description,
...
)
Arguments
file |
the name of the git2rdata object. Git2rdata objects cannot
have dots in their name. The name may include a relative path. |
root |
The root of a project. Can be a file path or a |
field_description |
a named character vector with the new descriptions for the fields. The names of the vector must match the variable names. |
name |
a character string with the new table name of the object. |
title |
a character string with the new title of the object. |
description |
a character string with the new description of the object. |
... |
parameters used in some methods |
See Also
Other storage:
data_package()
,
display_metadata()
,
list_data()
,
prune_meta()
,
read_vc()
,
relabel()
,
rename_variable()
,
rm_data()
,
verify_vc()
,
write_vc()
Upgrade Files to the New Version
Description
Updates the data written by older versions to the current data format
standard. Works both on a single file and (recursively) on a path. The
".yml"
file must contain a "..generic"
element. upgrade_data()
ignores
all other files.
Usage
upgrade_data(file, root = ".", verbose, ..., path)
## S3 method for class 'git_repository'
upgrade_data(
file,
root = ".",
verbose = TRUE,
...,
path,
stage = FALSE,
force = FALSE
)
Arguments
file |
the name of the git2rdata object. Git2rdata objects cannot
have dots in their name. The name may include a relative path. |
root |
The root of a project. Can be a file path or a |
verbose |
display a message with the update status. Defaults to |
... |
parameters used in some methods |
path |
specify |
stage |
Logical value indicating whether to stage the changes after
writing the data. Defaults to |
force |
Add ignored files. Default is FALSE. |
Value
the git2rdata object names.
See Also
Other internal:
is_git2rdata()
,
is_git2rmeta()
,
meta()
,
print.git2rdata()
,
summary.git2rdata()
Examples
# create a directory
root <- tempfile("git2rdata-")
dir.create(root)
# write dataframes to the root
write_vc(
iris[1:6, ], file = "iris", root = root, sorting = "Sepal.Length",
digits = 6
)
write_vc(
iris[5:10, ], file = "subdir/iris", root = root, sorting = "Sepal.Length",
digits = 6
)
# upgrade a single git2rdata object
upgrade_data(file = "iris", root = root)
# use path = "." to upgrade all git2rdata objects under root
upgrade_data(path = ".", root = root)
Read a file an verify the presence of variables
Description
Reads the file with read_vc()
.
Then verifies that every variable listed in variables
is present in the
data.frame.
Usage
verify_vc(file, root, variables)
Arguments
file |
the name of the git2rdata object. Git2rdata objects cannot
have dots in their name. The name may include a relative path. |
root |
The root of a project. Can be a file path or a |
variables |
a character vector with variable names. |
See Also
Other storage:
data_package()
,
display_metadata()
,
list_data()
,
prune_meta()
,
read_vc()
,
relabel()
,
rename_variable()
,
rm_data()
,
update_metadata()
,
write_vc()
Store a Data.Frame as a Git2rdata Object on Disk
Description
A git2rdata object consists of two files.
The ".tsv"
file contains the raw data as a plain text tab separated file.
The ".yml"
contains the metadata on the columns in plain text YAML format.
See vignette("plain text", package = "git2rdata")
for more details on the
implementation.
Usage
write_vc(
x,
file,
root = ".",
sorting,
strict = TRUE,
optimize = TRUE,
na = "NA",
...,
split_by
)
## S3 method for class 'character'
write_vc(
x,
file,
root = ".",
sorting,
strict = TRUE,
optimize = TRUE,
na = "NA",
...,
append = FALSE,
split_by = character(0),
digits
)
## S3 method for class 'git_repository'
write_vc(
x,
file,
root,
sorting,
strict = TRUE,
optimize = TRUE,
na = "NA",
...,
stage = FALSE,
force = FALSE
)
Arguments
x |
the |
file |
the name of the git2rdata object. Git2rdata objects cannot
have dots in their name. The name may include a relative path. |
root |
The root of a project. Can be a file path or a |
sorting |
an optional vector of column names defining which columns to
use for sorting |
strict |
What to do when the metadata changes. |
optimize |
If |
na |
the string to use for missing values in the data. |
... |
parameters used in some methods |
split_by |
An optional vector of variables name to split the text files.
This creates a separate file for every combination.
We prepend these variables to the vector of |
append |
logical. Only relevant if |
digits |
The number of significant digits of the smallest absolute
value.
The function applies the rounding automatically.
Only relevant for numeric variables.
Either a single positive integer or a named vector where the names link to
the variables in the |
stage |
Logical value indicating whether to stage the changes after
writing the data. Defaults to |
force |
Add ignored files. Default is FALSE. |
Value
a named vector with the file paths relative to root
. The names
contain the hashes of the files.
Note
..generic
is a reserved name for the metadata and is a forbidden
column name in a data.frame
.
See Also
Other storage:
data_package()
,
display_metadata()
,
list_data()
,
prune_meta()
,
read_vc()
,
relabel()
,
rename_variable()
,
rm_data()
,
update_metadata()
,
verify_vc()
Examples
## on file system
# create a directory
root <- tempfile("git2rdata-")
dir.create(root)
# write a dataframe to the directory
write_vc(
iris[1:6, ], file = "iris", root = root, sorting = "Sepal.Length",
digits = 6
)
# check that a data file (.tsv) and a metadata file (.yml) exist.
list.files(root, recursive = TRUE)
# read the git2rdata object from the directory
read_vc("iris", root)
# store a new version with different observations but the same metadata
write_vc(iris[1:5, ], "iris", root)
list.files(root, recursive = TRUE)
# Removing a column requires version requires new metadata.
# Add strict = FALSE to override the existing metadata.
write_vc(
iris[1:6, -2], "iris", root, sorting = "Sepal.Length", strict = FALSE
)
list.files(root, recursive = TRUE)
# storing the orignal version again requires another update of the metadata
write_vc(iris[1:6, ], "iris", root, sorting = "Sepal.Width", strict = FALSE)
list.files(root, recursive = TRUE)
# optimize = FALSE stores the data more verbose. This requires larger files.
write_vc(
iris[1:6, ], "iris2", root, sorting = "Sepal.Width", optimize = FALSE
)
list.files(root, recursive = TRUE)
## on git repo using a git2r::git-repository
# initialise a git repo using the git2r package
repo_path <- tempfile("git2rdata-repo-")
dir.create(repo_path)
repo <- git2r::init(repo_path)
git2r::config(repo, user.name = "Alice", user.email = "alice@example.org")
# store a dataframe in git repo.
write_vc(iris[1:6, ], file = "iris", root = repo, sorting = "Sepal.Length")
# This git2rdata object is not staged by default.
status(repo)
# read a dataframe from a git repo
read_vc("iris", repo)
# store a new version in the git repo and stage it in one go
write_vc(iris[1:5, ], "iris", repo, stage = TRUE)
status(repo)
# store a verbose version in a different gir2data object
write_vc(
iris[1:6, ], "iris2", repo, sorting = "Sepal.Width", optimize = FALSE
)
status(repo)