The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.
The W3C’s CSV on the Web Working Group produced a series of recommendations for working with tabular data on the web.
The csvwr
library implements parts of this standard in
R. The overall goal of the project is to support reading and writing of
annotated CSV tables, in order to ensure consistent processing and
reduce the amount of manual work needed to parse and prepare data before
it can be used in analysis.
Practically speaking, you annotate a csv file by providing an accompanying json document containing the metadata. We benefit from annotating tables with csvw benefit because:
This package includes some example csv and json for you to explore. Here is the csv file:
library(csvwr)
<- csvwr_example("computer-scientists.csv")
compsci_csv
cat(readLines(compsci_csv),sep="\n")
#> Name,Date Of Birth
#> Barbara Liskov,1939-11-07
#> Evelyn Boyd Granville,1924-05-01
#> Ada Lovelace,1815-12-10
Here is the annotation:
<- csvwr_example("computer-scientists.json")
compsci_json
cat(readLines(compsci_json),sep="\n")
#> {
#> "@context": "http://www.w3.org/ns/csvw",
#> "tables": [{
#> "url": "computer-scientists.csv",
#> "tableSchema": {
#> "columns": [{
#> "name": "name",
#> "titles": "Name",
#> "datatype": "string",
#> "propertyUrl": "foaf:name"
#> }, {
#> "name": "dob",
#> "titles": "Date Of Birth",
#> "datatype": "date",
#> "propertyUrl": "schema:birthDate"
#> }],
#> "aboutUrl": "http://example.org/computer-scientsts/{#name}",
#> "primaryKey": "name"
#> }
#> }]
#> }
You can get up and running quickly using the read_csvw_dataframe function:
<- read_csvw_dataframe(compsci_csv, compsci_json) d
This parses the csvw metadata, and uses that to parse and interpret the csv file. A data frame is returned with columns that are named and typed accordingly to the specifications in the table’s schema. Here you can see the “Date Of Birth” field has been given a syntactically valid variable name, and parsed into a date vector automatically:
str(d)
#> tibble [3 × 2] (S3: tbl_df/tbl/data.frame)
#> $ name: chr [1:3] "Barbara Liskov" "Evelyn Boyd Granville" "Ada Lovelace"
#> $ dob : Date[1:3], format: "1939-11-07" "1924-05-01" ...
::kable(d) knitr
name | dob |
---|---|
Barbara Liskov | 1939-11-07 |
Evelyn Boyd Granville | 1924-05-01 |
Ada Lovelace | 1815-12-10 |
This function assumes that you’re only interested in one table, and that you don’t want to work with the csvw metadata itself. The metadata vocabulary allows us to describe groups of tables (useful for lookup tables). You can get the table and annotations using the read_csvw function:
<- read_csvw(compsci_csv, compsci_json) csvw
This returns a nested list, which broadly follows the structure of the csvw metadata.
As you can see from the json above, the tables
element
provides a list of tables and the annotation for each table provides a
url
(this can be used to locate csv files from the json
metadata alone) and a tableSchema
.
Within the tableSchema
we have annotations for each
column. We parse these into a data frame (instead of a list of lists
which is how the jsonlite
library would ordinarily
interpret a json array of objects). This is much more idiomatic for
manipulation in R.
$tables[[1]]$tableSchema$columns
csvw#> # A tibble: 2 × 5
#> name titles datatype propertyUrl required
#> <chr> <chr> <list> <chr> <lgl>
#> 1 name Name <chr [1]> foaf:name FALSE
#> 2 dob Date Of Birth <chr [1]> schema:birthDate FALSE
We also introduce another element to each table, named
dataframe
. This provides the result of parsing the csv
table using the schema provided in the json:
$tables[[1]]$dataframe
csvw#> # A tibble: 3 × 2
#> name dob
#> <chr> <date>
#> 1 Barbara Liskov 1939-11-07
#> 2 Evelyn Boyd Granville 1924-05-01
#> 3 Ada Lovelace 1815-12-10
The function read_csvw_dataframe is just a convenience wrapper for calling read_csvw and extracting this data frame.
You can of course write json metadata by hand, but if you already have your table as a data frame in R then we can use this to get a head start.
If you provide the derive_table_schema function with a data frame it will prepare some table annotations.
<- data.frame(x=c("a","b","c"), y=1:3)
d <- derive_table_schema(d))
(s #> $columns
#> name titles datatype
#> 1 x x string
#> 2 y y integer
Notice that the column names, titles, and datatypes have been derived from the data frame.
You can of course refine the schema further if you wish (e.g. to declare further constraints on the datatypes or to add uri templates).
You can then pass this to create_metadata to build up a complete annotation.
First we build the table description. This requires that we provide a URL for where the csv can be found.
For example, if we save the data frame to a local file:
write.csv(d, "table.csv", row.names=FALSE)
Then we can create a table description using the filename as the URL and the schema we created earlier:
<- list(url="table.csv", tableSchema=s) tb
Relative URLs like this make sense when the json metadata and csv
table are to be found in the same place. You may instead want to
unambiguously locate the file with an absolute URL like
https://raw.githubusercontent.com/Robsteranium/csvwr/master/inst/extdata/computer-scientists.csv
which will work even if the metadata and table are held in different
locations.
Now we can build our complete annotation:
<- create_metadata(tables=list(tb)))
(m #> $`@context`
#> [1] "http://www.w3.org/ns/csvw"
#>
#> $tables
#> $tables[[1]]
#> $tables[[1]]$url
#> [1] "table.csv"
#>
#> $tables[[1]]$tableSchema
#> $tables[[1]]$tableSchema$columns
#> name titles datatype
#> 1 x x string
#> 2 y y integer
This can then be serialised to JSON:
<- jsonlite::toJSON(m)
j ::prettify(j)
jsonlite#> {
#> "@context": [
#> "http://www.w3.org/ns/csvw"
#> ],
#> "tables": [
#> {
#> "url": [
#> "table.csv"
#> ],
#> "tableSchema": {
#> "columns": [
#> {
#> "name": "x",
#> "titles": "x",
#> "datatype": "string"
#> },
#> {
#> "name": "y",
#> "titles": "y",
#> "datatype": "integer"
#> }
#> ]
#> }
#> }
#> ]
#> }
#>
This JSON may then be written to disk:
cat(j, file="metadata.json")
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.