From dataset To RDF

Long-form triples are tidy datasets with an explicit row (observation) identifier among the columns.

       
JSON object property value
spreadsheet row id column name cell
data.frame key variable measurement
data.frame key attribute value
RDF subject predicate object

Table source: rdflib

library(dataset)

Let’s take a small subset of the iris_dataset(), which is the semantically enriched version of the base R iris dataset. Limiting the the dataset to the top 3 rows, we have exactly 2 x 5 = 10 data cells.

head(iris_dataset, 2)
#> Anderson E (1935). "Iris Dataset [subset]."
#>         Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> iris:o1          5.1         3.5          1.4         0.2  setosa
#> iris:o2          4.9         3.0          1.4         0.2  setosa
#> Further metadata: describe(x)
xsd_convert(head(iris_dataset, 2))
#> Anderson E (1935). "Iris Dataset [subset]."
#>                Sepal.Length         Sepal.Width        Petal.Length
#> iris:o1 "5.1"^^<xs:decimal> "3.5"^^<xs:decimal> "1.4"^^<xs:decimal>
#> iris:o2 "4.9"^^<xs:decimal>   "3"^^<xs:decimal> "1.4"^^<xs:decimal>
#>                 Petal.Width               Species
#> iris:o1 "0.2"^^<xs:decimal> "setosa"^^<xs:string>
#> iris:o2 "0.2"^^<xs:decimal> "setosa"^^<xs:string>
#> Further metadata: describe(x)

Let us arrange this to subject-predicate-object triples.

iris_triples <- dataset_to_triples(xsd_convert(head(iris_dataset,2)))
iris_triples
#>          s            p                     o
#> 1  iris:o1 Sepal.Length   "5.1"^^<xs:decimal>
#> 2  iris:o2 Sepal.Length   "4.9"^^<xs:decimal>
#> 3  iris:o1  Sepal.Width   "3.5"^^<xs:decimal>
#> 4  iris:o2  Sepal.Width     "3"^^<xs:decimal>
#> 5  iris:o1 Petal.Length   "1.4"^^<xs:decimal>
#> 6  iris:o2 Petal.Length   "1.4"^^<xs:decimal>
#> 7  iris:o1  Petal.Width   "0.2"^^<xs:decimal>
#> 8  iris:o2  Petal.Width   "0.2"^^<xs:decimal>
#> 9  iris:o1      Species "setosa"^^<xs:string>
#> 10 iris:o2      Species "setosa"^^<xs:string>

We receive 2x5 = 10 rows; each with an identifier. The identifiers are made from row.names(), and we have exactly 5 statements about the first observation (iris:o1), and 5 statements about the second (iris:o2). Each statement simply states the observed value.

iris_triples$p <- paste0("iris:", iris_triples$p)
iris_triples
#>          s                 p                     o
#> 1  iris:o1 iris:Sepal.Length   "5.1"^^<xs:decimal>
#> 2  iris:o2 iris:Sepal.Length   "4.9"^^<xs:decimal>
#> 3  iris:o1  iris:Sepal.Width   "3.5"^^<xs:decimal>
#> 4  iris:o2  iris:Sepal.Width     "3"^^<xs:decimal>
#> 5  iris:o1 iris:Petal.Length   "1.4"^^<xs:decimal>
#> 6  iris:o2 iris:Petal.Length   "1.4"^^<xs:decimal>
#> 7  iris:o1  iris:Petal.Width   "0.2"^^<xs:decimal>
#> 8  iris:o2  iris:Petal.Width   "0.2"^^<xs:decimal>
#> 9  iris:o1      iris:Species "setosa"^^<xs:string>
#> 10 iris:o2      iris:Species "setosa"^^<xs:string>
row.names(head(iris_dataset,2))
#> [1] "iris:o1" "iris:o2"
vignette_temp_file <- file.path(tempdir(), "example_ttl.ttl")
dataset_ttl_write(dataset_to_triples(iris_triples), 
                  file_path = vignette_temp_file)

We see a standard metadata file expressed in the Turtle language. The definitions are separated with a # -- Observations ------ comment from the actual statements about the dataset.

# Only first 23 lines are read and printed:
readLines(vignette_temp_file, n = 23)
#>  [1] "@prefix  owl:        <http://www.w3.org/2002/07/owl#> ."             
#>  [2] "@prefix  qb:         <http://purl.org/linked-data/cube#> ."          
#>  [3] "@prefix  rdf:        <http://www.w3.org/1999/02/22-rdf-syntax-ns#> ."
#>  [4] "@prefix  rdfs:       <http://www.w3.org/2000/01/rdf-schema#> ."      
#>  [5] "@prefix  xsd:        <http://www.w3.org/2001/XMLSchema#> ."          
#>  [6] ""                                                                    
#>  [7] "# -- Observations -----------------------------------------"         
#>  [8] ""                                                                    
#>  [9] "1 a qb:Observation ;"                                                
#> [10] "   s   iris:o1 ;"                                                    
#> [11] "   p   iris:Sepal.Length ;"                                          
#> [12] "   o   \"5.1\"^^<xs:decimal> ;"                                      
#> [13] "   ."                                                                
#> [14] "2 a qb:Observation ;"                                                
#> [15] "   s   iris:o2 ;"                                                    
#> [16] "   p   iris:Sepal.Length ;"                                          
#> [17] "   o   \"4.9\"^^<xs:decimal> ;"                                      
#> [18] "   ."                                                                
#> [19] "3 a qb:Observation ;"                                                
#> [20] "   s   iris:o1 ;"                                                    
#> [21] "   p   iris:Sepal.Width ;"                                           
#> [22] "   o   \"3.5\"^^<xs:decimal> ;"                                      
#> [23] "   ."

If we would try to parse this file with a ttl-reader, we would get an error message, because not all statements are well-defined.

The prefix

The Turtle prefix statements define the abbreviations of the following namespaces:

readLines(vignette_temp_file, n = 5)
#> [1] "@prefix  owl:        <http://www.w3.org/2002/07/owl#> ."             
#> [2] "@prefix  qb:         <http://purl.org/linked-data/cube#> ."          
#> [3] "@prefix  rdf:        <http://www.w3.org/1999/02/22-rdf-syntax-ns#> ."
#> [4] "@prefix  rdfs:       <http://www.w3.org/2000/01/rdf-schema#> ."      
#> [5] "@prefix  xsd:        <http://www.w3.org/2001/XMLSchema#> ."

The prefix makes the ttl Turle-file future-proof: before explaining the semantics of the data, it contains all the definitions that are needed to understand the explanation. It is a dictionary; every elements of the vocabulary that are needed to explain the iris dataset should be here. This means that we must define the iris prefix, too.

These definitions can be found in the data("dataset_namespace") dataset. we only need to add the definitions ourselves that is unique about our own dataset, in this case, the definitions of the variables of the iris dataset, i.e., the iris namespace:

data("dataset_namespace")
unique(get_prefix(row.names(head(iris_dataset,2))))
#> [1] "iris:"

The dataset_namespace data file contains some often used vocabularies and their prefixes. Let us select owl:, rdf:, rdfs:, qb: and add iris: as <<www.example.com/iris#>> (the example.com domain is reserved by the World Wide Web consortium for documentation and tutorial examples.)

used_prefixes <- which(dataset_namespace$prefix %in% c(
  "owl:", "rdf:", "rdfs:", "qb:", "xsd:")
  )

vignette_namespace <- rbind(
  dataset_namespace[used_prefixes, ], 
       data.frame (prefix = "iris:", 
                   uri = '<www.example.com/iris#>')
) 

vignette_namespace
#>    prefix                                           uri
#> 6    owl:              <http://www.w3.org/2002/07/owl#>
#> 7     qb:           <http://purl.org/linked-data/cube#>
#> 8    rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
#> 9   rdfs:       <http://www.w3.org/2000/01/rdf-schema#>
#> 20   xsd:           <http://www.w3.org/2001/XMLSchema#>
#> 1   iris:                       <www.example.com/iris#>

Let us overwrite the earlier ttl file, but this time defining the variables and observations with the iris: prefix:

dataset_ttl_write(
  iris_triples, 
  ttl_namespace = vignette_namespace,
  file_path = vignette_temp_file, 
  overwrite = TRUE)
readLines(vignette_temp_file, n = 23)
#>  [1] "@prefix  owl:        <http://www.w3.org/2002/07/owl#> ."             
#>  [2] "@prefix  qb:         <http://purl.org/linked-data/cube#> ."          
#>  [3] "@prefix  rdf:        <http://www.w3.org/1999/02/22-rdf-syntax-ns#> ."
#>  [4] "@prefix  rdfs:       <http://www.w3.org/2000/01/rdf-schema#> ."      
#>  [5] "@prefix  xsd:        <http://www.w3.org/2001/XMLSchema#> ."          
#>  [6] "@prefix  iris:       <www.example.com/iris#> ."                      
#>  [7] ""                                                                    
#>  [8] "# -- Observations -----------------------------------------"         
#>  [9] ""                                                                    
#> [10] "iris:o1 a qb:Observation ;"                                          
#> [11] "   iris:Sepal.Length   \"5.1\"^^<xs:decimal> ;"                      
#> [12] "   iris:Sepal.Width   \"3.5\"^^<xs:decimal> ;"                       
#> [13] "   iris:Petal.Length   \"1.4\"^^<xs:decimal> ;"                      
#> [14] "   iris:Petal.Width   \"0.2\"^^<xs:decimal> ;"                       
#> [15] "   iris:Species   \"setosa\"^^<xs:string> ;"                         
#> [16] "   ."                                                                
#> [17] "iris:o2 a qb:Observation ;"                                          
#> [18] "   iris:Sepal.Length   \"4.9\"^^<xs:decimal> ;"                      
#> [19] "   iris:Sepal.Width   \"3\"^^<xs:decimal> ;"                         
#> [20] "   iris:Petal.Length   \"1.4\"^^<xs:decimal> ;"                      
#> [21] "   iris:Petal.Width   \"0.2\"^^<xs:decimal> ;"                       
#> [22] "   iris:Species   \"setosa\"^^<xs:string> ;"                         
#> [23] "   ."

Working with rdflib

RDFLib is a pure Python package for working with RDF with RDF serialisation parsers, store implementations, graph interface and a SPARQL query and update implementation. It has an excellent R binding, the rdflib package1.

In this section we show how to work further with our future-proof datasets. We parse the ttl file created with the dataset package into a triplestore:

require(rdflib)
example_rdf <- rdf_parse(vignette_temp_file, format = "turtle")
example_rdf
#> Total of 12 triples, stored in hashes
#> -------------------------------
#> <file:///www.example.com/iris#o2> <file:///www.example.com/iris#Species> "setosa"^^<xs:string> .
#> <file:///www.example.com/iris#o2> <file:///www.example.com/iris#Petal.Length> "1.4"^^<xs:decimal> .
#> <file:///www.example.com/iris#o1> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/linked-data/cube#Observation> .
#> <file:///www.example.com/iris#o2> <file:///www.example.com/iris#Petal.Width> "0.2"^^<xs:decimal> .
#> <file:///www.example.com/iris#o2> <file:///www.example.com/iris#Sepal.Length> "4.9"^^<xs:decimal> .
#> <file:///www.example.com/iris#o1> <file:///www.example.com/iris#Sepal.Length> "5.1"^^<xs:decimal> .
#> <file:///www.example.com/iris#o1> <file:///www.example.com/iris#Petal.Width> "0.2"^^<xs:decimal> .
#> <file:///www.example.com/iris#o1> <file:///www.example.com/iris#Petal.Length> "1.4"^^<xs:decimal> .
#> <file:///www.example.com/iris#o2> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/linked-data/cube#Observation> .
#> <file:///www.example.com/iris#o1> <file:///www.example.com/iris#Sepal.Width> "3.5"^^<xs:decimal> .
#> 
#> ... with 2 more triples

And define a simple SPARQL query on the data:

sparql <-
'PREFIX iris: <www.example.com/iris#> 
 SELECT ?observation ?value
 WHERE { ?observation iris:Sepal.Length ?value . }'

rdf_query(example_rdf, sparql)
#> # A tibble: 2 × 2
#>   observation                     value
#>   <chr>                           <dbl>
#> 1 file:///www.example.com/iris#o2   4.9
#> 2 file:///www.example.com/iris#o1   5.1

Convert, for example, to JSON-LD format…:

temp_jsonld_file <- file.path(tempdir(), "example_jsonld.json")
rdf_serialize(rdf=example_rdf, doc = temp_jsonld_file, format = "jsonld")

… and read in the first 12 lines:

readLines(temp_jsonld_file, 12)
#>  [1] "{"                                                                 
#>  [2] "  \"@graph\": ["                                                   
#>  [3] "    {"                                                             
#>  [4] "      \"@id\": \"file:///www.example.com/iris#o1\","               
#>  [5] "      \"@type\": \"http://purl.org/linked-data/cube#Observation\","
#>  [6] "      \"file:///www.example.com/iris#Petal.Length\": {"            
#>  [7] "        \"@type\": \"xs:decimal\","                                
#>  [8] "        \"@value\": \"1.4\""                                       
#>  [9] "      },"                                                          
#> [10] "      \"file:///www.example.com/iris#Petal.Width\": {"             
#> [11] "        \"@type\": \"xs:decimal\","                                
#> [12] "        \"@value\": \"0.2\""

  1. Carl Boettiger: A tidyverse lover’s intro to RDF↩︎