The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.
In the world of data science, RDF is a bit of an ugly duckling. Like XML and Java, only without the massive-adoption-that-refuses-to-die part. In fact RDF is most frequently expressed in XML, and RDF tools are written in Java, which help give RDF has the aesthetics of steampunk, of some technology for some futuristic Semantic Web1 in a toolset that feels about as lightweight and modern as iron dreadnought.
But don’t let these appearances deceive you. RDF really is cool. If you’ve ever gotten carried away using tidyr::gather
to make everything into one long table, you may have noticed you can just about always get things down to about three columns, as we see with an obligatory mtcars
data example for tidyr::gather
:
library(rdflib)
library(dplyr)
library(tidyr)
library(tibble)
library(jsonld)
car_triples <-%>%
mtcars rownames_to_column("Model") %>%
gather(attribute,measurement, -Model)
If you like long tables like this, RDF is for you. This layout isn’t “Tidy Data,” where rows are observations and columns are variables, but it is damn useful sometimes. This format is very liquid, easy to reshape into other structures – so much so that tidyr::gather
was originally known as melt
in the reshape2
package. It’s also a good way to get started thinking about RDF.
Looking at this table closely, we see that each row is reduced to the most elementary statement you can make from the data. A row no longer tells you the measurements (observations) all attributes (variables) of a given species (key), instead, you get just one fact per row, Mazda RX4
gets a mpg
measurement of 21.0
. In RDF-world, we think of these three-part statements as something very special, which we call triples. RDF is all about these triples.
The first column came from the row names in this case, the Model
of car. This acts serves as a key
to index the data.frame, i.e. the subject being described. The next column is the variable (also called attribute or property) being measured, (that is, column names, other than the key column(s), from the tidy data), called the property or predicate in RDF-speak (slash grammar-school jargon). The third column is the actual value measured, more object of the predicate. Call it key-property-value or subject-predicate-object, these are our triples. We can represent just about any data in fully elementary manner.
RDF | subject | predicate | object |
JSON | object | property | value |
spreadsheet | row id | column name | cell |
data.frame | key | variable | measurement |
data.frame | key | attribute | value |
Table 1 summarizes the many different names associated with triples. The first naming convention is the terminology typically associated with RDF. The second set are terms typically associated with JSON data, while the remaining are all examples in tabular or relational data structures.
Using row names as our subject was intuitive but actually a bit sloppy. tidyverse
lovers know that tidyverse
doesn’t like rownames, they aren’t tidy and have a way of causing trouble. Of course, we made rownames into a proper column to use gather
, but we could have taken this one step further. In true tidyverse
fashion, this rownames-column is really just one more variable we can observe, one more attribute of the thing we were describing: say, thing A (Car A) is a car_model_name
as Mazda RX4
and thing A also has mpg
of 21
. We can accomplish such a greater level of abstraction by keeping the Model as just another variable the row ids themselves as the key (i.e. the subject) of our triple:
car_triples <-%>%
mtcars rownames_to_column("Model") %>%
rowid_to_column("subject") %>%
gather(predicate, object, -subject)
This is identical to a gather
of all columns, where we have just made the original row ids an explicit column for reference (diligent reader will recognize we would need this information to reverse the operation and spread
the data back into it’s wide form; without it, our transformation is lossy and irreversible). Our subject
column now consists only of simple numeric id
’s, while we have gained an additional triple for every row in the original data which states Model
of each id
number (e.g. 1
is Model
Mazda RX4
). Okay, now you’re probably thinking: “wait a minute, 1
is not a very unique or specific key, surely that will cause trouble,” and you’d be right. For instance, if we performed the same transformation on the iris data, we get triples in the exact same format, ready to bind_rows
:
iris %>%
iris_triples <- rowid_to_column("subject") %>%
gather(key = predicate, value = object, -subject)
but in the iris
data, 1
corresponds to the first individual Iris flower in the measurement data, and not a Mazda RX4. If we don’t want to get confused, we’re going to need to make sure our identifiers are unique: not just kind of unique, but unique in the World wide. And what else is unique world-wide? Yup, you guessed it, we are going to use URLs for our subject identifiers, just like the world wide web. Think of this as a clever out-sourcing to the whole internet domain registry service. Here, we’ll imagine registering each of these example datasets with a separate base URL, so instead of a vague 1
to identify the first observation in the iris
example data, we’ll use the URL http://example.com/iris#1
, which we can now distinguish from http://example.com/mtcars#1
(and if you’re way ahead of me, yes, we’ll have more to say about URI vs URL and the use of blank nodes in just a minute). For example:
iris %>%
iris_triples <- rowid_to_column("subject") %>%
mutate(subject = paste0("http://example.com/", "iris#", subject)) %>%
gather(key = predicate, value = object, -subject)
A slightly more subtle version of the same problem can arise with our predicates. Different tables may use the same attribute (i.e. originally, a column name of a variable) for different things – the attribute labeled cyl
means “number of cylinders” in mtcars
data.frame, but could mean something very different in different data. Luckily we’ve already seen how to make names unique in RDF turn them into URLs.
iris %>%
iris_triples <- rowid_to_column("subject") %>%
mutate(subject = paste0("http://example.com/", "iris#", subject)) %>%
gather(key = predicate, value = object, -subject) %>%
mutate(predicate = paste0("http://example.com/", "iris#", predicate))
At this point the motivation for the name “Linked Data” is probably becoming painfully obvious.
One more column to go! But wait a minute, the object
column is different, isn’t it? These measurements don’t suffer from the same ambiguity – after all, there is no confusion if a car has 4
cylinders and an iris has 4
mm long sepals. However, a new issue has arisen in the data type (e.g. string
, boolean
, double
, integer
, dateTime
, etc). A close look reveals our object
column is encoded as a character
and not numeric
– how’d that happen? tidyr::gather
has coerced the whole column into character strings because some of the values, that is, the Species
names in iris
and the Model names in mtcars
, are text strings (and it couldn’t exactly coerce them into integers). Perhaps this isn’t a big deal – we can often guess the type of an object just by how it looks (so-called Duck typing, because if it quacks like duck…). Still, being explicit about data types is a Good Thing, so fortunately there’s an explicit way to address this too … oh no … not … yes … more URLs!
Luckily we don’t have to make up example.com
URLs this time because there’s already a well-established list of data types widely used across the internet that were originally developed for use in XML (I warned you) Schemas, listed in see the W3C RDF DataTypes. As the standard shows, familiar types string
, double
, boolean
, integer
, etc are made explicit using the XML Schema URL: http://www.w3.org/2001/XMLSchema#
, followed by the type; so an integer would be `http://www.w3.org/2001/XMLSchema#integer
, a character string http://www.w3.org/2001/XMLSchema#string
etc.
Because this case is a little different, the URL is attached directly after the object value, which is set off by quotes, using the symbol ^^
(I dunno, but I think two duck feet), such that 5.1
becomes "5.1"^^http://www.w3.org/2001/XMLSchema#double
. Wow2. Most of the time we won’t have to worry about the type, because, if it quacks…
rdflib
So far, we have explored the concept of triples using familiar data.frame
structures, but haven’t yet introduced any rdflib
functions. Though we’ve been thinking of RDF data in this explicitly tabular three-column structure, that is really just one potentially convenient representation. Just as the same tabular data can be represented in a data.frame
, written to disk as a .csv
file, or stored in a database (like MySQL or PostgreSQL), so it is for RDF to even greater degree. We have separate abstractions for the information itself compared to how it is represented.
To take advantage of this abstraction, rdflib
introduces an rdf
class object. Depending on how this is initialized, this could utilize storage in memory (the default), on disk, or potentially in an array of different databases, (including relational databases like PostgreSQL and rdf-specific ones like Virtuoso, depending on how the underlying redland
library is compiled – a topic beyond our scope here). Here, we simply initialize an rdf
object using the default in-memory storage:
rdf() rdf <-
To add triples to this rdf
object (often called an RDF Model or RDF Graph), we use the function rdf_add
, which takes a subject, predicate, and object as arguments, as we have just discussed. A datatype URI can be inferred from the R type used for the object (e.g. numeric
, integer
, logical
, character
, etc.)
paste0("http://example.com/", "iris#")
base <-
%>%
rdf rdf_add(subject = paste0(base, "obs1"),
predicate = paste0(base, "Sepal.Length"),
object = 5.1)
rdf
The result is displayed as a triple discussed above. This is technically an example of the nquad
notation we will see later. Note the inferred datatype URI.
This gather
thing started well, but now are data is looking pretty ugly, not to mention cumbersome. You have some idea why RDF hasn’t taken data science by storm, and we haven’t even looked at how ugly this gets when you write it in the RDF/XML serialization yet! On the upside, we’ve introduced most of the essential concepts that will let us start to work with data as triples. Before we proceed further, we’ll take a quick look at some of the options for expressing triples in different ways, and also introduce some of the different serializations (ways of representing in text) frequently used to express these triples.
Long URL strings are one of the most obvious ways that what started off looking like a concise, minimal statement got ugly and cumbersome. Borrowing from the notion of Namespaces in XML, most RDF tools permit custom prefixes to be declared and swapped in for longer URLs. A prefix is typically a short string3 followed by a :
that is used in place of the shared root URL. For instance, we might use the prefix iris:Sepal.Length
and iris:Sepal.Width
where iris:
is defined to mean http://example.com/iris#
in our example above.
While I’ve referred to these things as URLs, (uniform resource locator, aka web address) technically they can be a broader class of things known as URIs (uniform resource identifier). In addition to including anything that is a URL, URIs include things which are not URLs, like urn:isbn:0-486-27557-4
or urn:uuid:aac06f69-7ec8-403d-ad84-baa549133dce
, which are URNs: unique resource numbers in some numbering scheme (e.g. book ISBN numbers, or UUIDs), neither of which are URLs but nonetheless enjoy the same globally unique property.
Sometimes we do not need a globally unique identifier, we just want a way to refer to a node (e.g. subject, and sometimes an object) uniquely in our document. This is the role of a blank node (do follow the link for a better overview). These are frequently denoted with the prefix _:
, e.g. we could have replaced the sample IDs as _:1
, _:2
instead of the URLs such as http://example.com/iris#1
etc. Note that RDF operations need not preserve the actual string pattern in a blank ID name, it means the exact same thing if we replace all the _:1
s with _:b1
and _:2
with _:b2
, etc.
In librdf
we can get a blank node by passing an empty string or character string that is not a URI as the subject. Here we also use a URI that isn’t a URL as predicate:
rdf()
rdf <-%>% rdf_add("",
rdf "iris:Sepal.Length",
object = 5.1)
rdf
Note that we get a blank node, _:
with a randomly generated string.
nquads
rdfxml
, turtle
, and nquads
So far we have relied primarily on a three-column tabular format to represent our triples. We have also seen the default print
format used for the rdf
method, known as N-Quads above, which displays a bare, space-separated triple, possibly with a datatype URI attached to the object. The line ends with a dot, which indicates this is part of the same local triplestore (aka RDF graph or RDF Model). Technically this could be another URI indicating a unique global address for the triplestore in question.
We can serialize any rdf
object out to a file in this format with the rdf_serialize()
function, e.g.
rdf_serialize(rdf, "rdf.nq", format = "nquads")
Just as each of these formats can be serialized with rdf_serialize()
, each can be read by rdflib
using the function rdf_parse()
:
system.file("extdata/example.rdf", package="redland")
doc <- rdf_parse(doc, format = "rdfxml")
rdf <- rdf
N-Quads are convenient in that each triple is displayed on a unique line, and the format supports the blank node and Datatype URIs in the manner we have just discussed. Other formats are not so concise. Rather than print to file, we can simply change the default print format used by rdflib
to explore the textual layout of the other serializations. Here is one of the most common classical serializations, RDF/XML
which expresses triples in an XML-based schema:
options(rdf_print_format = "rdfxml")
rdf
Just looking at this is probably enough to explain why so many alternative serializations were created. Another popular format, turtle
, looks more like nquads
:
options(rdf_print_format = "turtle")
rdf
Here, blank nodes are denoted by []
. turtle
uses indentation to indicate that all three predicates (creator
, description
, title
) are properties of the same subject.
While formats such as nquads
and turtle
provide a much cleaner syntax than RDF/XML, they also introduce a custom format rather than building on a familiar standard (like XML) for which users already have a well-developed set of tools and intuition. After more than a decade of such challenges (RDF specification started 1997, including an the HTML-embedded serialization of RDFa in 2004), a more user friendly specification has emerged in the form of JSON-LD (1.0 W3C specification was released in 2014, the 1.1 specification released in February 2018). JSON-LD uses the familiar object notation of JSON, (which is rapidly replacing XML as the ubiquitous data exchange format, and will be more familiar to many readers than the specialized RDF formats or even XML. Here is our rdf
data in the JSON-LD serialization:
options(rdf_print_format = "jsonld")
rdf
In this serialization, our subject corresponds to “the thing in the curly braces,” (i.e. the JSON “object”) which is identified by the special @id
property (omitting @id
corresponds to a blank node). The predicate-object pairs in the triple are then just JSON key-value pairs within the curly braces of the given object. We can make this format look even more natural by stripping out the URLs. While it is possible to use prefixes in place of URLs, it is more natural to pull them out entirely, e.g. by declaring a default vocabulary in the JSON-LD “Context”, like so:
rdf_serialize(rdf, "example.json", "jsonld") %>%
jsonld_compact(context = '{"@vocab": "http://purl.org/dc/elements/1.1/"}')
The context of a JSON-LD file can also define datatypes, use multiple namespaces, and permit different names in the JSON keys from that found in the URLs. While a complete introduction to JSON-LD is beyond our scope, this representation essentially provides a way to map intuitive JSON structures into precise RDF triples.
So far we have considered examples where the data could be represented in tabular form. We frequently encounter data that cannot be easily represented in such a format. For instance, consider the JSON data in this example:
system.file("extdata/person.json", package="rdflib")
ex <-cat(readLines(ex), sep = "\n")
#jsonld_compact(ex, "{}")
This JSON object for a Person
has another JSON object nested inside (a PostalAddress
). Yet if we look at this data as nquads
, we see the familiar flat triple structure:
options(rdf_print_format = "nquads")
rdf_parse(ex, "jsonld")
rdf <- rdf
So what has happened? Note that our address
has been given the blank node URI _:b0
, which serves both as the object in the address
line of the Person
and as the subject of all the properties belonging to the PostalAddress
. In JSON-LD, this structure is referred to as being ‘flattened’:
jsonld_flatten(ex, context = "https://schema.org/")
Note that our JSON-LD structure now starts with an object called @graph
. Unlike our opening examples, this data is not tabular in nature, but rather, is formatted as a nested graph. Such nesting is very natural in JSON, where objects can be arranged in a tree-like structure with a single outer-most set of {}
indicating a root object. A graph is just a more generic form of a tree structure, where we are agnostic to the root. (We could in fact use the @reverse
property on address to create a root PostalAddress
that contains the Person
). In this way, the notion of data as a graph
offers a powerful generalization to the notion of tabular data. The @graph
above consists of two separate objects: a PostalAddress
(with id
of _:b0
) and a Person
(with an ORCID id). This layout acts much like a foreign key in a relational database, or as a list-column in tidyverse
(e.g. see tidyr::nest()
). rdflib
uses this flattened representation when serializing JSON-LD objects. Note that JSON-LD provides a rich set of utilities to go back and forth between flattened and nested layouts using jsonld_frame
. For instance, we can recover the original structure just by specifying a frame that indicates which type we want as the root:
jsonld_flatten(ex) %>%
jsonld_frame('{"@type": "https://schema.org//Person"}') %>%
jsonld_compact(context = "https://schema.org/")
(Recall that compacting just replaces URIs and any type declarations with short names given by the context). This is somewhat analogous to join
operations in relational data, or nesting and un-nesting functions in tidyr
. However, when working with RDF, the beautiful thing is that the differences between these two representations (nested or flattened) are purely aesthetic. Both representations have precisely the same semantic meaning, and are thus precisely the same thing in RDF world. We will never have to orchestrate a join on a foreign key before we can perform desired operations like select and filter on the data. We don’t have to think about how our data is organized, because it is always in the same molten triple format, whatever it is, and however nested it might be.
Just as we saw gather
could provide a relatively generic way of transforming a data.frame into RDF triples, JSON-LD defines a relatively simple convention for getting nested data (e.g. lists) into RDF triples. This convention simply treats JSON {}
objects as subjects
(often assigning blank node ids, as we saw with row ids), and key-value pairs (or in R-speak, list names and values) as predicates and objects, respectively. Any raw JSON file can be treated as JSON-LD, ideally by specifying an appropriate context
, which serves to map terms into URIs as we saw with data.frames. JSON-LD
is then already a valid RDF format that we can parse with rdflib
.
For instance, here is a simple function for coercing list objects into RDF with a specified context:
function(x, context = "https://schema.org/"){
as_rdf.list <-if(length(x) == 1) x <- x[[1]]
"@context"]] <- context
x[[ jsonlite::toJSON(x, pretty = TRUE, auto_unbox = TRUE, force = TRUE)
json <-::rdf_parse(json, "jsonld")
rdflib }
Here we set a default context (https://schema.org/), and map a few R terms to corresponding schema terms
list("https://schema.org/",
context <-list(schema = "https://schema.org//",
given = "givenName",
family = "familyName",
title = "name",
year = "datePublished",
note = "softwareVersion",
comment = "identifier",
role = "https://www.loc.gov/marc/relators/relaterm.html"))
We can now apply our function on arbitrary R list
objects, such as the bibentry
class object returned by the citation()
function:
options(rdf_print_format = "nquads") # go back to the default
citation("rdflib")
R <- as_rdf.list(R, context)
rdf <- rdf
So far, we have spent a lot of words describing how to transform data into RDF, and not much actually doing anything cool with said data.
Still working on writing this section
#source(system.file("examples/as_rdf.R", package="rdflib"))
source(system.file("examples/tidy_schema.R", package="rdflib"))
## Testing: Digest some data.frames into RDF and extract back
mtcars %>% rownames_to_column("Model")
cars <- as_rdf(iris, NULL, "iris:")
x1 <- as_rdf(cars, NULL, "mtcars:")
x2 <- c(x1,x2) rdf <-
sparql <- 'SELECT ?Species ?Sepal_Length ?Sepal_Width ?Petal_Length ?Petal_Width
WHERE {
?s <iris:Species> ?Species .
?s <iris:Sepal.Width> ?Sepal_Width .
?s <iris:Sepal.Length> ?Sepal_Length .
?s <iris:Petal.Length> ?Petal_Length .
?s <iris:Petal.Width> ?Petal_Width
}'
rdf_query(rdf, sparql) iris2 <-
We can automatically create the a SPARQL query that returns “tidy data”. Tidy data has predicates as columns, objects as values, subjects as rows.
tidy_schema("Species", "Sepal.Length", "Sepal.Width", prefix = "iris")
sparql <-
rdf_query(rdf, sparql)
“The semantic web is the future of the internet and always will be.” -Peter Norvig, Director of Research at Google↩︎
Couldn’t we just have used another column? Perhaps, but then it wouldn’t be a triple. More to the point, the datatype modifies object
alone, not the predicate or subject.↩︎
Technically I believe it should be a NCName, defined by the regexp [\i-[:]][\c-[:]]*
. Essentially, this says it cannot include symbol characters like :
, @
, $
, %
, &
, /
, +
, ,
, ;
, whitespace characters or different parenthesis. Furthermore an NCName cannot begin with a number, dot or minus character although they can appear later in an NCName.↩︎
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.