JSON (JavaScript Object Notation) is a lightweight and flexible data format that is easy for humans to read and for machines to parse. JSON has become a common format used in:
The tidyjson package provides a grammar for turning JSON data into tidy data.frames that are easy to work with in dplyr, ggplot2 and other modeling and analytics packages.
There are already several libraries for working with JSON data in R, such as rjson, rjsonio and jsonlite. Using these libraries, you can transform JSON into a nested R list. However, working with nested lists using base R functionality is difficult.
The jsonlite package goes farther by automatically creating a nested R data.frame. This is easier to work with than a list, but has two main limitations. First, the resulting data.frame isn't tidy, and so it can still be difficult to work with. Second, the structure of the data.frame may vary as the JSON sample changes, which can happen any time you change the database query or API call that generated the data.
The tidyjson package takes an alternate approach to structuring JSON data into tidy data.frames. Similar to tidyr, tidyjson builds a grammar for manipulating JSON into a tidy table structure. Tidyjson is based on the following principles:
%>%
operatorA simple example of how tidyjson works is as follows:
library(tidyjson) # this library
library(dplyr) # for %>% and other dplyr functions
# Define a simple JSON array of people
people <- '
[
{
"name": "bob",
"age": 32
},
{
"name": "susan",
"age": 54
}
]'
# Structure the data
people %>% # %>% is the magrittr pipeline operator
gather_array %>% # gather (stack) the array by index
spread_values( # spread (widen) values to widen the data.frame
name = jstring("name"), # value of "name" becomes a character column
age = jnumber("age") # value of "age" becomes a numeric column
)
#> document.id array.index name age
#> 1 1 1 bob 32
#> 2 1 2 susan 54
In such a simple example, we can use fromJSON
in the jsonlite package to do
this much faster:
library(jsonlite)
jsonlite::fromJSON(people, simplifyDataFrame = TRUE)
#> name age
#> 1 bob 32
#> 2 susan 54
However, if the structure of the JSON data changed, so would the columns output
by fromJSON
. So even in this simple example there is value in the explicit
structure defined in the tidyjson pipeline above.
The tidyjson package really shines in a more complex example. Consider the following JSON, which describes three purchases of five items made by two individuals:
purch_json <- '
[
{
"name": "bob",
"purchases": [
{
"date": "2014/09/13",
"items": [
{"name": "shoes", "price": 187},
{"name": "belt", "price": 35}
]
}
]
},
{
"name": "susan",
"purchases": [
{
"date": "2014/10/01",
"items": [
{"name": "dress", "price": 58},
{"name": "bag", "price": 118}
]
},
{
"date": "2015/01/03",
"items": [
{"name": "shoes", "price": 115}
]
}
]
}
]'
Suppose we want to find out how much each person has spent. Using jsonlite, we can parse the JSON:
library(jsonlite)
# Parse the JSON into a data.frame
purch_df <- jsonlite::fromJSON(purch_json, simplifyDataFrame = TRUE)
# Examine results
purch_df
#> name purchases
#> 1 bob 2014/09/13, shoes, belt, 187, 35
#> 2 susan 2014/10/01, 2015/01/03, dress, bag, 58, 118, shoes, 115
This looks deceptively simple, on inspection with str()
we see that the
resulting data structure is actually a complex nested data.frame:
str(purch_df)
#> 'data.frame': 2 obs. of 2 variables:
#> $ name : chr "bob" "susan"
#> $ purchases:List of 2
#> ..$ :'data.frame': 1 obs. of 2 variables:
#> .. ..$ date : chr "2014/09/13"
#> .. ..$ items:List of 1
#> .. .. ..$ :'data.frame': 2 obs. of 2 variables:
#> .. .. .. ..$ name : chr "shoes" "belt"
#> .. .. .. ..$ price: int 187 35
#> ..$ :'data.frame': 2 obs. of 2 variables:
#> .. ..$ date : chr "2014/10/01" "2015/01/03"
#> .. ..$ items:List of 2
#> .. .. ..$ :'data.frame': 2 obs. of 2 variables:
#> .. .. .. ..$ name : chr "dress" "bag"
#> .. .. .. ..$ price: int 58 118
#> .. .. ..$ :'data.frame': 1 obs. of 2 variables:
#> .. .. .. ..$ name : chr "shoes"
#> .. .. .. ..$ price: int 115
This is difficult to work with, and we end up writing code like this:
items <- lapply(purch_df$purchases, `[[`, "items")
prices <- lapply(items, lapply, `[[`, "price")
vapply(lapply(prices, unlist), sum, integer(1))
#> [1] 222 291
Reasoning about code like this is nearly impossible, and further, the relational structure of the data is lost (we no longer have the name of the user).
We can instead try to use dplyr and the do{}
operator to get at the
data in the nested data.frames, but this is equally challenging and confusing:
purch_df %>% group_by(name) %>% do({
.$purchases[[1]] %>% rowwise %>% do({
.$items[, "price", drop = FALSE]
})
}) %>% summarize(price = sum(price))
#> Source: local data frame [2 x 2]
#>
#> name price
#> 1 bob 222
#> 2 susan 291
Using tidyjson, we can build a pipeline to turn this JSON into a tidy data.frame where each row corresponds to a purchased item:
purch_items <- purch_json %>%
gather_array %>% # stack the users
spread_values(person = jstring("name")) %>% # extract the user name
enter_object("purchases") %>% gather_array %>% # stack the purchases
spread_values(purchase.date = jstring("date")) %>% # extract the purchase date
enter_object("items") %>% gather_array %>% # stack the items
spread_values( # extract item name and price
item.name = jstring("name"),
item.price = jnumber("price")
) %>%
select(person, purchase.date, item.name, item.price) # select only what is needed
The resulting data.frame is exactly what we want
purch_items
#> person purchase.date item.name item.price
#> 1 bob 2014/09/13 shoes 187
#> 2 bob 2014/09/13 belt 35
#> 3 susan 2014/10/01 dress 58
#> 4 susan 2014/10/01 bag 118
#> 5 susan 2015/01/03 shoes 115
And we can easily continue the pipeline in dplyr to compute derived data
purch_items %>% group_by(person) %>% summarize(spend = sum(item.price))
#> Source: local data frame [2 x 2]
#>
#> person spend
#> 1 bob 222
#> 2 susan 291
tbl_json
objectThe first step in using tidyjson is to convert your JSON into a tbl_json
object.
Almost every function in tidyjson accepts either a tbl_json
object or a character
vector of JSON data as it's first parameter, and returns a tbl_json
object for
downstream use. To facilitate integration with dplyr, tbl_json
inherits from
dplyr::tbl
.
The easiest way to construct a tbl_json
object is directly from a character
string:
# Using a single character string
x <- '{"key": "value"}' %>% as.tbl_json
x
#> document.id
#> 1 1
attr(x, "JSON")
#> [[1]]
#> [[1]]$key
#> [1] "value"
Behind the scenes, as.tbl_json
is parsing the JSON string and creating a
data.frame with 1 column, document.id
, which keeps track of the character
vector position (index) where the JSON data came from. In addition, each
tbl_json
object has a “JSON” attribute that contains a list of
JSON data of the same length as the number of rows in the data.frame
.
Often times you will have many lines of JSON data that you want to work with,
in which case you can directly convert a character vector to obtain a tbl_json
object with the same number of rows:
# Using a vector of JSON strings
y <- c('{"key1": "value1"}', '{"key2": "value2"}') %>% as.tbl_json
y
#> document.id
#> 1 1
#> 2 2
This creates a two row tbl_json
object, where each row corresponds to an index
of the character vector. We can see the underlying parsed JSON:
attr(y, "JSON")
#> [[1]]
#> [[1]]$key1
#> [1] "value1"
#>
#>
#> [[2]]
#> [[2]]$key2
#> [1] "value2"
If your JSON data is already embedded in a data.frame, then you will need
to call as.tbl_json
directly in order to specific which column contains
the JSON data. Note that the JSON in the data.frame should be character data,
and not a factor. Use stringsAsFactors = FALSE
in constructing the data.frame
to avoid turning the JSON into a factor.
df <- data.frame(
x = 1:2,
JSON = c('{"key1": "value1"}', '{"key2": "value2"}'),
stringsAsFactors = FALSE
)
z <- df %>% as.tbl_json(json.column = "JSON")
z
#> x
#> 1 1
#> 2 2
attr(z, "JSON")
#> [[1]]
#> [[1]]$key1
#> [1] "value1"
#>
#>
#> [[2]]
#> [[2]]$key2
#> [1] "value2"
The tidyjson package comes with several JSON example datasets:
commits
: commit data for the dplyr repo from github APIissues
: issue data for the dplyr repo from github APIworldbank
: world bank funded projects from
jsonstudiocompanies
: startup company data from
jsonstudioEach dataset has some example tidyjson queries in help(commits)
,
help(issues)
, help(worldbank)
and help(companies)
.
The rest of tidyjson is comprised of various verbs with operate on tbl_json
objects and return tbl_json
objects. They are meant to be used in a pipeline
with the %>%
operator.
Note that these verbs all operate on both the underlying data.frame and the JSON, iteratively moving data from the JSON into the data.frame. Any modifications of the underlying data.frame outside of these operations may produce unintended consequences where the data.frame and JSON become out of synch.
The following table provides a reference of how each verb is used and what (if any) effect it has on the data.frame rows and columns and on the associated JSON.
Verb | Use | Row Effect | Column Effect | JSON Effect |
---|---|---|---|---|
json_types() |
Identify JSON structure | none | type column | none |
gather_array() |
Stack JSON array | Repeats rows | index column | enter array |
gather_keys() |
Stack a {“key”: value} object | Repeats rows | key column | enter object |
spread_values() |
Create new columns from values | none | N value columns | none |
append_values_X() |
Append all values of a type | none | column of type X | none |
enter_object() |
Dive into an object “key” | Keeps rows with key | none | enter object |
json_lengths() |
Identify JSON length | none | length column | none |
json_types()
One of the first steps you will want to take is to investigate the structure of
your JSON data. The function json_types()
inspects the JSON associated with
each row of the data.frame, and adds a new column (type
by default) that
identifies the type according to the JSON standard.
c('{"a": 1}', '[1, 2]', '"a"', '1', 'true', 'null') %>% json_types
#> document.id type
#> 1 1 object
#> 2 2 array
#> 3 3 string
#> 4 4 number
#> 5 5 logical
#> 6 6 null
This is particularly useful for inspecting your JSON data types, and can added
after gather_array()
(or gather_keys()
) to inspect the types of the elements
(or values) in arrays (or objects).
gather_array()
Arrays are sometimes vectors (fixed or varying length integer, character or
logical vectors). But they also often contain lists of other objects (like
a list of purchases for a user). The function gather_array()
takes JSON arrays
and duplicates the rows in the data.frame to correspond to the indices of the
array, and puts the elements of the array into the JSON attribute.
This is equivalent to “stacking” the array in the data.frame, and lets you
continue to manipulate the remaining JSON in the elements of the array.
'[1, "a", {"k": "v"}]' %>% gather_array %>% json_types
#> document.id array.index type
#> 1 1 1 number
#> 2 1 2 string
#> 3 1 3 object
This allows you to enter into an array and begin processing it's elements with other tidyjson functions. It retains the array.index in case the relative position of elements in the array is useful information.
gather_keys()
Similar to gather_array()
, gather_keys()
takes JSON objects and duplicates
the rows in the data.frame to correspond to the keys of the object, and puts the
values of the object into the JSON attribute.
'{"name": "bob", "age": 32}' %>% gather_keys %>% json_types
#> document.id key type
#> 1 1 name string
#> 2 1 age number
This allows you to enter into the keys of the objects just like gather_array
let you enter elements of the array.
spread_values()
Adding new columns to your data.frame
is accomplished with spread_values()
,
which lets you dive into (potentially nested) JSON objects and extract specific
values. spread_values()
takes jstring()
, jnumber()
or jlogical()
function calls as arguments in order to specify the type of the data that should
be captured at each desired key location
These values can be of varying types at varying depths, e.g.,
'{"name": {"first": "bob", "last": "jones"}, "age": 32}' %>%
spread_values(
first.name = jstring("name", "first"),
age = jnumber("age")
)
#> document.id first.name age
#> 1 1 bob 32
append_values_X()
The append_values_X()
functions let you take the remaining JSON and add it as
a column X (for X in “string”, “number”, “logical”) insofar as it is of the
JSON type specified. For example:
'{"first": "bob", "last": "jones"}' %>%
gather_keys() %>%
append_values_string()
#> document.id key string
#> 1 1 first bob
#> 2 1 last jones
Any values that do not conform to the type specified will be NA in the resulting
column. This includes other scalar types (e.g., numbers or logicals if you are
using append_values_string()
) and also any rows where the JSON is still an
object or an array.
enter_object()
For complex JSON structures, you will often need to navigate into nested objects
in order to continue structuring your data. The function enter_object()
lets
you dive into a specific object key in the JSON attribute, so that all further
tidyjson calls happen inside that object (all other JSON data outside the object
is discarded). If the object doesn't exist for a given row / index, then that
data.frame row will be discarded.
c('{"name": "bob", "children": ["sally", "george"]}', '{"name": "anne"}') %>%
spread_values(parent.name = jstring("name")) %>%
enter_object("children") %>%
gather_array %>%
append_values_string("children")
#> document.id parent.name array.index children
#> 1 1 bob 1 sally
#> 2 1 bob 2 george
This is useful when you want to limit your data to just information found in a specific key.
json_lengths()
When investigating JSON data it can be helpful to identify the lengths of the JSON objects or arrays, especialy when they are 'ragged' across documents:
c('[1, 2, 3]', '{"k1": 1, "k2": 2}', '1', {}) %>% json_lengths
#> document.id length
#> 1 1 3
#> 2 2 2
#> 3 3 1
When beginning to work with JSON data, you often don't have easy access to a schema describing what is in the JSON. One of the benefits of document oriented data structures is that they let developers create data without having to worry about defining the schema explicitly.
Thus, the first step is to understand the structure of the JSON. Begin by
visually inspecting a single record with jsonlite::prettify()
.
'{"key": "value", "array": [1, 2, 3]}' %>% prettify
#> {
#> "key": "value",
#> "array": [
#> 1,
#> 2,
#> 3
#> ]
#> }
#>
However, for complex data or large JSON structures this can be tedious. Instead,
use gather_keys
, json_types
and json_lengths
to summarize the data:
'{"key": "value", "array": [1, 2, 3]}' %>%
gather_keys %>% json_types %>% json_lengths
#> document.id key type length
#> 1 1 key string 1
#> 2 1 array array 3
You can repeat this as you move through the JSON data using enter_object()
to
summarize nested structures as well.
Once you have an understanding of how you'd like the data to be assembled, begin
creating your tidyjson pipeline. Use enter_objects()
and gather_array()
to
navigate the JSON and stack any arrays, and use spread_values()
to get at
(potentially nested) key-value pairs along the way.
Before entering any objects, make sure you first use spread_values()
to
capture any top level identifiers you might need for analytics, summarization or
relational uses downstream. If an identifier doesn't exist, then you can always
fall back on the as.tbl_json
generated document.id column.
If you encounter data where information is encoded in both keys and values,
then consider using gather_keys()
and append_values_X()
where X
is the type
of JSON scalar data you expect in the values.
Note that there are often situations where there are multiple arrays or objects
of differing types that exist at the same level of the JSON hierarchy. In this
case, you need to use enter_object()
to enter each of them in separate
pipelines to create separate data.frames
that can then be joined
relationally.
Finally, don't forget that once you are done with your JSON tidying, you can
use dplyr to continue manipulating the
resulting data. dplyr::filter
, dplyr::arrange
and dplyr::mutate
can be
used and will preserve the JSON attribute for further tidyjson manipulation.
The same is true for the [
operator. Other dplyr
functions will destroy
the JSON attribute, so you will no longer be able to manipulate the JSON data.
Included in the tidyjson package is a 500 record sample,
worldbank
, which contains a subset of the JSON data describing world bank
funded projects from jsonstudio.
First, let's take a look at a single record. We can use jsonlite::prettify
to
make the JSON easy to read:
library(jsonlite)
worldbank[1] %>% prettify
#> {
#> "_id": {
#> "$oid": "52b213b38594d8a2be17c780"
#> },
#> "boardapprovaldate": "2013-11-12T00:00:00Z",
#> "closingdate": "2018-07-07T00:00:00Z",
#> "countryshortname": "Ethiopia",
#> "majorsector_percent": [
#> {
#> "Name": "Education",
#> "Percent": 46
#> },
#> {
#> "Name": "Education",
#> "Percent": 26
#> },
#> {
#> "Name": "Public Administration, Law, and Justice",
#> "Percent": 16
#> },
#> {
#> "Name": "Education",
#> "Percent": 12
#> }
#> ],
#> "project_name": "Ethiopia General Education Quality Improvement Project II",
#> "regionname": "Africa",
#> "totalamt": 130000000
#> }
#>
An interesting objects is “majorsector_percent”, which appears to capture the distribution of each project by sector. We also have several funding amounts, such as “totalamt”, which indicate how much money went into each project.
Let's grab the “totalamt”, and then gather the array of sectors and their percent allocations.
amts <- worldbank %>%
spread_values(
total = jnumber("totalamt")
) %>%
enter_object("majorsector_percent") %>% gather_array %>%
spread_values(
sector = jstring("Name"),
pct = jnumber("Percent")
) %>%
mutate(total.m = total / 10^6) %>%
select(document.id, sector, total.m, pct) %>%
tbl_df
amts
#> Source: local data frame [1,405 x 4]
#>
#> document.id sector total.m pct
#> 1 1 Education 130 46
#> 2 1 Education 130 26
#> 3 1 Public Administration, Law, and Justice 130 16
#> 4 1 Education 130 12
#> .. ... ... ... ...
Let's check that the “pct” column really adds up to 100 by project:
amts %>%
group_by(document.id) %>%
summarize(pct.total = sum(pct)) %>%
group_by(pct.total) %>%
tally
#> Source: local data frame [1 x 2]
#>
#> pct.total n
#> 1 100 500
It appears to always add up to 100. Let's also check the distribution of the total amounts.
summary(amts$total.m)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 0.00 0.00 30.00 81.55 100.00 1308.00
Many are 0, the mean is $80m and the max is over $1bn.
Let's now aggregate by the sector and compute, on a dollar weighted basis, where the money is going by sector:
amts %>%
group_by(sector) %>%
summarize(
spend.portion = sum(total.m * pct / 100)
) %>%
ungroup %>%
mutate(spend.dist = spend.portion / sum(spend.portion)) %>%
arrange(desc(spend.dist))
#> Source: local data frame [10 x 3]
#>
#> sector spend.portion spend.dist
#> 1 Public Administration, Law, and Justice 8698.143 0.25477320
#> 2 Transportation 5878.931 0.17219698
#> 3 Health and other social services 4312.514 0.12631581
#> 4 Energy and mining 3360.977 0.09844478
#> .. ... ... ...
Also included in the tidyjson package is a 1000 record sample,
companies
, which contains a subset of the JSON data describing startups from
jsonstudio.
Instead of using jsonlite::prettify
, let's quickly summarize the keys using
tidyjson and visualize the results:
library(ggplot2)
key_stats <- companies %>%
gather_keys %>% json_types %>% group_by(key, type) %>% tally
key_stats
#> Source: local data frame [64 x 3]
#> Groups: key
#>
#> key type n
#> 1 _id object 1000
#> 2 acquisition object 149
#> 3 acquisition null 851
#> 4 acquisitions array 1000
#> .. ... ... ...
ggplot(key_stats, aes(key, n, fill = type)) +
geom_bar(stat = "identity", position = "stack") +
coord_flip()
Suppose we are interested in exploring the funding round data. Let's examine it's structure:
companies %>%
enter_object("funding_rounds") %>%
gather_array %>%
gather_keys %>% json_types %>% group_by(key, type) %>% tally %>%
ggplot(aes(key, n, fill = type)) +
geom_bar(stat = "identity", position = "stack") +
coord_flip()
Now, referencing the above visualizations, we can structure some of the data for analysis:
rounds <- companies %>%
spread_values(
id = jstring("_id", "$oid"),
name = jstring("name"),
category = jstring("category_code")
) %>%
enter_object("funding_rounds") %>%
gather_array %>%
spread_values(
round = jstring("round_code"),
raised = jnumber("raised_amount")
)
rounds %>% glimpse
#> Observations: 696
#> Variables:
#> $ document.id (int) 1, 1, 1, 1, 1, 5, 5, 8, 8, 8, 14, 15, 16, 19, 27, ...
#> $ id (chr) "52cdef7e4bab8bd67529bd0c", "52cdef7e4bab8bd67529b...
#> $ name (chr) "OutSmart Power Systems", "OutSmart Power Systems"...
#> $ category (chr) "hardware", "hardware", "hardware", "hardware", "h...
#> $ array.index (int) 1, 2, 3, 4, 5, 1, 2, 1, 2, 3, 1, 1, 1, 1, 1, 2, 3,...
#> $ round (chr) "seed", "partial", "debt_round", "b", "a", "a", "d...
#> $ raised (dbl) 2000000, 200000, 2606500, 1500000, 3600000, 360000...
Now we can summarize by category and round how much is raised on average by round:
rounds %>%
filter(
!is.na(raised),
round %in% c('a', 'b', 'c'),
category %in% c('enterprise', 'software', 'web')
) %>%
group_by(category, round) %>%
summarize(raised = mean(raised)) %>%
ggplot(aes(round, raised / 10^6, fill = round)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(y = "Raised (m)") +
facet_grid(. ~ category)
This package is still a work in progress. Significant additional features we are contemplating include: