The goal of the package {dm} and its dm
class is to facilitate working with multiple related tables.
An object of the dm
class contains the data in the tables, and metadata about the tables, such as
This package augments {dplyr}/{dbplyr} workflows:
In addition, a battery of utilities is provided that helps with creating a tidy data model.
This package follows several of the “tidyverse” rules:
dm
objects are immutable (your data will never be overwritten in place)dm
objects are pipeable (i.e., return new dm
objects)The {dm} package builds heavily upon the {datamodelr} package, and upon the tidyverse. We’re looking forward to a great collaboration!
We will now demonstrate some of the features of {dm}:
Let’s first have a brief look at how to create a dm
-class object.
dm
objects:The {nycflights13} package offers a nice example of interconnected tables. The most straightforward way of squeezing those tables into a dm
object is:
#> ── Table source ───────────────────────────────────────────────────────────
#> src: <environment: R_GlobalEnv>
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airlines`, `airports`, `flights`, `planes`, `weather`
#> Columns: 53
#> Primary keys: 0
#> Foreign keys: 0
This fairly verbose output shows the data and metadata of a dm
object. The various components can be accessed with functions of the type dm_get_...()
, e.g.:
dm_get_src(flights_dm)
#> src: <environment: R_GlobalEnv>
#> tbls: airlines, airlines_filtered, airports_filtered, child_table, d1, d2,
#> d3, d4, data_1, data_2, data_3, dm_joined, mtcars_tibble, parent_table
dm_get_all_pks(flights_dm)
#> # A tibble: 0 x 2
#> # … with 2 variables: table <chr>, pk_col <keys>
#> # A tibble: 0 x 3
#> # … with 3 variables: child_table <chr>, child_fk_cols <keys>,
#> # parent_table <chr>
Notably, the function dm_get_tables()
returns a named list containing the individual tables.
As you can see in the output above, no keys have been set so far. We will use dm_add_pk()
and dm_add_fk()
to add primary keys (pk) and foreign keys (fk):
flights_dm_with_one_key <-
flights_dm %>%
dm_add_pk(airlines, carrier) %>%
dm_add_fk(flights, carrier, airlines)
After you set the keys and establish relations, you can create a graphical representation of your data model with dm_draw()
:
The dm_nycflights13()
function provides a shortcut: the dm
object returned by this function contains all tables (by default a reduced version of table flights
), defines all primary and foreign keys, and even assigns colors to the different types of tables. We will be using the dm
object created by this function from now on.
dm
objectThe idea of a filter on a dm
object:
dm
’s tables, just like with normal dplyr::filter()
callsdm
objectdm_apply_filters_to_tbl()
, a sequence of semi_join()
calls is performed to retrieve the requested table with only those values in the key columns which correspond to the remaining values in the filtered tablesThe function dm_apply_filters()
essentially calls dm_apply_filters_to_tbl()
for each table of the dm
and creates a new dm
object from the result.
Currently, this only works if the graph induced by the fk relations is cycle free, the default for dm_nycflights13()
:
Let’s set two filters:
us_flights_from_jfk_prepared <-
flights_dm_acyclic %>%
dm_filter(airports, name == "John F Kennedy Intl") %>%
dm_filter(airlines, name == "US Airways Inc.")
us_flights_from_jfk_prepared
#> ── Table source ───────────────────────────────────────────────────────────
#> src: <environment: R_GlobalEnv>
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airlines`, `airports`, `flights`, `planes`, `weather`
#> Columns: 53
#> Primary keys: 3
#> Foreign keys: 3
#> ── Filters ────────────────────────────────────────────────────────────────
#> airlines: name == "US Airways Inc."
#> airports: name == "John F Kennedy Intl"
With dm_apply_filters()
, you can apply all filters and therefore update all tables in the dm
, so that they contain only the rows that are relevant to the filters. The effect of the filters can be seen when counting the rows:
us_flights_from_jfk <- dm_apply_filters(us_flights_from_jfk_prepared)
us_flights_from_jfk %>%
dm_get_tables() %>%
map_int(nrow)
#> airlines airports flights planes weather
#> 1 1 95 67 26115
Alternatively, you can just pull out one of the tables from dm
to answer the question which planes were used to service the US Airways flights that departed from JFK airport:
#> # A tibble: 67 x 9
#> tailnum year type manufacturer model engines seats speed engine
#> <chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr>
#> 1 N102UW 1998 Fixed win… AIRBUS INDUS… A320-… 2 182 NA Turbo…
#> 2 N107US 1999 Fixed win… AIRBUS INDUS… A320-… 2 182 NA Turbo…
#> 3 N110UW 1999 Fixed win… AIRBUS INDUS… A320-… 2 182 NA Turbo…
#> 4 N111US 1999 Fixed win… AIRBUS INDUS… A320-… 2 182 NA Turbo…
#> 5 N112US 1999 Fixed win… AIRBUS INDUS… A320-… 2 182 NA Turbo…
#> 6 N113UW 1999 Fixed win… AIRBUS INDUS… A320-… 2 182 NA Turbo…
#> 7 N126UW 2009 Fixed win… AIRBUS A320-… 2 182 NA Turbo…
#> 8 N152UW 2013 Fixed win… AIRBUS A321-… 2 199 NA Turbo…
#> 9 N154UW 2013 Fixed win… AIRBUS A321-… 2 199 NA Turbo…
#> 10 N167US 2001 Fixed win… AIRBUS INDUS… A321-… 2 199 NA Turbo…
#> # … with 57 more rows
Each of the planes in the result set above was a part of at least one US Airways flight departing from JFK. Do they have any common characteristics?
#> # A tibble: 6 x 2
#> model n
#> <chr> <int>
#> 1 A319-112 16
#> 2 A320-214 7
#> 3 A320-232 11
#> 4 A321-211 8
#> 5 A321-231 24
#> 6 ERJ 190-100 IGW 1
For comparison, let’s look at the equivalent manual query in {dplyr} syntax:
flights %>%
left_join(airports, by = c("origin" = "faa")) %>%
filter(name == "John F Kennedy Intl") %>%
left_join(airlines, by = "carrier") %>%
filter(name.y == "US Airways Inc.") %>%
semi_join(planes, ., by = "tailnum") %>%
count(model)
The {dm} code is leaner because the foreign key relations are encoded in the object.
Mind, that if you access a table via tbl.dm()
, $.dm()
or [[.dm()
, filter conditions set for other tables are ignored.
The dm_join_to_tbl()
function joins two immediately related tables in a data model. The definition of the primary and foreign key constraints is used to define the relationship.
#> # A tibble: 11,227 x 20
#> year month day dep_time sched_dep_time dep_delay arr_time
#> <int> <int> <int> <int> <int> <dbl> <int>
#> 1 2013 1 10 3 2359 4 426
#> 2 2013 1 10 16 2359 17 447
#> 3 2013 1 10 450 500 -10 634
#> 4 2013 1 10 520 525 -5 813
#> 5 2013 1 10 530 530 0 824
#> 6 2013 1 10 531 540 -9 832
#> 7 2013 1 10 535 540 -5 1015
#> 8 2013 1 10 546 600 -14 645
#> 9 2013 1 10 549 600 -11 652
#> 10 2013 1 10 550 600 -10 649
#> # … with 11,217 more rows, and 13 more variables: sched_arr_time <int>,
#> # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
#> # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> # minute <dbl>, time_hour <dttm>, name <chr>
The same operation using {dplyr} syntax looks like this:
Omitting the by
argument leads to a warning.
dm
object to a DB or learn from itdm
objects can be transferred from one src
to another. The relevant verb is copy_dm_to()
, which will copy both data and key constraints.
src_sqlite <- src_sqlite(":memory:", create = TRUE)
src_sqlite
#> src: sqlite 3.30.1 [:memory:]
#> tbls:
flights_dm_with_keys_remote <- copy_dm_to(src_sqlite, flights_dm_with_keys)
As a result, the tables are transferred to the target data source, and all keys will be contained in the returned data model.
src_sqlite
#> src: sqlite 3.30.1 [:memory:]
#> tbls: airlines, airports, flights, planes, sqlite_stat1, sqlite_stat4,
#> weather
flights_dm_with_keys_remote
#> ── Table source ───────────────────────────────────────────────────────────
#> src: sqlite 3.30.1 [:memory:]
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airlines`, `airports`, `flights`, `planes`, `weather`
#> Columns: 53
#> Primary keys: 3
#> Foreign keys: 4
In the opposite direction, dm
objects can also be “learned” from a DB, including the key constraints, by utilizing the DB’s meta-information tables. Unfortunately, this currently only works for MSSQL and Postgres, so we cannot show the results here just yet:
dm
objects and basic operations on them, like handling key constraints in the “Class ‘dm’ and basic operations” articledm
objects: the “Visualizing ‘dm’ objects” article