The goal of the {dm} package and the dm
class that comes with it, is to make your life easier when you are dealing with data from several different tables.
Let’s take a look at the dm
class.
dm
The dm
class consists of a collection of tables and metadata about the tables, such as
All tables in a dm
must be obtained from the same data source; csv files and spreadsheets would need to be imported to data frames in R.
dm
objectsThe relevant functions for creating dm
objects are:
dm()
as_dm()
new_dm()
There are currently three options available for creating a dm
object. To illustrate these options, we will now create the same dm
in three different ways. We can use the tables from the well-known {nycflights13} package.
dm
Start with an empty dm
object that has been created with dm()
or new_dm()
, and add tables to that object:
#> ── Table source ───────────────────────────────────────────────────────────
#> src: <environment: R_GlobalEnv>
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables:
#> Columns: 0
#> Primary keys: 0
#> Foreign keys: 0
#> ── Table source ───────────────────────────────────────────────────────────
#> src: <environment: R_GlobalEnv>
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airlines`, `airports`, `flights`, `planes`, `weather`
#> Columns: 53
#> Primary keys: 0
#> Foreign keys: 0
Turn a named list of tables into a dm
with as_dm()
:
as_dm(list(airlines = airlines,
airports = airports,
flights = flights,
planes = planes,
weather = weather))
#> ── Table source ───────────────────────────────────────────────────────────
#> src: <environment: R_GlobalEnv>
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airlines`, `airports`, `flights`, `planes`, `weather`
#> Columns: 53
#> Primary keys: 0
#> Foreign keys: 0
src
into a dm
Squeeze all (or a subset of) tables belonging to a src
object into a dm
using dm()
:
#> ── Table source ───────────────────────────────────────────────────────────
#> src: <environment: R_GlobalEnv>
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airlines`, `airports`, `flights`, `planes`, `weather`
#> Columns: 53
#> Primary keys: 0
#> Foreign keys: 0
Here we are making use of the fact that the function dm(src, table_names = NULL)
includes all available tables on a source in the dm
object, should the table_names
-argument be left NULL
. This means that you can use this, for example, on a postgres database that you access via src_postgres()
(with the appropriate arguments dbname
, host
, port
, …), to produce a dm
object with all the tables on the database.
Another way of creating a dm
object is calling new_dm()
on a list of tbls
, e.g., the tables included in the well-known iris
dataset:
#> ── Table source ───────────────────────────────────────────────────────────
#> src: <environment: R_GlobalEnv>
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `iris1`, `iris2`
#> Columns: 10
#> Primary keys: 0
#> Foreign keys: 0
We can get the list of tables with dm_get_tables()
and the src
object with dm_get_src()
.
In order to pull a specific table from a dm
, use:
#> # A tibble: 1,458 x 8
#> faa name lat lon alt tz dst tzone
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
#> 1 04G Lansdowne Airport 41.1 -80.6 1044 -5 A America/New_…
#> 2 06A Moton Field Municipa… 32.5 -85.7 264 -6 A America/Chic…
#> 3 06C Schaumburg Regional 42.0 -88.1 801 -6 A America/Chic…
#> 4 06N Randall Airport 41.4 -74.4 523 -5 A America/New_…
#> 5 09J Jekyll Island Airport 31.1 -81.4 11 -5 A America/New_…
#> 6 0A9 Elizabethton Municip… 36.4 -82.2 1593 -5 A America/New_…
#> 7 0G6 Williams County Airp… 41.5 -84.5 730 -5 A America/New_…
#> 8 0G7 Finger Lakes Regiona… 42.9 -76.8 492 -5 A America/New_…
#> 9 0P2 Shoestring Aviation … 39.8 -76.6 1000 -5 U America/New_…
#> 10 0S9 Jefferson County Intl 48.1 -123. 108 -8 A America/Los_…
#> # … with 1,448 more rows
But how can we use {dm}-functions to manage the primary keys of the tables in a dm
object?
dm
objectsSome useful functions for managing primary key settings are:
dm_add_pk()
dm_has_pk()
dm_get_pk()
dm_rm_pk()
dm_enum_pk_candidates()
dm_get_all_pks()
Currently dm
objects only support one-column primary keys. If your tables have unique compound keys, adding a surrogate key column might be helpful. If you created a dm
object according to the examples in “Examples of dm
objects”, your object does not yet have any primary keys set. So let’s add one.
dm_add_pk()
has an option to check if the column of the table given by the user is a unique key; for performance reasons, the check will not be executed unless requested. Since the iris
dataset does not have any unique one-column keys, we will use the nycflights13
tables, i.e. flights_dm
from above.
dm_has_pk(flights_dm, airports)
#> [1] FALSE
flights_dm_with_key <- dm_add_pk(flights_dm, airports, faa)
flights_dm_with_key
#> ── Table source ───────────────────────────────────────────────────────────
#> src: <environment: R_GlobalEnv>
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airlines`, `airports`, `flights`, `planes`, `weather`
#> Columns: 53
#> Primary keys: 1
#> Foreign keys: 0
The dm
now has a primary key. Let’s check:
Get the name of the column that is marked as primary key of the table:
Remove a primary key:
If you still need to get to know your data better, and it is already available in the form of a dm
object, you can use the dm_enum_pk_candidates()
function in order to get information about which columns of the table are unique keys:
#> # A tibble: 8 x 3
#> columns candidate why
#> <keys> <lgl> <chr>
#> 1 faa TRUE ""
#> 2 lon TRUE ""
#> 3 alt FALSE "has duplicate values: 0, 1, 3, 4, 5, … (>= 7 total)"
#> 4 dst FALSE "has duplicate values: A, N, U"
#> 5 lat FALSE "has duplicate values: 38.88944, 40.63975"
#> 6 name FALSE "has duplicate values: All Airports, Capital City Airp…
#> 7 tz FALSE "has duplicate values: -10, -9, -8, -7, -6, … (>= 7 to…
#> 8 tzone FALSE "has duplicate values: America/Anchorage, America/Chic…
The flights
table does not have any one-column primary key candidates:
#> # A tibble: 1 x 2
#> candidate n
#> <lgl> <int>
#> 1 FALSE 19
To get an overview over all tables with primary keys, use dm_get_all_pks()
:
#> # A tibble: 3 x 2
#> table pk_col
#> <chr> <keys>
#> 1 airlines carrier
#> 2 airports faa
#> 3 planes tailnum
Here we used the prepared dm
object dm_nycflights13(cycle = TRUE)
as an example. This object already has all keys pre-set.
Useful functions for managing foreign key relations include:
dm_add_fk()
dm_has_fk()
dm_get_fk()
dm_rm_fk()
dm_enum_fk_candidates()
dm_get_all_fks()
Now it gets (even more) interesting: we want to define relations between different tables. With the dm_add_fk()
function you can define which column of which table points to another table’s column.
This is done by choosing a foreign key from one table that will point to a primary key of another table. The primary key of the referred table must be set with dm_add_pk()
. dm_add_fk()
will find the primary key column of the referenced table by itself and make the indicated column of the child table point to it.
#> ── Table source ───────────────────────────────────────────────────────────
#> src: <environment: R_GlobalEnv>
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airlines`, `airports`, `flights`, `planes`, `weather`
#> Columns: 53
#> Primary keys: 1
#> Foreign keys: 1
This will throw an error:
flights_dm %>% dm_add_fk(flights, origin, airports)
#> Error: ref_table `airports` needs a primary key first. Use `dm_enum_pk_candidates()` to find appropriate columns and `dm_add_pk()` to define a primary key.
Let’s create a dm
object with a foreign key relation to work with later on:
What if we tried to add another foreign key relation from flights
to airports
to the object? Column dest
might work, since it also contains airport codes:
flights_dm_with_fk %>% dm_add_fk(flights, dest, airports, check = TRUE)
#> Error: Column `dest` of table `flights` contains values (see above) that are not present in column `faa` of table `airports`.
As you can see, behind the scenes, checks are executed automatically (unless check = FALSE
) by the functions of dm
to prevent steps that would result in inconsistent representations.
Use dm_has_fk()
for checking if a foreign key exists that is pointing from one table to another:
flights_dm_with_fk %>% dm_has_fk(flights, planes)
#> [1] FALSE
flights_dm_with_fk %>% dm_has_fk(flights, airports)
#> [1] TRUE
If you want to access the name of the column which acts as a foreign key of one table to another table’s column, use dm_get_fk()
:
flights_dm_with_fk %>% dm_get_fk(flights, planes)
#> <list_of<character>[0]>
flights_dm_with_fk %>% dm_get_fk(flights, airports)
#> <list_of<character>[1]>
#> [[1]]
#> [1] "origin"
Remove foreign key relations with dm_rm_fk()
(parameter column = NULL
means that all relations will be removed):
flights_dm_with_fk %>%
dm_rm_fk(table = flights, column = dest, ref_table = airports) %>%
dm_get_fk(flights, airports)
#> Error: (`dest`) is not a foreign key of table `flights` into table `airports`. Foreign key columns are: (`origin`).
flights_dm_with_fk %>%
dm_rm_fk(flights, origin, airports) %>%
dm_get_fk(flights, airports)
#> <list_of<character>[0]>
flights_dm_with_fk %>%
dm_rm_fk(flights, NULL, airports) %>%
dm_get_fk(flights, airports)
#> <list_of<character>[0]>
Since the primary keys are defined in the dm
object, you do not need to provide the referenced column name of ref_table
. This is always the primary key column of the table.
Another function for getting to know your data better (cf. dm_enum_pk_candidates()
in “Primary keys of dm
objects”) is dm_enum_fk_candidates()
. Use it to get an overview over foreign key candidates that point from one table to another:
#> # A tibble: 15 x 3
#> columns candidate why
#> <keys> <lgl> <chr>
#> 1 origin TRUE ""
#> 2 day FALSE "Can't join on 'value' x 'value' because of incomp…
#> 3 dewp FALSE "Can't join on 'value' x 'value' because of incomp…
#> 4 hour FALSE "Can't join on 'value' x 'value' because of incomp…
#> 5 humid FALSE "Can't join on 'value' x 'value' because of incomp…
#> 6 month FALSE "Can't join on 'value' x 'value' because of incomp…
#> 7 precip FALSE "Can't join on 'value' x 'value' because of incomp…
#> 8 pressure FALSE "Can't join on 'value' x 'value' because of incomp…
#> 9 temp FALSE "Can't join on 'value' x 'value' because of incomp…
#> 10 time_hour FALSE "cannot join a POSIXct object with an object that …
#> 11 visib FALSE "Can't join on 'value' x 'value' because of incomp…
#> 12 wind_dir FALSE "Can't join on 'value' x 'value' because of incomp…
#> 13 wind_gust FALSE "Can't join on 'value' x 'value' because of incomp…
#> 14 wind_speed FALSE "Can't join on 'value' x 'value' because of incomp…
#> 15 year FALSE "Can't join on 'value' x 'value' because of incomp…
Get an overview of all foreign key relations withdm_get_all_fks()
:
#> # A tibble: 4 x 3
#> child_table child_fk_cols parent_table
#> <chr> <keys> <chr>
#> 1 flights carrier airlines
#> 2 flights dest airports
#> 3 flights origin airports
#> 4 flights tailnum planes