Defining a ‘dm’ object for your data

2020-03-04

This document describes how you can get your data into a dm object.

Although the example that we will be using is included in dm, and you can see it by running dm_nycflights13(), we will be going through the built-in example one more time here, step by step.

The five tables that we are working with contain information about all flights that departed from the airports of New York to other destinations in the United States in 2013, and are available through the nycflights13 package:

Once we’ve loaded nycflights13, the aforementioned tables are all in our work environment, ready to be accessed.

library(dm)
library(nycflights13)

Adding Tables

First, we will tell dm which tables we want to work with and how they are connected. For that, we can use dm or as_dm(), which accepts lists of objects. You must explicitly name the objects if you use list() (e.g., list("airlines" = airlines, "flights" = flights)) — here we use tibble::lst(), which automatically names the components of the list.

flights_dm_no_keys <- tibble::lst(airlines, airports, flights, planes, weather) %>%
  as_dm()

Defining Keys

Even though you now have a new dm object that contains all your data, some key details are still missing that determine how your five tables are connected (the foreign keys), and which column(s) uniquely identify the observations (the primary keys).

Primary Keys

dm offers dm_enum_pk_candidates() to identify primary keys and dm_add_pk() to add them.

## # A tibble: 9 x 3
##   columns     candidate why                                                
##   <keys>      <lgl>     <chr>                                              
## 1 tailnum     TRUE      ""                                                 
## 2 engine      FALSE     "has duplicate values: 4 Cycle, Reciprocating, Tur…
## 3 engines     FALSE     "has duplicate values: 1, 2, 3, 4"                 
## 4 manufactur… FALSE     "has duplicate values: AIRBUS, AIRBUS INDUSTRIE, A…
## 5 model       FALSE     "has duplicate values: 717-200, 737-301, 737-3G7, …
## 6 seats       FALSE     "has duplicate values: 2, 4, 5, 6, 7, … (>= 7 tota…
## 7 speed       FALSE     "has duplicate values: 90, 105, 162, 432, NA"      
## 8 type        FALSE     "has duplicate values: Fixed wing multi engine, Fi…
## 9 year        FALSE     "has duplicate values: 1959, 1963, 1975, 1976, 197…

Now, add the primary keys that you have identified:

## ── Table source ───────────────────────────────────────────────────────────
## src:  <environment: R_GlobalEnv>
## ── Metadata ───────────────────────────────────────────────────────────────
## Tables: `airlines`, `airports`, `flights`, `planes`, `weather`
## Columns: 53
## Primary keys: 3
## Foreign keys: 0

To review the primary keys after setting them, call dm_get_all_pks().

## # A tibble: 3 x 2
##   table    pk_col 
##   <chr>    <keys> 
## 1 airlines carrier
## 2 airports faa    
## 3 planes   tailnum

Foreign Keys

## # A tibble: 19 x 3
##    columns      candidate why                                              
##    <keys>       <lgl>     <chr>                                            
##  1 carrier      TRUE      ""                                               
##  2 tailnum      FALSE     "334264 entries (99.3%) of `flights$tailnum` not…
##  3 dest         FALSE     "336776 entries (100%) of `flights$dest` not in …
##  4 origin       FALSE     "336776 entries (100%) of `flights$origin` not i…
##  5 air_time     FALSE     "Can't join on 'value' x 'value' because of inco…
##  6 arr_delay    FALSE     "Can't join on 'value' x 'value' because of inco…
##  7 arr_time     FALSE     "Can't join on 'value' x 'value' because of inco…
##  8 day          FALSE     "Can't join on 'value' x 'value' because of inco…
##  9 dep_delay    FALSE     "Can't join on 'value' x 'value' because of inco…
## 10 dep_time     FALSE     "Can't join on 'value' x 'value' because of inco…
## 11 distance     FALSE     "Can't join on 'value' x 'value' because of inco…
## 12 flight       FALSE     "Can't join on 'value' x 'value' because of inco…
## 13 hour         FALSE     "Can't join on 'value' x 'value' because of inco…
## 14 minute       FALSE     "Can't join on 'value' x 'value' because of inco…
## 15 month        FALSE     "Can't join on 'value' x 'value' because of inco…
## 16 sched_arr_t… FALSE     "Can't join on 'value' x 'value' because of inco…
## 17 sched_dep_t… FALSE     "Can't join on 'value' x 'value' because of inco…
## 18 time_hour    FALSE     "cannot join a POSIXct object with an object tha…
## 19 year         FALSE     "Can't join on 'value' x 'value' because of inco…

To define how your tables are related, use dm_add_fk() to add foreign keys. First, define the tables that you wish to connect by parameterizing the dm_add_fk() function with table and ref_table options.

Then indicate in column which column of table refers to ref_table’s primary key, which you’ve defined above. Use check = FALSE to omit consistency checks.

## ── Table source ───────────────────────────────────────────────────────────
## src:  <environment: R_GlobalEnv>
## ── Metadata ───────────────────────────────────────────────────────────────
## Tables: `airlines`, `airports`, `flights`, `planes`, `weather`
## Columns: 53
## Primary keys: 3
## Foreign keys: 3

Retrieving Keys

To retrieve your keys later on, use dm_get_all_fks(), or dm_get_fk() for its singular version.

## # A tibble: 3 x 2
##   table    pk_col 
##   <chr>    <keys> 
## 1 airlines carrier
## 2 airports faa    
## 3 planes   tailnum

Voilà, here’s your dm object that you can work with:

## ── Table source ───────────────────────────────────────────────────────────
## src:  <environment: R_GlobalEnv>
## ── Metadata ───────────────────────────────────────────────────────────────
## Tables: `airlines`, `airports`, `flights`, `planes`, `weather`
## Columns: 53
## Primary keys: 3
## Foreign keys: 3