Multiple, linked tables are a common concept in database management. Since many R users have a background in other disciplines, we present six important terms in relational data modeling to jump-start working with {dm}.
1) Data frames and tables
A data frame is a fundamental data structure in R. If you imagine it visually, the result is a typical table structure. That’s why working with data from spreadsheets is so convenient and the users of the the popular {dplyr} package for data wrangling mainly rely on data frames.
The disadvantage of data frames and flat file systems like spreadsheets is that they can result in bloated tables that repeat the same values multiple times. In the worst case scenario, you may have a data frame with many rows and columns but with only a single value different in each row.
This calls for a better data organization by utilizing the resemblance between data frames and database tables, which consist of columns and rows, too. The elements are just called differently:
Data Frame
Table
Column
Attribute
Row
Tuple
Therefore, the separation into multiple tables is a first step that helps data quality. But without an associated data model you don’t take full advantage. For example, joining is more complicated than it should be. This is illustrated above.
With {dm} you can have the best of both worlds: Manage your data as linked tables, then flatten multiple tables into one for your analysis with {dplyr} on an as-needed basis.
2) Model
A data model shows the structure between multiple tables that can be linked together. The nycflights13 relations can be transferred into the following graphical representation:
The flights table is linked to three other tables: airlines, planes and airports. By using directed arrows, the visualization explicitly shows the connection between different columns/attributes. For example: The column carrier in flights can be joined with the column carrier from the airlines table. Further Reading: The {dm} methods for visualizing data models.
The links between the tables are established through primary keys and foreign keys.
3) Primary Keys
In a relational data model, every table needs to have one column/attribute that uniquely identifies a row. This column is called a primary key (abbreviated with pk). The primary key column has unique values and can’t contain NA or NULL values. If no such column exists, it is common practice to create a synthetic column of numeric or globally unique identifiers (surrogate key).
In the airlines table of nycflights13 the column carrier is the primary key.
Further Reading: The {dm} package offers several function for dealing with primary keys.
4) Foreign Keys
The counterpart of a primary key in one table is the foreign key in another table. In order to join two tables, the primary key of the first table needs to be available in the second table, too. This second column is called the foreign key (abbreviated with fk).
For example, if you want to link the airlines table in the nycflights13 data to the flights table, the primary key in the airlines table is carrier which is present as foreign key carrier in the flights table.
Further Reading: The {dm} functions for working with foreign keys.
5) Normalization
One main goal is to keep the data organization as clean and simple as possible by avoiding redundant data entries. Normalization is the technical term that describes this central design principle of a relational data model: splitting data into multiple tables. A normalized data schema consists of several relations (tables) that are linked with attributes (columns) with primary and foreign keys.
For example, if you want to change the name of one airport in nycflights13, you have to change only a single data entry. Sometimes, this principle is called “single point of truth”.
dm is built upon relational data models but it is not a database itself. Databases are systems for data management and many of them are constructed as relational databases, e.g. SQLite, MySQL, MSSQL, Postgres. As you can guess from the names of the databases SQL, the structured querying language plays an important role: It was invented for the purpose of querying relational databases.
Therefore, {dm} can copy data from and to databases, and works transparently with both in-memory data and with relational database systems.