The basic principle of the SCDB package is to enable the user to easily implement and maintain a database of time-versioned data.
In practice, this is done by labeling each record in the data with three additional fields:
This strategy of time versioning is often called “type 2” history.
Note that identical records may be removed and introduced more than once; for example, in a table of names and addresses, a person may change their address (or name) back to a previous value.
The SCDB package provides the function update_snapshot
to handle the insertion and deactivation of records using this strategy.
It further includes several functions to improve the Quality of life for
working with database data.
A simple example is shown below:
# First we connect to our database.
# If this is, e.g., a PostgreSQL database already running on the machine, connection
# can be done after the configuration of a .pgpass file.
# For this example, we use an on-disk SQLite db to showcase.
conn <- get_connection(drv = RSQLite::SQLite())
# NOTE: Had the PostgreSQL DB been configured, we would not need to pass any args
# to get_connection()
# Our example data is mtcars with rownames converted to a column and only the first hp
# column of mtcars
example_data <- dplyr::transmute(mtcars, car = rownames(mtcars), hp)
# If the data does not already live on the remote, we must transfer it
example_data <- dplyr::copy_to(conn, example_data, overwrite = TRUE)
# In this example, we imagine that on day 1, in this case 2020-01-01 11:00:00, our data
# known to us is the first 3 records of mtcars
data <- head(example_data, 3)
# We then store these data in the database using update_snapshot
update_snapshot(.data = data,
conn = conn,
db_table = "mtcars", # the name of the DB table to store the data in
timestamp = as.POSIXct("2020-01-01 11:00:00"))
# We can access our data using the `get_table` function
print(get_table(conn, "mtcars"))
#> # Source: SQL [3 x 2]
#> # Database: sqlite 3.41.2 []
#> car hp
#> <chr> <dbl>
#> 1 Datsun 710 93
#> 2 Mazda RX4 110
#> 3 Mazda RX4 Wag 110
# And we can see the time-keeping if we set `include_slice_info = TRUE`
print(get_table(conn, "mtcars", include_slice_info = TRUE))
#> # Source: SQL [3 x 5]
#> # Database: sqlite 3.41.2 []
#> car hp checksum from_ts until_ts
#> <chr> <dbl> <chr> <chr> <chr>
#> 1 Datsun 710 93 08c864e3854eb5a1460d87b3360d636f 2020-01-01 11:0… <NA>
#> 2 Mazda RX4 110 7cbe488757cc85aab6583dbc4226bf68 2020-01-01 11:0… <NA>
#> 3 Mazda RX4 Wag 110 b82618e7f5dd30d5df68540cecc696c8 2020-01-01 11:0… <NA>
# Let's say that the next day, our data set is now the first 5 of our example data
data <- head(example_data, 5)
# We then store these data in the database using update_snapshot
update_snapshot(.data = data,
conn = conn,
db_table = "mtcars", # the name of the DB table to store the data in
timestamp = as.POSIXct("2020-01-02 12:00:00"))
# We again use the `get_table` function and see the latest available data
print(get_table(conn, "mtcars"))
#> # Source: SQL [5 x 2]
#> # Database: sqlite 3.41.2 []
#> car hp
#> <chr> <dbl>
#> 1 Datsun 710 93
#> 2 Mazda RX4 110
#> 3 Mazda RX4 Wag 110
#> 4 Hornet 4 Drive 110
#> 5 Hornet Sportabout 175
# And we can see the time-keeping if we set `include_slice_info = TRUE`
print(get_table(conn, "mtcars", include_slice_info = TRUE))
#> # Source: SQL [5 x 5]
#> # Database: sqlite 3.41.2 []
#> car hp checksum from_ts until_ts
#> <chr> <dbl> <chr> <chr> <chr>
#> 1 Datsun 710 93 08c864e3854eb5a1460d87b3360d636f 2020-01-01 … <NA>
#> 2 Mazda RX4 110 7cbe488757cc85aab6583dbc4226bf68 2020-01-01 … <NA>
#> 3 Mazda RX4 Wag 110 b82618e7f5dd30d5df68540cecc696c8 2020-01-01 … <NA>
#> 4 Hornet 4 Drive 110 3c1b6c43b206dd93ee4f6c3d06e1b416 2020-01-02 … <NA>
#> 5 Hornet Sportabout 175 9355ed7a70e3ff73a4b6ee7f7129aa35 2020-01-02 … <NA>
# Since our data is time-versioned, we can recover the data from the day before
print(get_table(conn, "mtcars", slice_ts = "2020-01-01 11:00:00"))
#> # Source: SQL [3 x 2]
#> # Database: sqlite 3.41.2 []
#> car hp
#> <chr> <dbl>
#> 1 Datsun 710 93
#> 2 Mazda RX4 110
#> 3 Mazda RX4 Wag 110
# On day 3, we imagine that we have the same 5 records, but one of them is altered
data <- head(example_data, 5) |>
dplyr::mutate(hp = ifelse(dplyr::row_number() == 1, hp / 2, hp))
# We then store these data in the database using update_snapshot
update_snapshot(.data = data,
conn = conn,
db_table = "mtcars", # the name of the DB table to store the data in
timestamp = as.POSIXct("2020-01-03 10:00:00"))
# We can again access our data using the `get_table` function and see that the currently
# available data (with the changed hp value for Mazda RX4)
print(get_table(conn, "mtcars"))
#> # Source: SQL [5 x 2]
#> # Database: sqlite 3.41.2 []
#> car hp
#> <chr> <dbl>
#> 1 Datsun 710 93
#> 2 Mazda RX4 Wag 110
#> 3 Hornet 4 Drive 110
#> 4 Hornet Sportabout 175
#> 5 Mazda RX4 55
# When `slice_ts` is set to `NULL`, the full history of the table is returned
print(get_table(conn, "mtcars", slice_ts = NULL))
#> # Source: table<`mtcars`> [6 x 5]
#> # Database: sqlite 3.41.2 []
#> car hp checksum from_ts until_ts
#> <chr> <dbl> <chr> <chr> <chr>
#> 1 Datsun 710 93 08c864e3854eb5a1460d87b3360d636f 2020-01-01 … <NA>
#> 2 Mazda RX4 110 7cbe488757cc85aab6583dbc4226bf68 2020-01-01 … 2020-01…
#> 3 Mazda RX4 Wag 110 b82618e7f5dd30d5df68540cecc696c8 2020-01-01 … <NA>
#> 4 Hornet 4 Drive 110 3c1b6c43b206dd93ee4f6c3d06e1b416 2020-01-02 … <NA>
#> 5 Hornet Sportabout 175 9355ed7a70e3ff73a4b6ee7f7129aa35 2020-01-02 … <NA>
#> 6 Mazda RX4 55 1232f78f7befb3a765b91176eaacdbb0 2020-01-03 … <NA>
# Setting include_slice_info = TRUE also returns checksum, from_ts and until_ts.
# This is most useful when viewing data from a specific point in time
print(get_table(conn, "mtcars", slice_ts = "2020-01-03 06:30:00",
include_slice_info = TRUE))
#> # Source: SQL [5 x 5]
#> # Database: sqlite 3.41.2 []
#> car hp checksum from_ts until_ts
#> <chr> <dbl> <chr> <chr> <chr>
#> 1 Datsun 710 93 08c864e3854eb5a1460d87b3360d636f 2020-01-01 … <NA>
#> 2 Mazda RX4 110 7cbe488757cc85aab6583dbc4226bf68 2020-01-01 … 2020-01…
#> 3 Mazda RX4 Wag 110 b82618e7f5dd30d5df68540cecc696c8 2020-01-01 … <NA>
#> 4 Hornet 4 Drive 110 3c1b6c43b206dd93ee4f6c3d06e1b416 2020-01-02 … <NA>
#> 5 Hornet Sportabout 175 9355ed7a70e3ff73a4b6ee7f7129aa35 2020-01-02 … <NA>