Basic principles

suppressPackageStartupMessages(library(SCDB))
options("SCDB.log_path" = tempdir())

The basic principle of the SCDB package is to enable the user to easily implement and maintain a database of time-versioned data.

In practice, this is done by labeling each record in the data with three additional fields:

This strategy of time versioning is often called “type 2” history.

Note that identical records may be removed and introduced more than once; for example, in a table of names and addresses, a person may change their address (or name) back to a previous value.

The SCDB package provides the function update_snapshot to handle the insertion and deactivation of records using this strategy. It further includes several functions to improve the Quality of life for working with database data.

A simple example is shown below:

# First we connect to our database.
# If this is, e.g., a PostgreSQL database already running on the machine, connection
# can be done after the configuration of a .pgpass file.
# For this example, we use an on-disk SQLite db to showcase.
conn <- get_connection(drv = RSQLite::SQLite())
# NOTE: Had the PostgreSQL DB been configured, we would not need to pass any args
# to get_connection()
# Our example data is mtcars with rownames converted to a column and only the first hp
# column of mtcars
example_data <- dplyr::transmute(mtcars, car = rownames(mtcars), hp)
# If the data does not already live on the remote, we must transfer it
example_data <- dplyr::copy_to(conn, example_data, overwrite = TRUE)
# In this example, we imagine that on day 1, in this case 2020-01-01 11:00:00, our data
# known to us is the first 3 records of mtcars
data <- head(example_data, 3)
# We then store these data in the database using update_snapshot
update_snapshot(.data = data,
                conn = conn,
                db_table = "mtcars", # the name of the DB table to store the data in
                timestamp = as.POSIXct("2020-01-01 11:00:00"))
# We can access our data using the `get_table` function
print(get_table(conn, "mtcars"))
#> # Source:   SQL [3 x 2]
#> # Database: sqlite 3.41.2 []
#>   car              hp
#>   <chr>         <dbl>
#> 1 Datsun 710       93
#> 2 Mazda RX4       110
#> 3 Mazda RX4 Wag   110

# And we can see the time-keeping if we set `include_slice_info = TRUE`
print(get_table(conn, "mtcars", include_slice_info = TRUE))
#> # Source:   SQL [3 x 5]
#> # Database: sqlite 3.41.2 []
#>   car              hp checksum                         from_ts          until_ts
#>   <chr>         <dbl> <chr>                            <chr>            <chr>   
#> 1 Datsun 710       93 08c864e3854eb5a1460d87b3360d636f 2020-01-01 11:0… <NA>    
#> 2 Mazda RX4       110 7cbe488757cc85aab6583dbc4226bf68 2020-01-01 11:0… <NA>    
#> 3 Mazda RX4 Wag   110 b82618e7f5dd30d5df68540cecc696c8 2020-01-01 11:0… <NA>
# Let's say that the next day, our data set is now the first 5 of our example data
data <- head(example_data, 5)
# We then store these data in the database using update_snapshot
update_snapshot(.data = data,
                conn = conn,
                db_table = "mtcars", # the name of the DB table to store the data in
                timestamp = as.POSIXct("2020-01-02 12:00:00"))

# We again use the `get_table` function and see the latest available data
print(get_table(conn, "mtcars"))
#> # Source:   SQL [5 x 2]
#> # Database: sqlite 3.41.2 []
#>   car                  hp
#>   <chr>             <dbl>
#> 1 Datsun 710           93
#> 2 Mazda RX4           110
#> 3 Mazda RX4 Wag       110
#> 4 Hornet 4 Drive      110
#> 5 Hornet Sportabout   175

# And we can see the time-keeping if we set `include_slice_info = TRUE`
print(get_table(conn, "mtcars", include_slice_info = TRUE))
#> # Source:   SQL [5 x 5]
#> # Database: sqlite 3.41.2 []
#>   car                  hp checksum                         from_ts      until_ts
#>   <chr>             <dbl> <chr>                            <chr>        <chr>   
#> 1 Datsun 710           93 08c864e3854eb5a1460d87b3360d636f 2020-01-01 … <NA>    
#> 2 Mazda RX4           110 7cbe488757cc85aab6583dbc4226bf68 2020-01-01 … <NA>    
#> 3 Mazda RX4 Wag       110 b82618e7f5dd30d5df68540cecc696c8 2020-01-01 … <NA>    
#> 4 Hornet 4 Drive      110 3c1b6c43b206dd93ee4f6c3d06e1b416 2020-01-02 … <NA>    
#> 5 Hornet Sportabout   175 9355ed7a70e3ff73a4b6ee7f7129aa35 2020-01-02 … <NA>

# Since our data is time-versioned, we can recover the data from the day before
print(get_table(conn, "mtcars", slice_ts = "2020-01-01 11:00:00"))
#> # Source:   SQL [3 x 2]
#> # Database: sqlite 3.41.2 []
#>   car              hp
#>   <chr>         <dbl>
#> 1 Datsun 710       93
#> 2 Mazda RX4       110
#> 3 Mazda RX4 Wag   110
# On day 3, we imagine that we have the same 5 records, but one of them is altered
data <- head(example_data, 5) |>
  dplyr::mutate(hp = ifelse(dplyr::row_number() == 1, hp / 2, hp))
# We then store these data in the database using update_snapshot
update_snapshot(.data = data,
                conn = conn,
                db_table = "mtcars", # the name of the DB table to store the data in
                timestamp = as.POSIXct("2020-01-03 10:00:00"))
# We can again access our data using the `get_table` function and see that the currently
# available data (with the changed hp value for Mazda RX4)
print(get_table(conn, "mtcars"))
#> # Source:   SQL [5 x 2]
#> # Database: sqlite 3.41.2 []
#>   car                  hp
#>   <chr>             <dbl>
#> 1 Datsun 710           93
#> 2 Mazda RX4 Wag       110
#> 3 Hornet 4 Drive      110
#> 4 Hornet Sportabout   175
#> 5 Mazda RX4            55


# When `slice_ts` is set to `NULL`, the full history of the table is returned
print(get_table(conn, "mtcars", slice_ts = NULL))
#> # Source:   table<`mtcars`> [6 x 5]
#> # Database: sqlite 3.41.2 []
#>   car                  hp checksum                         from_ts      until_ts
#>   <chr>             <dbl> <chr>                            <chr>        <chr>   
#> 1 Datsun 710           93 08c864e3854eb5a1460d87b3360d636f 2020-01-01 … <NA>    
#> 2 Mazda RX4           110 7cbe488757cc85aab6583dbc4226bf68 2020-01-01 … 2020-01…
#> 3 Mazda RX4 Wag       110 b82618e7f5dd30d5df68540cecc696c8 2020-01-01 … <NA>    
#> 4 Hornet 4 Drive      110 3c1b6c43b206dd93ee4f6c3d06e1b416 2020-01-02 … <NA>    
#> 5 Hornet Sportabout   175 9355ed7a70e3ff73a4b6ee7f7129aa35 2020-01-02 … <NA>    
#> 6 Mazda RX4            55 1232f78f7befb3a765b91176eaacdbb0 2020-01-03 … <NA>

# Setting include_slice_info = TRUE also returns checksum, from_ts and until_ts.
# This is most useful when viewing data from a specific point in time
print(get_table(conn, "mtcars", slice_ts = "2020-01-03 06:30:00",
                include_slice_info = TRUE))
#> # Source:   SQL [5 x 5]
#> # Database: sqlite 3.41.2 []
#>   car                  hp checksum                         from_ts      until_ts
#>   <chr>             <dbl> <chr>                            <chr>        <chr>   
#> 1 Datsun 710           93 08c864e3854eb5a1460d87b3360d636f 2020-01-01 … <NA>    
#> 2 Mazda RX4           110 7cbe488757cc85aab6583dbc4226bf68 2020-01-01 … 2020-01…
#> 3 Mazda RX4 Wag       110 b82618e7f5dd30d5df68540cecc696c8 2020-01-01 … <NA>    
#> 4 Hornet 4 Drive      110 3c1b6c43b206dd93ee4f6c3d06e1b416 2020-01-02 … <NA>    
#> 5 Hornet Sportabout   175 9355ed7a70e3ff73a4b6ee7f7129aa35 2020-01-02 … <NA>