The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.

Incremental refresh of star databases

Introduction

Once we have star databases built with the data available at the moment, periodically we may obtain additional data, with the same structure as the initial data but from a later time or from another place. Sometimes the new data also contains instances which are already included in the star database to operate on them.

Suppose we need to jointly analyse all the data available at the moment: We must include them in the star database. One possibility is to integrate all available data into a flat table and build the star database again. Another possibility is to use the incremental refresh mechanism described in this document.

This document shows by means of an example the possibilities offered by the package in this context. First, the starting data sets are presented. The next section shows how to generate the refresh structures. The following section shows how to perform the incremental refresh. Finally, we present how to make changes to the transformation functions and add these changes for future refresh operations. Finish with the conclusions.

Starting data sets

The starting data set is the content in the variable mrs_db, obtained in the vignette titled Obtaining and transforming flat tables, vignette("v05-flat-table-op"). It contains the constellation, formed by two star databases. Next we get their names.

library(rolap)

mrs_db |> 
  get_fact_names()
#> [1] "mrs_age"   "mrs_cause"

The code to generate the constellation from the initial data is available in the vignette. Below is a graphic representation of the tables that make it up.

db_dm <- mrs_db |>
  as_dm_class(pk_facts = FALSE)
db_dm |> 
  dm::dm_draw(rankdir = "LR", view_type = "all")

From the original data source (the Deaths in 122 U.S. cities - 1962-2016. 122 Cities Mortality Reporting System dataset), suppose we obtain a set of data that we want to integrate with the previous data to analyse them together.

We have stored it in the package, in a file with the same format as the original file which only contains a small portion of the original data. We have made sure that there is data that was already included in the data set considered to obtain the content of the mrs_db constellation and also new data. It is accessed below.

file <-
  system.file(
    "extdata/mrs",
    "mrs_122_us_cities_1962_2016_new.csv",
    package = "rolap"
  )

mrs_ft_new <-
  read_flat_table_file(name = 'mrs new', file)

Using the read_flat_table_file() function we read a table stored in a text file and create a flat_table object with a name. Below are the first records of the table. We access the table using the get_table() function for the object of the flat_table class.

ft <- mrs_ft_new |> 
  get_table()

pander::pandoc.table(head(ft), split.table = Inf)
Year WEEK Week Ending Date REGION State City Pneumonia and Influenza Deaths All Deaths <1 year (all cause deaths) 1-24 years (all cause deaths) 25-44 years 45-64 years (all cause deaths) 65+ years (all cause deaths)
1962 48 12/01/1962 2 NY Buffalo 5 148 10 1 11 31 95
1963 3 01/19/1963 4 IA Des Moines 2 55 1 3 6 17 28
1963 6 02/09/1963 8 CO Pueblo 0 21 0 0 1 4 16
1963 7 02/16/1963 7 TX El Paso 0 39 10 2 3 9 15
1963 25 06/22/1963 1 MA Springfield 5 35 0 0 1 9 25
1964 10 03/07/1964 1 MA Cambridge 3 31 0 0 3 10 18

Generation of star database refresh structures

From the table that we have read, with the information contained in the constellation variable, we can automatically generate refresh structures for the star databases that make up the constellation.

Refresh structures

From the data read in the form of a flat_table object, using the update_according_to() function, we generate a refresh structure of the star database of the mrs_db constellation whose name is indicated.

mrs_db_age_refresh <- mrs_ft_new |>
  update_according_to(mrs_db, star = "mrs_age")

The modification structure contains a star database generated from the read data. We can see its graphical representation below.

db_dm <- mrs_db_age_refresh |>
  get_star_database() |>
  as_dm_class(pk_facts = FALSE)
db_dm |> 
  dm::dm_draw(rankdir = "LR", view_type = "all")

In exactly the same way, we generate the refresh structure for the other star database.

mrs_db_cause_refresh <- mrs_ft_new |>
  update_according_to(mrs_db, star = "mrs_cause")

Below is its graphical representation.

db_dm <- mrs_db_cause_refresh |>
  get_star_database() |>
  as_dm_class(pk_facts = FALSE)
db_dm |> 
  dm::dm_draw(rankdir = "LR", view_type = "all")

Transformation code

The transformation code to generate the variable mrs_db can be obtained in the vignette titled Obtaining and transforming flat tables, vignette("v05-flat-table-op").

It is also included in the refresh structure for each of the star databases obtained. It can be obtained using the get_transformation_code() and get_transformation_file() functions, in the form of a vector of strings and a file respectively.

mrs_db_age_refresh |>
  get_transformation_code()
#>  [1] "transform_instance_table <- function(instance_df, lookup_ft, definition_fun, star_sch) {"                                                                                                                                    
#>  [2] "  ft <- "                                                                                                                                                                                                                    
#>  [3] "    flat_table("                                                                                                                                                                                                             
#>  [4] "      name = 'mrs',"                                                                                                                                                                                                         
#>  [5] "      instances = instance_df,"                                                                                                                                                                                              
#>  [6] "      unknown_value = 'Not available'"                                                                                                                                                                                       
#>  [7] "    ) |>"                                                                                                                                                                                                                    
#>  [8] "    transform_to_measure("                                                                                                                                                                                                   
#>  [9] "      attributes = c('Pneumonia and Influenza Deaths', 'All Deaths', '<1 year (all cause deaths)', '1-24 years (all cause deaths)', '25-44 years', '45-64 years (all cause deaths)', '65+ years (all cause deaths)'),"       
#> [10] "      k_sep = NULL,"                                                                                                                                                                                                         
#> [11] "      decimal_sep = NULL"                                                                                                                                                                                                    
#> [12] "    ) |>"                                                                                                                                                                                                                    
#> [13] "    transform_attribute_format("                                                                                                                                                                                             
#> [14] "      attributes = 'WEEK',"                                                                                                                                                                                                  
#> [15] "      width = 2,"                                                                                                                                                                                                            
#> [16] "      decimal_places = 0,"                                                                                                                                                                                                   
#> [17] "      k_sep = ',',"                                                                                                                                                                                                          
#> [18] "      decimal_sep = '.',"                                                                                                                                                                                                    
#> [19] "      space_filling = TRUE"                                                                                                                                                                                                  
#> [20] "    ) |>"                                                                                                                                                                                                                    
#> [21] "    replace_empty_values("                                                                                                                                                                                                   
#> [22] "      attributes = c('Year', 'WEEK', 'Week Ending Date', 'REGION', 'State', 'City'),"                                                                                                                                        
#> [23] "      empty_values = NULL"                                                                                                                                                                                                   
#> [24] "    ) |>"                                                                                                                                                                                                                    
#> [25] "    add_custom_column("                                                                                                                                                                                                      
#> [26] "      name = 'city_state',"                                                                                                                                                                                                  
#> [27] "      definition = definition_fun"                                                                                                                                                                                           
#> [28] "    ) |>"                                                                                                                                                                                                                    
#> [29] "    replace_attribute_values("                                                                                                                                                                                               
#> [30] "      attributes = c('City', 'city_state'),"                                                                                                                                                                                 
#> [31] "      old = c('Wilimington', 'Wilimington DE'),"                                                                                                                                                                             
#> [32] "      new = c('Wilmington', 'Wilmington DE')"                                                                                                                                                                                
#> [33] "    ) |>"                                                                                                                                                                                                                    
#> [34] "    join_lookup_table("                                                                                                                                                                                                      
#> [35] "      fk_attributes = 'city_state',"                                                                                                                                                                                         
#> [36] "      lookup = lookup_ft"                                                                                                                                                                                                    
#> [37] "    ) |>"                                                                                                                                                                                                                    
#> [38] "    select_attributes("                                                                                                                                                                                                      
#> [39] "      attributes = c('Year', 'WEEK', 'Week Ending Date', 'REGION', 'State', 'City', 'city_state', 'status', 'pop', 'lat', 'long')"                                                                                           
#> [40] "    ) |>"                                                                                                                                                                                                                    
#> [41] "    separate_measures("                                                                                                                                                                                                      
#> [42] "      measures = list(c('Pneumonia and Influenza Deaths', 'All Deaths'), c('<1 year (all cause deaths)', '1-24 years (all cause deaths)', '25-44 years', '45-64 years (all cause deaths)', '65+ years (all cause deaths)')),"
#> [43] "      names = c('mrs_cause', 'mrs_age'),"                                                                                                                                                                                    
#> [44] "      na_rm = TRUE"                                                                                                                                                                                                          
#> [45] "    ) |>"                                                                                                                                                                                                                    
#> [46] "    magrittr::extract2('mrs_age') |>"                                                                                                                                                                                        
#> [47] "    transform_to_values("                                                                                                                                                                                                    
#> [48] "      attribute = 'age',"                                                                                                                                                                                                    
#> [49] "      measure = 'all_deaths',"                                                                                                                                                                                               
#> [50] "      id_reverse = NULL,"                                                                                                                                                                                                    
#> [51] "      na_rm = TRUE"                                                                                                                                                                                                          
#> [52] "    ) |>"                                                                                                                                                                                                                    
#> [53] "    snake_case("                                                                                                                                                                                                             
#> [54] "    ) |>"                                                                                                                                                                                                                    
#> [55] "    replace_string("                                                                                                                                                                                                         
#> [56] "      attributes = 'age',"                                                                                                                                                                                                   
#> [57] "      string = ' (all cause deaths)',"                                                                                                                                                                                       
#> [58] "      replacement = NULL"                                                                                                                                                                                                    
#> [59] "    ) |>"                                                                                                                                                                                                                    
#> [60] "    as_star_database("                                                                                                                                                                                                       
#> [61] "      schema = star_sch"                                                                                                                                                                                                     
#> [62] "    )"                                                                                                                                                                                                                       
#> [63] ""                                                                                                                                                                                                                            
#> [64] "  ft"                                                                                                                                                                                                                        
#> [65] "}"                                                                                                                                                                                                                           
#> [66] ""                                                                                                                                                                                                                            
#> [67] "ft <- transform_instance_table(instance_df, lookup_ft, definition_fun, star_sch)"

If additional transformations are needed, it can be modified to our convenience. In the same way we consult it for the other star database, although we don’t show the result here.

code <- mrs_db_cause_refresh |>
  get_transformation_code()

We can see that the selection of the star database is done using the magrittr::extract2() function, which was not used in the original vignette to select the elements of a list. It has been included here to preserve the pipe syntax throughout the transformation.

This code only needs to be used if we want to perform additional transformations. If the previously defined transformations are sufficient, they have already been automatically applied to the data in order to integrate them.

Instances included in the structure

Seeing the new instances that are going to be added to the dimensions can help us determine whether or not we need to modify the transformation code. We can do this using the get_new_dimension_instances() function, as shown below.

mrs_db_age_refresh |>
  get_new_dimension_instances()
#> $when
#> # A tibble: 110 × 3
#>    year  week  week_ending_date
#>    <chr> <chr> <chr>           
#>  1 1962  "48"  12/01/1962      
#>  2 1963  " 3"  01/19/1963      
#>  3 1963  " 7"  02/16/1963      
#>  4 1964  "10"  03/07/1964      
#>  5 1964  "12"  03/21/1964      
#>  6 1964  "20"  05/16/1964      
#>  7 1966  "26"  07/02/1966      
#>  8 1966  "47"  11/26/1966      
#>  9 1967  "10"  03/11/1967      
#> 10 1967  "22"  06/03/1967      
#> # ℹ 100 more rows
#> 
#> $where
#> # A tibble: 2 × 8
#>   region state city      city_state   status        pop         lat   long    
#>   <chr>  <chr> <chr>     <chr>        <chr>         <chr>       <chr> <chr>   
#> 1 1      MA    Boston    Boston MA    state capital "  567,759" 42.3  " -71.0"
#> 2 5      MD    Baltimore Baltimore MD non-capital   "  602,658" 39.3  " -76.6"

If necessary, starting from the code that we have obtained, we can perform transformations on the starting flat_table structure or on the star_database object obtained.

We can see that there are two new cities that were not included in the where dimension of the initial star databases: Baltimore and Boston.

Incremental refresh

The most common thing is that refresh operations only include new instances in fact tables, but it may be the case that repeated instances appear: They may have different values in the measures, but the same values in the dimension foreign keys.

Existing fact instances

To perform the incremental refresh operation, we must determine what happens to the fact table instances that were already included in the original star database. We can query existing fact instances using the get_existing_fact_instances() function, as can be seen below for each of the star databases.

mrs_db_age_refresh |>
  get_existing_fact_instances()
#> # A tibble: 200 × 14
#>    year  week  week_ending_date region state city  city_state status pop   lat  
#>    <chr> <chr> <chr>            <chr>  <chr> <chr> <chr>      <chr>  <chr> <chr>
#>  1 1963  "25"  06/22/1963       1      MA    Spri… Springfie… non-c… "  1… 42.1 
#>  2 1963  "25"  06/22/1963       1      MA    Spri… Springfie… non-c… "  1… 42.1 
#>  3 1963  "25"  06/22/1963       1      MA    Spri… Springfie… non-c… "  1… 42.1 
#>  4 1963  "25"  06/22/1963       1      MA    Spri… Springfie… non-c… "  1… 42.1 
#>  5 1963  "25"  06/22/1963       1      MA    Spri… Springfie… non-c… "  1… 42.1 
#>  6 1965  " 5"  02/06/1965       4      KS    Wich… Wichita KS non-c… "  3… 37.7 
#>  7 1965  " 5"  02/06/1965       4      KS    Wich… Wichita KS non-c… "  3… 37.7 
#>  8 1965  " 5"  02/06/1965       4      KS    Wich… Wichita KS non-c… "  3… 37.7 
#>  9 1965  " 5"  02/06/1965       4      KS    Wich… Wichita KS non-c… "  3… 37.7 
#> 10 1965  " 5"  02/06/1965       4      KS    Wich… Wichita KS non-c… "  3… 37.7 
#> # ℹ 190 more rows
#> # ℹ 4 more variables: long <chr>, age <chr>, all_deaths <int>, nrow_agg <int>

mrs_db_cause_refresh |>
  get_existing_fact_instances()
#> # A tibble: 40 × 14
#>    year  week  week_ending_date region state city  city_state status pop   lat  
#>    <chr> <chr> <chr>            <chr>  <chr> <chr> <chr>      <chr>  <chr> <chr>
#>  1 1963  "25"  06/22/1963       1      MA    Spri… Springfie… non-c… "  1… 42.1 
#>  2 1965  " 5"  02/06/1965       4      KS    Wich… Wichita KS non-c… "  3… 37.7 
#>  3 1966  "35"  09/03/1966       9      CA    Long… Long Beac… non-c… "  4… 33.8 
#>  4 1966  "41"  10/15/1966       9      CA    San … San Jose … non-c… "  8… 37.3 
#>  5 1968  "29"  07/20/1968       3      OH    Youn… Youngstow… non-c… "   … 41.1 
#>  6 1971  "28"  07/17/1971       3      IL    Chic… Chicago IL non-c… "2,8… 41.8 
#>  7 1973  " 6"  02/10/1973       2      NY    Syra… Syracuse … non-c… "  1… 43.0 
#>  8 1974  "49"  12/07/1974       5      FL    Tampa Tampa FL   non-c… "  3… 28.0 
#>  9 1976  "48"  12/04/1976       9      CA    Fres… Fresno CA  non-c… "  4… 36.8 
#> 10 1977  "21"  05/28/1977       6      TN    Memp… Memphis TN non-c… "  6… 35.1 
#> # ℹ 30 more rows
#> # ℹ 4 more variables: long <chr>, pneumonia_and_influenza_deaths <int>,
#> #   all_deaths <int>, nrow_agg <int>

In each case it may be interesting to perform a different operation on these previously existing instances. By default, what is done is to ignore them but we can indicate that they are grouped, replaced or deleted, using the existing_instances parameter.

Incremental refresh operation

To better appreciate the update made, we are going to perform the same query before and after the incremental refresh operation: We obtain the design tables, along with the number of instances of each one. For the where dimension table, we get the names of the first cities to check that Baltimore and Boston were not included.

l_db <- mrs_db |>
  as_tibble_list()

names <- names(l_db)
for (i in seq_along(l_db)){
  cat(sprintf("name: %s, %d rows\n", names[i], nrow(l_db[[i]])))
}
#> name: when, 1966 rows
#> name: where, 120 rows
#> name: who, 5 rows
#> name: mrs_cause, 3342 rows
#> name: mrs_age, 16565 rows

head(sort(l_db[['where']]$city), 15)
#>  [1] "Akron"       "Albany"      "Albuquerque" "Allentown"   "Atlanta"    
#>  [6] "Austin"      "Baton Rouge" "Berkeley"    "Birmingham"  "Boise"      
#> [11] "Bridgeport"  "Buffalo"     "Cambridge"   "Camden"      "Canton"

Next, we perform the incremental refresh for each of the star databases in the original constellation. For one of the star databases, through the existing_instances parameter, it has been indicated that the instances of previously existing facts are grouped with the new ones. For the other design, the default option is considered: Ignore fact table previously existing instances.

We will make a copy of the constellation to perform an operation later.

mrs_db_seg <- mrs_db

mrs_db <- mrs_db |>
  incremental_refresh(mrs_db_age_refresh) |>
  incremental_refresh(mrs_db_cause_refresh, existing_instances = "group")

Finally, we consult the same data as before again.

l_db <- mrs_db |>
  as_tibble_list()

names <- names(l_db)
for (i in seq_along(l_db)){
  cat(sprintf("name: %s, %d rows\n", names[i], nrow(l_db[[i]])))
}
#> name: when, 2076 rows
#> name: where, 122 rows
#> name: who, 5 rows
#> name: mrs_cause, 3677 rows
#> name: mrs_age, 18228 rows

head(sort(l_db[['where']]$city), 15)
#>  [1] "Akron"       "Albany"      "Albuquerque" "Allentown"   "Atlanta"    
#>  [6] "Austin"      "Baltimore"   "Baton Rouge" "Berkeley"    "Birmingham" 
#> [11] "Boise"       "Boston"      "Bridgeport"  "Buffalo"     "Cambridge"

We observe how the number of instances has increased in the facts and in the dimensions and the where dimension already contains the names of the cities that appeared as new in the update operation.

Changes in transformation operations

In this case no problem has been detected in the new dimension data. But let’s assume that we wanted to make some change to the modify operations before doing the update to show how it would be done.

New transformation function

First of all, we obtain the transformation function. In this case we have stored it in a temporary file using the get_transformation_file() function and we have copied it here.

Starting from the tibble of the initial flat_table, the star_schema to define the star_database, the lookup tables and field definition functions that we have used, we can reproduce all the transformations and adapt them to the new situations that hypothetically have arisen.

transform_instance_table <-
  function(instance_df,
           lookup_ft,
           definition_fun,
           star_sch) {
    ft <-
      flat_table(name = 'mrs',
                 instances = instance_df,
                 unknown_value = 'Not available') |>
      transform_to_measure(
        attributes = c(
          'Pneumonia and Influenza Deaths',
          'All Deaths',
          '<1 year (all cause deaths)',
          '1-24 years (all cause deaths)',
          '25-44 years',
          '45-64 years (all cause deaths)',
          '65+ years (all cause deaths)'
        ),
        k_sep = NULL,
        decimal_sep = NULL
      ) |>
      transform_attribute_format(
        attributes = 'WEEK',
        width = 2,
        decimal_places = 0,
        k_sep = ',',
        decimal_sep = '.'
      ) |>
      replace_empty_values(
        attributes = c('Year', 'WEEK', 'Week Ending Date', 'REGION', 'State', 'City'),
        empty_values = NULL
      ) |>
      add_custom_column(name = 'city_state',
                        definition = definition_fun) |>
      replace_attribute_values(
        attributes = c('City', 'city_state'),
        old = c('Wilimington', 'Wilimington DE'),
        new = c('Wilmington', 'Wilmington DE')
      ) |>
      join_lookup_table(fk_attributes = 'city_state',
                        lookup = lookup_ft) |>
      select_attributes(
        attributes = c(
          'Year',
          'WEEK',
          'Week Ending Date',
          'REGION',
          'State',
          'City',
          'city_state',
          'status',
          'pop',
          'lat',
          'long'
        )
      ) |>
      separate_measures(
        measures = list(
          c('Pneumonia and Influenza Deaths', 'All Deaths'),
          c(
            '<1 year (all cause deaths)',
            '1-24 years (all cause deaths)',
            '25-44 years',
            '45-64 years (all cause deaths)',
            '65+ years (all cause deaths)'
          )
        ),
        names = c('mrs_cause', 'mrs_age'),
        na_rm = TRUE
      ) |>
      magrittr::extract2('mrs_cause') |>
      snake_case() |>
      as_star_database(schema = star_sch)
    
    ft
  }

We can get these parameters from the initial definition (in the vignette titled Obtaining and transforming flat tables, vignette("v05-flat-table-op")) or by consulting the refresh structure with the functions available for this purpose.

instance_df <- mrs_ft_new |> 
  get_table()
  
lookup_list <- mrs_db_cause_refresh |> 
  get_lookup_tables()

star_sch <- mrs_db_cause_refresh |> 
  get_star_schema()

# function to define a derived column
city_state <- function(table) {
  paste0(table$City, ' ', table$State)
}

mrs_db_cause_transf <-
  transform_instance_table(
    instance_df = instance_df,
    lookup_ft = lookup_list[['us_cities']],
    definition_fun = city_state,
    star_sch = star_sch
  )

We perform the same operation for the other star database.

transform_instance_table_2 <-
  function(instance_df,
           lookup_ft,
           definition_fun,
           star_sch) {
    ft <-
      flat_table(name = 'mrs',
                 instances = instance_df,
                 unknown_value = 'Not available') |>
      transform_to_measure(
        attributes = c(
          'Pneumonia and Influenza Deaths',
          'All Deaths',
          '<1 year (all cause deaths)',
          '1-24 years (all cause deaths)',
          '25-44 years',
          '45-64 years (all cause deaths)',
          '65+ years (all cause deaths)'
        ),
        k_sep = NULL,
        decimal_sep = NULL
      ) |>
      transform_attribute_format(
        attributes = 'WEEK',
        width = 2,
        decimal_places = 0,
        k_sep = ',',
        decimal_sep = '.'
      ) |>
      replace_empty_values(
        attributes = c('Year', 'WEEK', 'Week Ending Date', 'REGION', 'State', 'City'),
        empty_values = NULL
      ) |>
      add_custom_column(name = 'city_state',
                        definition = definition_fun) |>
      replace_attribute_values(
        attributes = c('City', 'city_state'),
        old = c('Wilimington', 'Wilimington DE'),
        new = c('Wilmington', 'Wilmington DE')
      ) |>
      join_lookup_table(fk_attributes = 'city_state',
                        lookup = lookup_ft) |>
      select_attributes(
        attributes = c(
          'Year',
          'WEEK',
          'Week Ending Date',
          'REGION',
          'State',
          'City',
          'city_state',
          'status',
          'pop',
          'lat',
          'long'
        )
      ) |>
      separate_measures(
        measures = list(
          c('Pneumonia and Influenza Deaths', 'All Deaths'),
          c(
            '<1 year (all cause deaths)',
            '1-24 years (all cause deaths)',
            '25-44 years',
            '45-64 years (all cause deaths)',
            '65+ years (all cause deaths)'
          )
        ),
        names = c('mrs_cause', 'mrs_age'),
        na_rm = TRUE
      ) |>
      magrittr::extract2('mrs_age') |>
      transform_to_values(
        attribute = 'age',
        measure = 'all_deaths',
        id_reverse = NULL,
        na_rm = TRUE
      ) |>
      snake_case() |>
      replace_string(attributes = 'age',
                     string = ' (all cause deaths)',
                     replacement = NULL) |>
      as_star_database(schema = star_sch)
    
    ft
  }

star_sch <- mrs_db_age_refresh |>
  get_star_schema()

mrs_db_age_transf <-
  transform_instance_table_2(
    instance_df = instance_df,
    lookup_ft = lookup_list[['us_cities']],
    definition_fun = city_state,
    star_sch = star_sch
  )

The result of these functions, in each case, is a star_database containing the new transformations.

Refresh structures

Now we have to create the refresh structure for each star database, considering the original constellation (we have made a copy before updating it in mrs_db_seg). To create the refresh structures, we now indicate that the operations are considered from the star database of the sdb_operations parameter.

mrs_db_cause_transf_refresh <- mrs_ft_new |>
  update_according_to(mrs_db_seg, star = "mrs_cause", sdb_operations = mrs_db_cause_transf)

mrs_db_age_transf_refresh <- mrs_ft_new |>
  update_according_to(mrs_db_seg, star = "mrs_age", sdb_operations = mrs_db_age_transf)

That is, with the sdb_operations parameter, it takes the data to be refreshed from the constellation, but the modification operations are taken from this star database (where we have defined the new operations that have been considered necessary).

Incremental refresh operation

To show the result of the incremental refresh operations, we are going to repeat the process carried out previously: We show the name and number of instances of the tables before and after the transformation, as well as the names of the first cities of the where dimension.

l_db <- mrs_db_seg |>
  as_tibble_list()

names <- names(l_db)
for (i in seq_along(l_db)){
  cat(sprintf("name: %s, %d rows\n", names[i], nrow(l_db[[i]])))
}
#> name: when, 1966 rows
#> name: where, 120 rows
#> name: who, 5 rows
#> name: mrs_cause, 3342 rows
#> name: mrs_age, 16565 rows

head(sort(l_db[['where']]$city), 15)
#>  [1] "Akron"       "Albany"      "Albuquerque" "Allentown"   "Atlanta"    
#>  [6] "Austin"      "Baton Rouge" "Berkeley"    "Birmingham"  "Boise"      
#> [11] "Bridgeport"  "Buffalo"     "Cambridge"   "Camden"      "Canton"

To perform the incremental refresh, as the refresh structure has new transformation operations, we can indicate that these become the transformation operations of the constellation. Thus, in future updates, they will be taken into account automatically. This is indicated by the boolean parameter replace_transformations.

mrs_db_seg <- mrs_db_seg |>
  incremental_refresh(mrs_db_age_transf_refresh, replace_transformations = TRUE) |>
  incremental_refresh(
    mrs_db_cause_transf_refresh,
    existing_instances = "group",
    replace_transformations = TRUE
  )

Finally, we consult the same data as before again.

l_db <- mrs_db_seg |>
  as_tibble_list()

names <- names(l_db)
for (i in seq_along(l_db)){
  cat(sprintf("name: %s, %d rows\n", names[i], nrow(l_db[[i]])))
}
#> name: when, 2076 rows
#> name: where, 122 rows
#> name: who, 5 rows
#> name: mrs_cause, 3677 rows
#> name: mrs_age, 18228 rows

head(sort(l_db[['where']]$city), 15)
#>  [1] "Akron"       "Albany"      "Albuquerque" "Allentown"   "Atlanta"    
#>  [6] "Austin"      "Baltimore"   "Baton Rouge" "Berkeley"    "Birmingham" 
#> [11] "Boise"       "Boston"      "Bridgeport"  "Buffalo"     "Cambridge"

It can be seen that we have obtained the same results as before.

Conclusions

This document shows the functions supporting incremental refreshing of star databases offered by the rolap package.

Starting from a flat table, the transformation operations through which a star database has been defined are automatically applied. These operations can be obtained and enriched to adapt to new situations that arise. Once the data has been conveniently updated, it can be integrated with the original star database indicating the operation that is considered most appropriate to integrate the data previously existing in the fact table.

This incremental refresh mechanism, although not trivial, facilitates this generally difficult process. It lays the foundations to continue simplifying it and offering new functionalities.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.