The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.

Intro


What it does

Motivation

Administrative health data data are often stored on database with strict security measures which may disable permission to write temporary tables. Writing queries without being able to cache intermediate results is challenging, especially when the data is too large to be downloaded from database into R (i.e., local memory) without some filtering process. The lack of regular expression support in SQL makes writing queries for searching ICD codes even more painful.

The ‘dbplyr’ package not only translates ‘dplyr’ syntax to SQL but also make nesting sub-queries easy with the pipe operator. With ‘dbplyr’, complex query script can be built from nested sub-queries instead of writing temporary tables at a cost of performance (sub-queries have to be ran again and again if it was called multiple times). Nevertheless, building complex queries with ‘dbplyr’ is a useful approach to push computation onto the SQL server that is often much more efficient and resourceful than the user’s (virtual) machine.

However, ‘dbplyr’ SQL translations failed at some places for the intended use case and needs fixing. The motivation of this package is to ease the pain from these issues for analyst who regularly works with healthcare database.


Installation

Simply run:

devtools::install_github("kevinHzq/healthdb", build_vignettes = TRUE)

We will also need the following packages for this demo.

library(dplyr)
#> Warning: package 'dplyr' was built under R version 4.3.2
library(dbplyr)
#> Warning: package 'dbplyr' was built under R version 4.3.3
library(lubridate)
#> Warning: package 'lubridate' was built under R version 4.3.2
library(glue)
#> Warning: package 'glue' was built under R version 4.3.3
library(purrr)
#> Warning: package 'purrr' was built under R version 4.3.2
library(healthdb)

Intended use case

Consider the case definition of substance use disorder (SUD) from British Columbia Centre for Disease Control’s Chronic Disease Dashboard,

One or more hospitalization with a substance use disorder diagnostic code, OR Two or more physician visits with a substance use disorder diagnostic code within one year.

We are going to implement this definition. First, let’s make a demo data sets for the two sources:

  1. Physician claims with multiple columns of ICD-9 diagnostic codes

    # make_test_dat() makes either a toy data.frame or database table in memory with known number of rows that satisfy the query we will show later
    claim_db <- make_test_dat(vals_kept = c("303", "304", "305", "291", "292", glue("30{30:59}"), glue("29{10:29}"), noise_val = c("999", "111")), type = "database")
    
    # this is a database table
    # note that in-memory SQLite database stores dates as numbers
    claim_db %>% head()
    #> # Source:   SQL [6 x 6]
    #> # Database: sqlite 3.45.2 [:memory:]
    #>     uid clnt_id dates diagx diagx_1 diagx_2
    #>   <int>   <int> <dbl> <chr> <chr>   <chr>  
    #> 1    51       1 16660 999   999     999    
    #> 2    24       1 17464 3048  3040    <NA>   
    #> 3    14       2 17640 3041  3047    <NA>   
    #> 4    37       3 16948 2913  3035    999    
    #> 5    91       4 16712 999   999     999    
    #> 6    82       4 16760 999   999     <NA>
  2. Hospitalization with ICD-10 codes

    hosp_df <- make_test_dat(vals_kept = c(glue("F{10:19}"), glue("F{100:199}"), noise_val = "999"), type = "data.frame")
    
    # this is a local data.frame/tibble
    hosp_df %>% head()
    #>   uid clnt_id      dates diagx diagx_1 diagx_2
    #> 1  43       1 2015-08-13  F156    F144    F168
    #> 2  16       1 2017-10-25   F12    F132    <NA>
    #> 3   6       2 2018-04-19  F133    F128    <NA>
    #> 4  29       3 2016-05-27  F130    F164     999
    #> 5  83       4 2015-10-04   999     999    <NA>
    #> 6  74       4 2015-11-21   999    <NA>     999

Interactive functions

Let’s focus on the physician claims. Extracting clients with at least two records within a year is not difficult, and involves only a few steps. The codes could look like the following using dplyr, however, it does not work because: 1. SQL does not support multiple patterns in one LIKE operation, 2. dbply currently have issue with translating n_distinct.

## not run
claim_db %>%
  # identify the target codes
  filter(if_any(starts_with("diagx_"), ~ str_like(., c("291%", "292%", "303%", "304%", "305%")))) %>%
  # each clnt has at least 2 records on different dates
  group_by(clnt_id) %>%
  # the n_distinct step is mainly for reducing computation in the next step
  filter(n_distinct(dates) >= 2) %>%
  # any two dates within one year?
  filter((max(dates) - min(dates)) <= 365)
## end

Here’s how you could use healthdb to achieve these steps:

  1. Identify rows contains the target codes

    result1 <- claim_db %>%
      identify_row(
    vars = starts_with("diagx_"),
    match = "start",
    vals = c(291:292, 303:305)
      )
    #> 
    #> Identify records with condition(s): 
    #>  - where at least one of the diagx_1, diagx_2 column(s) in each record 
    #>    - contains a value satisfied SQL LIKE pattern: 291% OR 292% OR 303% OR 304% OR 305% 
    #> 
    #> To see the final query generated by 'dbplyr', use dplyr::show_query() on the output. 
    #> To extract the SQL string, use dbplyr::remote_query().
  2. Bonus: remove clients with exclusion codes

    This step is not in the substance use disorder definition, but other disease definitions often require exclusion of some ICDs that contradicts the ones of interest. Let’s say we want to remove clients with code “111” here.

    We first identify “111” from the source, then exclude clients in the output from the previous step’s result. exclude() take either a data set (via the excl argument) or expression (condition argument) as input. For the former, it performs an anti join matching on the by argument (see dplyr::join_by()). For the latter, it is the opposite of filter, i.e., filter(!(some_expression)).

    result2 <- result1 %>%
      exclude(
    excl = identify_row(claim_db, starts_with("diagx_"), "in", "111"),
    by = "clnt_id"
      )
    #> 
    #> Identify records with condition(s): 
    #>  - where at least one of the diagx_1, diagx_2 column(s) in each record 
    #>    - contains a value exactly matched values in set: "111" 
    #> 
    #> To see the final query generated by 'dbplyr', use dplyr::show_query() on the output. 
    #> To extract the SQL string, use dbplyr::remote_query().
    #> 
    #> Exclude records in `data` through anti_join with `excl` matching on (by argument): "clnt_id"
  3. Restrict the number of records per client

    result3 <- result2 %>% restrict_n(
      clnt_id = clnt_id,
      n_per_clnt = 2,
      count_by = dates,
      # here we use filter mode to remove records that failed the restriction
      mode = "filter"
    )
    #> 
    #> Apply restriction that each client must have at least 2 records with distinct dates. Clients/groups whichdid not meetthe condition were excluded.
  4. Restrict the temporal pattern of diagnoses

    restrict_date() also supports more complicated patterns like having n diagnoses at least i days apart within j years, but the “apart” feature requires relatively expensive computation and implemented for local data.frames only. Note that when SQL interpret order of dates, the result could be not deterministic if there were duplicate dates within client. Therefore, a unique row id colume (uid) has to be supplied to get consistent result.

    result4 <- result3 %>% restrict_date(
      clnt_id = clnt_id,
      date_var = dates,
      n = 2,
      within = 365,
      uid = uid,
      # here we use flag mode to flag records that met the restriction instead of removing those
      mode = "flag"
    )
    #> 
    #> Apply restriction that each client must have 2 records that were  within 365 days. Records that met the condition were flagged.
  5. Fetch variables from other tables by matching common keys

    Up to this point, the result is only a query and have not been downloaded. Hopefully, it has been shrunken to a manageable size for collection.

    # Class of result4
    class(result4)
    #> [1] "tbl_SQLiteConnection" "tbl_dbi"              "tbl_sql"             
    #> [4] "tbl_lazy"             "tbl"
    
    # execute query and download the result
    result_df <- result4 %>% collect()
    
    # Number of rows in source
    nrow(claim_db %>% collect())
    #> [1] 100
    
    # Number of rows in the current result
    nrow(result_df)
    #> [1] 35

    Our data now only contains diagnoses that are probably not enough for further analyses. Let’s say we want to gather client demographics such as age and sex from other sources. This certainly can be done with multiple dplyr::left_join() calls. Here we provide the fetch_var() function to make the codes more concise.

    # make two look up tables
    age_tab <- data.frame(
      clnt_id = 1:50,
      age = sample(1:90, 50),
      sex = sample(c("F", "M"), 50, replace = TRUE)
    )
    address_tab <- data.frame(
      clnt_id = rep(1:50, 5), year = rep(2016:2020, each = 50),
      area_code = sample(0:200, 50, replace = TRUE)
    )
    
    # get year from dates for matching
    
    result_df <- result_df %>% mutate(year = lubridate::year(as.Date(dates, origin = "1970-01-01")))
    
    # note that keys must be present in all tables
    result_df %>%
      fetch_var(
    keys = c(clnt_id, year),
    linkage = list(
      # |clnt_id means matching on clnt_id only
      age_tab ~ c(age, sex) | clnt_id,
      address_tab ~ area_code
    )
      ) %>%
      head()
    #> 
    #> The data has 35 rows. After joining, variable(s) from age_tab has 35 rows, and variable(s) from address_tab has 35 rows
    #> # A tibble: 6 × 12
    #>     uid clnt_id dates diagx diagx_1 diagx_2 flag_restrict_n flag_restrict_date
    #>   <int>   <int> <dbl> <chr> <chr>   <chr>             <int>              <int>
    #> 1    35       7 16810 2913  2923    999                   1                  1
    #> 2    22       7 16897 3056  2917    999                   1                  1
    #> 3    47       7 17096 3033  3051    3036                  1                  1
    #> 4    10       7 17250 2923  3057    999                   1                  0
    #> 5    41      10 16954 3033  305     2929                  1                  0
    #> 6    44      10 17788 999   304     292                   1                  0
    #> # ℹ 4 more variables: year <dbl>, age <int>, sex <chr>, area_code <int>

Call-building functions

To complete the definition, we need to repeat the process shown above with hospitalization data. Some studies may use more than a handful of data sources to define their sample. We packed steps 1-4 in one function define_case(), and provide tools to perform batch execution with different data and parameters to meet those needs.

# build the full definition of SUD
sud_def <- build_def(
  # name of definition
  def_lab = "SUD",
  # place holder names for sources
  src_labs = c("claim", "hosp"),
  def_fn = define_case, # you could alter it and supply your own function
  # below are argumets of define_case
  fn_args = list(
    # if length = 1, the single element will be use for every source
    vars = list(starts_with("diagx_")),
    match = "start", # match ICD starts with vals
    vals = list(c(291:292, 303:305), glue("F{10:19}")),
    clnt_id = clnt_id,
    n_per_clnt = c(2, 1),
    date_var = dates,
    within = c(365, NULL),
    uid = uid,
    mode = "flag"
  )
)

sud_def
#> # A tibble: 2 × 5
#>   def_lab src_labs def_fn      fn_args          fn_call   
#>   <chr>   <chr>    <chr>       <list>           <list>    
#> 1 SUD     claim    define_case <named list [9]> <language>
#> 2 SUD     hosp     define_case <named list [9]> <language>

Let’s look inside the fn_call list column. Two calls of define_case() have been made with different parameters. The data arguments are left empty on purpose for re-usability. For example, you may want to repeat the analysis with data from different regions or study periods.

sud_def$fn_call
#> [[1]]
#> define_case(data = , vars = starts_with("diagx_"), match = "start", 
#>     vals = c(291:292, 303:305), clnt_id = clnt_id, n_per_clnt = 2, 
#>     date_var = dates, within = 365, uid = uid, mode = "flag")
#> 
#> [[2]]
#> define_case(data = , vars = starts_with("diagx_"), match = "start", 
#>     vals = glue("F{10:19}"), clnt_id = clnt_id, n_per_clnt = 1, 
#>     date_var = dates, within = NULL, uid = uid, mode = "flag")

Executing the definition is simple. If verbose option is not turned off by options(healthdb.verbose = FALSE), the output message will explain what has been done. You could append multiple build_def() outputs together and execute them all at once. Definition and source labels will be added to the result to identify outputs from different calls.

# execute the definition
result_list <- sud_def %>%
  execute_def(with_data = list(
    claim = claim_db,
    hosp = hosp_df
  ))
#> 
#> Processing source: claim_db
#> --------------Inclusion step--------------
#> 
#> Identify records with condition(s): 
#>  - where at least one of the diagx_1, diagx_2 column(s) in each record 
#>    - contains a value satisfied SQL LIKE pattern: 291% OR 292% OR 303% OR 304% OR 305% 
#> 
#> To see the final query generated by 'dbplyr', use dplyr::show_query() on the output. 
#> To extract the SQL string, use dbplyr::remote_query().
#> 
#> --------------No. rows restriction--------------
#> 
#> Apply restriction that each client must have at least 2 records with distinct dates. Clients/groups whichmetthe condition were flagged. 
#> 
#> --------------Time span restriction--------------
#> 
#> Apply restriction that each client must have 2 records that were  within 365 days. Records that met the condition were flagged. 
#> 
#> -------------- Output all records--------------
#> 
#> Processing source: hosp_df
#> --------------Inclusion step--------------
#> 
#> Identify records with condition(s): 
#>  - where at least one of the diagx_1, diagx_2 column(s) in each record 
#>    - contains a value satisfied regular expression: ^F10|^F11|^F12|^F13|^F14|^F15|^F16|^F17|^F18|^F19 
#> 
#> All unique value(s) and frequency in the result (as the conditions require just one of the columns containing target values; irrelevant values may come from other columns): 
#>  999 F100 F101 F105 F109 F112 F113 F115 F116 F119 F120 F121 F122 F125 F128 F130 
#>    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
#> F131 F132 F134 F140 F142 F144 F145 F148 F150 F157 F158 F159 F161 F162 F164 F167 
#>    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
#> F168 F170 F171 F174 F176 F178 F180 F184 F187 F188 F191 F194 F195 F197 F198 NA's 
#>    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
#> 
#> -------------- Output all records--------------

Let’s check the results!

# view the results
purrr::walk(result_list, ~ head(.) %>% print())
#> # Source:     SQL [6 x 10]
#> # Database:   sqlite 3.45.2 [:memory:]
#> # Ordered by: dates, uid
#>   def   src     uid clnt_id dates diagx diagx_1 diagx_2 flag_restrict_n
#>   <chr> <chr> <int>   <int> <dbl> <chr> <chr>   <chr>             <int>
#> 1 SUD   claim    24       1 17464 3048  3040    <NA>                  0
#> 2 SUD   claim    14       2 17640 3041  3047    <NA>                  0
#> 3 SUD   claim    37       3 16948 2913  3035    999                   0
#> 4 SUD   claim    18       4 18151 305   3056    999                   0
#> 5 SUD   claim    38       5 18248 3055  3054    999                   0
#> 6 SUD   claim    46       6 17273 2922  3048    2927                  0
#> # ℹ 1 more variable: flag_restrict_date <int>
#>   def  src uid clnt_id      dates diagx diagx_1 diagx_2
#> 1 SUD hosp  43       1 2015-08-13  F156    F144    F168
#> 2 SUD hosp  16       1 2017-10-25   F12    F132    <NA>
#> 3 SUD hosp   6       2 2018-04-19  F133    F128    <NA>
#> 4 SUD hosp  29       3 2016-05-27  F130    F164     999
#> 5 SUD hosp  10       4 2019-09-12  F138    F120     999
#> 6 SUD hosp  30       5 2019-12-18  F192    F100     999

At this point, the result from the claim database (result[[1]]) has not been collected locally. You could collect it manually, do further filtering, and then combine with the result from hospitalization data in any way you want. If you just need a simple row bind, we have bind_source() with convenient naming feature.

bind_source(result_list,
  # output_name = c(names in the list elements)
  src = "src",
  uid = "uid",
  clnt_id = "clnt_id",
  flag_date = c("flag_restrict_date", NA),
  force_proceed = TRUE
)
#> # A tibble: 99 × 5
#>    src_No src     uid clnt_id flag_date
#>     <int> <chr> <int>   <int>     <int>
#>  1      1 claim    24       1         0
#>  2      1 claim    14       2         0
#>  3      1 claim    37       3         0
#>  4      1 claim    18       4         0
#>  5      1 claim    38       5         0
#>  6      1 claim    46       6         0
#>  7      1 claim    35       7         1
#>  8      1 claim    22       7         1
#>  9      1 claim    47       7         1
#> 10      1 claim    10       7         0
#> # ℹ 89 more rows

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.