Programmatic workflow demonstration

This vignette provides an in-depth walk-through of the MRPWorkflow and MRPModel classes in the shinymrp package. You will find practical information about the purpose, arguments, and caveats of each method, enabling you to conduct MRP analyses programmatically.

For an introduction to the package and workflow concepts, begin with the Getting started with shinymrp vignette.

Initializing the workflow

MRP analysis begins by instantiating an MRPWorkflow object with mrp_workflow(). The object-oriented design efficiently manages your data and supports a reproducible workflow without requiring repeated user input. All workflow methods are accessed via the R6 $ operator.

library(shinymrp)
workflow <- mrp_workflow()

1. Data preparation

MRP requires two key data components:

For details on accepted formats and preprocessing, see the Data Preprocessing vignette.

1.1 Preprocessing the sample data

Sample data should be a tidy data frame with your outcome measure and predictors. Time-varying data are also supported, allowing dates or indices (e.g., week, month, or year) to be incorporated for time-varying summaries and estimates.

The example_sample_data() function demonstrates accepted structures and common options:

sample_data <- example_sample_data(
  is_timevar = TRUE,
  is_aggregated = TRUE,
  special_case = NULL,
  family = "binomial"
)
#> Warning in value[[3L]](cond): GitHub API check failed: HTTP 403 Forbidden..
#> Proceeding with download.
head(sample_data)
#> # A tibble: 6 × 10
#>     zip sex    race  age    time county state date       total positive
#>   <dbl> <chr>  <chr> <chr> <dbl>  <dbl> <dbl> <date>     <dbl>    <dbl>
#> 1 71354 female black 0-17      3  22025    22 2020-01-13     1        0
#> 2 71354 female black 0-17      5  22025    22 2020-01-27     1        1
#> 3 71354 female black 0-17      9  22025    22 2020-02-24     1        0
#> 4 71354 female black 18-34     1  22025    22 2019-12-30     2        1
#> 5 71354 female black 18-34     6  22025    22 2020-02-03     1        0
#> 6 71354 female black 18-34     9  22025    22 2020-02-24     1        1

Data preprocessing is performed with the $preprocess() method, which standardizes formats (e.g., recodes age groups, creates time indices), removes duplicates, and prepares data for modeling.

Note: Only basic cleaning is performed. Conduct preliminary data checks beforehand.

workflow$preprocess(
  sample_data,
  is_timevar = TRUE,
  is_aggregated = TRUE,
  special_case = NULL,
  family = "binomial"
)
#> Preprocessing sample data...
#> Warning in value[[3L]](cond): GitHub API check failed: HTTP 403 Forbidden..
#> Proceeding with download.

1.2 Assembling poststratification data

1.2.1 Linking to ACS

Use the $link_acs() method to retrieve poststratification data from the American Community Survey (ACS), specifying the desired geographic level and reference year (i.e., the last year of the five-year ACS data collection). Supported levels include ZIP code, county, and state. If not specified, poststratification data are aggregated for the U.S. population.

workflow$link_acs(link_geo = "zip", acs_year = 2021)
#> Linking data to the ACS...
#> Warning in value[[3L]](cond): GitHub API check failed: HTTP 403 Forbidden..
#> Proceeding with download.
#> Warning in value[[3L]](cond): GitHub API check failed: HTTP 403 Forbidden..
#> Proceeding with download.

1.2.2 Importing custom poststratification data

You can also import your own poststratification data using $load_pstrat(). Requirements mirror those for sample data.

pstrat_data <- example_pstrat_data()
#> Warning in value[[3L]](cond): GitHub API check failed: HTTP 403 Forbidden..
#> Proceeding with download.
workflow$load_pstrat(pstrat_data, is_aggregated = TRUE)
#> Preprocessing poststratification data...
#> Warning in value[[3L]](cond): GitHub API check failed: HTTP 403 Forbidden..
#> Proceeding with download.

2. Descriptive statistics

The MRPWorkflow class provides methods for creating visualizations of demographic distributions, geographic coverage, and outcome summaries.

Use $demo_bars() to generate bar plots comparing demographic distributions in the sample and poststratification data.

workflow$demo_bars(demo = "sex")

For analyzing COVID-19 test data with linked geographic predictors, the $covar_hist() method creates frequency histograms of the geographic predictors. This method will give the following error if it is called with incompatible data that do not include geographic predictors.

workflow$covar_hist(covar = "income")
#> Error in workflow$covar_hist(covar = "income"): Covariate data is not available. This method is only available for COVID data.

$sample_size_map() visualizes sample sizes across geographic areas in an interactive choropleth map, powered by highcharter.

workflow$sample_size_map()
#> Registered S3 method overwritten by 'quantmod':
#>   method            from
#>   as.zoo.data.frame zoo

$outcome_plot() creates plots of average outcome values, including time variation if present.

workflow$outcome_plot()

$outcome_map() visualizes outcome summaries by geography, with options for the maximum or minimum value across time.

workflow$outcome_map(summary_type = "max")

3. Model building

The MRPModel class is a wrapper around CmdStanR objects, providing a high-level interface for obtaining posterior summaries and diagnostics.

3.1 Model specification

Use $create_model() to specify the model, including the mean structure (fixed and varying effects, two-way interactions) and priors. Varying effects assume normal distributions with unknown standard deviations (with specified priors). Stan code is generated for compilation via CmdStanR. Supported prior types include:

Default priors are used if "" (empty string) is supplied:

model1 <- workflow$create_model(
  intercept_prior = "normal(0, 4)",
  fixed = list(
    sex = "normal(0, 2)",
    race = "normal(0, 2)"
  ),
  varying = list(
    age = "",
    time = ""
  )
)

3.2 Model fitting

The $fit() method runs Stan’s main Markov chain Monte Carlo algorithm via CmdStanR.

model1$fit(
  n_iter = 500,
  n_chains = 2,
  seed = 123,
  show_messages = FALSE,
  show_exceptions = FALSE
)

3.3 Posterior summaries and diagnostics

The fitted MRPModel object extracts draws from the internal CmdStanMCMC object for summaries and diagnostics.

model1$summary()
#> $fixed
#>              Estimate Est.Error    l-95% CI  u-95% CI    R-hat Bulk_ESS
#> intercept  -0.1926188 0.2250724 -0.59452797 0.3132129 1.007664 134.0609
#> sex.male    0.2936252 0.1328935  0.02452109 0.5443322 1.000301 424.1697
#> race.black         NA        NA          NA        NA       NA       NA
#> race.other  0.7855232 0.1859582  0.41538058 1.1373412 1.006413 415.9744
#> race.white  0.8434822 0.1679745  0.47585995 1.1415483 1.001577 458.3526
#>             Tail_ESS
#> intercept   82.97744
#> sex.male   407.90451
#> race.black        NA
#> race.other 337.37038
#> race.white 381.81452
#> 
#> $varying
#>                   Estimate Est.Error   l-95% CI  u-95% CI    R-hat  Bulk_ESS
#> age (intercept)  0.2883083 0.2692683 0.03661777 1.1676160 1.011352  55.09682
#> time (intercept) 0.2704408 0.1275997 0.04227495 0.5847031 1.003678 199.92277
#>                  Tail_ESS
#> age (intercept)  103.3675
#> time (intercept) 180.6145
#> 
#> $other
#> data frame with 0 columns and 0 rows
model1$diagnostics()
#>               Metric
#> 1         Divergence
#> 2 Maximum tree depth
#> 3             E-BFMI
#>                                                       Message
#> 1         0 of 500 (0.0%) transitions ended with a divergence
#> 2 0 of 500 transitions hit the maximum tree depth limit of 10
#> 3                   0 of 2 chains had an E-BFMI less than 0.3
workflow$pp_check(model1)
#> Running posterior predictive check...

model2 <- workflow$create_model(
  intercept_prior = "normal(0, 4)",
  fixed = list(
    sex = "normal(0, 2)",
    race = "normal(0, 2)"
  ),
  varying = list(
    age = "",
    time = ""
  ),
  interaction = list(
    `age:time` = "normal(0, 1)",
    `race:time` = "normal(0, 1)",
    `sex:time` = "normal(0, 1)"
  )
)

model2$fit(
  n_iter = 500,
  n_chains = 2,
  seed = 123,
  show_messages = FALSE,
  show_exceptions = FALSE
)
#> Warning: 1 of 500 (0.0%) transitions ended with a divergence.
#> See https://mc-stan.org/misc/warnings for details.
workflow$compare_models(model1, model2)
#> Running leave-one-out cross-validation...
#>        elpd_diff  se_diff
#> model2  0.000000 0.000000
#> model1 -1.670122 2.579206

3.4 Saving and loading models

You can easily save and restore models using the qs package. The $save() gathers all posterior draws and diagnostics, then calls qs::qsave() internally.

model1$save(file = "model1.qs")

# load back into workspace
model1 <- qs::qread("model1.qs")

4. Result visualization

Use $estimate_plot() to visualize the model estimates, either overall or by group, with a 95% credible interval by default. Users can specify alternative intervals.

workflow$estimate_plot(model1, group = "sex", interval = 0.95)
#> Running poststratification...

For time-varying data, $estimate_map() creates snapshots of geography-based estimates at specific time points. time_index is irrelevant for cross-sectional data.

workflow$estimate_map(model1, geo = "county", interval = 0.95, time_index = 1)

Further reading