The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.

More on data preparation

library(shinymrp)

This vignette details the data requirements, implementation steps, and relevant code references for data preparation in shinymrp.

MRP requires two main data components: the survey or test data and a corresponding poststratification table. The workflow involves two stages:

Data requirements

Individual-level vs. aggregated data

Data preprocessing accepts either of these formats:

For continuous outcomes, only individual-level data are supported.
For binary outcomes, the aggregated format is preferred for computational efficiency; individual-level data are aggregated automatically upon upload. Other data requirements depend on format, primarily regarding outcome measures.

Required columns and naming conventions

The code expects columns with specific names and values (not case-sensitive):

Data modules

Input data are categorized for clear requirements and implementation, with multiple modules. The two primary categories, time-varying and cross-sectional, support specific applications as well as general cases. The following cheatsheet summarizes requirements and typical preprocessing outputs for each.


TIME-VARYING


COVID-19 Test Data

  1. Sample data
  1. Poststratification data

General

  1. Sample data
  1. Poststratification data

CROSS-SECTIONAL


Public Opinion Poll Data

  1. Sample data
  1. Poststratification data

General

  1. Sample data
  1. Poststratification data

Data preprocessing steps

The preprocessing pipeline includes:

Code reference: preprocess.

Geographic identifiers and covariates

A major strength of MRP is small area estimation, so it is advisable to include as much geographic/geo-covariate information as possible.

First, the application identifies geographic units at larger scales that are not present in the data. It automatically determines the smallest geographic units in the data and infers corresponding units at larger scales. For example, if the data contains ZIP codes, the application will automatically find the county and state that has the largest overlap with each ZIP code.

Quantitative measures associated with geographic units are sourced from your data or external datasets. For general use cases, the app scans the data to find quantities that have a one-to-one relationship with the geographic identifier of interest.

For the COVID-19 use case, we have identified specific ZIP code-level measures that are informative in modeling COVID-19 test results. We obtain these quantities at the tract level from the ACS and other sources, then aggregate over the tracts that overlap with each ZIP code based on the USPS crosswalk table.

We obtain the following tract-level measures from the ACS and other sources:

Code reference: get_tract_data.

While the ACS reports geography at the levels of census tracts, counties, and states, ZIP codes are defined by the U.S. Postal Service (USPS). We use the ZIP code crosswalk table released by the U.S. Department of Housing and Urban Development and USPS to link ZIP codes to census tracts and calculate ZIP-code-level measures by aggregating all available tract-level measures weighted by tract population counts. ZIP code level statistics are computed by combining the values across census tracts as follows:

Code reference: combine_tracts_covid.

Poststratification tables

Poststratification tables are computed from ACS data via the tidycensus package and IPUMS and summarize the size of every subpopulation defined by demographic and geographic cross-categories. For efficiency, tables are precomputed at the tract level and then aggregated for larger geographies. We select the county with the most overlapping residential addresses for a given ZIP code as the ZIP-linked county and sum over the overlapping tracts for each ZIP code to obtain ZIP code-level population counts.

Code reference: combine_tracts and combine_tracts_covid


  1. Geographical columns are optional for general use. The app automatically identifies the smallest available geographic scale and infers higher levels.↩︎

  2. Geographical columns are optional for general use. The app automatically identifies the smallest available geographic scale and infers higher levels.↩︎

  3. Geographical columns are optional for general use. The app automatically identifies the smallest available geographic scale and infers higher levels.↩︎

  4. For individual-level data, dates are automatically converted to time indices but can be provided explicitly. Aggregated data must include a time column with time indices. Optionally include a date column (first day of each period) for visualization. The interface uses time-invariant poststratification data.↩︎

  5. For continuous outcomes, name your outcome column outcome.↩︎

  6. For binary outcomes, the column in individual-level data must be positive. For aggregated data, use total (number in cell) and positive (number positive in cell).↩︎

  7. For binary outcomes, the column in individual-level data must be positive. For aggregated data, use total (number in cell) and positive (number positive in cell).↩︎

  8. Survey weights must be in a column named weight. If uploaded poststratification data contain weights, they’re used to estimate population counts.↩︎

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.