More on data preparation

The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.

Data requirements

Individual-level vs. aggregated data

Data preprocessing accepts either of these formats:

Individual-level: Each row contains data for a single person.
Aggregated: Each row contains data for a group (e.g., White males aged 18-30 in Michigan), summarizing demographic/geographic factors, totals, and outcome summaries.

For continuous outcomes, only individual-level data are supported.
For binary outcomes, the aggregated format is preferred for computational efficiency; individual-level data are aggregated automatically upon upload. Other data requirements depend on format, primarily regarding outcome measures.

Required columns and naming conventions

The code expects columns with specific names and values (not case-sensitive):

Sex: male, female
Race: Black, White, other
Age
Edu (education attainment): below high school (no hs), high school (hs), some college, 4-year college, post-grad
ZIP code ¹
County ²
State ³
Time indices (time) ⁴
Date
Continuous outcome measure (outcome) ⁵
Positive response indicator or number of positive responses (positive) ⁶
Cross-tabulation cell counts (total) ⁷
Survey weights (weight) ⁸

Data modules

Input data are categorized for clear requirements and implementation, with multiple modules. The two primary categories, time-varying and cross-sectional, support specific applications as well as general cases. The following cheatsheet summarizes requirements and typical preprocessing outputs for each.

TIME-VARYING

COVID-19 Test Data

Sample data

Sex: male, female
Race: Black, White, other
Age: 0-17, 18-34, 35-64, 65-74, 75+
ZIP code: each ZIP treated as distinct
Time: Dates (yyyy-mm-dd) or sequential indices (starting at 1)

Poststratification data

ACS linking: sex * race * age * zip

General

Sample data

Sex: male, female
Race: Black, White, other
Age: 0-17, 18-34, 35-64, 65-74, 75+
ZIP code: each ZIP treated as distinct
County: five-digit FIPS codes required
State: name, abbreviation, or FIPS code
Time: Dates or sequential indices

Poststratification data

ACS linking: sex * race * age * (user selected geographic levels)
User upload

CROSS-SECTIONAL

Public Opinion Poll Data

Sample data

Sex: male, female
Race: Black, White, other
Age: 18-29, 30-39, 40-49, 50-59, 60-69, 70+
Education (edu): below high school, high school, some college, 4-year college, post-grad
State: name, abbreviation, or FIPS code

Poststratification data

ACS linking: sex * race * age * edu * state

General

Sample data

Sex: male, female
Race: Black, White, other
Age: 0-17, 18-34, 35-64, 65-74, 75+
ZIP code: each ZIP treated as distinct
County: five-digit FIPS codes
State: name, abbreviation, or FIPS code

Poststratification data

ACS linking: sex * race * age * (user selected geographic levels)
User upload

Data preprocessing steps

The preprocessing pipeline includes:

Data cleaning: Standardizes column names, converts values to lowercase, handles missing/unknown data, and standardizes ZIP/FIPS codes.
Conversion to categorical: Recodes variables, applies categorization intervals, and assigns time indices to dates.
Imputation: Imputes missing entries using observed frequency distributions.
Aggregation: Aggregates individual-level data to produce cell counts for each combination of relevant group factors.

Code reference: preprocess.

Geographic identifiers and covariates

A major strength of MRP is small area estimation, so it is advisable to include as much geographic/geo-covariate information as possible.

First, the application identifies geographic units at larger scales that are not present in the data. It automatically determines the smallest geographic units in the data and infers corresponding units at larger scales. For example, if the data contains ZIP codes, the application will automatically find the county and state that has the largest overlap with each ZIP code.

Quantitative measures associated with geographic units are sourced from your data or external datasets. For general use cases, the app scans the data to find quantities that have a one-to-one relationship with the geographic identifier of interest.

For the COVID-19 use case, we have identified specific ZIP code-level measures that are informative in modeling COVID-19 test results. We obtain these quantities at the tract level from the ACS and other sources, then aggregate over the tracts that overlap with each ZIP code based on the USPS crosswalk table.

We obtain the following tract-level measures from the ACS and other sources:

Binary indicators of whether tracts are classified as urban or not
Population counts categorized by levels of education
Population counts categorized by ratios of income to poverty level in the past 12 months
Population counts categorized by employment status
Median household income in the past 12 months
Area Deprivation Index (ADI).

Code reference: get_tract_data.

While the ACS reports geography at the levels of census tracts, counties, and states, ZIP codes are defined by the U.S. Postal Service (USPS). We use the ZIP code crosswalk table released by the U.S. Department of Housing and Urban Development and USPS to link ZIP codes to census tracts and calculate ZIP-code-level measures by aggregating all available tract-level measures weighted by tract population counts. ZIP code level statistics are computed by combining the values across census tracts as follows:

Urbanicity of a ZIP code is defined as the percentage of covered census tracts classified as urban, weighted by tract population counts;
Higher education measure of a ZIP code is defined as the percentage of the residing population who have earned an Associate’s degree or higher;
Poverty measure of a ZIP code is defined as the percentage of the residing population whose ratio of income to poverty level in the past 12 months is below 1;
Employment rate of a ZIP code is defined as the percentage of the residing population who are employed as a part of the civilian labor force;
Income measure of a ZIP code is defined as the average value of tract-level median household income in the past 12 months, weighted by tract population counts;
ADI of a ZIP code is the average ADI across covered census tracts, weighted by tract population counts.

Code reference: combine_tracts_covid.

Poststratification tables

Poststratification tables are computed from ACS data via the tidycensus package and IPUMS and summarize the size of every subpopulation defined by demographic and geographic cross-categories. For efficiency, tables are precomputed at the tract level and then aggregated for larger geographies. We select the county with the most overlapping residential addresses for a given ZIP code as the ZIP-linked county and sum over the overlapping tracts for each ZIP code to obtain ZIP code-level population counts.

Code reference: combine_tracts and combine_tracts_covid

Geographical columns are optional for general use. The app automatically identifies the smallest available geographic scale and infers higher levels.↩︎
Geographical columns are optional for general use. The app automatically identifies the smallest available geographic scale and infers higher levels.↩︎
Geographical columns are optional for general use. The app automatically identifies the smallest available geographic scale and infers higher levels.↩︎
For individual-level data, dates are automatically converted to time indices but can be provided explicitly. Aggregated data must include a time column with time indices. Optionally include a date column (first day of each period) for visualization. The interface uses time-invariant poststratification data.↩︎
For continuous outcomes, name your outcome column outcome.↩︎
For binary outcomes, the column in individual-level data must be positive. For aggregated data, use total (number in cell) and positive (number positive in cell).↩︎
For binary outcomes, the column in individual-level data must be positive. For aggregated data, use total (number in cell) and positive (number positive in cell).↩︎
Survey weights must be in a column named weight. If uploaded poststratification data contain weights, they’re used to estimate population counts.↩︎

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.