Background

What is NHANES?

The National Health and Nutrition Examination Survey (NHANES) is:

  • A nationally representative, complex survey of U.S. civilians
  • Continuous since 1999, collected in 2-year cycles
  • Combines interviews, physical exams, and laboratory tests
  • A cornerstone dataset for population health research

Data spans:

Category Datasets
Laboratory 173
Questionnaire 77
Examination 75
Dietary 14
Total 339+

NHANES is powerful, but working with it at scale is painful.

The Problem

The CDC publishes data in two-year cycle fragments with cryptic naming:

DEMO    DEMO_B    DEMO_C    DEMO_D    ...    DEMO_L
(1999)  (2001)    (2003)    (2005)           (2021)

Pain points for researchers:

  • CDC servers are slow and frequently unreliable
  • Every table must be individually downloaded, cleaned, and merged
  • Column types conflict across cycles (numeric in one cycle, factor in another)
  • Inconsistent variable coding: raw numeric codes, not labels
  • Repetitive, error-prone boilerplate code in every new project

For large data science projects and iterative workflows, this becomes a real bottleneck.

The Solution

nhanesdata: What We Built

Built on the shoulders of the excellent nhanesA package (Endres et al.), which made NHANES data more accessible, but still relied on CDC’s servers.

We added:

  • Pre-merged tables across all cycles
  • Harmonized column types and labels
  • Parquet files hosted on a public cloud server (Cloudflare R2)
  • create_design() for CDC-compliant survey weighting
  • Variable and term search utilities

The result:

  • Near-instantaneous downloads
  • No CDC server dependency at runtime
  • Reproducible, consistent data ready to join and analyze
  • Lower barrier to entry for NHANES research

How It Works

Instead of downloading data by cycle, nhanesdata serves it by table. All cycles for a given table are already merged into a single dataset.

For example, read_nhanes("demo") returns one tibble containing NHANES demographics from 1999 through 2023 — every cycle combined, ready to use.

Every dataset always contains two key variables:

  • seqn: the participant ID
  • year: the survey cycle start year

Data Harmonization: The Problem

nhanesA translates numeric codes to labeled factors only when a parseable CDC codebook exists for that cycle. When one doesn’t, the same variable comes back as raw numbers.

Same variable, two cycles:

Cycle with codebook:    BMIWT = "Could not obtain" / "Clothing" / "Medical appliance"
Cycle without:          BMIWT = 1 / 3 / 4

Add R type conflicts (integer vs double, factor vs character) and variables absent from certain cycles, and there is a lot to resolve before bind_rows() will even run.

Data Harmonization: Our Approach

What we did:

  • Cached translation tables from cycles that do have codebooks and applied them to cycles that don’t
  • Resolved all R type conflicts across cycles before binding
  • Variables not collected in a given cycle are filled with NA, not dropped

Warning

We’re human. We did our best. If a variable is critical to your analysis, verify it against the CDC codebook. Use get_url("DEMO_J") for a direct link.

Using nhanesdata

Installation & Loading Data

install.packages("nhanesdata")
library(nhanesdata)

# Load any NHANES table - merged across all cycles, instantly
demo  <- read_nhanes("demo")   # Demographics, 1999-2023
bpx   <- read_nhanes("bpx")    # Blood pressure
bmx   <- read_nhanes("bmx")    # Body measures
tchol <- read_nhanes("tchol")  # Total cholesterol
# Not sure what a table is called? Search by keyword or variable name
term_search("blood pressure")   # keyword search to find table name
var_search("BPXSY1")            # find which cycles contain a variable
get_url("BPX_J")                # open CDC codebook for a specific cycle

Joining & Filtering

library(nhanesdata)
library(dplyr)

demo  <- read_nhanes("demo")
bpx   <- read_nhanes("bpx")
bmx   <- read_nhanes("bmx")

# Join on BOTH seqn AND year - seqn alone is not unique across cycles
analysis <- demo |>
  inner_join(bpx, by = c("seqn", "year")) |>
  inner_join(bmx, by = c("seqn", "year")) |>
  filter(
    ridageyr >= 18,       # adults only
    year >= 2007,         # 2007-2017 cycles
    year <= 2017
  ) |>
  mutate(obese = bmxbmi >= 30)

Filtering and variable creation should happen before creating the survey design object.

Survey Design

NHANES oversamples certain subgroups, so each participant carries a sampling weight. Ignoring weights produces biased population estimates.

Pick the weight type that matches your most restrictive data source:

Weight Use when your analysis includes…
"interview" Questionnaire / demographics only
"mec" Any physical exam or lab data
"fasting" Fasting lab measurements
design <- create_design(
  dsn      = analysis,
  start_yr = 2007,
  end_yr   = 2017,
  wt_type  = "mec"
)

create_design() handles CDC-required weight scaling across cycles automatically.

Survey Design: Weighted Analysis

library(srvyr)

# Weighted mean systolic BP and obesity prevalence by age group
design |>
  group_by(age_60plus = ridageyr >= 60) |>
  summarize(
    mean_sbp  = survey_mean(bpxsy1, na.rm = TRUE),
    pct_obese = survey_mean(obese,  na.rm = TRUE),
    n         = survey_total(vartype = NULL)
  )
# A tibble: 2 × 6
  age_60plus mean_sbp mean_sbp_se pct_obese pct_obese_se        n
  <lgl>         <dbl>       <dbl>     <dbl>        <dbl>    <dbl>
1 FALSE          119.       0.315     0.332       0.00531 178456823
2 TRUE           135.       0.481     0.357       0.00902  41234109

Mortality Linkage

One of the most powerful features: read_nhanes("mortality")

library(survival)

demo      <- read_nhanes("demo")
mortality <- read_nhanes("mortality")   # NDI linkage through Dec 31, 2019

survival_data <- demo |>
  inner_join(mortality, by = c("seqn", "year")) |>
  filter(ridageyr >= 40, year >= 1999)

design <- create_design(survival_data, 1999, 2017, wt_type = "interview")

svycoxph(
  Surv(permth_exm / 12, mortstat == 1) ~ ridageyr + riagendr,
  design = design
)

Population-representative survival analyses, entirely in R.

Caveats & What’s Next

What’s Not Included (Yet)

We harmonized the majority of NHANES datasets, but some are excluded:

Currently excluded:

  • Datasets where CDC suffix conventions are not followed
  • Datasets with repeated measures (require special handling)
  • Surplus, pooled, and restricted-access samples
  • The 2019-2020 cycle (COVID-19 disrupted data collection; CDC advises against combining it with standard cycles)

We’re actively working on expanding coverage. If you need a specific dataset not yet available, open a GitHub issue or contribute!

The Bottom Line

Goal: Abstract away the data science overhead so researchers can focus on their science.

We achieved this by:

  • Harmonizing 339+ datasets across 13 cycles
  • Hosting on a public, no-auth cloud server
  • Serving as Parquet files for near-instant access
  • Providing create_design() for CDC-compliant weighting

What it means for you:

  • Go from install to analysis in minutes
  • Consistent, documented data you can trust
  • Reproducible workflows that don’t break when CDC servers go down

Thank You

Get started:

install.packages("nhanesdata")

Resources:

Built on: nhanesA, arrow, srvyr, dplyr

Contributors:

  • Kyle Grealis, MS (lead developer)
  • Natalie Neugaard (Goulett), MPH (presenter)
  • Amrit Baral, PhD, MBBS, MPH
  • Raymond Balise, PhD
  • Johannes Thrul, PhD
  • Janardan Devkota, PhD