Preparing STEPS Data for Analysis

The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.

Abhijit Pakhare

2026-05-06

Introduction

This guide helps you prepare your WHO STEPS survey data file for use with the stepssurvey package. It covers the variables the package expects, how auto-detection works, common data quality issues, and how to resolve mismatches between your data and the package expectations.

The package is designed to work with STEPS data from any country, regardless of instrument version (v3.1 or v3.2) or data management system (Epi Info, SPSS, Stata, Excel).

Supported file formats

Format	Extension	Typical source	Reader used
SPSS	`.sav`	WHO STEPS data entry / Epi Info export	`haven::read_spss()`
Stata	`.dta`	WHO analysis template	`haven::read_dta()`
Excel	`.xlsx`	Custom data entry	`readxl::read_excel()`
CSV	`.csv`	Any spreadsheet export	`readr::read_csv()`

Recommendation: Use the .sav file directly as exported from your data management system. The package preserves SPSS variable labels during column detection (to disambiguate codes like A1 that mean different things across versions) and then strips them before analysis to avoid compatibility issues.

Minimum required variables

At a minimum, the package needs age and sex to produce any output. Beyond that, each additional variable you provide enables more indicators and tables.

Essential (required)

Variable	STEPS codes	Description
Age	`C3` (v3.2), `age`, `c1` (v3.1)	Respondent age in completed years
Sex	`C1` (v3.2), `sex`, `gender`, `c2` (v3.1)	Male/Female coding (1/2, M/F, or text)

Strongly recommended

Variable	STEPS codes	Description
Sampling weight (Step 1)	`WStep1`, `wt_final`, `sampleweight`	Probability weight for behavioural module
Sampling weight (Step 2)	`WStep2`	Weight for physical measurements
Sampling weight (Step 3)	`WStep3`	Weight for biochemical measurements
PSU / Cluster	`psu`, `cluster`, `I1`, `ea_id`	Primary sampling unit identifier
Stratum	`stratum`, `strata`, `district`, `region`	Stratification variable

If only one weight column is present, it is used for all three steps. If no weight is found, the package assumes equal weights (simple random sample).

Step 1: Behavioural risk factors

Tobacco:

Variable	v3.1 code	v3.2 code	Description
Current smoker	T1	T1	Currently smoke tobacco (yes/no)
Daily smoker	T2	T2	Smoke daily (yes/no)
Age started	T3	T3	Age of smoking initiation
Cigarettes/day	T5a	T5a	Manufactured cigarettes per day
Quit attempt	T6	T6	Tried to quit in past 12 months
Past smoker	T8	T8	Ever smoked in the past
Smokeless tobacco	T12/T15	T12	Current smokeless use
Second-hand (home)	T17	T17	Exposure to smoke at home
Second-hand (work)	T18	T18	Exposure to smoke at workplace

Alcohol:

Variable	v3.1 code	v3.2 code	Description
Ever consumed	–	A1	Lifetime alcohol consumption
Past 12 months	A2/A4	A2	Consumed in past year
Current (30 days)	A1	A5	Consumed in past 30 days
Occasions (30 days)	A6	A6	Number of drinking occasions
Drinks per occasion	A7	A7	Typical number of drinks
Heavy episodic	–	A9	Times with 6+ drinks (30 days)

Note on A1/A5 ambiguity: In v3.1, A1 means “current drinker (past 30 days)”. In v3.2, A1 means “ever consumed alcohol” and A5 is “past 30 days”. The package uses SPSS variable labels to disambiguate when the column code alone is ambiguous. This is one reason why .sav files (which carry labels) work better than plain CSV.

Diet:

Variable	v3.1 code	v3.2 code	Description
Fruit days/week	D1	D1	Days eating fruit in typical week
Fruit servings/day	D2	D2	Servings of fruit on those days
Vegetable days/week	D3	D3	Days eating vegetables
Vegetable servings/day	D4	D4	Servings of vegetables on those days
Salt at table	D5	D5	Frequency of adding salt
Processed salt food	D7	D7	Frequency of processed salty food

Physical Activity (GPAQ):

Variable	v3.2 code	Description
Vigorous work (y/n)	P1	Does vigorous work activity
Vigorous work days	P2	Days per week
Vigorous work hours	P3a	Hours per day
Vigorous work minutes	P3b	Minutes per day
Moderate work (y/n)	P4	Does moderate work activity
Moderate work days	P5	Days per week
Moderate work hours	P6a	Hours per day
Moderate work minutes	P6b	Minutes per day
Transport (y/n)	P7	Walks or cycles for transport
Transport days	P8	Days per week
Transport hours	P9a	Hours per day
Transport minutes	P9b	Minutes per day
Vigorous recreation (y/n)	P10	Does vigorous recreational activity
Vigorous recreation days	P11	Days per week
Vigorous recreation hours	P12a	Hours per day
Vigorous recreation minutes	P12b	Minutes per day
Moderate recreation (y/n)	P13	Does moderate recreational activity
Moderate recreation days	P14	Days per week
Moderate recreation hours	P15a	Hours per day
Moderate recreation minutes	P15b	Minutes per day
Sedentary hours	P16a	Sitting time, hours per day
Sedentary minutes	P16b	Sitting time, minutes per day

The package computes MET-minutes/week from these raw items using WHO MET multipliers: vigorous activities × 8 MET, moderate and transport activities × 4 MET. The insufficient_pa indicator (< 600 MET-minutes/week) is then derived automatically.

If your dataset already has a pre-computed met_total variable, the package uses it directly instead of calculating from raw items.

Step 2: Physical measurements

Variable	v3.1 code	v3.2 code	Description
Height (cm)	M1	M11	Standing height
Weight (kg)	M2	M12	Body weight
Waist (cm)	M3	M14	Waist circumference
Hip (cm)	–	M15	Hip circumference
SBP reading 1	B1	M4a	First systolic BP
SBP reading 2	B3	M5a	Second systolic BP
SBP reading 3	B5	M6a	Third systolic BP
DBP reading 1	B2	M4b	First diastolic BP
DBP reading 2	B4	M5b	Second diastolic BP
DBP reading 3	B6	M6b	Third diastolic BP
BP medication	B7/H3	M7	Currently on antihypertensives
Heart rate 1	–	M16a	First heart rate reading
Heart rate 2	–	M16b	Second heart rate reading
Heart rate 3	–	M16c	Third heart rate reading

Blood pressure: The package averages the last two of three readings (WHO protocol). If only two readings are available, their average is used. If only one reading is available, it is used directly.

Step 3: Biochemical measurements

Variable	v3.1 code	v3.2 code	Description
Fasting glucose	C1	B5	Fasting blood glucose (mmol/L)
Diabetes meds	C5	B6/H8	Currently on diabetes medication
Total cholesterol	C6	B8	Total cholesterol (mmol/L)
Cholesterol meds	C10	B9/H14	Currently on cholesterol medication
HDL cholesterol	–	B17	HDL cholesterol (mmol/L)
Triglycerides	–	B16	Fasting triglycerides (mmol/L)

Health history (H-codes)

Variable	Code	Description
BP ever measured	H1	Ever had BP measured by health worker
BP diagnosed	H2a	Ever told by doctor that BP is raised
Glucose ever measured	H6	Ever had blood sugar measured
DM diagnosed	H7a	Ever told by doctor that blood sugar is raised
Cholesterol ever measured	H12	Ever had cholesterol measured
Cholesterol diagnosed	H13a	Ever told by doctor that cholesterol is raised
CVD history	H17	History of heart attack, angina, or stroke
Aspirin use	H18	Currently taking aspirin regularly
Statin use	H19	Currently taking statins regularly
Advised: quit tobacco	H20a	Doctor/health worker advised to quit tobacco
Advised: reduce salt	H20b	Advised to reduce salt intake
Advised: eat fruit/veg	H20c	Advised to eat more fruit/vegetables
Advised: reduce fat	H20d	Advised to reduce dietary fat
Advised: more PA	H20e	Advised to increase physical activity
Advised: healthy weight	H20f	Advised to maintain healthy body weight

How auto-detection works

When you call detect_steps_columns(data) (or upload a file in the Shiny app), the package searches for each expected variable using a prioritised alias list. For example, the fasting glucose variable is searched for as:

b5, b5_mmol, c1_mmol, fasting_glucose, glucose_fasting, fbg, fpg

The search is case-insensitive and uses the column names after janitor::clean_names() has standardised them.

For ambiguous codes (like A1 which means different things in v3.1 vs v3.2), the package also checks the SPSS variable label to disambiguate. This is why .sav files produce the most reliable auto-detection.

After detection, you can inspect the mapping:

raw  <- import_steps_data("my_data.sav")
cols <- detect_steps_columns(raw)

# See all detected columns
str(cols[!sapply(cols, is.null)])

# See what was NOT detected
names(cols[sapply(cols, is.null)])

Common data issues and solutions

Issue: Sex coded as numeric without labels

Some datasets code sex as 1/2 without clear labels. The package handles this automatically using the WHO STEPS convention: 1 = Male, 2 = Female. If your data uses a different coding, recode before analysis or override the column.

Issue: Yes/No variables coded inconsistently

STEPS datasets use various codings for binary variables: 1/2 (yes/no), 0/1, “Yes”/“No”, “Y”/“N”. The recode_yn() function handles all of these automatically. It treats 1 = Yes and 2 = No (the WHO convention), as well as 0/1 where 1 = Yes.

Issue: Glucose values in mg/dL instead of mmol/L

The package expects biochemical values in mmol/L (the WHO standard). If your data uses mg/dL, convert before analysis:

# Glucose: mg/dL to mmol/L (divide by 18)
raw$b5 <- raw$b5 / 18

# Cholesterol: mg/dL to mmol/L (divide by 38.67)
raw$b8 <- raw$b8 / 38.67

Issue: Multiple datasets for different STEPS steps

Some surveys store Step 1 (interview), Step 2 (measurements), and Step 3 (blood tests) in separate files. Merge them by respondent ID before importing:

step1 <- haven::read_spss("step1_interview.sav")
step2 <- haven::read_spss("step2_measurements.sav")
step3 <- haven::read_spss("step3_biochemistry.sav")

combined <- dplyr::left_join(step1, step2, by = "pid") |>
  dplyr::left_join(step3, by = "pid")

# Save combined file
haven::write_sav(combined, file.path(tempdir(), "steps_combined.sav"))

# Or import directly
raw <- combined |> janitor::clean_names()

Issue: Missing sampling weights

If your dataset does not include sampling weights, the package will proceed with equal weights (equivalent to assuming a simple random sample). This produces valid point estimates but confidence intervals may not correctly reflect the true survey design.

For proper analysis, you need at minimum one weight variable. The WHO STEPS toolkit computes three step-specific weights (WStep1, WStep2, WStep3) that account for non-response at each step. Contact your survey statistician if weights are not in the data file.

Issue: Variables detected as wrong type

If a numeric variable is stored as character (common with Epi Info exports), the cleaning step will attempt to convert it. If conversion fails, the variable is set to NA with a warning message. Check the console for messages like “NAs introduced by coercion”.

Pre-flight checklist

Before running the analysis, verify these items:

File format: Preferably .sav (preserves variable labels for better auto-detection). CSV works but may require more manual column mapping.
One row per respondent: The data should be in wide format with one row per survey participant and columns for each variable.
Age and sex present: These are the only truly required variables. Verify they contain reasonable values (age should be numeric, sex should have exactly two levels).
Weights present: Check for columns named WStep1, WStep2, WStep3, or similar. If using a single weight, ensure it maps to weight_step1.
Biochemical units: Glucose should be in mmol/L (typical values 3–15), cholesterol in mmol/L (typical values 2–10). Values in hundreds suggest mg/dL units that need conversion.
Blood pressure readings: Should be in mmHg. Typical SBP range is 80–250, typical DBP range is 40–150. Values outside this range are set to NA during cleaning.
No duplicate respondent IDs: If the same person appears multiple times, prevalence estimates will be biased.
Consistent coding: Binary variables should use the same coding scheme throughout (ideally 1 = Yes, 2 = No per WHO convention).

Quick diagnostic script

Run this after importing to check data quality:

library(stepssurvey)

raw  <- import_steps_data("my_data.sav")
cols <- detect_steps_columns(raw)

# Summary
cat("Rows:", nrow(raw), "\n")
cat("Columns:", ncol(raw), "\n")
cat("Detected:", sum(!sapply(cols, is.null)), "/", length(cols), "\n")

# Check key variables
if (!is.null(cols$age)) {
  cat("\nAge range:", range(raw[[cols$age]], na.rm = TRUE), "\n")
  cat("Age NAs:", sum(is.na(raw[[cols$age]])), "\n")
}
if (!is.null(cols$sex)) {
  cat("\nSex distribution:\n")
  print(table(raw[[cols$sex]], useNA = "ifany"))
}
if (!is.null(cols$weight_step1)) {
  wt <- raw[[cols$weight_step1]]
  cat("\nWeight range:", round(range(wt, na.rm = TRUE), 3), "\n")
  cat("Weight NAs:", sum(is.na(wt)), "\n")
}

# List undetected variables
missing <- names(cols[sapply(cols, is.null)])
cat("\nUndetected variables (", length(missing), "):\n")
cat(paste(" ", missing, collapse = "\n"), "\n")

Preparing STEPS Data for Analysis

Abhijit Pakhare

2026-05-06

Introduction

Supported file formats

Minimum required variables

Essential (required)

Strongly recommended

Step 1: Behavioural risk factors

Step 2: Physical measurements

Step 3: Biochemical measurements

Health history (H-codes)

How auto-detection works

Common data issues and solutions

Issue: Sex coded as numeric without labels

Issue: Yes/No variables coded inconsistently

Issue: Glucose values in mg/dL instead of mmol/L

Issue: Multiple datasets for different STEPS steps

Issue: Missing sampling weights

Issue: Variables detected as wrong type

Pre-flight checklist

Quick diagnostic script

Further reading