The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.
The Survey of Consumer Finances (SCF) is a triennial survey of U.S. household finances conducted by the Federal Reserve Board. It is among the most detailed and methodologically sophisticated data sources on U.S. households’ personal finances.
To ensure valid estimation and inference, the SCF incorporates two key methodological features:
These design features demand appropriate statistical handling. Analysts unfamiliar with replicate weighting and imputation pooling may inadvertently produce biased or misleading results. In practice, these barriers have discouraged even quantitatively competent researchers from working directly with SCF microdata.
The scf
package aims to reduce this friction. It
provides a structured and reproducible R interface for downloading,
transforming, and analyzing SCF data using methods appropriate to its
design. The package handles replicate weights and Rubin’s Rules
transparently and consistently across descriptive statistics, hypothesis
testing, regression modeling, and visualization.
This vignette introduces the core analytic workflow supported by the package. For detailed methodological discussion, see Cohen (2025a).
Download raw SCF data and load it into a valid multiply-imputed
survey object using scf_download()
and
scf_load()
. The result is an scf_mi_survey
object that contains replicate-weighted survey designs for each
implicate.
# Using Mock data with distribution
td <- tempdir()
src <- system.file("extdata", "scf2022_mock_raw.rds", package = "scf")
file.copy(src, file.path(td, "scf2022.rds"), overwrite = TRUE)
#> [1] TRUE
scf2022 <- scf_load(2022, data_directory = td)
# Using real SCF data (uncomment to run)
# scf2022 <- scf_download(2022)
# scf2022 <- scf_load(scf2022)
Before logging, you must bottom-code income and net worth at $1 to
avoid NA values due to log(0). The scf_update()
function
safely adds or modifies variables across all implicates.
scf2022 <- scf_update(scf2022,
senior = age >= 65,
female = factor(hhsex, levels = 1:2, labels = c("Male", "Female")),
rich = networth > 1e6,
networth = ifelse(networth > 1, networth, 1),
log_networth = log(networth),
income = ifelse(income > 1, income, 1),
log_income = log(income),
npeople = x101
)
Use names(scf2022$mi_design[[1]]$variables)
to inspect
variables.
Use scf_mean()
, scf_median()
, and
scf_percentile()
to calculate pooled estimates with Rubin’s
Rules. Use by =
for grouped statistics.
scf_mean(scf2022, ~networth, by = ~senior)
#> Multiply-Imputed, Replicate-Weighted Mean Estimate
#>
#> group variable estimate se min max
#> FALSE networth 809777.2 23912.24 771184 821626.9
#> TRUE networth 1595858.8 203948.10 1355046 1812726.0
scf_median(scf2022, ~income, by = ~female)
#> Multiply-Imputed Median Estimate
#>
#> group variable quantile estimate se min max
#> Female income 0.5 49721.94 0.0000 49721.94 49721.94
#> Male income 0.5 85824.40 592.0398 85392.03 86472.95
scf_percentile(scf2022, ~networth, q = 0.9)
#> Multiply-Imputed Percentile Estimate
#>
#> variable quantile estimate se min max
#> networth 0.9 1197722 116410.8 1039600 1360000
scf_percentile(scf2022, ~networth, q = 0.75, by = ~female)
#> Multiply-Imputed Percentile Estimate
#>
#> group variable quantile estimate se min max
#> Female networth 0.75 238504 1842.52 237680 241800
#> Male networth 0.75 642660 31637.37 623800 698700
Conduct t-tests and proportion tests on pooled SCF data. These tests return interpretable outputs with correct degrees of freedom and pooled standard errors.
scf_ttest(scf2022, ~networth, mu = 250000)
#> SCF One-sample t-test
#> Alternative hypothesis: mean is not equal to 250000
#>
#> Estimate: 1000482.40
#> Standard Error: 153600.47
#> t = 4.89, df = 744.0, p = 0.0000 ***
#> CI (95%): [698940.49, 1302024.31]
scf_ttest(scf2022, ~networth, group = ~senior)
#> SCF Two-sample t-test
#> Alternative hypothesis: mean is not equal to 0
#>
#> Group means:
#> group mean
#> FALSE 809777.2
#> TRUE 1595858.8
#>
#> Estimate: -786081.60
#> Standard Error: 507708.12
#> t = -1.55, df = 114.2, p = 0.1243
#> CI (95%): [-1791827.18, 219663.98]
scf_prop_test(scf2022, ~senior, p = 0.25)
#>
#> One-sample proportion test
#> Null hypothesis: proportion = 0.25
#> Alternative hypothesis: two.sided
#> Confidence level: 95%
#>
#> estimate std.error z.value p.value conf.low conf.high stars
#> 0.2423 0.0022 -3.5822 3e-04 0.2381 0.2465 ***
scf_prop_test(scf2022, ~rich, ~female)
#>
#> Two-sample proportion test
#> Null hypothesis: proportion difference = 0.5
#> Alternative hypothesis: two.sided
#> Confidence level: 95%
#>
#> estimate std.error z.value p.value conf.low conf.high stars
#> 0.1626 0.0253 -13.3593 0 0.1131 0.2121 ***
#>
#> Estimated group proportions:
#> group proportion
#> Male 0.1842
#> Female 0.0216
Fit linear or generalized linear models with Rubin-aware pooling. Logistic models can return odds ratios if requested.
scf_ols(scf2022, networth ~ age + log_income)
#> OLS Regression Results (Multiply-Imputed SCF)
#> --------------------------------------------------
#> term estimate std.error t.value p.value stars
#> (Intercept) -22709564 3541475 -6.412 1.134e-09 ***
#> age 35813 8843 4.050 6.274e-05 ***
#> log_income 1965105 300888 6.531 7.193e-10 ***
#>
#> Model Fit Statistics:
#> Mean R-squared: 0.1389 (SD: 0.0156)
#> Mean AIC: 2251.665 (SD: 3295.433)
#>
#> Note: Implicate-level model objects are stored in `object$imps`
#> Use `summary(object$imps[[1]])` to inspect them.
scf_logit(scf2022, rich ~ age + log_income)
#> Logistic Regression Results (Multiply-Imputed SCF)
#> --------------------------------------------------
#> term estimate std.error t.value p.value stars
#> (Intercept) 0.0000 0.0000 -4.0575 <2e-16 ***
#> age 1.0964 0.0409 2.4634 0.0138 *
#> log_income 14.1789 9.3306 4.0297 0.0001 ***
#>
#> Model Fit Diagnostics:
#> Pseudo R-squared: 0.5069
#> Mean AIC: 28.381
#>
#> Notes:
#> - Estimates are reported on the Odds Ratio scale.
#> - Implicate-level models are stored in `object$imps`
scf_logit(scf2022, rich ~ age + log_income, odds = TRUE)
#> Logistic Regression Results (Multiply-Imputed SCF)
#> --------------------------------------------------
#> term estimate std.error t.value p.value stars
#> (Intercept) 0.0000 0.0000 -4.0575 <2e-16 ***
#> age 1.0964 0.0409 2.4634 0.0138 *
#> log_income 14.1789 9.3306 4.0297 0.0001 ***
#>
#> Model Fit Diagnostics:
#> Pseudo R-squared: 0.5069
#> Mean AIC: 28.381
#>
#> Notes:
#> - Estimates are reported on the Odds Ratio scale.
#> - Implicate-level models are stored in `object$imps`
scf_glm(scf2022, own ~ age , family = binomial())
#> Generalized Linear Model (Multiply-Imputed SCF)
#> --------------------------------------------------
#> term estimate std.error z.value p.value stars
#> (Intercept) 1.4508 0.6710 2.1621 0.03061 *
#> age 0.0070 0.0127 0.5494 0.58270
#>
#> Model Fit Diagnostics:
#> Pseudo R-squared: 0.002 (SD: 0.000)
#> Mean AIC: 56 (SD: 80)
#>
#> Note: Model fit pooled across implicates via Rubin's Rules.
#> Inspect individual models via `object$models[[i]]`.
Note on Warnings
When running logistic regression withscf_logit()
or other functions that usefamily = binomial()
, you may see warnings like:`Warning: non-integer #successes in a binomial glm!`
This warning is harmless. It appears because
survey::svyglm()
uses replicate weights that can lead to fractional counts. The model still estimates correctly. For more background and discussion, see Stack Overflow thread.
Produce publication-quality plots using multiply-imputed data. All visuals account for weights and imputations.
Use scf_implicates()
to inspect individual implicate
estimates for sensitivity analysis.
freq_table <- scf_freq(scf2022, ~rich)
scf_implicates(freq_table, long = TRUE)
#> implicate group category est var estimate se
#> richFALSE 1 NA FALSE 0.8730810 0.0004432830 0.8730810 0.02105429
#> richTRUE 1 NA TRUE 0.1269190 0.0004432830 0.1269190 0.02105429
#> richFALSE1 2 NA FALSE 0.8531922 0.0005488627 0.8531922 0.02342782
#> richTRUE1 2 NA TRUE 0.1468078 0.0005488627 0.1468078 0.02342782
#> richFALSE2 3 NA FALSE 0.8725839 0.0004395554 0.8725839 0.02096558
#> richTRUE2 3 NA TRUE 0.1274161 0.0004395554 0.1274161 0.02096558
#> richFALSE3 4 NA FALSE 0.8794327 0.0003846508 0.8794327 0.01961252
#> richTRUE3 4 NA TRUE 0.1205673 0.0003846508 0.1205673 0.01961252
#> richFALSE4 5 NA FALSE 0.8827906 0.0004057971 0.8827906 0.02014441
#> richTRUE4 5 NA TRUE 0.1172094 0.0004057971 0.1172094 0.02014441
#> lower upper cv
#> richFALSE 0.83181461 0.9143474 0.02411493
#> richTRUE 0.08565258 0.1681854 0.16588761
#> richFALSE1 0.80727369 0.8991107 0.02745902
#> richTRUE1 0.10088926 0.1927263 0.15958158
#> richFALSE2 0.83149137 0.9136764 0.02402700
#> richTRUE2 0.08632357 0.1685086 0.16454418
#> richFALSE3 0.84099216 0.9178732 0.02230133
#> richTRUE3 0.08212678 0.1590078 0.16266860
#> richFALSE4 0.84330761 0.9222737 0.02281901
#> richTRUE4 0.07772632 0.1566924 0.17186689
For more details on the SCF methodology and the scf
package, see:
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.