The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.

Analyzing the Survey of Consumer Finances

Introduction

The Survey of Consumer Finances (SCF) is a triennial survey of U.S. household finances conducted by the Federal Reserve Board. It is among the most detailed and methodologically sophisticated data sources on U.S. households’ personal finances.

To ensure valid estimation and inference, the SCF incorporates two key methodological features:

  1. Complex Survey Design: The SCF uses a dual frame design with a geographic national sample and list sample of wealthy people selected from IRS records. Each implicate includes 999 replicate weights constructed via balanced repeated replication (BRR), which enable design-consistent estimation of variance.
  2. Multiple Imputation: The SCF addresses item nonresponse through multiple imputation. Each release includes five implicates—plausible, complete versions of the dataset with different imputed values for missing items.

These design features demand appropriate statistical handling. Analysts unfamiliar with replicate weighting and imputation pooling may inadvertently produce biased or misleading results. In practice, these barriers have discouraged even quantitatively competent researchers from working directly with SCF microdata.

The scf package aims to reduce this friction. It provides a structured and reproducible R interface for downloading, transforming, and analyzing SCF data using methods appropriate to its design. The package handles replicate weights and Rubin’s Rules transparently and consistently across descriptive statistics, hypothesis testing, regression modeling, and visualization.

This vignette introduces the core analytic workflow supported by the package. For detailed methodological discussion, see Cohen (2025a).

Workflow

1. Downloading and Loading the Data

Download raw SCF data and load it into a valid multiply-imputed survey object using scf_download() and scf_load(). The result is an scf_mi_survey object that contains replicate-weighted survey designs for each implicate.

# Using Mock data with distribution
td  <- tempdir()
src <- system.file("extdata", "scf2022_mock_raw.rds", package = "scf")
file.copy(src, file.path(td, "scf2022.rds"), overwrite = TRUE)
#> [1] TRUE
scf2022 <- scf_load(2022, data_directory = td)

# Using real SCF data (uncomment to run)
# scf2022 <- scf_download(2022)
# scf2022 <- scf_load(scf2022)

2. Creating and Transforming Variables

Before logging, you must bottom-code income and net worth at $1 to avoid NA values due to log(0). The scf_update() function safely adds or modifies variables across all implicates.


scf2022 <- scf_update(scf2022,
  senior = age >= 65,
  female = factor(hhsex, levels = 1:2, labels = c("Male", "Female")),
  rich = networth > 1e6,
  networth = ifelse(networth > 1, networth, 1),
  log_networth = log(networth),
  income = ifelse(income > 1, income, 1),
  log_income = log(income),
  npeople = x101
)

Use names(scf2022$mi_design[[1]]$variables) to inspect variables.

3. Univariate and Bivariate Distributions

Use scf_mean(), scf_median(), and scf_percentile() to calculate pooled estimates with Rubin’s Rules. Use by = for grouped statistics.

scf_mean(scf2022, ~networth, by = ~senior)
#> Multiply-Imputed, Replicate-Weighted Mean Estimate
#> 
#>  group variable  estimate        se     min       max
#>  FALSE networth  809777.2  23912.24  771184  821626.9
#>   TRUE networth 1595858.8 203948.10 1355046 1812726.0
scf_median(scf2022, ~income, by = ~female)
#> Multiply-Imputed Median Estimate
#> 
#>   group variable quantile estimate       se      min      max
#>  Female   income      0.5 49721.94   0.0000 49721.94 49721.94
#>    Male   income      0.5 85824.40 592.0398 85392.03 86472.95
scf_percentile(scf2022, ~networth, q = 0.9)
#> Multiply-Imputed Percentile Estimate
#> 
#>  variable quantile estimate       se     min     max
#>  networth      0.9  1197722 116410.8 1039600 1360000
scf_percentile(scf2022, ~networth, q = 0.75, by = ~female)
#> Multiply-Imputed Percentile Estimate
#> 
#>   group variable quantile estimate       se    min    max
#>  Female networth     0.75   238504  1842.52 237680 241800
#>    Male networth     0.75   642660 31637.37 623800 698700

4. Hypothesis Tests

Conduct t-tests and proportion tests on pooled SCF data. These tests return interpretable outputs with correct degrees of freedom and pooled standard errors.

scf_ttest(scf2022, ~networth, mu = 250000)
#> SCF One-sample t-test 
#> Alternative hypothesis: mean is not equal to 250000 
#> 
#> Estimate: 1000482.40 
#> Standard Error: 153600.47 
#> t = 4.89, df = 744.0, p = 0.0000 *** 
#> CI (95%): [698940.49, 1302024.31]
scf_ttest(scf2022, ~networth, group = ~senior)
#> SCF Two-sample t-test 
#> Alternative hypothesis: mean is not equal to 0 
#> 
#> Group means:
#>  group      mean
#>  FALSE  809777.2
#>   TRUE 1595858.8
#> 
#> Estimate: -786081.60 
#> Standard Error: 507708.12 
#> t = -1.55, df = 114.2, p = 0.1243  
#> CI (95%): [-1791827.18, 219663.98]
scf_prop_test(scf2022, ~senior, p = 0.25)
#> 
#> One-sample proportion test
#> Null hypothesis: proportion = 0.25
#> Alternative hypothesis: two.sided
#> Confidence level: 95%
#> 
#>  estimate std.error z.value p.value conf.low conf.high stars
#>    0.2423    0.0022 -3.5822   3e-04   0.2381    0.2465   ***
scf_prop_test(scf2022, ~rich, ~female)
#> 
#> Two-sample proportion test
#> Null hypothesis: proportion difference = 0.5
#> Alternative hypothesis: two.sided
#> Confidence level: 95%
#> 
#>  estimate std.error  z.value p.value conf.low conf.high stars
#>    0.1626    0.0253 -13.3593       0   0.1131    0.2121   ***
#> 
#> Estimated group proportions:
#>   group proportion
#>    Male     0.1842
#>  Female     0.0216

5. Regression Modeling

Fit linear or generalized linear models with Rubin-aware pooling. Logistic models can return odds ratios if requested.

scf_ols(scf2022, networth ~ age + log_income)
#> OLS Regression Results (Multiply-Imputed SCF)
#> --------------------------------------------------
#>         term  estimate std.error t.value   p.value stars
#>  (Intercept) -22709564   3541475  -6.412 1.134e-09   ***
#>          age     35813      8843   4.050 6.274e-05   ***
#>   log_income   1965105    300888   6.531 7.193e-10   ***
#> 
#> Model Fit Statistics:
#>   Mean R-squared: 0.1389 (SD: 0.0156)
#>   Mean AIC:       2251.665 (SD: 3295.433)
#> 
#> Note: Implicate-level model objects are stored in `object$imps`
#>       Use `summary(object$imps[[1]])` to inspect them.
scf_logit(scf2022, rich ~ age + log_income)
#> Logistic Regression Results (Multiply-Imputed SCF)
#> --------------------------------------------------
#>         term estimate std.error t.value p.value stars
#>  (Intercept)   0.0000    0.0000 -4.0575  <2e-16   ***
#>          age   1.0964    0.0409  2.4634  0.0138     *
#>   log_income  14.1789    9.3306  4.0297  0.0001   ***
#> 
#> Model Fit Diagnostics:
#>   Pseudo R-squared:  0.5069 
#>   Mean AIC:          28.381 
#> 
#> Notes:
#>  - Estimates are reported on the Odds Ratio scale.
#>  - Implicate-level models are stored in `object$imps`
scf_logit(scf2022, rich ~ age + log_income, odds = TRUE)
#> Logistic Regression Results (Multiply-Imputed SCF)
#> --------------------------------------------------
#>         term estimate std.error t.value p.value stars
#>  (Intercept)   0.0000    0.0000 -4.0575  <2e-16   ***
#>          age   1.0964    0.0409  2.4634  0.0138     *
#>   log_income  14.1789    9.3306  4.0297  0.0001   ***
#> 
#> Model Fit Diagnostics:
#>   Pseudo R-squared:  0.5069 
#>   Mean AIC:          28.381 
#> 
#> Notes:
#>  - Estimates are reported on the Odds Ratio scale.
#>  - Implicate-level models are stored in `object$imps`
scf_glm(scf2022, own ~ age , family = binomial())
#> Generalized Linear Model (Multiply-Imputed SCF)
#> --------------------------------------------------
#>         term estimate std.error z.value p.value stars
#>  (Intercept)   1.4508    0.6710  2.1621 0.03061     *
#>          age   0.0070    0.0127  0.5494 0.58270      
#> 
#> Model Fit Diagnostics:
#>   Pseudo R-squared: 0.002 (SD: 0.000)
#>   Mean AIC:         56 (SD: 80)
#> 
#> Note: Model fit pooled across implicates via Rubin's Rules.
#>       Inspect individual models via `object$models[[i]]`.

Note on Warnings
When running logistic regression with scf_logit() or other functions that use family = binomial(), you may see warnings like:

`Warning: non-integer #successes in a binomial glm!`

This warning is harmless. It appears because survey::svyglm() uses replicate weights that can lead to fractional counts. The model still estimates correctly. For more background and discussion, see Stack Overflow thread.

6. Visualization

Produce publication-quality plots using multiply-imputed data. All visuals account for weights and imputations.

scf_plot_dbar(scf2022, ~senior)

scf_plot_bbar(scf2022, ~female, ~rich, scale = "percent")

scf_plot_cbar(scf2022, ~networth, ~edcl, stat = "median")

scf_plot_dist(scf2022, ~age, bins = 10) 

scf_plot_smooth(scf2022, ~age)

scf_plot_hex(scf2022, ~income, ~networth)

7. Inspecting Implicates and Pooled Objects

Use scf_implicates() to inspect individual implicate estimates for sensitivity analysis.

freq_table <- scf_freq(scf2022, ~rich)
scf_implicates(freq_table, long = TRUE)
#>            implicate group category       est          var  estimate         se
#> richFALSE          1    NA    FALSE 0.8730810 0.0004432830 0.8730810 0.02105429
#> richTRUE           1    NA     TRUE 0.1269190 0.0004432830 0.1269190 0.02105429
#> richFALSE1         2    NA    FALSE 0.8531922 0.0005488627 0.8531922 0.02342782
#> richTRUE1          2    NA     TRUE 0.1468078 0.0005488627 0.1468078 0.02342782
#> richFALSE2         3    NA    FALSE 0.8725839 0.0004395554 0.8725839 0.02096558
#> richTRUE2          3    NA     TRUE 0.1274161 0.0004395554 0.1274161 0.02096558
#> richFALSE3         4    NA    FALSE 0.8794327 0.0003846508 0.8794327 0.01961252
#> richTRUE3          4    NA     TRUE 0.1205673 0.0003846508 0.1205673 0.01961252
#> richFALSE4         5    NA    FALSE 0.8827906 0.0004057971 0.8827906 0.02014441
#> richTRUE4          5    NA     TRUE 0.1172094 0.0004057971 0.1172094 0.02014441
#>                 lower     upper         cv
#> richFALSE  0.83181461 0.9143474 0.02411493
#> richTRUE   0.08565258 0.1681854 0.16588761
#> richFALSE1 0.80727369 0.8991107 0.02745902
#> richTRUE1  0.10088926 0.1927263 0.15958158
#> richFALSE2 0.83149137 0.9136764 0.02402700
#> richTRUE2  0.08632357 0.1685086 0.16454418
#> richFALSE3 0.84099216 0.9178732 0.02230133
#> richTRUE3  0.08212678 0.1590078 0.16266860
#> richFALSE4 0.84330761 0.9222737 0.02281901
#> richTRUE4  0.07772632 0.1566924 0.17186689

Learn More

For more details on the SCF methodology and the scf package, see:

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.