The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.

Efficient Storage of Imputed Data

2026-02-16

1 Introduction

When performing multiple imputation with {rbmi} using many imputations (e.g., 100-1000), the full imputed dataset can become very large. However, most of this data is redundant: observed values are identical across all imputations.

The {rbmiUtils} package provides two functions to address this:

This approach can reduce storage requirements by 90% or more, depending on the proportion of missing data.

2 The Storage Problem

Consider a typical clinical trial dataset:

Full storage: 2,500 rows × 1,000 imputations = 2.5 million rows

Reduced storage: 125 missing values × 1,000 imputations = 125,000 rows (5% of full size)

3 Setup

library(dplyr)
library(rbmi)
library(rbmiUtils)

4 Example with Package Data

The {rbmiUtils} package includes example datasets we can use:

data("ADMI", package = "rbmiUtils")  # Full imputed dataset
data("ADEFF", package = "rbmiUtils") # Original data with missing values

# Check dimensions
cat("Full imputed dataset (ADMI):", nrow(ADMI), "rows\n")
#> Full imputed dataset (ADMI): 100000 rows
cat("Number of imputations:", length(unique(ADMI$IMPID)), "\n")
#> Number of imputations: 100

5 Reducing Imputed Data

First, prepare the original data to match the imputed data structure:

original <- ADEFF |>
 mutate(
   TRT = TRT01P,
   USUBJID = as.character(USUBJID)
 )

# Count missing values
n_missing <- sum(is.na(original$CHG))
cat("Missing values in original data:", n_missing, "\n")
#> Missing values in original data: 44

Define the variables specification:

vars <- set_vars(
 subjid = "USUBJID",
 visit = "AVISIT",
 group = "TRT",
 outcome = "CHG"
)

Now reduce the imputed data:

reduced <- reduce_imputed_data(ADMI, original, vars)

cat("Full imputed rows:", nrow(ADMI), "\n")
#> Full imputed rows: 100000
cat("Reduced rows:", nrow(reduced), "\n")
#> Reduced rows: 4400
cat("Compression ratio:", round(100 * nrow(reduced) / nrow(ADMI), 1), "%\n")
#> Compression ratio: 4.4 %

6 What’s in the Reduced Data?

The reduced dataset contains only the rows that were originally missing:

# First few rows
head(reduced)
#> # A tibble: 6 × 12
#>   IMPID STRATA REGION  REGIONC TRT    BASE   CHG AVISIT USUBJID CRIT1FLN CRIT1FL
#>   <dbl> <chr>  <chr>     <dbl> <chr> <dbl> <dbl> <chr>  <chr>      <dbl> <chr>  
#> 1     1 A      North …       1 Plac…    12 -1.96 Week … ID011          0 N      
#> 2     1 A      Europe        3 Drug…     3 -3.71 Week … ID014          0 N      
#> 3     1 B      Europe        3 Drug…     9 -1.96 Week … ID018          0 N      
#> 4     1 A      Asia          4 Drug…    10 -5.55 Week … ID033          0 N      
#> 5     1 A      Asia          4 Drug…     0 -1.28 Week … ID061          0 N      
#> 6     1 A      South …       2 Plac…     5 -2.60 Week … ID071          0 N      
#> # ℹ 1 more variable: CRIT <chr>

# Structure matches original imputed data
cat("\nColumns in reduced data:\n")
#> 
#> Columns in reduced data:
cat(paste(names(reduced), collapse = ", "))
#> IMPID, STRATA, REGION, REGIONC, TRT, BASE, CHG, AVISIT, USUBJID, CRIT1FLN, CRIT1FL, CRIT

Each row represents an imputed value for a specific subject-visit-imputation combination.

7 Expanding Back to Full Data

When you need to run analyses, expand the reduced data back to full form:

expanded <- expand_imputed_data(reduced, original, vars)

cat("Expanded rows:", nrow(expanded), "\n")
#> Expanded rows: 100000
cat("Original ADMI rows:", nrow(ADMI), "\n")
#> Original ADMI rows: 100000

8 Verifying Data Integrity

Let’s verify that the round-trip preserves data integrity:

# Sort both datasets for comparison
admi_sorted <- ADMI |>
 arrange(IMPID, USUBJID, AVISIT)

expanded_sorted <- expanded |>
 arrange(IMPID, USUBJID, AVISIT)

# Compare CHG values
all_equal <- all.equal(
 admi_sorted$CHG,
 expanded_sorted$CHG,
 tolerance = 1e-10
)

cat("Data integrity check:", all_equal, "\n")
#> Data integrity check: TRUE

9 Practical Workflow

Here’s how to integrate efficient storage into your workflow:

9.1 Save Reduced Data

# After imputation
impute_obj <- impute(draws_obj, references = c("Placebo" = "Placebo", "Drug A" = "Placebo"))
full_imputed <- get_imputed_data(impute_obj)

# Reduce for storage
reduced <- reduce_imputed_data(full_imputed, original_data, vars)

# Save both (reduced is much smaller)
saveRDS(reduced, "imputed_reduced.rds")
saveRDS(original_data, "original_data.rds")

9.2 Load and Analyse

# Load saved data
reduced <- readRDS("imputed_reduced.rds")
original_data <- readRDS("original_data.rds")

# Expand when needed for analysis
full_imputed <- expand_imputed_data(reduced, original_data, vars)

# Run analysis
ana_obj <- analyse_mi_data(
 data = full_imputed,
 vars = vars,
 method = method,
 fun = ancova
)

10 Storage Comparison

Here’s a comparison of storage requirements for different scenarios:

Subjects Visits Missing % Imputations Full Rows Reduced Rows Savings
500 5 5% 100 250,000 12,500 95%
500 5 5% 1,000 2,500,000 125,000 95%
1,000 8 10% 500 4,000,000 400,000 90%
200 4 20% 1,000 800,000 160,000 80%

The savings scale with:

11 When to Use This Approach

Use reduced storage when:

Keep full data when:

12 Edge Cases

12.1 No Missing Data

If the original data has no missing values, reduce_imputed_data() returns an empty data.frame:

# If original has no missing values
reduced <- reduce_imputed_data(full_imputed, complete_data, vars)
nrow(reduced)
#> [1] 0

# expand_imputed_data handles this correctly
expanded <- expand_imputed_data(reduced, complete_data, vars)
# Returns original data with IMPID = "1"

12.2 Single Imputation

The functions work with any number of imputations, including just one.

13 Summary

The reduce_imputed_data() and expand_imputed_data() functions provide an efficient way to store imputed datasets:

  1. Reduce after imputation to store only what’s necessary
  2. Expand before analysis to reconstruct full datasets
  3. Verify data integrity is preserved through round-trip

This approach is particularly valuable when working with large numbers of imputations or when storage and memory are constrained.

For the complete analysis workflow using imputed data, see vignette('pipeline').

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.