The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.
When performing multiple imputation with {rbmi}
using many imputations (e.g., 100-1000), the full imputed dataset can
become very large. However, most of this data is redundant: observed
values are identical across all imputations.
The {rbmiUtils} package provides two functions to
address this:
reduce_imputed_data(): Extract only the imputed values
(originally missing)expand_imputed_data(): Reconstruct the full dataset
when neededThis approach can reduce storage requirements by 90% or more, depending on the proportion of missing data.
Consider a typical clinical trial dataset:
Full storage: 2,500 rows × 1,000 imputations = 2.5 million rows
Reduced storage: 125 missing values × 1,000 imputations = 125,000 rows (5% of full size)
The {rbmiUtils} package includes example datasets we can
use:
data("ADMI", package = "rbmiUtils") # Full imputed dataset
data("ADEFF", package = "rbmiUtils") # Original data with missing values
# Check dimensions
cat("Full imputed dataset (ADMI):", nrow(ADMI), "rows\n")
#> Full imputed dataset (ADMI): 100000 rows
cat("Number of imputations:", length(unique(ADMI$IMPID)), "\n")
#> Number of imputations: 100First, prepare the original data to match the imputed data structure:
original <- ADEFF |>
mutate(
TRT = TRT01P,
USUBJID = as.character(USUBJID)
)
# Count missing values
n_missing <- sum(is.na(original$CHG))
cat("Missing values in original data:", n_missing, "\n")
#> Missing values in original data: 44Define the variables specification:
Now reduce the imputed data:
The reduced dataset contains only the rows that were originally missing:
# First few rows
head(reduced)
#> # A tibble: 6 × 12
#> IMPID STRATA REGION REGIONC TRT BASE CHG AVISIT USUBJID CRIT1FLN CRIT1FL
#> <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl> <chr> <chr> <dbl> <chr>
#> 1 1 A North … 1 Plac… 12 -1.96 Week … ID011 0 N
#> 2 1 A Europe 3 Drug… 3 -3.71 Week … ID014 0 N
#> 3 1 B Europe 3 Drug… 9 -1.96 Week … ID018 0 N
#> 4 1 A Asia 4 Drug… 10 -5.55 Week … ID033 0 N
#> 5 1 A Asia 4 Drug… 0 -1.28 Week … ID061 0 N
#> 6 1 A South … 2 Plac… 5 -2.60 Week … ID071 0 N
#> # ℹ 1 more variable: CRIT <chr>
# Structure matches original imputed data
cat("\nColumns in reduced data:\n")
#>
#> Columns in reduced data:
cat(paste(names(reduced), collapse = ", "))
#> IMPID, STRATA, REGION, REGIONC, TRT, BASE, CHG, AVISIT, USUBJID, CRIT1FLN, CRIT1FL, CRITEach row represents an imputed value for a specific subject-visit-imputation combination.
When you need to run analyses, expand the reduced data back to full form:
Let’s verify that the round-trip preserves data integrity:
# Sort both datasets for comparison
admi_sorted <- ADMI |>
arrange(IMPID, USUBJID, AVISIT)
expanded_sorted <- expanded |>
arrange(IMPID, USUBJID, AVISIT)
# Compare CHG values
all_equal <- all.equal(
admi_sorted$CHG,
expanded_sorted$CHG,
tolerance = 1e-10
)
cat("Data integrity check:", all_equal, "\n")
#> Data integrity check: TRUEHere’s how to integrate efficient storage into your workflow:
# After imputation
impute_obj <- impute(draws_obj, references = c("Placebo" = "Placebo", "Drug A" = "Placebo"))
full_imputed <- get_imputed_data(impute_obj)
# Reduce for storage
reduced <- reduce_imputed_data(full_imputed, original_data, vars)
# Save both (reduced is much smaller)
saveRDS(reduced, "imputed_reduced.rds")
saveRDS(original_data, "original_data.rds")# Load saved data
reduced <- readRDS("imputed_reduced.rds")
original_data <- readRDS("original_data.rds")
# Expand when needed for analysis
full_imputed <- expand_imputed_data(reduced, original_data, vars)
# Run analysis
ana_obj <- analyse_mi_data(
data = full_imputed,
vars = vars,
method = method,
fun = ancova
)Here’s a comparison of storage requirements for different scenarios:
| Subjects | Visits | Missing % | Imputations | Full Rows | Reduced Rows | Savings |
|---|---|---|---|---|---|---|
| 500 | 5 | 5% | 100 | 250,000 | 12,500 | 95% |
| 500 | 5 | 5% | 1,000 | 2,500,000 | 125,000 | 95% |
| 1,000 | 8 | 10% | 500 | 4,000,000 | 400,000 | 90% |
| 200 | 4 | 20% | 1,000 | 800,000 | 160,000 | 80% |
The savings scale with:
Use reduced storage when:
Keep full data when:
If the original data has no missing values,
reduce_imputed_data() returns an empty data.frame:
The functions work with any number of imputations, including just one.
The reduce_imputed_data() and
expand_imputed_data() functions provide an efficient way to
store imputed datasets:
This approach is particularly valuable when working with large numbers of imputations or when storage and memory are constrained.
For the complete analysis workflow using imputed data, see
vignette('pipeline').
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.