The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.
{slideimp} is a lightweight R package for fast K-NN and
PCA imputation of missing values in high-dimensional numeric matrices
(samples in rows, variables in columns). Two “meta-callers” wrap the
workhorse functions knn_imp() and
pca_imp():
group_imp(): group-wise K-NN or PCA imputation
(e.g., by chromosome for Illumina DNAm microarrays such as 450K, EPIC,
EPICv2, MSA).
slide_imp(): sliding window K-NN or PCA imputation
for whole-genome methylation data (WGBS, EM-seq). Note:
not intended for Illumina microarrays, use group_imp()
instead.
pca_imp(): Optimized reimplementation of missMDA::imputePCA()
for high-dimensional matrices. Supports a LOBPCG eigensolver that
warm-starts each EM iteration from the previous eigenblock and search
direction.
knn_imp(): Full K-NN imputation with
{RcppThread} multi-threading. Supports subset
imputation to restrict work to specific columns (e.g., clock CpGs) for
very fast targeted imputation.
The stable version of {slideimp} can be installed from
CRAN using:
install.packages("slideimp")You can install the development version of {slideimp}
with:
pak::pkg_install("hhp94/slideimp")You can install the optional slideimp.extra
package (which provides lightweight Illumina manifests) with:
pak::pkg_install("hhp94/slideimp.extra")If you find {slideimp} helpful in your work, please cite
our Bioinformatics paper:
@article{10.1093/bioinformatics/btag318,
author = {Pham, Hung and Lombroso, Adam P and Cevik, Esma Cansu and Taylor, Hugh S and O'Donnell, Kieran J},
title = {slideimp: efficient imputation of {DNA} methylation data},
journal = {Bioinformatics},
volume = {42},
number = {6},
pages = {btag318},
year = {2026},
doi = {10.1093/bioinformatics/btag318},
month = jun,
}dim(MSA_beta_matrix)
# [1] 20 281797
MSA_beta_matrix[1:4, 1:4]
# cg06185909_TC11 cg18975462_BC11 cg20516119_TC11 cg10149399_BC11
# sample1 NA 0.5023435 0.3835431 NA
# sample2 0.4907466 0.5095459 0.9025816 0.4313347
# sample3 0.6885036 NA 0.7646753 0.4498772
# sample4 0.0000000 0.0000000 NA 0.0000000Chromosome-wise imputation of Illumina microarrays can be
performed with a single function call (requires the Illumina manifests,
conveniently provided by the slideimp.extra
package).
This example demonstrates PCA imputation. To use K-NN imputation
instead, supply the k argument.
library(slideimp.extra)
library(slideimp)
imputed <- group_imp(
obj = MSA_beta_matrix,
group = "MSA", # <- this feature requires the `slideimp.extra` package
ncp = 10, # <- change to `k` for K-NN imputation. Optimal value requires `tune_imp()`.
clamp = c(0, 1), # <- PCA imputation can generate values outside of `c(0, 1)`
.progress = FALSE # <- turn on to monitor progress of longer running jobs
)
# Found cleaned manifest for 'MSA'
# ! 1 group dropped: no features remaining after matching `obj_cn`.
# ℹ Dropped group indices: 24
# Imputing 25 groups using PCA.
# Running mode: sequential
print(imputed, n = 4, p = 4)
# Method: group_imp (PCA imputation)
# Dimensions: 20 x 281797
#
# cg06185909_TC11 cg18975462_BC11 cg20516119_TC11 cg10149399_BC11
# sample1 0.1517542 0.5023435 0.38354308 0.2067731
# sample2 0.4907466 0.5095459 0.90258164 0.4313347
# sample3 0.6885036 0.7339375 0.76467530 0.4498772
# sample4 0.0000000 0.0000000 0.05230101 0.0000000
# # Showing 4 of 20 rows and 4 of 281797 columnsgroup_imp() fails with unmapped probes: your matrix
likely contains sesame-generated
control probes (prefixed ctl_) or you picked the wrong
manifest ("EPICv2" vs. "EPICv2_deduped").
Check the manifest, remove "ctl" probes with
obj <- obj[, !grepl("^ctl", colnames(obj))], or pass
allow_unmapped = TRUE to bypass.slideimp.extra::dedup_matrix().solver choice and LOBPCG warm-start,
plus threshold, scale,
lobpcg_control, and {mirai}/BLAS threading,
see Speeding
up PCA imputation.clamp = c(0, 1) to restrict PCA-imputed values
to the valid range.pin_blas = TRUE so BLAS threads and
{mirai} workers do not compete for cores. This requires
{RhpcBLASctl}.cores argument,
which uses {RcppThread}’s parallelFor, instead
of {mirai}. If you only need clock CpGs, pass
subset to skip all other probes.{slideimp} functions expect the input to
be a numeric matrix where variables are stored in the columns.library(slideimp)
set.seed(1234)
sim_obj <- sim_mat(n = 20, p = 100, n_col_groups = 2)
# `input` holds the simulated matrix with missing data
obj <- sim_obj$input
# `col_group` holds the column-wise grouping
group_df <- sim_obj$col_groupHyperparameters are tuned using tune_imp(). We
evaluate the following options with grid search (K-NN imputation):
k): 5 or 20dist_pow) for weighted
averaging: 1 or 2 (higher values assign lower weights to more distant
neighbors)Tuning is performed on a subset of the data. We use 10 repeats
(n_reps = 10) of cross-validation for evaluation. We
artificially mask 50 observed values (num_na = 50) to
compute the RMSE and MAE. Use larger values for both n_reps
and num_na in real analyses for more reliable error
estimates.
Note: Parallelization via the cores
argument (which uses {RcppThread}) is only available for
K-NN imputation.
knn_params <- expand.grid(k = c(5, 20), dist_pow = c(1, 2))
group2_columns <- subset(group_df, group == "group2")
group2_only <- obj[, group2_columns$feature]
tune_knn <- tune_imp(
group2_only,
parameters = knn_params,
.f = "knn_imp",
cores = 8, # <- only available for K-NN imputation
n_reps = 10,
num_na = 50
)
#> Tuning `knn_imp()`
#> Step 1/2: Resolving NA locations
#> Running mode: threaded (8 cores)
#> Step 2/2: Tuningcompute_metrics() or
{yardstick} functions.metrics <- compute_metrics(tune_knn)
# Equivalently: dplyr::summarize(metrics, n = n(), mean_error = mean(.estimate),
# sd_error = sd(.estimate), .by = c(k, dist_pow, .metric))
sum_metrics <- do.call(
data.frame,
aggregate(
.estimate ~ k + dist_pow + .metric,
data = metrics,
FUN = function(x) c(n = length(x), mean_error = mean(x), sd_error = sd(x))
)
)
sum_metrics[order(sum_metrics$.estimate.mean_error), ]
#> k dist_pow .metric .estimate.n .estimate.mean_error .estimate.sd_error
#> 2 20 1 mae 10 0.1501427 0.01599598
#> 4 20 2 mae 10 0.1519464 0.01638402
#> 1 5 1 mae 10 0.1656087 0.01683512
#> 3 5 2 mae 10 0.1669792 0.01666404
#> 6 20 1 rmse 10 0.1897722 0.01915653
#> 8 20 2 rmse 10 0.1918765 0.01936392
#> 5 5 1 rmse 10 0.2081833 0.01949624
#> 7 5 2 rmse 10 0.2101955 0.01884442mirai::daemons(). See Speeding
up PCA imputation for additional PCA performance-tuning
guidance.pin_blas = TRUE when tuning PCA imputation in parallel
to avoid thrashing.mirai::daemons(2) # 2 Cores
# PCA imputation.
pca_params <- data.frame(ncp = c(1, 5))
# For machines with multi-threaded BLAS, turn on `pin_blas = TRUE`
tune_pca <- tune_imp(obj, parameters = pca_params, .f = "pca_imp", n_reps = 10, num_na = 50)
mirai::daemons(0) # Close daemonsgroup_imp() using the best
parameters.knn_group_results <- group_imp(obj, group = group_df, k = 20, dist_pow = 1, cores = 2)
#> Imputing 2 groups using KNN.
#> Running mode: threaded (2 cores)
knn_group_results
#> Method: group_imp (KNN imputation)
#> Dimensions: 20 x 100
#>
#> feature1 feature2 feature3 feature4 feature5 feature6
#> sample1 0.3486482 0.7385414 0.4077444 0.1607935 0.3924661 0.2434143
#> sample2 0.5338935 0.4724364 0.9663621 0.4788070 0.5061132 0.3923603
#> sample3 0.7185848 0.7351035 0.6724479 0.3162537 0.7634236 1.0000000
#> sample4 0.1734418 0.0000000 0.0000000 0.0000000 0.0000000 0.2106358
#> sample5 0.5388440 0.5306182 0.5685354 0.5383513 0.4680080 0.8518388
#> sample6 0.3768380 0.5570723 0.8764909 0.5276245 0.6722794 0.5740639
#> # Showing 6 of 20 rows and 6 of 100 columnsgroup_imp() can be parallelized
with {mirai}, as with tune_imp().mirai::daemons(2)
pca_group_results <- group_imp(obj, group = group_df, ncp = 10)
mirai::daemons(0)knn_imp() or pca_imp().full_knn_results <- knn_imp(obj = obj, k = 20)
full_pca_results <- pca_imp(obj = obj, ncp = 10)slide_imp() performs sliding window imputation.group_imp().slide_imp() separately on each chromosome) before
imputation. See the package vignette for more details.window_size
and overlap_size with tune_imp().# Simulate some data
chr1_beta <- sim_mat(n = 10, p = 2000)$inputslide_imp() parameters:
location (required): a sorted numeric vector of length
ncol(obj) specifying the position of each column (e.g.,
genomic coordinates in bp).window_size (required): width of each sliding window
(same unit as location).overlap_size: overlap width between consecutive windows
(same units as location). Must be strictly less than
window_size.min_window_n (required): minimum number of columns a
window must contain to be imputed. Windows with fewer columns than this
threshold are dropped. Must be greater than k (for K-NN) or
ncp (for PCA).dry_run: return only the calculated windows to inspect
which are included.k (required): (specifying K-NN imputation)
number of nearest neighbors to use inside each window.ncp (required): (specifying PCA imputation)
number of principal components to retain. Use this instead of
k when performing sliding-window PCA imputation.subset: impute only a subset of features (e.g., clock
CpGs).flank: build flanking windows of
window_size around features provided in
subset.First, let’s perform a dry run to examine the windows that will
be imputed by slide_imp.
location <- seq_len(ncol(chr1_beta)) # 1, 2, ..., 2000 for this simulated chromosome
slide_imp(
obj = chr1_beta,
location = location,
window_size = 50, # select with `tune_imp()`
overlap_size = 5,
min_window_n = 11, # must be > k = 10
dry_run = TRUE
)
#> # slideimp table: 45 x 4
#> start end window_n subset_local
#> 1 50 50 <double [50]>
#> 46 95 50 <double [50]>
#> 91 140 50 <double [50]>
#> 136 185 50 <double [50]>
#> 181 230 50 <double [50]>
#> 226 275 50 <double [50]>
#> 271 320 50 <double [50]>
#> 316 365 50 <double [50]>
#> 361 410 50 <double [50]>
#> 406 455 50 <double [50]>
#> # ... with 35 more rowsslide_imp(
obj = chr1_beta,
location = location,
window_size = 50, # select with `tune_imp()`
overlap_size = 5,
min_window_n = 11, # must be > k = 10
k = 10, # select with `tune_imp()`
cores = 2,
.progress = FALSE
)
#> Method: slide_imp (KNN imputation)
#> Dimensions: 10 x 2000
#>
#> feature1 feature2 feature3 feature4 feature5 feature6
#> sample1 0.7500638 0.5323295 0.6095626 0.96762386 0.5149855 1.0000000
#> sample2 0.2809107 0.8695599 0.6324029 0.62469147 0.6040981 0.1583159
#> sample3 0.9409348 0.5445597 0.6432675 1.00000000 0.5613868 0.3054879
#> sample4 0.5946795 0.0000000 0.5837423 0.07237333 0.4410413 0.6101870
#> sample5 0.8664253 0.6206139 0.3444691 0.52025046 0.5220036 0.7464794
#> sample6 0.8157626 0.3053222 0.7227880 0.57711498 0.4576367 0.0000000
#> # Showing 6 of 10 rows and 6 of 2000 columnscol_vars() / mean_imp_col(): Fast
column-wise variance and mean imputation via
{RcppArmadillo} and {RcppThread}.mat_miss(): Efficient column or row missing
statistics.These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.