The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.
{slideimp} is a lightweight R package for fast K-NN and
PCA imputation of missing values in high-dimensional numeric
matrices.
The stable version of {slideimp} can be installed from
CRAN using:
install.packages("slideimp")You can install the development version of {slideimp}
with:
pak::pkg_install("hhp94/slideimp")You can install the optional {slideimp.extra}
package (which provides lightweight Illumina manifests) with:
pak::pkg_install("hhp94/slideimp.extra")dim(MSA_beta_matrix)
# [1] 20 281797
MSA_beta_matrix[1:4, 1:4]
# cg06185909_TC11 cg18975462_BC11 cg20516119_TC11 cg10149399_BC11
# sample1 NA 0.5023435 0.3835431 NA
# sample2 0.4907466 0.5095459 0.9025816 0.4313347
# sample3 0.6885036 NA 0.7646753 0.4498772
# sample4 0.0000000 0.0000000 NA 0.0000000Chromosome-wise imputation of Illumina microarrays can be
performed with a single function call (requires the Illumina manifests,
conveniently provided by the slideimp.extra
package).
This example demonstrates PCA imputation. To use K-NN imputation
instead, supply the k argument.
library(slideimp.extra)
library(slideimp)
library(mirai)
imputed <- group_imp(
obj = MSA_beta_matrix,
group = "MSA", # <- this feature requires the `slideimp.extra` package
ncp = 10, # <- change to `k` for K-NN imputation
.progress = FALSE # <- turn on to monitor progress of longer running jobs
)
# Found cleaned manifest for 'MSA'
# Groups 24 dropped: no features remaining after matching obj columns.
# Imputing 25 group(s) using PCA.
# Running Mode: sequential ...
print(imputed, n = 4, p = 4)
# Method: group_imp (PCA imputation)
# Dimensions: 20 x 281797
#
# cg06185909_TC11 cg18975462_BC11 cg20516119_TC11 cg10149399_BC11
# sample1 0.1517542 0.5023435 0.38354308 0.2067731
# sample2 0.4907466 0.5095459 0.90258164 0.4313347
# sample3 0.6885036 0.7339375 0.76467530 0.4498772
# sample4 0.0000000 0.0000000 0.05230101 0.0000000
#
# # Showing [1:4, 1:4] of full matrixobj <- obj[, !grepl("^ctl", colnames(obj))] or set
allow_unmapped = TRUE to ignore them.slideimp.extra::dedup_matrix().cores
argument instead of {mirai} (available by default on
Windows and Linux). If you only need clock CpGs, provide the
subset argument to ignore all other probes.{slideimp} functions expect the input to
be a numeric matrix where variables are stored in the columns.library(slideimp)
set.seed(1234)
sim_obj <- sim_mat(n = 20, p = 100, n_col_groups = 2)
# Extract the $input attribute
obj <- sim_obj$input
# Extract metadata to map the features to groups
group_df <- sim_obj$col_groupHyperparameters are tuned using tune_imp(). We
evaluate the following options with grid search:
k): 5 or 20dist_pow) for weighted
averaging: 1 or 2 (higher values assign lower weights to more distant
neighbors)Tuning is performed on a subset of the data. We use 10 repeats
(n_reps = 10) of cross-validation for evaluation. We
artificially mask 50 observed values (num_na = 50) to
compute the RMSE and MAE. Use larger values for both n_reps
and num_na in real analyses for more reliable error
estimates.
Note: Parallelization via the cores
argument is only available for K-NN imputation with OpenMP.
knn_params <- expand.grid(k = c(5, 20), dist_pow = c(1, 2))
group2_columns <- subset(group_df, group == "group2")
group2_only <- obj[, group2_columns$feature]
tune_knn <- tune_imp(
group2_only,
parameters = knn_params,
.f = "knn_imp",
cores = 8,
n_reps = 10,
num_na = 50
)
#> Tuning knn_imp
#> Step 1/2: Resolving NA locations
#> Running Mode: parallel...
#> Step 2/2: Tuningcompute_metrics() or {yardstick}
functions.metrics <- compute_metrics(tune_knn)
# equivalently: dplyr::summarize(metrics, ..., .by = c(k, dist_pow, .metric))
sum_metrics <- do.call(
data.frame,
aggregate(
.estimate ~ k + dist_pow + .metric,
data = metrics,
FUN = function(x) {
c(
n = length(x),
mean_error = mean(x),
sd_error = sd(x)
)
}
)
)
sum_metrics[order(sum_metrics$.estimate.mean_error), ]
#> k dist_pow .metric .estimate.n .estimate.mean_error .estimate.sd_error
#> 2 20 1 mae 10 0.1501427 0.01599598
#> 4 20 2 mae 10 0.1519464 0.01638402
#> 1 5 1 mae 10 0.1656087 0.01683512
#> 3 5 2 mae 10 0.1669792 0.01666404
#> 6 20 1 rmse 10 0.1897722 0.01915653
#> 8 20 2 rmse 10 0.1918765 0.01936392
#> 5 5 1 rmse 10 0.2081833 0.01949624
#> 7 5 2 rmse 10 0.2101955 0.01884442mirai::daemons().pin_blas = TRUE when tuning PCA imputation in parallel
to avoid thrashing.mirai::daemons(2) # 2 Cores
# PCA imputation.
pca_params <- data.frame(ncp = c(1, 5))
# For machines with multi-threaded BLAS, turn on `pin_blas = TRUE`
tune_pca <- tune_imp(obj, parameters = pca_params, .f = "pca_imp", n_reps = 10, num_na = 50)
mirai::daemons(0) # Close daemonsgroup_imp() using the best
parameters.knn_group_results <- group_imp(obj, group = group_df, k = 20, dist_pow = 1, cores = 2)
#> Imputing 2 group(s) using KNN.
#> Running Mode: parallel (OpenMP within groups)...
knn_group_results
#> Method: group_imp (KNN imputation)
#> Dimensions: 20 x 100
#>
#> feature1 feature2 feature3 feature4 feature5 feature6
#> sample1 0.3486482 0.7385414 0.4077444 0.1607935 0.3924661 0.2434143
#> sample2 0.5338935 0.4724364 0.9663621 0.4788070 0.5061132 0.3923603
#> sample3 0.7185848 0.7351035 0.6724479 0.3162537 0.7634236 1.0000000
#> sample4 0.1734418 0.0000000 0.0000000 0.0000000 0.0000000 0.2106358
#> sample5 0.5388440 0.5306182 0.5685354 0.5383513 0.4680080 0.8518388
#> sample6 0.3768380 0.5570723 0.8764909 0.5276245 0.6722794 0.5740639
#>
#> # Showing [1:6, 1:6] of full matrixgroup_imp() can be parallelized
using the {mirai} package, similar to
tune_imp(). pin_blas = TRUE can help PCA
imputation on multi-threaded BLAS machines.# Similar to `tune_imp`, parallelization is controlled by `mirai::daemons()`
mirai::daemons(2)
pca_group_results <- group_imp(obj, group = group_df, ncp = 10)
mirai::daemons(0)knn_imp() or pca_imp().full_knn_results <- knn_imp(obj = obj, k = 20)
full_pca_results <- pca_imp(obj = obj, ncp = 10)slide_imp() performs sliding window imputation.group_imp().slide_imp() separately on each chromosome) before
imputation. See the package vignette for more details.window_size
and overlap_size with tune_imp().# Simulate some data
chr1_beta <- sim_mat(n = 10, p = 2000)$input
dim(chr1_beta)
#> [1] 10 2000
chr1_beta[1:5, 1:5]
#> feature1 feature2 feature3 feature4 feature5
#> sample1 0.7500638 0.5323295 0.6095626 0.96762386 0.5149855
#> sample2 0.2809107 0.8695599 NA NA 0.6040981
#> sample3 0.9409348 0.5445597 0.6432675 1.00000000 0.5613868
#> sample4 0.5946795 0.0000000 NA 0.07237333 0.4410413
#> sample5 0.8664253 0.6206139 0.3444691 0.52025046 0.5220036slide_imp() parameters:
location: required - a sorted numeric
vector of length ncol(obj) specifying the position of each
column (e.g., genomic coordinates in bp).window_size: required - width of each
sliding window (same unit as location).overlap_size: optional - overlap width
between consecutive windows (same units as location). Must
be strictly less than window_size.min_window_n: required - minimum
number of columns a window must contain to be imputed. Windows with
fewer columns than this threshold are dropped. Must be greater than
k (for K-NN) or ncp (for PCA).dry_run: optional - return just the
calculated windows to examine which windows are included.k: required - (specifying K-NN
imputation) number of nearest neighbors to use inside each
window.ncp: required - (specifying PCA
imputation) number of principal components to retain. Use this
instead of k when performing sliding-window PCA
imputation.subset: optional - impute just a
subset of features (i.e., just clock CpGs).flank: optional - build flanking
windows of window_size around features provided in
subset.First, let’s perform a dry run to examine the windows that will
be imputed by slide_imp.
location <- seq_len(ncol(chr1_beta)) # 1, 2, ..., 2000 for this simulated chromosome
slide_imp(
obj = chr1_beta,
location = location,
window_size = 50, # select with `tune_imp()`
overlap_size = 5,
min_window_n = 11, # must be > k = 10
dry_run = TRUE
)
#> # slideimp table: 45 x 4
#> start end window_n subset_local
#> 1 50 50 <double [50]>
#> 46 95 50 <double [50]>
#> 91 140 50 <double [50]>
#> 136 185 50 <double [50]>
#> 181 230 50 <double [50]>
#> 226 275 50 <double [50]>
#> 271 320 50 <double [50]>
#> 316 365 50 <double [50]>
#> 361 410 50 <double [50]>
#> 406 455 50 <double [50]>
#> # ... with 35 more rowsslide_imp(
obj = chr1_beta,
location = location,
window_size = 50, # select with `tune_imp()`
overlap_size = 5,
min_window_n = 11, # must be > k = 10
k = 10, # select with `tune_imp()`
cores = 2,
.progress = FALSE
)
#> Method: slide_imp (KNN imputation)
#> Dimensions: 10 x 2000
#>
#> feature1 feature2 feature3 feature4 feature5 feature6
#> sample1 0.7500638 0.5323295 0.6095626 0.96762386 0.5149855 1.0000000
#> sample2 0.2809107 0.8695599 0.6324029 0.62469147 0.6040981 0.1583159
#> sample3 0.9409348 0.5445597 0.6432675 1.00000000 0.5613868 0.3054879
#> sample4 0.5946795 0.0000000 0.5837423 0.07237333 0.4410413 0.6101870
#> sample5 0.8664253 0.6206139 0.3444691 0.52025046 0.5220036 0.7464794
#> sample6 0.8157626 0.3053222 0.7227880 0.57711498 0.4576367 0.0000000
#>
#> # Showing [1:6, 1:6] of full matrixgroup_imp(): Parallelizable group-wise (e.g., by
chromosomes or column clusters) K-NN or PCA imputation with optional
auxiliary features and group-wise parameters. The group
argument supports:
"EPICv2" or "MSA"
(requires the {slideimp.extra}
package).data.frame containing a feature
column (character strings of feature names) and a group
column (defining the group).prep_groups() to create groups based on
a mapping data.frame (i.e., Illumina manifests) with custom group-wise
parameters.slide_imp(): Sliding window K-NN or PCA imputation for
extremely high-dimensional numeric matrices (WGBS/EM-seq) with ordered
features (i.e., by genomic position).tune_imp(): Parallelizable hyperparameter tuning with
repeated cross-validation; works with built-in or custom imputation
functions.knn_imp(): Full-matrix K-NN imputation with multi-core
parallelization, {mlpack}
KD/Ball-Tree nearest neighbor implementation (for data with very low
missing rates and extremely high dimensions), and optional subset
imputation (ideal for epigenetic clock calculations).pca_imp(): Optimized version of missMDA::imputePCA()
for high-dimensional numeric matrices.col_vars() and mean_imp_col() provide fast
column-wise variance and mean imputation using the high-performance
{RcppArmadillo} backend with full OpenMP parallelization
support.These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.