The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.
opdisDownsampling is an R package for optimal, distribution-preserving, class-proportional down-sampling of biomedical data. It reduces dataset size while preserving class proportions and the statistical structure of the original data.
This repository contains the package source and documentation.
Figure adapted from: Lötsch J, Malkusch S, Ultsch A (2021). Optimal distribution-preserving downsampling of large biomedical data sets (opdisDownsampling). PLOS ONE. https://doi.org/10.1371/journal.pone.0255838 (see “Reference” paragraph below)
You can install opdisDownsampling directly from source:
From CRAN:
install.packages("opdisDownsampling")From this GitHub repository:
remotes::install_github("JornLotsch/opdisDownsampling")Or manually by cloning the repository and running:
devtools::install("path/to/opdisDownsampling")The main function is opdisDownsampling().
library(opdisDownsampling)
data(iris)
Iris50percent <- opdisDownsampling(Data = iris[,1:4], Cls = as.integer(iris$Species),
Size = 50, Seed = 42, MaxCores = 1)set.seed(42)
# Small synthetic dataset for the first example
large_dataset <- data.frame(
class = sample(c("A", "B"), 2000, replace = TRUE),
x1 = rnorm(2000),
x2 = runif(2000),
x3 = rpois(2000, lambda = 3)
)
# Smaller synthetic dataset for the second example
my_data <- data.frame(
class = sample(c("A", "B"), 300, replace = TRUE),
x1 = rnorm(300),
x2 = runif(300)
)
# Automatic memory optimisation for large datasets (for demonstration purposes, a relatively small 'large' dataset is generated).
LargeDataSample <- opdisDownsampling(
Data = large_dataset[,2:ncol(large_dataset)],
Size = 0.1,
Seed = 42,
nTrials = 5000,
JobSize = NULL,
verbose = TRUE
)
# Custom chunk size for fine-tuned memory control
CustomSample <- opdisDownsampling(
Data = my_data[,2:ncol(my_data)],
Size = 100,
Seed = 42,
nTrials = 2000,
JobSize = 500
)| Argument | Description |
|---|---|
Data |
Numeric data frame or matrix to downsample |
Cls |
Class membership vector; if missing, all data are assigned to one class |
Size |
Proportion (0–1) or absolute number of rows to class-proportionally retain |
Seed |
Seed control. Options: "auto" for seed recovery,
"simple" to generate and report a seed using the current
RNG state, or an integer for exact reproducibility. Use integers for
systematic testing and fully reproducible analyses. |
nTrials |
Number of sampling trials. Default: 1000 |
TestStat |
Statistical test for distribution comparison. Default:
"ad". Available options: "ad",
"kuiper", "cvm", "wass",
"dts", "ks", "kld",
"amrdd", "euc", "nent". |
MaxCores |
Maximum cores for parallel processing |
PCAimportance |
Use PCA for variable selection |
JobSize |
Number of trials per chunk. Use 0 for no chunking,
NULL for automatic memory-aware chunk-size calculation, or
a positive integer for manual chunking. |
verbose |
Print diagnostic information about memory usage and chunking |
TestStat
options| Value | Description |
|---|---|
"ad" |
Anderson–Darling statistic |
"kuiper" |
Kuiper statistic |
"cvm" |
Cramér–von Mises statistic |
"wass" |
Wasserstein distance |
"dts" |
Distributional Transform Statistic |
"ks" |
Kolmogorov–Smirnov statistic |
"kld" |
Kullback–Leibler divergence (via KullbLeiblKLD2()) |
"amrdd" |
Average Mean Root of Distributional Differences (via
amrdd()) |
"euc" |
Euclidean distance (via EucDist()) |
"nent" |
Absolute normalized entropy difference (via
abs_norm_entropy_diff()) |
The package offers PCA based variable selection approaches: ####
PCA-based Selection (PCAimportance = TRUE) - Identifies
variables with high loadings in principal components - Focuses on
variables that capture the most variance in the data - Useful for
dimensionality reduction while preserving data structure
The package automatically optimizes memory usage through intelligent
chunking: - Automatic chunk sizing: Considers data
dimensions, available system memory, and number of processor cores -
Memory-constrained processing: Prevents memory
exhaustion on large datasets or high trial counts - Adaptive
strategy: Uses larger chunks for small datasets (efficiency)
and smaller chunks for large datasets (memory safety) -
Diagnostic output: Enable verbose = TRUE
to understand memory usage patterns
Memory optimization is particularly beneficial for: - Large datasets (>100MB) - High trial counts (>1000 trials) - Memory-constrained systems - Datasets with many variables or observations
Returns a list containing:
ReducedData: Down-sampled data frameRemovedData: Data not included in the sampleReducedInstances: Row names of the reduced dataRemovedInstances: Row names of the removed dataopdisDownsampling() also works with data containing
missing values.
library(opdisDownsampling)
set.seed(42)
iris_data <- data.frame(iris[, 1:4])
n_na <- round(0.05 * nrow(iris_data) * ncol(iris_data))
na_pos <- sample(nrow(iris_data) * ncol(iris_data), n_na)
x <- as.matrix(iris_data)
x[na_pos] <- NA
iris_with_missing <- as.data.frame(x)
iris_with_missing
downsampled_missing <- opdisDownsampling(
Data = iris_with_missing,
Cls = iris$Species,
Size = 0.8,
Seed = 42
)
downsampled_missingJobSize = NULL to enable automatic memory-aware
chunk-size calculation.verbose = TRUE to monitor memory usage and chunking
diagnostics.JobSize = NULL for automatic chunk-size
calculation, or manually set smaller values such as
JobSize = 10 or JobSize = 25.MaxCores value to reduce parallel memory
overhead.JobSize = 0 processes all trials in a
single batch.See the CRAN package page for full documentation and the reference manual.
If you use this package, please cite the CRAN package and the original paper:
Lötsch J, Malkusch S, Ultsch A. Optimal distribution-preserving downsampling of large biomedical data sets (opdisDownsampling). PLoS One. 2021 Aug 5;16(8):e0255838. doi: 10.1371/journal.pone.0255838. PMID: 34352006; PMCID: PMC8341664.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.