The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.

Package {opdisDownsampling}


Type: Package
Title: Optimal Distribution Preserving Down-Sampling of Bio-Medical Data
Version: 1.6
Description: An optimized method for distribution-preserving class-proportional down-sampling of bio-medical data <doi:10.1371/journal.pone.0255838>.
Depends: R (≥ 3.5.0)
Imports: parallel, graphics, methods, stats, caTools, pracma, twosamples, doParallel, pbmcapply, foreach, utils, Rcpp (≥ 1.0.0)
LazyData: true
LinkingTo: Rcpp
License: GPL-3
URL: https://github.com/JornLotsch/opdisDownsampling
Encoding: UTF-8
Maintainer: Jorn Lotsch <j.lotsch@em.uni-frankfurt.de>
NeedsCompilation: yes
Packaged: 2026-06-25 05:55:34 UTC; joern
Author: Jorn Lotsch ORCID iD [aut, cre], Sebastian Malkusch ORCID iD [aut], Alfred Ultsch ORCID iD [aut]
Repository: CRAN
Date/Publication: 2026-06-25 06:40:02 UTC

Example data of hematologic marker expression.

Description

Data set of 6 flow cytometry-based lymphoma makers from 55,843 cells from healthy subjects (class 1) and 55,843 cells from lymphoma patients (class 2).

Usage

data("FlowcytometricData")

Details

Size 111686 x 6 , stored in FlowcytometricData$[Var_1,Var_2,Var_3,Var_4,Var_5,Var_6] Classes 2, stored in FlowcytometricData$Cls

Examples

data(FlowcytometricData)
str(FlowcytometricData)

Example data an artificial Gaussian mixture.

Description

Dataset of 30000 instances with 10 variables that are Gaussian mixtures and belong to classes Cls = 1, 2, or 3, with different means and standard deviations and equal weights of 0.5, 0.4, and 0.1, respectively.

Usage

data("GMMartificialData")

Details

Size 30000 x 10, stored in GMMartificialData$[X1,X2,X3,X4,X5,X6,X7,X8,X9,X10]

Classes 3, stored in GMMartificialData$Cls

Examples

data(GMMartificialData)
str(GMMartificialData)

Random Seed Recovery from RNG State

Description

Functions for recovering the original seed value that produced the current random number generator state. Provides both R and C++ implementations with the C++ version offering significantly improved performance for large search spaces.

Usage

get_seed(range = NULL, fallback_seed = 42, max_search = 2147483647,
  step_size = 50000, use_cpp = TRUE, ...)

get_seed_cpp(range = NULL, fallback_seed = 42, max_search = 2147483647,
  step_size = 50000, batch_size = 10000, verbose = TRUE)

Arguments

range

Optional integer vector of specific seed values to search. If provided, only these seeds will be tested instead of systematic range searching.

fallback_seed

Integer seed value to return if no matching seed is found during the search process (default: 42).

max_search

Maximum seed value to search up to when performing systematic range searching. Must be a positive integer within the valid range for R's random number generator (default: 2147483647).

step_size

Step size for systematic range searching when no specific range is provided. Larger values speed up search but may miss the target seed if it falls between steps (default: 50000).

use_cpp

Logical; if TRUE, uses the fast C++ implementation via get_seed_cpp(). If FALSE, falls back to the slower R implementation with a warning (default: TRUE).

batch_size

Integer specifying the number of seeds to process in each C++ batch operation. Larger batches are more memory efficient but require more RAM. Only used in get_seed_cpp() (default: 10000).

verbose

Logical; if TRUE, prints progress information during the search process including batch progress and timing information. Only used in get_seed_cpp() (default: TRUE).

...

Additional arguments passed to get_seed_cpp() when use_cpp = TRUE.

Details

The functions work by systematically testing seed values to find one that reproduces the current RNG state stored in .Random.seed. The search process:

Performance Considerations:

The C++ implementation (get_seed_cpp()) provides significant performance improvements:

Search Strategy:

Memory Management:

The C++ implementation uses batched processing controlled by batch_size to:

Value

Returns an integer representing the seed value that reproduces the current random number generator state. If no matching seed is found within the search parameters, returns the fallback_seed value.

Note

See Also

set.seed, .Random.seed

Examples

## Basic seed recovery after generating random numbers
set.seed(123)
recovered_seed <- get_seed()
print(recovered_seed)


Optimal Distribution Preserving Down-Sampling of Bio-Medical Data

Description

The package provides functions for optimal distribution-preserving down-sampling of large (bio-medical) data sets. It draws statistically representative subsets of data while preserving the class proportions and original data distribution.

Usage

opdisDownsampling(Data, Cls, Size, Seed = "simple", nTrials = 1000,
  TestStat = "ad", MaxCores = getOption("mc.cores", 2L),
  PCAimportance = FALSE, JobSize = 0, verbose = FALSE)

Arguments

Data

Numeric data as a vector, matrix, or data frame. Each row represents an instance, each column a variable.

Cls

Optional vector with class labels for each instance in Data. If missing, all instances are treated as belonging to the same class.

Size

The number (integer) or proportion (0<Size<1) of instances to draw from the dataset. The reduction is class proportional and aims to preserve the variable distributions.

Seed

Seed value. Options: "auto" for seed recovery, "simple" to generate and report a seed using the current RNG state, or an integer for exact reproducibility. Use integers for systematic testing and fully reproducible analyses.

nTrials

Number of random sampling trials used to find the optimal subset (default: 1000).

TestStat

Character string defining the statistical test used to assess distribution similarity.

Available options are:

  • "ad": Anderson–Darling statistic

  • "kuiper": Kuiper statistic

  • "cvm": Cramér–von Mises statistic

  • "wass": Wasserstein distance

  • "dts": Distributional Transform Statistic

  • "ks": Kolmogorov–Smirnov statistic

  • "kld": Kullback–Leibler divergence (via KullbLeiblKLD2())

  • "amrdd": Average Mean Root of Distributional Differences (via amrdd())

  • "euc": Euclidean distance (via EucDist())

  • "nent": Absolute normalized entropy difference (via abs_norm_entropy_diff())

MaxCores

Maximum number of CPU cores to use for parallel computing (default is value stored in getOption("mc.cores"), or 2 if missing).

PCAimportance

Logical; if TRUE, only variables deemed important by principal component analysis are used in computing similarity statistics.

JobSize

Integer specifying the number of trials to process in each chunk. If 0 (default), no chunking is applied. If NULL, an automatic chunk size is calculated based on data dimensions, number of trials, available system memory, and number of processor cores. A positive integer manually sets the chunk size.

verbose

Logical; if TRUE, prints diagnostic information about chunk-size selection, including data dimensions, estimated memory usage, and the chosen chunking strategy. Useful for understanding memory usage patterns and debugging performance issues.

Details

Chunked processing can be used to reduce memory usage when dealing with large datasets or high numbers of trials. Set JobSize = NULL to enable automatic memory-aware chunk-size calculation. The automatic chunking strategy considers:

Set JobSize = 0 to process all trials in a single batch. Set JobSize to a positive integer to manually define the number of trials processed per chunk.

Variable Selection Method:

If PCAimportance = TRUE, PCA-based variable selection is used. Variables are ranked by their loadings in the first principal components, and variables with higher importance scores are used for distribution comparisons.

Value

Returns a list with the following elements:

ReducedData

Down-sampled data set (as data frame or matrix) including only the selected instances.

RemovedData

Data not included in the sample.

ReducedInstances

Row indices (or names) of the selected instances from the original data set.

RemovedInstances

Row indices (or names) of the unselected instances from the original data set.

Author(s)

Jorn Lotsch

References

Lotsch, J., Malkusch, S., Ultsch, A. (2021):\ Optimal distribution-preserving downsampling of large biomedical data sets.\ PLoS ONE 16(8): e0255838. doi:10.1371/journal.pone.0255838

Examples

## Example: Down-sample the Iris dataset to 50 points
data(iris)
Iris50percent <- opdisDownsampling(Data = iris[,1:4], Cls = as.integer(iris$Species),
  Size = 50, Seed = 42, MaxCores = 1)

## Example: Down-sample with custom chunk size and verbose output
data(iris)
Iris50percent <- opdisDownsampling(Data = iris[,1:4], Cls = as.integer(iris$Species),
  Size = 50, Seed = 42, MaxCores = 1, JobSize = 25, verbose = TRUE)

## Example: Use PCA-based variable selection
data(iris)
Iris_pca <- opdisDownsampling(Data = iris[,1:4], Cls = as.integer(iris$Species),
  Size = 50, Seed = 42, PCAimportance = TRUE, MaxCores = 1)

## Example: Memory-efficient processing of large dataset with many trials
## Not run: 
# For large datasets, automatic chunking can reduce memory usage
LargeDataSample <- opdisDownsampling(Data = large_dataset,
  Size = 0.1, Seed = 42, nTrials = 5000, JobSize = NULL, verbose = TRUE)

## End(Not run)

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.