The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.

BigDataStatMeth

CRAN status CRAN downloads Documentation License

Overview

BigDataStatMeth provides efficient statistical methods and linear algebra operations for large-scale data analysis using block-wise algorithms and HDF5 storage. Designed for genomic, transcriptomic, and multi-omic data analysis, it enables processing datasets that exceed available RAM through intelligent data partitioning and disk-based computation.

The package offers both R and C++ APIs, allowing flexible integration into existing workflows while maintaining high performance for computationally intensive operations.

Key Features

Installation

From CRAN (Stable Release)

install.packages("BigDataStatMeth")

From GitHub (Development Version)

# Install devtools if needed
install.packages("devtools")

# Install BigDataStatMeth
devtools::install_github("isglobal-brge/BigDataStatMeth")

System Requirements

R packages: - Matrix - rhdf5 (Bioconductor) - RcppEigen - RSpectra

System dependencies: - HDF5 library (>= 1.8) - C++11 compatible compiler - For Windows: Rtools

Install Bioconductor dependencies:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
    
BiocManager::install(c("rhdf5", "HDF5Array"))

Quick Start

Basic Workflow: PCA on Large Genomic Data

library(BigDataStatMeth)
library(rhdf5)

# Create HDF5 file from matrix
genotype_matrix <- matrix(rnorm(5000 * 10000), 5000, 10000)
bdCreate_hdf5_matrix(
  filename = "genomics.hdf5",
  object = genotype_matrix,
  group = "data",
  dataset = "genotypes"
)

# Perform block-wise PCA
pca_result <- bdPCA_hdf5(
  filename = "genomics.hdf5",
  group = "data",
  dataset = "genotypes",
  k = 4,              # Number of blocks
  bcenter = TRUE,     # Center data
  bscale = FALSE,     # Don't scale
  threads = 4         # Use 4 threads
)

# Access results
components <- pca_result$components
variance_explained <- pca_result$variance_prop

Working with HDF5 Files

# Matrix operations directly on HDF5
result <- bdblockmult_hdf5(
  filename = "data.hdf5",
  group = "matrices",
  A = "matrix_A",
  B = "matrix_B"
)

# Cross-product
crossp <- bdCrossprod_hdf5(
  filename = "data.hdf5",
  group = "matrices",
  A = "matrix_A"
)

# SVD decomposition
svd_result <- bdSVD_hdf5(
  filename = "data.hdf5",
  group = "matrices",
  dataset = "matrix_A",
  k = 8,
  threads = 4
)

Core Functionality

Linear Algebra Operations

Operation R Function Features
Matrix multiplication bdblockmult_hdf5() Block-wise, parallel, HDF5
Cross-product bdCrossprod_hdf5() t(A) %% A, t(A) %% B
Transposed cross-product bdtCrossprod_hdf5() A %% t(A), A %% t(B)
SVD bdSVD_hdf5() Block-wise, hierarchical
QR decomposition bdQR_hdf5() Block-wise
Cholesky bdCholesky_hdf5() For positive-definite matrices
Matrix inversion bdInvCholesky_hdf5() Via Cholesky decomposition

Statistical Methods

Method R Function Description
Principal Component Analysis bdPCA_hdf5() Block-wise PCA with centering/scaling
Singular Value Decomposition bdSVD_hdf5() Hierarchical block-wise SVD
Canonical Correlation Analysis bdCCA_hdf5() Multi-omic data integration
Linear Regression bdlm_hdf5() Large-scale regression models

Data Management

Operation R Function Purpose
Create HDF5 dataset bdCreate_hdf5_matrix() Initialize HDF5 files
Normalize data bdNormalize_hdf5() Center and/or scale
Remove low-quality data bdRemovelowdata_hdf5() Filter by missing values
Impute missing values bdImputeSNPs_hdf5() Mean/median imputation
Split datasets bdSplit_matrix_hdf5() Partition into blocks
Merge datasets bdBind_hdf5_datasets() Combine by rows/columns

Utility Functions

Function Purpose
bdgetDim_hdf5() Get dataset dimensions
bdExists_hdf5_element() Check if dataset exists
bdgetDatasetsList_hdf5() List all datasets in group
bdRemove_hdf5_element() Delete dataset or group
bdImportTextFile_hdf5() Import text files to HDF5

Documentation

Comprehensive documentation is available at https://isglobal-brge.github.io/BigDataStatMeth/

Sections

Vignettes

# List available vignettes
vignette(package = "BigDataStatMeth")

# View specific vignette
vignette("getting-started", package = "BigDataStatMeth")
vignette("pca-genomics", package = "BigDataStatMeth")

Performance

BigDataStatMeth is designed for efficiency:

Use Cases

BigDataStatMeth is particularly suited for:

Examples

Example 1: Genomic PCA with Quality Control

library(BigDataStatMeth)

# Load genomic data
bdCreate_hdf5_matrix("gwas.hdf5", genotypes, "data", "snps")

# Quality control
bdRemovelowdata_hdf5("gwas.hdf5", "data", "snps", 
                     pcent = 0.05, bycols = TRUE)  # Remove SNPs >5% missing

# Impute remaining missing values
bdImputeSNPs_hdf5("gwas.hdf5", "data", "snps_filtered")

# Perform PCA
pca <- bdPCA_hdf5("gwas.hdf5", "data", "snps_filtered", 
                  k = 8, bcenter = TRUE, threads = 4)

# Plot results
plot(pca$components[,1], pca$components[,2],
     xlab = "PC1", ylab = "PC2",
     main = "Population Structure")

Example 2: Multi-Omic CCA

# Prepare data
bdCreate_hdf5_matrix("multi_omic.hdf5", gene_expression, "data", "genes")
bdCreate_hdf5_matrix("multi_omic.hdf5", methylation, "data", "cpgs")

# Normalize
bdNormalize_hdf5("multi_omic.hdf5", "data", "genes", 
                 bcenter = TRUE, bscale = TRUE)
bdNormalize_hdf5("multi_omic.hdf5", "data", "cpgs",
                 bcenter = TRUE, bscale = TRUE)

# Canonical Correlation Analysis
cca <- bdCCA_hdf5(
  filename = "multi_omic.hdf5",
  X = "NORMALIZED/data/genes",
  Y = "NORMALIZED/data/cpgs",
  m = 10  # Number of blocks
)

# Extract canonical correlations
correlations <- h5read("multi_omic.hdf5", "Results/cor")

Example 3: Custom Method Development (C++ API)

#include <Rcpp.h>
#include "BigDataStatMeth.hpp"

using namespace BigDataStatMeth;

// [[Rcpp::export]]
void custom_analysis(std::string filename, std::string dataset) {
  
  hdf5Dataset* ds = new hdf5Dataset(filename, dataset, false);
  ds->openDataset();
  
  // Your custom algorithm using BigDataStatMeth functions
  // Block-wise processing, matrix operations, etc.
  
  delete ds;
}

See Developing Methods for complete examples.

Citation

If you use BigDataStatMeth in your research, please cite:

Pelegri-Siso D, Gonzalez JR (2024). BigDataStatMeth: Statistical Methods 
for Big Data Using Block-wise Algorithms and HDF5 Storage. 
R package version X.X.X, https://github.com/isglobal-brge/BigDataStatMeth

BibTeX entry:

@Manual{bigdatastatmeth,
  title = {BigDataStatMeth: Statistical Methods for Big Data},
  author = {Dolors Pelegri-Siso and Juan R. Gonzalez},
  year = {2024},
  note = {R package version X.X.X},
  url = {https://github.com/isglobal-brge/BigDataStatMeth},
}

Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Development Guidelines

Getting Help

License

MIT License - see LICENSE file for details.

Authors

Dolors Pelegri-Siso
Bioinformatics Research Group in Epidemiology (BRGE)
ISGlobal - Barcelona Institute for Global Health

Juan R. Gonzalez
Bioinformatics Research Group in Epidemiology (BRGE)
ISGlobal - Barcelona Institute for Global Health

Acknowledgments

Development of BigDataStatMeth was supported by ISGlobal and the Bioinformatics Research Group in Epidemiology (BRGE).

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.