The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.
BigDataStatMeth provides efficient statistical methods and linear algebra operations for large-scale data analysis using block-wise algorithms and HDF5 storage. Designed for genomic, transcriptomic, and multi-omic data analysis, it enables processing datasets that exceed available RAM through intelligent data partitioning and disk-based computation.
The package offers both R and C++ APIs, allowing flexible integration into existing workflows while maintaining high performance for computationally intensive operations.
install.packages("BigDataStatMeth")# Install devtools if needed
install.packages("devtools")
# Install BigDataStatMeth
devtools::install_github("isglobal-brge/BigDataStatMeth")R packages: - Matrix - rhdf5 (Bioconductor) - RcppEigen - RSpectra
System dependencies: - HDF5 library (>= 1.8) - C++11 compatible compiler - For Windows: Rtools
Install Bioconductor dependencies:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(c("rhdf5", "HDF5Array"))library(BigDataStatMeth)
library(rhdf5)
# Create HDF5 file from matrix
genotype_matrix <- matrix(rnorm(5000 * 10000), 5000, 10000)
bdCreate_hdf5_matrix(
filename = "genomics.hdf5",
object = genotype_matrix,
group = "data",
dataset = "genotypes"
)
# Perform block-wise PCA
pca_result <- bdPCA_hdf5(
filename = "genomics.hdf5",
group = "data",
dataset = "genotypes",
k = 4, # Number of blocks
bcenter = TRUE, # Center data
bscale = FALSE, # Don't scale
threads = 4 # Use 4 threads
)
# Access results
components <- pca_result$components
variance_explained <- pca_result$variance_prop# Matrix operations directly on HDF5
result <- bdblockmult_hdf5(
filename = "data.hdf5",
group = "matrices",
A = "matrix_A",
B = "matrix_B"
)
# Cross-product
crossp <- bdCrossprod_hdf5(
filename = "data.hdf5",
group = "matrices",
A = "matrix_A"
)
# SVD decomposition
svd_result <- bdSVD_hdf5(
filename = "data.hdf5",
group = "matrices",
dataset = "matrix_A",
k = 8,
threads = 4
)| Operation | R Function | Features |
|---|---|---|
| Matrix multiplication | bdblockmult_hdf5() |
Block-wise, parallel, HDF5 |
| Cross-product | bdCrossprod_hdf5() |
t(A) %% A, t(A) %% B |
| Transposed cross-product | bdtCrossprod_hdf5() |
A %% t(A), A %% t(B) |
| SVD | bdSVD_hdf5() |
Block-wise, hierarchical |
| QR decomposition | bdQR_hdf5() |
Block-wise |
| Cholesky | bdCholesky_hdf5() |
For positive-definite matrices |
| Matrix inversion | bdInvCholesky_hdf5() |
Via Cholesky decomposition |
| Method | R Function | Description |
|---|---|---|
| Principal Component Analysis | bdPCA_hdf5() |
Block-wise PCA with centering/scaling |
| Singular Value Decomposition | bdSVD_hdf5() |
Hierarchical block-wise SVD |
| Canonical Correlation Analysis | bdCCA_hdf5() |
Multi-omic data integration |
| Linear Regression | bdlm_hdf5() |
Large-scale regression models |
| Operation | R Function | Purpose |
|---|---|---|
| Create HDF5 dataset | bdCreate_hdf5_matrix() |
Initialize HDF5 files |
| Normalize data | bdNormalize_hdf5() |
Center and/or scale |
| Remove low-quality data | bdRemovelowdata_hdf5() |
Filter by missing values |
| Impute missing values | bdImputeSNPs_hdf5() |
Mean/median imputation |
| Split datasets | bdSplit_matrix_hdf5() |
Partition into blocks |
| Merge datasets | bdBind_hdf5_datasets() |
Combine by rows/columns |
| Function | Purpose |
|---|---|
bdgetDim_hdf5() |
Get dataset dimensions |
bdExists_hdf5_element() |
Check if dataset exists |
bdgetDatasetsList_hdf5() |
List all datasets in group |
bdRemove_hdf5_element() |
Delete dataset or group |
bdImportTextFile_hdf5() |
Import text files to HDF5 |
Comprehensive documentation is available at https://isglobal-brge.github.io/BigDataStatMeth/
# List available vignettes
vignette(package = "BigDataStatMeth")
# View specific vignette
vignette("getting-started", package = "BigDataStatMeth")
vignette("pca-genomics", package = "BigDataStatMeth")BigDataStatMeth is designed for efficiency:
BigDataStatMeth is particularly suited for:
library(BigDataStatMeth)
# Load genomic data
bdCreate_hdf5_matrix("gwas.hdf5", genotypes, "data", "snps")
# Quality control
bdRemovelowdata_hdf5("gwas.hdf5", "data", "snps",
pcent = 0.05, bycols = TRUE) # Remove SNPs >5% missing
# Impute remaining missing values
bdImputeSNPs_hdf5("gwas.hdf5", "data", "snps_filtered")
# Perform PCA
pca <- bdPCA_hdf5("gwas.hdf5", "data", "snps_filtered",
k = 8, bcenter = TRUE, threads = 4)
# Plot results
plot(pca$components[,1], pca$components[,2],
xlab = "PC1", ylab = "PC2",
main = "Population Structure")# Prepare data
bdCreate_hdf5_matrix("multi_omic.hdf5", gene_expression, "data", "genes")
bdCreate_hdf5_matrix("multi_omic.hdf5", methylation, "data", "cpgs")
# Normalize
bdNormalize_hdf5("multi_omic.hdf5", "data", "genes",
bcenter = TRUE, bscale = TRUE)
bdNormalize_hdf5("multi_omic.hdf5", "data", "cpgs",
bcenter = TRUE, bscale = TRUE)
# Canonical Correlation Analysis
cca <- bdCCA_hdf5(
filename = "multi_omic.hdf5",
X = "NORMALIZED/data/genes",
Y = "NORMALIZED/data/cpgs",
m = 10 # Number of blocks
)
# Extract canonical correlations
correlations <- h5read("multi_omic.hdf5", "Results/cor")#include <Rcpp.h>
#include "BigDataStatMeth.hpp"
using namespace BigDataStatMeth;
// [[Rcpp::export]]
void custom_analysis(std::string filename, std::string dataset) {
hdf5Dataset* ds = new hdf5Dataset(filename, dataset, false);
ds->openDataset();
// Your custom algorithm using BigDataStatMeth functions
// Block-wise processing, matrix operations, etc.
delete ds;
}See Developing Methods for complete examples.
If you use BigDataStatMeth in your research, please cite:
Pelegri-Siso D, Gonzalez JR (2024). BigDataStatMeth: Statistical Methods
for Big Data Using Block-wise Algorithms and HDF5 Storage.
R package version X.X.X, https://github.com/isglobal-brge/BigDataStatMeth
BibTeX entry:
@Manual{bigdatastatmeth,
title = {BigDataStatMeth: Statistical Methods for Big Data},
author = {Dolors Pelegri-Siso and Juan R. Gonzalez},
year = {2024},
note = {R package version X.X.X},
url = {https://github.com/isglobal-brge/BigDataStatMeth},
}Contributions are welcome! Please:
git checkout -b feature/amazing-feature)git commit -m 'Add amazing feature')git push origin feature/amazing-feature)R CMD check before submittingMIT License - see LICENSE file for details.
Dolors Pelegri-Siso
Bioinformatics Research Group in Epidemiology (BRGE)
ISGlobal - Barcelona Institute for Global Health
Juan R. Gonzalez
Bioinformatics Research Group in Epidemiology (BRGE)
ISGlobal - Barcelona Institute for Global Health
Development of BigDataStatMeth was supported by ISGlobal and the Bioinformatics Research Group in Epidemiology (BRGE).
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.