Matrix operations, algebra and Statistical Methods for Big Data

Dolors Pelegri1,2,3* and Juan R. Gonzalez1,2,3**

1Universitat Autonoma de Barcelona - Universitat Autonoma de Barcelona (UAB)
2ISGlobal, Centre for Research in Environmental Epidemiology (ISGlobal)
3Bioinformatics Research Group in Epidemiolgy (BRGE)

*dolors.pelegri@isglobal.org
**juanr.gonzalez@isglobal.org

2025-12-19

Abstract

Description of functions to perform matrix operations, algebra and basic statistical analyses using HDF5 data files.

1 Overview

This package implements several matrix operations using Matrix objects as well as HDF5 data files. Some basic algebra operations that can also be computed that are useful to implement statistical analyses using standard methodologies such as principal component analyses (PCA) or least squares estimation. The package also contains specific statistical methods mainly used in omic data analysis such as lasso regression.

2 Prerequisites

The package requires other packages from CRAN and Bioconductor to be installed.

CRAN: Matrix, RcppEigen and RSpectra.
Bioconductor: HDF5array, rhdf5

The user can execute this code to install the required packages:

# Install BiocManager (if not previously installed)
install.packages("BiocManager") 

# Install required packages
BiocManager::install(c("Matrix", "RcppEigen", "RSpectra",
                       "HDF5Array", "rhdf5"))

Our package needs to be installed from source code. In such cases, a collection of software (e.g. C, C++, Fortran, …) are required, mainly for Windows users. These programs can be installed using Rtools.

3 Install package

Once required packages and Rtools are installed, BigDataStatMeth package can be installed from our GitHub repository as follows:

# Install devtools and load library (if not previously installed)
install.packages("devtools") 
library(devtools)

# Install BigDataStatMeth 
install_github("isglobal-brge/BigDataStatMeth")

4 Getting started

First, let us start by loading the required packages to describe the main capabilities of the package

library(rhdf5)
library(BigDataStatMeth)

5 Previous knowledge

5.1 HDF5 data files

Hierarchical Data Format (HDF) is a set of file formats designed to store and organize large amounts of data. It is supported by The HDF Group, a non-profit corporation whose mission is to ensure continued development of HDF5 technologies and the continued accessibility of data stored in HDF.

HDF5 is a technology suite that makes possible the management of extremely large and complex data collections, can accommodate virtually every kind of data in a single file, sequences, images, SNP matrices, and every other type of data and metadata associated with an experiment.

There is no limit on the number or size of data objects in the collection, giving great flexibility for omic data. Is high-performance I/O with a rich set of integrated performance features that allow for access time and storage space optimizations

HDF5 file structure include only two major types of object:

Datasets, which are multidimensional arrays of a homogeneous type. For example, datasets for omics data could be to genomics, transcriptomics, epigenomics, proteomics and/or metabolomics experiments
Groups, which are container structures which can hold datasets and other groups

This results in a truly hierarchical, filesystem-like data format

Figure 1: HDF5 hierarchical structure

5.2 Basics in hdf5 files using R

Create hdf5 file

We have implemented the bdCreate_hdf5_matrix_file () function to create an hdf5 file with a group and a dataset in a single step. This function allows to create datasets from a standard R matrices from Matrix objects.

library(rhdf5)

set.seed(5234)
n <- 500
m <- 600
A <- matrix(rnorm(n*m,mean=0,sd=1), n,m)

# We also can create a dataset from R matrix object
bdCreate_hdf5_matrix(filename = "robject.hdf5", 
                     object = A,
                     group = "INPUT", 
                     dataset = "A",
                     overwriteFile = TRUE)
$fn
[1] "robject.hdf5"

$ds
[1] "INPUT/A"

We see 0 in the console, indicating that no errors where found when creating the hdf5 file. Notice that a file called “robject.hdf5” will be created in the working directory

list.files(pattern = "*.hdf5")
[1] "robject.hdf5"

Add datasets in hdf5 file

The function bdCreate_hdf5_matrix() also allows to add a dataset in a existing hdf5 file. We can create the dataset in any group, if group doesn’t exists in file, the group is created before append the dataset.

set.seed(5234)
n <- 500
m <- 1000
A <- matrix(rnorm(n*m,mean=0,sd=1), n, m)

set.seed(5234)
n <- 1000
m <- 12000
B <- matrix(rnorm(n*m,mean=3,sd=0.5), n, m)

# Path to HDF5 file
example_fn <- "delayed.hdf5"

# We create another data file (delayed.hdf5) with a matrix A.
# The group is called INPUT, overwriteFile is set to true to 
# overwrite a file if exists
bdCreate_hdf5_matrix(filename = example_fn, 
                     object = A, 
                     group = "INPUT", 
                     dataset = "A", 
                     overwriteFile = TRUE)
$fn
[1] "delayed.hdf5"

$ds
[1] "INPUT/A"

# And them, we add another matrix B to the same group
bdCreate_hdf5_matrix(object = B, 
                filename = example_fn, 
                group = "INPUT", 
                dataset = "B")
$fn
[1] "delayed.hdf5"

$ds
[1] "INPUT/B"

Open and get hdf5 content file

We can open an existing file show contents and access data using functions from rhdf5 package. rhdf5 is an R interface for HDF5. The file must always be opened before working with it.

With h5ls() function we get the the list of an hdf5 content without the need of open the file.

# Examine hierarchy before open file
h5ls(example_fn)
   group  name       otype dclass          dim
0      / INPUT   H5I_GROUP                    
1 /INPUT     A H5I_DATASET  FLOAT   500 x 1000
2 /INPUT     B H5I_DATASET  FLOAT 1000 x 12000

Once opened, it can be seen just typing its name. In that case, we only get the current level in the hierarchical tree

# Open file
h5fdelay <- H5Fopen(example_fn)
# Show hdf5 hierarchy (groups)
h5fdelay
HDF5 FILE 
        name /
    filename 

   name     otype dclass dim
0 INPUT H5I_GROUP

NOTE: We can also use hdfview

Access datasets data

The $ operator can be used to access the next group level, this operator reads the object from disk. We can assign the dataset contents to an R object in order to work with it.

bdata <- h5fdelay$INPUT$B
bdata[1:3,1:5]
         [,1]     [,2]     [,3]     [,4]     [,5]
[1,] 3.562285 2.343400 2.952424 2.871667 3.569449
[2,] 2.702681 2.455068 3.006128 3.371495 3.256128
[3,] 3.057644 2.993982 3.833475 3.758272 2.448493

And finally, we close the opened files.

h5closeAll()

6 Convert text file to HDF5

Usually we have files with a large amount of data to be loaded into R. For Big data problem, R can’t deal with this data so we cannot efficiently work with then. In order to resolve this issue, you can work directly from data on disk with datasets stored in an HDF5 data file.

You can easily change your text files to HDF5 using the function bdImportTextFile_hdf5 (), to do that, we only have to define the input file with the filename parameter and the HDF5 destination dataset with parameters outputfile for the output file name, outGroup with the group to store the dataset and the dataset name with the parameter outDataset. It should be considered that:

This function only allows to import numeric data except for rownames and colnames where character are allowed.
Data can be delimited by different characters. By default, we use tabs (“\t”) but it can be changed easily with sep parameter.

In this example we convert the “colesterol.csv” file and we will save the data to a file ‘colesterol_file.hdf5’ under the group ‘COLESTEROLGROUP’ in the dataset ‘COLESTEROLDATASET’. In this example the text file contains column names so we set the parameter header = TRUE.

import_hdf5 <- bdImportTextFile_hdf5(filename = "colesterol.csv",
                                     sep=',', 
                                     outputfile = "colesterol_file.hdf5", 
                                     outGroup = "COLESTEROL", 
                                     outDataset = "COLESTEROLDATA", 
                                     header = TRUE,
                                     overwrite = TRUE)

If we observe the new file colesterol_file.hdf5 and its content


# Show content
h5ls(import_hdf5$fn)
                                 group                     name       otype
0                                    /               COLESTEROL   H5I_GROUP
1                          /COLESTEROL .COLESTEROLDATA_dimnames   H5I_GROUP
2 /COLESTEROL/.COLESTEROLDATA_dimnames                        2 H5I_DATASET
3                          /COLESTEROL           COLESTEROLDATA H5I_DATASET
    dclass       dim
0                   
1                   
2 COMPOUND        10
3    FLOAT 1000 x 10

# We can open the file and have access to the data
res_hdf5 <- h5read(import_hdf5$fn, import_hdf5$ds)
colnames_hdf5 <- h5read(import_hdf5$fn, import_hdf5$ds_cols)[,1]

# Show hdf5 content dataset
res_hdf5[1:5, 1:6]
         [,1]     [,2]     [,3]      [,4]     [,5]     [,6]
[1,] 223.7348 55.25039 14.90246 0.9026095 10.22067 1.450970
[2,] 248.1820 53.47404 25.12592 0.8710345 17.10225 1.002655
[3,] 180.2071 58.54268 12.95197 0.8882435 11.21530 1.073924
[4,] 200.1522 62.58284 28.27819 0.9361357 12.79399 1.129373
[5,] 234.3281 68.70254 13.21404 0.7775238 11.39689 1.235655

# Show colnames 
head(colnames_hdf5)
[1] "TCholesterol" "Age"          "Insulin"      "Creatinine"   "BUN"         
[6] "LLDR"        

# Show data with colnames
colnames(res_hdf5) <- colnames_hdf5
res_hdf5[1:5, 1:6]
     TCholesterol      Age  Insulin Creatinine      BUN     LLDR
[1,]     223.7348 55.25039 14.90246  0.9026095 10.22067 1.450970
[2,]     248.1820 53.47404 25.12592  0.8710345 17.10225 1.002655
[3,]     180.2071 58.54268 12.95197  0.8882435 11.21530 1.073924
[4,]     200.1522 62.58284 28.27819  0.9361357 12.79399 1.129373
[5,]     234.3281 68.70254 13.21404  0.7775238 11.39689 1.235655

NOTE: We can overwrite an existing hdf5 dataset with new data by setting overwrite = TRUE and file wit overwriteFile = TRUE.

7 Operation with HDF5 files

Once our data are available as hdf5 files, we can operate directly on it without having data on memory (i.e we have access from disk. In BigDataStatMeth we have implemented most of the common matrix operations and algebra functions to help users how are not familiar with this type of implementation. Next section provides several examples

7.1 HDF5 Matrix Multiplication

If we have the matrices stored in hdf5 data file, we can use the function bdblockmult_hdf5(). This function allows to perform a matrix multiplication from two matrices stored in a file.

The bdblockmult_hdf5() function in the BigDataStatMeth package performs efficient matrix multiplication by blocks, directly on datasets stored in an HDF5 file. This allows scalable computation even when the matrices do not fit into memory.

If the outgroup is not specified, the result is written to the "OUTPUT" group by default. Likewise, if the outdataset is not provided, the output dataset name is constructed as "A_x_B", where A and B are the names of the input datasets.


# Perform blockwise matrix multiplication
res <- bdblockmult_hdf5(filename = example_fn, group = "INPUT",
                        A = "A", B = "B", outgroup = "HDF5_RES")

# Show the content of the HDF5 file
h5ls(res$fn)
      group     name       otype dclass          dim
0         / HDF5_RES   H5I_GROUP                    
1 /HDF5_RES    A_x_B H5I_DATASET  FLOAT  500 x 12000
2         /    INPUT   H5I_GROUP                    
3    /INPUT        A H5I_DATASET  FLOAT   500 x 1000
4    /INPUT        B H5I_DATASET  FLOAT 1000 x 12000

We can then extract a portion of the result to validate correctness:

# Extract the result from HDF5
result_hdf5 <- h5read(res$fn, res$ds)[1:3, 1:5]
result_hdf5
          [,1]      [,2]      [,3]      [,4]      [,5]
[1,] 146.46071 168.18088 185.33397 213.72110 198.21560
[2,] 165.83147 192.13943 197.74114 181.76050 175.02919
[3,] -29.83376 -48.24969 -67.16883 -53.60578 -46.85563

# Compute the same multiplication in R
result_r <- (A %*% B)[1:3, 1:5]
result_r
          [,1]      [,2]      [,3]      [,4]      [,5]
[1,] 146.46071 168.18088 185.33397 213.72110 198.21560
[2,] 165.83147 192.13943 197.74114 181.76050 175.02919
[3,] -29.83376 -48.24969 -67.16883 -53.60578 -46.85563

# Compare both results
all.equal((A %*% B), h5read(res$fn, res$ds))
[1] TRUE

This confirms that the result obtained with the C++ HDF5-based backend is numerically identical to the result obtained using R’s in-memory multiplication.

7.2 Crossproduct and Transposed Crossproduct in HDF5

The functions bdCrossprod_hdf5() and bdtCrossprod_hdf5() from the BigDataStatMeth package implement efficient versions of the crossproduct operations using blockwise computation on HDF5 datasets. These functions allow working with large matrices that may not fit in memory, by performing the computation in chunks.

Both functions can operate on a single matrix or on two matrices:

If only matrix A is provided:
- bdCrossprod_hdf5() computes t(A) %*% A
- bdtCrossprod_hdf5() computes A %*% t(A)
If both matrices A and B are provided:
- bdCrossprod_hdf5() computes t(A) %*% B
- bdtCrossprod_hdf5() computes A %*% t(B)

By default, the output is stored in the "OUTPUT" group. If outdataset is not specified, the default dataset names are: - "Crossprod_A_x_B" for bdCrossprod_hdf5() - "tCrossprod_A_x_B" for bdtCrossprod_hdf5() where A and B refer to the names of the input datasets provided via the A and B arguments, respectively. If only matrix A is specified (i.e., B is missing), then the name becomes “Crossprod_A” or “tCrossprod_A”. This naming convention ensures traceability of the operation and the involved datasets within the HDF5 file structure.

Crossproduct of a single matrix


# Create example matrices
set.seed(123)
A <- matrix(rnorm(1000 * 200), nrow = 1000, ncol = 200)
B <- matrix(rnorm(1000 * 150), nrow = 1000, ncol = 150)
C <- matrix(rnorm(800 * 200),  nrow = 800,  ncol = 200) 

# Save matrices to HDF5 file using BigDataStatMeth
example_fn <- "delayed.hdf5"

bdCreate_hdf5_matrix(filename = example_fn, 
                     object = A,
                     group = "INPUT", 
                     dataset =  "A", 
                     overwriteFile = TRUE)
$fn
[1] "delayed.hdf5"

$ds
[1] "INPUT/A"

bdCreate_hdf5_matrix(filename = example_fn,
                     object = B,
                     group = "INPUT", 
                     dataset =  "B", 
                     overwriteFile = FALSE)
$fn
[1] "delayed.hdf5"

$ds
[1] "INPUT/B"

bdCreate_hdf5_matrix(filename = example_fn,
                     object = C,
                     group = "INPUT", 
                     dataset = "C", 
                     overwriteFile = FALSE)
$fn
[1] "delayed.hdf5"

$ds
[1] "INPUT/C"

# Compute t(A) %*% A
res_cross <- bdCrossprod_hdf5(filename = example_fn, 
                              group = "INPUT", 
                              A = "A")

# Show where the result is stored
h5ls(res_cross$fn)
    group            name       otype dclass        dim
0       /           INPUT   H5I_GROUP                  
1  /INPUT               A H5I_DATASET  FLOAT 1000 x 200
2  /INPUT               B H5I_DATASET  FLOAT 1000 x 150
3  /INPUT               C H5I_DATASET  FLOAT  800 x 200
4       /          OUTPUT   H5I_GROUP                  
5 /OUTPUT CrossProd_A_x_A H5I_DATASET  FLOAT  200 x 200

# Compare with R's crossprod
res_hdf5 <- h5read(res_cross$fn, res_cross$ds)
res_r <- crossprod(A)

all.equal(res_r, res_hdf5)
[1] TRUE

For bdtCrossprod_hdf5(), it is "tcrossprod_A".

Crossproduct of two matrices

# Compute t(A) %*% B
res_cross2 <- bdCrossprod_hdf5(filename = example_fn, 
                               group = "INPUT", 
                               A = "A", 
                               B = "B")

# Compare with R
res_hdf5 <- h5read(res_cross2$fn, res_cross2$ds)
res_r <- crossprod(A, B)

all.equal(res_r, res_hdf5)
[1] TRUE

Transposed Crossproduct of a single matrix

# Compute A %*% t(A)
res_tcross <- bdtCrossprod_hdf5(filename = example_fn, 
                                group = "INPUT", 
                                A = "A")

h5ls(res_tcross$fn)
    group             name       otype dclass         dim
0       /            INPUT   H5I_GROUP                   
1  /INPUT                A H5I_DATASET  FLOAT  1000 x 200
2  /INPUT                B H5I_DATASET  FLOAT  1000 x 150
3  /INPUT                C H5I_DATASET  FLOAT   800 x 200
4       /           OUTPUT   H5I_GROUP                   
5 /OUTPUT  CrossProd_A_x_A H5I_DATASET  FLOAT   200 x 200
6 /OUTPUT  CrossProd_A_x_B H5I_DATASET  FLOAT   200 x 150
7 /OUTPUT tCrossProd_A_x_A H5I_DATASET  FLOAT 1000 x 1000

res_hdf5_t <- h5read(res_tcross$fn, res_tcross$ds)
res_r_t <- tcrossprod(A)

all.equal(res_r_t, res_hdf5_t)
[1] TRUE

Transposed Crossproduct of two matrices

# Compute A %*% t(B)
res_tcross2 <- bdtCrossprod_hdf5(filename = example_fn, 
                                 group = "INPUT", 
                                 A = "A", 
                                 B = "C")

res_hdf5_t <- h5read(res_tcross2$fn, res_tcross2$ds)
res_r_t <- tcrossprod(A, C)

all.equal(res_r_t, res_hdf5_t)
[1] TRUE

These examples confirm that both crossproduct and transposed crossproduct computations done with the HDF5-based C++ backend are equivalent to the results produced by R’s crossprod() and tcrossprod() functions, while allowing better scalability and performance for large datasets.

Crossproduct Combinations Supported

The following table summarizes the types of crossproduct operations supported by the functions bdCrossprod_hdf5() and bdtCrossprod_hdf5() in the BigDataStatMeth package.

Function	Matrix A	Matrix B (optional)	Operation
`bdCrossprod_hdf5()`	A	—	`t(A) %*% A`
`bdCrossprod_hdf5()`	A	B	`t(A) %*% B`
`bdtCrossprod_hdf5()`	A	—	`A %*% t(A)`
`bdtCrossprod_hdf5()`	A	B	`A %*% t(B)`

These operations are executed efficiently using C++ backends and blockwise computation on HDF5 datasets, making them suitable for large-scale matrix operations.

7.3 Blockwise Matrix Subtraction in HDF5

The bdblockSubstract_hdf5() function performs blockwise element-wise subtraction of two matrices stored in an HDF5 file. It computes $A - B$, writing the result back into the file, without requiring either matrix to be fully loaded into memory.


# Create two matrices of the same dimensions
set.seed(42)
A_sub <- matrix(rnorm(1000 * 300), nrow = 1000, ncol = 300)
B_sub <- matrix(rnorm(1000 * 300), nrow = 1000, ncol = 300)

# Save them to HDF5
fn_sub <- "subtraction_example.hdf5"

bdCreate_hdf5_matrix(filename = fn_sub, object = A_sub,
                     group = "INPUT", dataset = "A_sub",
                     overwriteFile = TRUE)
$fn
[1] "subtraction_example.hdf5"

$ds
[1] "INPUT/A_sub"

bdCreate_hdf5_matrix(filename = fn_sub, object = B_sub,
                     group = "INPUT", dataset = "B_sub",
                     overwriteFile = FALSE)
$fn
[1] "subtraction_example.hdf5"

$ds
[1] "INPUT/B_sub"

# Perform subtraction: A - B
res_sub <- bdblockSubstract_hdf5(filename = fn_sub,
                                  group = "INPUT",
                                  A = "A_sub", B = "B_sub")

# Compare a subset with R
result_hdf5 <- h5read(res_sub$fn, res_sub$ds)
result_r <- A_sub - B_sub

all.equal(result_r, result_hdf5)
[1] TRUE

The result stored in HDF5 is numerically identical to the in-memory computation of $A - B$ in R, validating the blockwise approach.

7.4 Blockwise Matrix Addition in HDF5

The bdblockSum_hdf5() function performs blockwise element-wise addition of two matrices stored in HDF5 format. This is useful for efficiently computing $A + B$ on large matrices.

# Create two compatible matrices
set.seed(99)
A_add <- matrix(rnorm(800 * 250), nrow = 800, ncol = 250)
B_add <- matrix(rnorm(800 * 250), nrow = 800, ncol = 250)

# Save them to HDF5
fn_add <- "addition_example.hdf5"

bdCreate_hdf5_matrix(filename = fn_add, object = A_add,
                     group = "INPUT", dataset = "A",
                     overwriteFile = TRUE)
$fn
[1] "addition_example.hdf5"

$ds
[1] "INPUT/A"

bdCreate_hdf5_matrix(filename = fn_add, object = B_add,
                     group = "INPUT", dataset = "B",
                     overwriteFile = FALSE)
$fn
[1] "addition_example.hdf5"

$ds
[1] "INPUT/B"

# Perform addition: A + B
res_add <- bdblockSum_hdf5(filename = fn_add,
                            group = "INPUT",
                            A = "A", B = "B")

# Compare result with R
result_hdf5 <- h5read(res_add$fn, res_add$ds)
result_r <- A_add + B_add

all.equal(result_r, result_hdf5)
[1] TRUE

Like subtraction, the addition operation yields identical results to standard R arithmetic but can scale to matrices stored on disk using the HDF5 format.

8 Matrix Decompositions and Inverses

8.1 Singular Value Decomposition (SVD)

The SVD of an $m \times n$ real or complex matrix $A$ is a factorization of the form:

\[U\Sigma { V }^{ T }\]

where :

$U$ is a $m \times m$ real or complex unitary matrix
$\Sigma$ is a $m \times n$ rectangular diagonal matrix with non-negative real numbers on the diagonal
$V$ is a $n \times n$ real or complex unitary matrix.

Notice that:

The diagonal entries $\sigma_i$ of $\Sigma$ are known as the singular values of $A$.
The columns of $U$ are called the left-singular vectors of $A$.
The columns of $V$ are called the right-singular vectors of $A$.

8.1.1 Block Singular Values Decomposition

This method was developed by M. A. Iwen and B. W. Ong (2016). The authors introduced a distributed and incremental SVD algorithm that is useful for agglomerative data analysis on large networks. The algorithm calculates the singular values and left singular vectors of a matrix A, by first, partitioning it by columns. This creates a set of submatrices of A with the same number of rows, but only some of its columns. After that, the SVD of each of the submatrices is computed. The final step consists of combining the results obtained by merging them again and computing the SVD of the resulting matrix.

Figure 2: Flowchart for a two-level hierarchical Block SVD algorithm

This method is implemented in bdSVD_hdf5 function, this function works directly on hdf5 data format, loading in memory only the data to perform calculations and saving the results again in the hdf5 file for later use.

We have to indicate the file to work with, the dataset name and the group where the dataset is located :

# Create dataframe data with 'odata' matrix in delayed hdf5 file at OMIC group
set.seed(5234)
n <- 100
m <- 15000
omicdata <- matrix(rnorm(n*m, mean=0, sd=1), n,m)

bdCreate_hdf5_matrix(filename = example_fn, 
                     object = omicdata, 
                     group = "OMICS", 
                     dataset = "data", 
                     overwriteDataset = TRUE)
$fn
[1] "delayed.hdf5"

$ds
[1] "OMICS/data"

# Direct from hdf5 data file
svdh5 <- bdSVD_hdf5( filename = example_fn, 
                     group = "OMICS", 
                     dataset = "data", 
                     overwrite  = TRUE)

# get results svd (d) from hdf5 data file
svd_hdf5_d <- h5read(svdh5$fn, svdh5$ds_d)

# Results in hdf5 file for d
svd_hdf5_d[1:7]
[1] 131.8042 131.1643 130.6787 130.6179 130.5750 130.3000 130.1092

svd <- svd(scale(omicdata))
svd$d[1:7]
[1] 131.8042 131.1643 130.6787 130.6179 130.5750 130.3000 130.1092

Like in Simple Singular Value Decomposition we can normalize, center or scale data before proceed with SVD decomposition with bscale and bcenter parameters, by default this parameter are TRUE, data is normalized before SVD decomposition. To proceed with SVD without normalization :

# Direct from hdf5 data file (using only one thread, serial execution)
svdh5 <- bdSVD_hdf5( filename = example_fn, 
                     group = "OMICS", 
                     dataset = "data",
                     bcenter = FALSE, 
                     bscale = FALSE,
                     overwrite  = TRUE)

# get results svd (d)
svd_hdf5_d <- h5read(svdh5$fn, svdh5$ds_d)[1:7]
# SVD (d) from file - data not normalized
svd_hdf5_d
[1] 132.0820 131.4080 130.9686 130.8555 130.6781 130.4269 130.3248

# with R implementation from data in memory
svd <- svd(omicdata)
svd$d[1:7]
[1] 132.0820 131.4080 130.9686 130.8555 130.6781 130.4269 130.3248

In the SVD decomposition by blocks we can indicate the number of decomposition levels and number of local SVDs to concatenate at each level with parameters q and k respectively, by default q = 1 one level with k=2.

# Block decomposition with 1 level and 4 local SVDs at each level using 
# two threads (as maximum)
svdh5 <- bdSVD_hdf5( filename = example_fn, 
                     group = "OMICS", 
                     dataset = "data",
                     q = 1, 
                     k = 4, 
                     threads = 2,
                     overwrite  = TRUE)

# get results svd (d)
svd_hdf5_d <- h5read(svdh5$fn, svdh5$ds_d)[1:7]

# SVD (d) from file - data not normalized
svd_hdf5_d
[1] 131.8042 131.1643 130.6787 130.6179 130.5750 130.3000 130.1092

# with R implementation from data in memory
svd <- svd(scale(omicdata))
svd$d[1:7]
[1] 131.8042 131.1643 130.6787 130.6179 130.5750 130.3000 130.1092

8.2 Cholesky Decomposition

The Cholesky decomposition is a matrix factorization technique applicable to symmetric, positive-definite matrices. It decomposes a matrix $A$ into the product of a lower triangular matrix $L$ and its transpose, such that: \[ A = L \cdot L^\top \] This method is particularly useful for solving systems of linear equations, inverting matrices, and performing efficient numerical computations.

The BigDataStatMeth package provides the function bdInvCholesky_hdf5() to compute the inverse of a symmetric, positive-definite matrix stored in HDF5 format using the Cholesky decomposition. This function operates directly on disk-stored data, minimizing memory usage while maintaining computational efficiency. If the input matrix meets the necessary conditions (Hermitian and positive-definite), the function performs the factorization and inversion in a blockwise manner, making it suitable for large-scale data.

Here’s how to use bdInvCholesky_hdf5():

N <- 100
set.seed(5234)
Y <- matrix(rnorm(N*N), N, N)
Ycp <- crossprod(Y)

bdCreate_hdf5_matrix(filename = example_fn, 
                     object = Ycp, 
                     group = "chol", 
                     dataset = "data",
                     transp = FALSE,
                     overwriteFile = TRUE, overwriteDataset = TRUE, 
                     unlimited = FALSE)
$fn
[1] "delayed.hdf5"

$ds
[1] "chol/data"

cholh5 <- bdCholesky_hdf5(filename = example_fn, 
                          group = "chol", 
                          dataset = "data",
                          outdataset = "matrixDec", 
                          outgroup = "Cholesky_Dec",
                          overwrite = TRUE)
Warning: 'memory.size()' is Windows-specific
choldesc_hdf5 <-  h5read(cholh5$fn, cholh5$ds)
choldesc_hdf5[1:3,1:5]
         [,1]      [,2]       [,3]       [,4]       [,5]
[1,] 9.726702  2.161678  1.3276203 -0.8431624  0.6066569
[2,] 0.000000 10.339840 -0.1125209 -0.3722274  0.2753345
[3,] 0.000000  0.000000  9.9295256  0.9489195 -0.3208095

choldesc_r <- chol(Ycp)
choldesc_r[1:3,1:5]
         [,1]      [,2]       [,3]       [,4]       [,5]
[1,] 9.726702  2.161678  1.3276203 -0.8431624  0.6066569
[2,] 0.000000 10.339840 -0.1125209 -0.3722274  0.2753345
[3,] 0.000000  0.000000  9.9295256  0.9489195 -0.3208095

all.equal(choldesc_hdf5, choldesc_r)
[1] TRUE

The computation can be further optimized by specifying the elementsBlock parameter, which determines the approximate number of elements to include in each processing block. For example, setting elementsBlock = 50 instructs the function to process the matrix in chunks of approximately 50 elements, balancing speed and memory use.

cholh5 <- bdCholesky_hdf5(filename = example_fn, 
                          group = "chol", 
                          dataset = "data",
                          outdataset = "matrixDec_50", 
                          outgroup = "Cholesky_Dec",
                          elementsBlock = 50,
                          overwrite = TRUE)
Warning: 'memory.size()' is Windows-specific
Warning: 'memory.size()' is Windows-specific
Warning: 'memory.size()' is Windows-specific
Warning: 'memory.size()' is Windows-specific
Warning: 'memory.size()' is Windows-specific
Warning: 'memory.size()' is Windows-specific
Warning: 'memory.size()' is Windows-specific
Warning: 'memory.size()' is Windows-specific
Warning: 'memory.size()' is Windows-specific
Warning: 'memory.size()' is Windows-specific
Warning: 'memory.size()' is Windows-specific
Warning: 'memory.size()' is Windows-specific
Warning: 'memory.size()' is Windows-specific
Warning: 'memory.size()' is Windows-specific
Warning: 'memory.size()' is Windows-specific
Warning: 'memory.size()' is Windows-specific
Warning: 'memory.size()' is Windows-specific
Warning: 'memory.size()' is Windows-specific
Warning: 'memory.size()' is Windows-specific
Warning: 'memory.size()' is Windows-specific
Warning: 'memory.size()' is Windows-specific
Warning: 'memory.size()' is Windows-specific
Warning: 'memory.size()' is Windows-specific
Warning: 'memory.size()' is Windows-specific
Warning: 'memory.size()' is Windows-specific
Warning: 'memory.size()' is Windows-specific
Warning: 'memory.size()' is Windows-specific
Warning: 'memory.size()' is Windows-specific
Warning: 'memory.size()' is Windows-specific
Warning: 'memory.size()' is Windows-specific
Warning: 'memory.size()' is Windows-specific
Warning: 'memory.size()' is Windows-specific
Warning: 'memory.size()' is Windows-specific
Warning: 'memory.size()' is Windows-specific
Warning: 'memory.size()' is Windows-specific
Warning: 'memory.size()' is Windows-specific
Warning: 'memory.size()' is Windows-specific
Warning: 'memory.size()' is Windows-specific
Warning: 'memory.size()' is Windows-specific
Warning: 'memory.size()' is Windows-specific
Warning: 'memory.size()' is Windows-specific
Warning: 'memory.size()' is Windows-specific
Warning: 'memory.size()' is Windows-specific
Warning: 'memory.size()' is Windows-specific
Warning: 'memory.size()' is Windows-specific
Warning: 'memory.size()' is Windows-specific
Warning: 'memory.size()' is Windows-specific
Warning: 'memory.size()' is Windows-specific
Warning: 'memory.size()' is Windows-specific
Warning: 'memory.size()' is Windows-specific

# Result dataset
cholh5$ds
[1] "Cholesky_Dec/matrixDec_50"

choldesc_hdf5 <-  h5read(cholh5$fn, cholh5$ds)
choldesc_hdf5[1:3,1:5]
         [,1]      [,2]       [,3]       [,4]       [,5]
[1,] 9.726702  2.161678  1.3276203 -0.8431624  0.6066569
[2,] 0.000000 10.339840 -0.1125209 -0.3722274  0.2753345
[3,] 0.000000  0.000000  9.9295256  0.9489195 -0.3208095

choldesc_r <- chol(Ycp)
choldesc_r[1:3,1:5]
         [,1]      [,2]       [,3]       [,4]       [,5]
[1,] 9.726702  2.161678  1.3276203 -0.8431624  0.6066569
[2,] 0.000000 10.339840 -0.1125209 -0.3722274  0.2753345
[3,] 0.000000  0.000000  9.9295256  0.9489195 -0.3208095

all.equal(choldesc_hdf5, choldesc_r)
[1] TRUE

8.3 QR Decomposition

The QR decomposition is a matrix factorization technique in which a matrix $A$ is decomposed into the product of an orthogonal matrix $Q$ and an upper triangular matrix $R$, such that:

\[ A = Q \cdot R \]

This decomposition is widely used in numerical linear algebra for solving least squares problems, eigenvalue computations, and matrix inversion. In the context of large-scale data, computing the QR decomposition efficiently and in a memory-safe manner becomes essential.

The BigDataStatMeth package implements two functions to compute the inverse of a matrix using QR decomposition:

bdQR() for in-memory matrices.
bdQR_hdf5() for matrices stored in HDF5 format.

The computation is done in blocks, allowing it to scale to datasets that do not fit in memory. The result can be used in further downstream statistical procedures, such as CCA or regression.

In this example, we apply the QR decomposition to the same matrix data inside the group chol that was created and stored in HDF5 format in the previous section.

QRh5 <- bdQR_hdf5(filename = example_fn,
                  group = "chol",
                  dataset = "data",
                  outgroup = "QR_Dec",thin = TRUE,
                  overwrite = TRUE)

# Result dataset
QR_Q_hdf5 <-  h5read(QRh5$fn, QRh5$ds_Q)
QR_R_hdf5 <-  h5read(QRh5$fn, QRh5$ds_R)

# Q matrix
QR_Q_hdf5[1:3,1:5]
            [,1]        [,2]        [,3]         [,4]         [,5]
[1,] -0.68551057  0.04546917  0.01531785 0.0008194793  0.034269378
[2,] -0.15234901 -0.74590341  0.02851012 0.0182487534  0.005718736
[3,] -0.09356695  0.01443328 -0.70676468 0.0584085661 -0.032816534

# R matrix
QR_R_hdf5[1:3,1:5]
          [,1]       [,2]        [,3]        [,4]       [,5]
[1,] -138.0121  -40.05205  -22.151218  11.2513535 -14.627475
[2,]    0.0000 -141.41677   -3.081843   0.6397727  -3.862321
[3,]    0.0000    0.00000 -139.144722 -25.7224753   8.435610

# Q matrix in R
QR_Q_r <- qr.Q(qr(Ycp))
QR_Q_r[1:3,1:5]
            [,1]        [,2]        [,3]         [,4]         [,5]
[1,] -0.68551057  0.04546917  0.01531785 0.0008194793  0.034269378
[2,] -0.15234901 -0.74590341  0.02851012 0.0182487534  0.005718736
[3,] -0.09356695  0.01443328 -0.70676468 0.0584085661 -0.032816534

all.equal(QR_Q_hdf5, QR_Q_r)
[1] TRUE

In this example, we show how to control the block size used during the computation by setting the block_size parameter. Adjusting this value can help improve performance depending on the available memory and system architecture.

QRh5 <- bdQR_hdf5(filename = example_fn,
                  group = "chol",
                  dataset = "data",
                  outgroup = "QR_Dec",
                  block_size = 256,
                  overwrite = TRUE)

# Result dataset
QR_Q_hdf5 <-  h5read(QRh5$fn, QRh5$ds_Q)
QR_R_hdf5 <-  h5read(QRh5$fn, QRh5$ds_R)

# Q matrix
QR_Q_hdf5[1:3,1:5]
            [,1]        [,2]        [,3]         [,4]         [,5]
[1,] -0.68551057  0.04546917  0.01531785 0.0008194793  0.034269378
[2,] -0.15234901 -0.74590341  0.02851012 0.0182487534  0.005718736
[3,] -0.09356695  0.01443328 -0.70676468 0.0584085661 -0.032816534

# R matrix
QR_R_hdf5[1:3,1:5]
          [,1]       [,2]        [,3]        [,4]       [,5]
[1,] -138.0121  -40.05205  -22.151218  11.2513535 -14.627475
[2,]    0.0000 -141.41677   -3.081843   0.6397727  -3.862321
[3,]    0.0000    0.00000 -139.144722 -25.7224753   8.435610

# Q matrix in R
QR_Q_r <- qr.Q(qr(Ycp))
QR_Q_r[1:3,1:5]
            [,1]        [,2]        [,3]         [,4]         [,5]
[1,] -0.68551057  0.04546917  0.01531785 0.0008194793  0.034269378
[2,] -0.15234901 -0.74590341  0.02851012 0.0182487534  0.005718736
[3,] -0.09356695  0.01443328 -0.70676468 0.0584085661 -0.032816534

all.equal(QR_Q_hdf5, QR_Q_r)
[1] TRUE

8.4 Cholesky-based Matrix Inversion

The Cholesky decomposition is a matrix factorization technique applicable to symmetric, positive-definite matrices. It expresses a matrix $A$ as the product of a lower triangular matrix $L$ and its transpose:

\[ A = L \cdot L^\top \]

This decomposition is particularly useful for solving linear systems and computing matrix inverses in a numerically stable and efficient way.

The BigDataStatMeth package implements two functions to compute the inverse of a matrix using Cholesky decomposition:

bdInvCholesky() for in-memory matrices.
bdInvCholesky_hdf5() for matrices stored in HDF5 format.

These functions are especially useful when dealing with large covariance matrices in genomics or other high-dimensional applications.

invCholh5 <- bdInvCholesky_hdf5( filename = example_fn,
                                 group = "chol",
                                 dataset = "data",
                                 outdataset = "invmatrix", 
                                 outgroup = "InvCholesky", 
                                 fullMatrix = FALSE, 
                                 overwrite = TRUE)
Warning: 'memory.size()' is Windows-specific
Warning: 'memory.size()' is Windows-specific
Warning: 'memory.size()' is Windows-specific
Warning: 'memory.size()' is Windows-specific
Warning: 'memory.size()' is Windows-specific
Warning: 'memory.size()' is Windows-specific
Warning: 'memory.size()' is Windows-specific
Warning: 'memory.size()' is Windows-specific
Warning: 'memory.size()' is Windows-specific
Warning: 'memory.size()' is Windows-specific
Warning: 'memory.size()' is Windows-specific
Warning: 'memory.size()' is Windows-specific

# Result dataset
invChol_hdf5 <-  h5read(invCholh5$fn, invCholh5$ds)
invChol_hdf5[1:5,1:5]
         [,1]      [,2]      [,3]       [,4]       [,5]
[1,] 70.55445 -59.13169  5.832314 -20.482406 -57.516370
[2,]  0.00000  50.61402 -4.833963  17.358636  49.337577
[3,]  0.00000   0.00000  0.915268  -1.606462  -4.476971
[4,]  0.00000   0.00000  0.000000   6.222298  16.988061
[5,]  0.00000   0.00000  0.000000   0.000000  48.897665

The argument fullMatrix is a logical flag that controls how the inverse is stored: - If fullMatrix = TRUE, the entire inverse matrix is stored in the output file. - If fullMatrix = FALSE (default), only the lower triangular part is stored, which reduces disk usage and is sufficient when the matrix is symmetric.

invCholh5 <- bdInvCholesky_hdf5( filename = example_fn,
                                 group = "chol",
                                 dataset = "data",
                                 outdataset = "invmatrix", 
                                 outgroup = "InvCholesky", 
                                 fullMatrix = TRUE, 
                                 overwrite = TRUE)
Warning: 'memory.size()' is Windows-specific
Warning: 'memory.size()' is Windows-specific
Warning: 'memory.size()' is Windows-specific
Warning: 'memory.size()' is Windows-specific
Warning: 'memory.size()' is Windows-specific
Warning: 'memory.size()' is Windows-specific
Warning: 'memory.size()' is Windows-specific
Warning: 'memory.size()' is Windows-specific
Warning: 'memory.size()' is Windows-specific
Warning: 'memory.size()' is Windows-specific
Warning: 'memory.size()' is Windows-specific
Warning: 'memory.size()' is Windows-specific

# Result dataset
invChol_hdf5 <-  h5read(invCholh5$fn, invCholh5$ds)
invChol_hdf5[1:5,1:5]
           [,1]       [,2]      [,3]       [,4]       [,5]
[1,]  70.554448 -59.131685  5.832314 -20.482406 -57.516370
[2,] -59.131685  50.614021 -4.833963  17.358636  49.337577
[3,]   5.832314  -4.833963  0.915268  -1.606462  -4.476971
[4,] -20.482406  17.358636 -1.606462   6.222298  16.988061
[5,] -57.516370  49.337577 -4.476971  16.988061  48.897665

# Inverse Cholesky matrix in R
invChol_r <- solve(Ycp)
invChol_r[1:5,1:5]
           [,1]       [,2]      [,3]       [,4]       [,5]
[1,]  70.554448 -59.131685  5.832314 -20.482406 -57.516370
[2,] -59.131685  50.614021 -4.833963  17.358636  49.337577
[3,]   5.832314  -4.833963  0.915268  -1.606462  -4.476971
[4,] -20.482406  17.358636 -1.606462   6.222298  16.988061
[5,] -57.516370  49.337577 -4.476971  16.988061  48.897665

all.equal(invChol_hdf5, invChol_r)
[1] TRUE

9 Session information

sessionInfo()
R version 4.5.1 (2025-06-13)
Platform: aarch64-apple-darwin20
Running under: macOS Sequoia 15.5

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1

locale:
[1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Madrid
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] BigDataStatMeth_1.0.3 rhdf5_2.52.1          knitr_1.50           
[4] BiocStyle_2.36.0     

loaded via a namespace (and not attached):
 [1] cli_3.6.5           rlang_1.1.6         xfun_0.52          
 [4] data.table_1.17.8   jsonlite_2.0.0      RCurl_1.98-1.17    
 [7] htmltools_0.5.8.1   sass_0.4.10         rmarkdown_2.29     
[10] evaluate_1.0.4      jquerylib_0.1.4     bitops_1.0-9       
[13] fastmap_1.2.0       yaml_2.3.10         lifecycle_1.0.4    
[16] Rhdf5lib_1.30.0     bookdown_0.43       BiocManager_1.30.26
[19] compiler_4.5.1      Rcpp_1.1.0          rhdf5filters_1.20.0
[22] rstudioapi_0.17.1   digest_0.6.37       R6_2.6.1           
[25] bslib_0.9.0         tools_4.5.1         cachem_1.1.0