The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.

Big-memory workflows with bigPLScox

Frédéric Bertrand

Cedric, Cnam, Paris
frederic.bertrand@lecnam.net

2025-11-06

Motivation

A central feature of bigPLScox is the ability to operate on file-backed bigmemory::big.matrix objects. This vignette demonstrates how to prepare such datasets, fit models with big_pls_cox() and big_pls_cox_gd(), and integrate them with cross-validation helpers. The examples complement the introductory vignette “Getting started with bigPLScox”.

Preparing a big.matrix

We simulate a moderately large design matrix and persist it to disk via bigmemory::filebacked.big.matrix(). Using file-backed storage allows models to train on datasets that exceed the available RAM.

library(bigPLScox)
library(bigmemory)

set.seed(2024)
n_obs <- 5000
n_pred <- 100

X_dense <- matrix(rnorm(n_obs * n_pred), nrow = n_obs)
time <- rexp(n_obs, rate = 0.2)
status <- rbinom(n_obs, 1, 0.7)

big_dir <- tempfile("bigPLScox-")
dir.create(big_dir)
X_big <- filebacked.big.matrix(
  nrow = n_obs,
  ncol = n_pred,
  backingpath = big_dir,
  backingfile = "X.bin",
  descriptorfile = "X.desc",
  init = X_dense
)

The resulting big.matrix can be reopened in future sessions via its descriptor file. All big-memory modelling functions accept either an in-memory matrix or a big.matrix reference.

Fitting big-memory models

big_pls_cox() runs the classical PLS-Cox algorithm while streaming data from disk.

fit_big <- big_pls_cox(
  X = X_big,
  time = time,
  status = status,
  ncomp = 5
)

head(fit_big$scores)
#>               [,1]          [,2]          [,3]          [,4]          [,5]
#> [1,] -4.450742e-06 -1.669808e-19 -1.671178e-32 -1.707642e-46 -1.148375e-59
#> [2,] -4.450742e-06 -1.669808e-19 -1.671178e-32 -1.707642e-46 -1.148375e-59
#> [3,] -4.450742e-06 -1.669808e-19 -1.671178e-32 -1.707642e-46 -1.148375e-59
#> [4,] -4.450742e-06 -1.669808e-19 -1.671178e-32 -1.707642e-46 -1.148375e-59
#> [5,] -4.450742e-06 -1.669808e-19 -1.671178e-32 -1.707642e-46 -1.148375e-59
#> [6,] -4.450742e-06 -1.669808e-19 -1.671178e-32 -1.707642e-46 -1.148375e-59
str(fit_big)
#> List of 9
#>  $ scores  : num [1:5000, 1:5] -4.45e-06 -4.45e-06 -4.45e-06 -4.45e-06 -4.45e-06 ...
#>  $ loadings: num [1:100, 1:5] -0.1 -0.1 -0.1 -0.1 -0.1 ...
#>  $ weights : num [1:100, 1:5] -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 ...
#>  $ center  : num [1:100] 0.982 0.982 0.982 0.982 0.982 ...
#>  $ scale   : num [1:100] 1.65e-07 1.65e-07 1.65e-07 1.65e-07 1.65e-07 ...
#>  $ cox_fit :List of 10
#>   ..$ coefficients     : num [1:5] NA NA 1.82e+41 NA NA
#>   ..$ var              : num [1:5, 1:5] 0 0 0 0 0 0 0 0 0 0 ...
#>   ..$ loglik           : num [1:2] -26230 -26230
#>   ..$ score            : num 8.72e-20
#>   ..$ iter             : int 1
#>   ..$ linear.predictors: num [1:5000] 0.000392 0.000392 0.000392 0.000392 0.000392 ...
#>   ..$ residuals        : Named num [1:5000] 0.8605 0.7913 -0.0129 0.1737 0.9958 ...
#>   .. ..- attr(*, "names")= chr [1:5000] "1" "2" "3" "4" ...
#>   ..$ means            : num [1:5] -4.45e-06 -1.67e-19 -1.67e-32 -1.71e-46 -1.15e-59
#>   ..$ method           : chr "efron"
#>   ..$ class            : chr "coxph"
#>  $ keepX   : int [1:5] 0 0 0 0 0
#>  $ time    : num [1:5000] 1.024 1.5182 0.1047 5.9911 0.0424 ...
#>  $ status  : num [1:5000] 1 1 0 1 1 1 1 1 0 1 ...
#>  - attr(*, "class")= chr "big_pls_cox"

The gradient-descent variant big_pls_cox_gd() uses stochastic optimisation and is well-suited for very large datasets.

fit_big_gd <- big_pls_cox_gd(
  X = X_big,
  time = time,
  status = status,
  ncomp = 5,
  max_iter = 100,
  tol = 1e-4
  )

head(fit_big$scores)
#>               [,1]          [,2]          [,3]          [,4]          [,5]
#> [1,] -4.450742e-06 -1.669808e-19 -1.671178e-32 -1.707642e-46 -1.148375e-59
#> [2,] -4.450742e-06 -1.669808e-19 -1.671178e-32 -1.707642e-46 -1.148375e-59
#> [3,] -4.450742e-06 -1.669808e-19 -1.671178e-32 -1.707642e-46 -1.148375e-59
#> [4,] -4.450742e-06 -1.669808e-19 -1.671178e-32 -1.707642e-46 -1.148375e-59
#> [5,] -4.450742e-06 -1.669808e-19 -1.671178e-32 -1.707642e-46 -1.148375e-59
#> [6,] -4.450742e-06 -1.669808e-19 -1.671178e-32 -1.707642e-46 -1.148375e-59
str(fit_big)
#> List of 9
#>  $ scores  : num [1:5000, 1:5] -4.45e-06 -4.45e-06 -4.45e-06 -4.45e-06 -4.45e-06 ...
#>  $ loadings: num [1:100, 1:5] -0.1 -0.1 -0.1 -0.1 -0.1 ...
#>  $ weights : num [1:100, 1:5] -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 ...
#>  $ center  : num [1:100] 0.982 0.982 0.982 0.982 0.982 ...
#>  $ scale   : num [1:100] 1.65e-07 1.65e-07 1.65e-07 1.65e-07 1.65e-07 ...
#>  $ cox_fit :List of 10
#>   ..$ coefficients     : num [1:5] NA NA 1.82e+41 NA NA
#>   ..$ var              : num [1:5, 1:5] 0 0 0 0 0 0 0 0 0 0 ...
#>   ..$ loglik           : num [1:2] -26230 -26230
#>   ..$ score            : num 8.72e-20
#>   ..$ iter             : int 1
#>   ..$ linear.predictors: num [1:5000] 0.000392 0.000392 0.000392 0.000392 0.000392 ...
#>   ..$ residuals        : Named num [1:5000] 0.8605 0.7913 -0.0129 0.1737 0.9958 ...
#>   .. ..- attr(*, "names")= chr [1:5000] "1" "2" "3" "4" ...
#>   ..$ means            : num [1:5] -4.45e-06 -1.67e-19 -1.67e-32 -1.71e-46 -1.15e-59
#>   ..$ method           : chr "efron"
#>   ..$ class            : chr "coxph"
#>  $ keepX   : int [1:5] 0 0 0 0 0
#>  $ time    : num [1:5000] 1.024 1.5182 0.1047 5.9911 0.0424 ...
#>  $ status  : num [1:5000] 1 1 0 1 1 1 1 1 0 1 ...
#>  - attr(*, "class")= chr "big_pls_cox"

Both functions return objects that expose the latent scores and loading vectors, allowing downstream visualisations and diagnostics identical to their in-memory counterparts.

Cross-validation on big matrices

Cross-validation for big-memory models is supported through the list interface. This enables streaming each fold directly from disk.

set.seed(2024)
data_big <- list(x = X_big, time = time, status = status)
cv_big <- cv.coxgpls(
  data_big,
  nt = 5,
  ncores = 1,
  ind.block.x = c(10, 40)
)
cv_big$opt_nt

For large experiments consider combining foreach::foreach() with doParallel::registerDoParallel() to parallelise folds.

Timing snapshot

The native C++ solvers substantially reduce wall-clock time compared to fitting through the R interface alone. The bench package provides convenient instrumentation; the chunk below only runs when it is available.

if (requireNamespace("bench", quietly = TRUE)) {
  bench::mark(
    streaming = big_pls_cox(X_big, time, status, ncomp = 5, keepX = 0),
    gd = big_pls_cox_gd(X_big, time, status, ncomp = 5, max_iter = 150),
    iterations = 5,
    check = FALSE
  )
}
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 streaming    30.4ms   30.5ms      32.7    8.81MB     8.19
#> 2 gd             13ms   13.2ms      75.2    6.79MB    18.8

Deviance residuals with big matrices

Once a model has been fitted we can evaluate deviance residuals using the new C++ backend. Supplying the linear predictor avoids recomputing it in R and works with any matrix backend.

eta_big <- predict(fit_big, type = "link")
dr_cpp <- computeDR(time, status, engine = "cpp", eta = eta_big)
max(abs(dr_cpp - computeDR(time, status)))
#> [1] 3.877472

Cleaning up

Temporary backing files can be removed after the analysis. In production pipelines you will typically keep the descriptor file alongside the binary data.

rm(X_big)
file.remove(file.path(big_dir, "X.bin"))
#> [1] TRUE
file.remove(file.path(big_dir, "X.desc"))
#> [1] TRUE
unlink(big_dir, recursive = TRUE)

Additional resources

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.