The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.
A central feature of bigPLScox is the ability to
operate on file-backed bigmemory::big.matrix objects. This
vignette demonstrates how to prepare such datasets, fit models with
big_pls_cox() and big_pls_cox_gd(), and
integrate them with cross-validation helpers. The examples complement
the introductory vignette “Getting started with bigPLScox”.
We simulate a moderately large design matrix and persist it to disk
via bigmemory::filebacked.big.matrix(). Using file-backed
storage allows models to train on datasets that exceed the available
RAM.
library(bigPLScox)
library(bigmemory)
set.seed(2024)
n_obs <- 5000
n_pred <- 100
X_dense <- matrix(rnorm(n_obs * n_pred), nrow = n_obs)
time <- rexp(n_obs, rate = 0.2)
status <- rbinom(n_obs, 1, 0.7)
big_dir <- tempfile("bigPLScox-")
dir.create(big_dir)
X_big <- filebacked.big.matrix(
nrow = n_obs,
ncol = n_pred,
backingpath = big_dir,
backingfile = "X.bin",
descriptorfile = "X.desc",
init = X_dense
)The resulting big.matrix can be reopened in future
sessions via its descriptor file. All big-memory modelling functions
accept either an in-memory matrix or a big.matrix
reference.
big_pls_cox() runs the classical PLS-Cox algorithm while
streaming data from disk.
fit_big <- big_pls_cox(
X = X_big,
time = time,
status = status,
ncomp = 5
)
head(fit_big$scores)
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] -4.450742e-06 -1.669808e-19 -1.671178e-32 -1.707642e-46 -1.148375e-59
#> [2,] -4.450742e-06 -1.669808e-19 -1.671178e-32 -1.707642e-46 -1.148375e-59
#> [3,] -4.450742e-06 -1.669808e-19 -1.671178e-32 -1.707642e-46 -1.148375e-59
#> [4,] -4.450742e-06 -1.669808e-19 -1.671178e-32 -1.707642e-46 -1.148375e-59
#> [5,] -4.450742e-06 -1.669808e-19 -1.671178e-32 -1.707642e-46 -1.148375e-59
#> [6,] -4.450742e-06 -1.669808e-19 -1.671178e-32 -1.707642e-46 -1.148375e-59
str(fit_big)
#> List of 9
#> $ scores : num [1:5000, 1:5] -4.45e-06 -4.45e-06 -4.45e-06 -4.45e-06 -4.45e-06 ...
#> $ loadings: num [1:100, 1:5] -0.1 -0.1 -0.1 -0.1 -0.1 ...
#> $ weights : num [1:100, 1:5] -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 ...
#> $ center : num [1:100] 0.982 0.982 0.982 0.982 0.982 ...
#> $ scale : num [1:100] 1.65e-07 1.65e-07 1.65e-07 1.65e-07 1.65e-07 ...
#> $ cox_fit :List of 10
#> ..$ coefficients : num [1:5] NA NA 1.82e+41 NA NA
#> ..$ var : num [1:5, 1:5] 0 0 0 0 0 0 0 0 0 0 ...
#> ..$ loglik : num [1:2] -26230 -26230
#> ..$ score : num 8.72e-20
#> ..$ iter : int 1
#> ..$ linear.predictors: num [1:5000] 0.000392 0.000392 0.000392 0.000392 0.000392 ...
#> ..$ residuals : Named num [1:5000] 0.8605 0.7913 -0.0129 0.1737 0.9958 ...
#> .. ..- attr(*, "names")= chr [1:5000] "1" "2" "3" "4" ...
#> ..$ means : num [1:5] -4.45e-06 -1.67e-19 -1.67e-32 -1.71e-46 -1.15e-59
#> ..$ method : chr "efron"
#> ..$ class : chr "coxph"
#> $ keepX : int [1:5] 0 0 0 0 0
#> $ time : num [1:5000] 1.024 1.5182 0.1047 5.9911 0.0424 ...
#> $ status : num [1:5000] 1 1 0 1 1 1 1 1 0 1 ...
#> - attr(*, "class")= chr "big_pls_cox"The gradient-descent variant big_pls_cox_gd() uses
stochastic optimisation and is well-suited for very large datasets.
fit_big_gd <- big_pls_cox_gd(
X = X_big,
time = time,
status = status,
ncomp = 5,
max_iter = 100,
tol = 1e-4
)
head(fit_big$scores)
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] -4.450742e-06 -1.669808e-19 -1.671178e-32 -1.707642e-46 -1.148375e-59
#> [2,] -4.450742e-06 -1.669808e-19 -1.671178e-32 -1.707642e-46 -1.148375e-59
#> [3,] -4.450742e-06 -1.669808e-19 -1.671178e-32 -1.707642e-46 -1.148375e-59
#> [4,] -4.450742e-06 -1.669808e-19 -1.671178e-32 -1.707642e-46 -1.148375e-59
#> [5,] -4.450742e-06 -1.669808e-19 -1.671178e-32 -1.707642e-46 -1.148375e-59
#> [6,] -4.450742e-06 -1.669808e-19 -1.671178e-32 -1.707642e-46 -1.148375e-59
str(fit_big)
#> List of 9
#> $ scores : num [1:5000, 1:5] -4.45e-06 -4.45e-06 -4.45e-06 -4.45e-06 -4.45e-06 ...
#> $ loadings: num [1:100, 1:5] -0.1 -0.1 -0.1 -0.1 -0.1 ...
#> $ weights : num [1:100, 1:5] -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 ...
#> $ center : num [1:100] 0.982 0.982 0.982 0.982 0.982 ...
#> $ scale : num [1:100] 1.65e-07 1.65e-07 1.65e-07 1.65e-07 1.65e-07 ...
#> $ cox_fit :List of 10
#> ..$ coefficients : num [1:5] NA NA 1.82e+41 NA NA
#> ..$ var : num [1:5, 1:5] 0 0 0 0 0 0 0 0 0 0 ...
#> ..$ loglik : num [1:2] -26230 -26230
#> ..$ score : num 8.72e-20
#> ..$ iter : int 1
#> ..$ linear.predictors: num [1:5000] 0.000392 0.000392 0.000392 0.000392 0.000392 ...
#> ..$ residuals : Named num [1:5000] 0.8605 0.7913 -0.0129 0.1737 0.9958 ...
#> .. ..- attr(*, "names")= chr [1:5000] "1" "2" "3" "4" ...
#> ..$ means : num [1:5] -4.45e-06 -1.67e-19 -1.67e-32 -1.71e-46 -1.15e-59
#> ..$ method : chr "efron"
#> ..$ class : chr "coxph"
#> $ keepX : int [1:5] 0 0 0 0 0
#> $ time : num [1:5000] 1.024 1.5182 0.1047 5.9911 0.0424 ...
#> $ status : num [1:5000] 1 1 0 1 1 1 1 1 0 1 ...
#> - attr(*, "class")= chr "big_pls_cox"Both functions return objects that expose the latent scores and loading vectors, allowing downstream visualisations and diagnostics identical to their in-memory counterparts.
Cross-validation for big-memory models is supported through the list interface. This enables streaming each fold directly from disk.
set.seed(2024)
data_big <- list(x = X_big, time = time, status = status)
cv_big <- cv.coxgpls(
data_big,
nt = 5,
ncores = 1,
ind.block.x = c(10, 40)
)
cv_big$opt_ntFor large experiments consider combining
foreach::foreach() with
doParallel::registerDoParallel() to parallelise folds.
The native C++ solvers substantially reduce wall-clock time compared
to fitting through the R interface alone. The bench package
provides convenient instrumentation; the chunk below only runs when it
is available.
if (requireNamespace("bench", quietly = TRUE)) {
bench::mark(
streaming = big_pls_cox(X_big, time, status, ncomp = 5, keepX = 0),
gd = big_pls_cox_gd(X_big, time, status, ncomp = 5, max_iter = 150),
iterations = 5,
check = FALSE
)
}
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 streaming 30.4ms 30.5ms 32.7 8.81MB 8.19
#> 2 gd 13ms 13.2ms 75.2 6.79MB 18.8Once a model has been fitted we can evaluate deviance residuals using the new C++ backend. Supplying the linear predictor avoids recomputing it in R and works with any matrix backend.
Temporary backing files can be removed after the analysis. In production pipelines you will typically keep the descriptor file alongside the binary data.
help(big_pls_cox) and help(big_pls_cox_gd)
document all tuning parameters for the big-memory solvers.saveRDS() to
avoid recomputing large models when iterating on analyses.These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.