The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.

Benchmarking bigPLScox

Frédéric Bertrand

Cedric, Cnam, Paris
frederic.bertrand@lecnam.net

2025-11-06

Motivation

High-dimensional survival datasets can be computationally demanding. bigPLScox implements algorithms that scale to large numbers of predictors and observations via component-based models, sparse penalties, and stochastic gradient descent routines. This vignette demonstrates how to benchmark the package against baseline approaches using the bench package.

We focus on simulated data to illustrate reproducible comparisons between the classical coxgpls() solver, its big-memory counterparts, and the survival::coxph() implementation.

Setup

The examples below require a recent version of bench together with survival.

library(bigPLScox)
library(survival)
library(bench)
library(bigmemory)
library(plsRcox)

The helper dataCox() simulates survival outcomes with right censoring. We work with a moderately sized problem here, but larger values for n and p can be used to stress test performance.

set.seed(2024)
sim_design <- dataCox(
  n = 2000,
  lambda = 2,
  rho = 1.5,
  x = matrix(rnorm(2000 * 50), ncol = 50),
  beta = c(1, 3, rep(0, 48)),
  cens.rate = 5
)

cox_data <- list(
  x = as.matrix(sim_design[, -(1:3)]),
  time = sim_design$time,
  status = sim_design$status
)
X_big <- bigmemory::as.big.matrix(cox_data$x)

Running the benchmark

We compare the classical Cox proportional hazards model with coxgpls() and the two big_pls_cox() solvers. The bench::mark() helper executes the estimators multiple times and records timing statistics alongside memory usage information.

bench_res <- bench::mark(
  coxgpls = coxgpls(
    cox_data$x,
    cox_data$time,
    cox_data$status,
    ncomp = 5,
    ind.block.x = c(3, 10)
  ),
  big_pls = big_pls_cox(X_big, cox_data$time, cox_data$status, ncomp = 5),
  big_pls_gd = big_pls_cox_gd(X_big, cox_data$time, cox_data$status, ncomp = 5, max_iter = 100),
  survival = coxph(Surv(cox_data$time, cox_data$status) ~ cox_data$x, ties = "breslow"),
  iterations = 100,
  check = FALSE
)
bench_res
#> # A tibble: 4 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 coxgpls     17.64ms  18.02ms      55.3   30.95MB   383.  
#> 2 big_pls      7.72ms   7.88ms     127.     3.54MB    12.5 
#> 3 big_pls_gd  11.67ms  11.71ms      85.3    2.75MB     7.42
#> 4 survival    26.94ms  27.32ms      36.4   13.39MB    23.3

bench_summary <- bench_res[, c("expression", "median", "itr/sec")]
bench_summary
#> # A tibble: 4 × 3
#>   expression   median `itr/sec`
#>   <bch:expr> <bch:tm>     <dbl>
#> 1 coxgpls     18.02ms      55.3
#> 2 big_pls      7.88ms     127. 
#> 3 big_pls_gd  11.71ms      85.3
#> 4 survival    27.32ms      36.4

The resulting tibble reports elapsed time, memory allocations, and garbage collection statistics for each estimator. The itr/sec column is often the most useful indicator when comparing multiple implementations. The bench_summary object summarises the median runtime and iterations per second.

Visualising the results

bench provides ggplot-based helpers to visualise the distributions of elapsed and memory usage.

plot(bench_res, type = "jitter")

Additional geometries, such as ridge plots, are available via autoplot(bench_res, type = "ridge").

Accuracy on a shared test set

We use a common train/test split across four estimators and compare predictive performance using the survival concordance index. The test set is identical for each method to ensure a fair comparison.

set.seed(4242)
split_id <- sample(c("train", "test"), size = nrow(cox_data$x), replace = TRUE, prob = c(0.7, 0.3))
train_idx <- split_id == "train"
test_idx <- !train_idx

train_x <- cox_data$x[train_idx, , drop = FALSE]
test_x <- cox_data$x[test_idx, , drop = FALSE]
train_time <- cox_data$time[train_idx]
test_time <- cox_data$time[test_idx]
train_status <- cox_data$status[train_idx]
test_status <- cox_data$status[test_idx]

big_fit <- big_pls_cox(bigmemory::as.big.matrix(train_x), train_time, train_status, ncomp = 5)
gd_fit <- big_pls_cox_gd(bigmemory::as.big.matrix(train_x), train_time, train_status, ncomp = 5, max_iter = 120)
pls_fit <- plsRcox::plsRcox(train_x, train_time, train_status, nt = 5, verbose = FALSE)

lp_big <- predict(big_fit, newdata = test_x, type = "link")
lp_gd <- predict(gd_fit, newdata = test_x, type = "link")
lp_pls <- predict(pls_fit, newdata = test_x)

surv_test <- survival::Surv(test_time, test_status)
acc_results <- data.frame(
  method = c("big_pls_cox", "big_pls_cox_gd", "plsRcox"),
  c_index = c(
    survival::concordance(surv_test ~ lp_big)$concordance,
    survival::concordance(surv_test ~ lp_gd)$concordance,
    survival::concordance(surv_test ~ lp_pls)$concordance
  )
)
acc_results
#>           method    c_index
#> 1    big_pls_cox 0.06829377
#> 2 big_pls_cox_gd 0.57433234
#> 3        plsRcox 0.06821958

All methods operate on the same train/test split and the concordance index facilitates direct comparison with the reference implementation from .

Exporting benchmark results

Use the function write.csv() to store the benchmarking table as part of a reproducible pipeline. For larger studies consider varying the number of latent components, sparsity constraints, or the dataset dimensions.

if (!dir.exists("inst/benchmarks/results")) {
  dir.create("inst/benchmarks/results", recursive = TRUE)
}
write.csv(bench_res[,1:9], file = "inst/benchmarks/results/benchmarking-demo.csv", row.names = FALSE)

Reusing the benchmarking scripts

The package also ships with standalone scripts under inst/benchmarks/ that mirror this vignette while exposing additional configuration points. Run them from the repository root as:

Rscript inst/benchmarks/cox-benchmark.R
Rscript inst/benchmarks/benchmark_bigPLScox.R
Rscript inst/benchmarks/cox_pls_benchmark.R

Each script accepts environment variables to adjust the problem size and stores results under inst/benchmarks/results/ with time-stamped filenames.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.