The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider. If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.
Random Forest Survival Analysis with ggRandomForests
John Ehrlinger
2026-04-28
Work in progress
This vignette is under active development. Code examples and narrative may change before the next release.
Introduction
Random forests (Breiman 2001) are a non-parametric ensemble method that requires no distributional assumptions on the relation between covariates and the response. Random survival forests (RSF) (Ishwaran et al. 2008; Ishwaran and Kogalur 2007) extend the method to right-censored, time-to-event data by growing trees with a log-rank splitting rule and aggregating Kaplan–Meier estimates within terminal nodes.
The randomForestSRC package (Ishwaran and Kogalur 2024) provides a unified implementation for survival, regression, and classification forests. ggRandomForests extracts tidy data objects from rfsrc fits and renders them with ggplot2(Wickham 2016), making it straightforward to explore how a forest is constructed, which variables matter, and how the response depends on individual predictors.
This vignette demonstrates a complete random survival forest workflow on the primary biliary cirrhosis (PBC) data set (Fleming and Harrington 1991):
Data preparation and exploration — cleaning, EDA, Kaplan–Meier curves
Growing the forest — fitting an RSF, checking convergence and OOB error
Variable selection — VIMP and minimal depth via max.subtree()
Dependence plots — variable dependence and partial dependence via gg_variable() and gg_partial_rfsrc()
Variable interactions — conditioning plots and interactive 3-D partial dependence surfaces with plotly
library(ggplot2)library(dplyr)library(tidyr)library(randomForestSRC)library(survival)if (requireNamespace("ggRandomForests", quietly =TRUE)) {library(ggRandomForests)} elseif (requireNamespace("pkgload", quietly =TRUE)) { pkgload::load_all(export_all =FALSE, helpers =FALSE, attach_testthat =FALSE)} else {stop("Install ggRandomForests (or pkgload for dev builds) to render this vignette.")}theme_set(theme_bw())# Plotting constantsevent_marks <-c(1, 4)event_labels <-c("Censored", "Death")event_colors <-c("steelblue", "firebrick")
Data: Primary Biliary Cirrhosis (PBC)
The PBC study consists of 424 patients referred to Mayo Clinic between 1974 and 1984, of whom 312 were randomized into a trial of D-penicillamine (DPCA) versus placebo. The data is described in Fleming and Harrington (1991) Chapter 0.2, with a proportional hazards model developed in Chapter 4.4. We use the copy bundled with randomForestSRC.
data("pbc", package ="randomForestSRC")
Data cleaning
We convert days to years, recode treatment as a factor, and coerce low-cardinality numeric columns (five or fewer unique values, including binary 0/1 indicators) to factors. We avoid converting binary columns to logical because randomForestSRC::partial.rfsrc() does not handle logical predictors correctly in survival forests.
pbc <- pbc |>mutate(years = days /365.25,age = age /365.25,treatment =factor(ifelse(treatment ==1, "DPCA",ifelse(treatment ==2, "Placebo", NA)),levels =c("DPCA", "Placebo") ) ) |>select(-days)# Low-cardinality numerics (including binary 0/1) to factor.# NOTE: do NOT convert to logical — partial.rfsrc() fails with logical# predictors in survival forests (randomForestSRC <= 3.5.1).# Exclude the response columns (status, years) from conversion.resp_cols <-c("status", "years")for (nm insetdiff(names(pbc), resp_cols)) { v <- pbc[[nm]]if (is.numeric(v) &&!is.factor(v) &&length(unique(v[!is.na(v)])) <=5) { pbc[[nm]] <-factor(v) }}
Good practice before modeling: scan categorical variables as stacked histograms over follow-up time, and continuous variables as scatter plots colored by event status.
KM survival estimates by treatment group with 95% confidence bands.
The curves largely overlap, consistent with the original finding that DPCA offered no clear survival benefit over placebo (Fleming and Harrington 1991).
plot(gg_km, type ="cum_haz") +labs(y ="Cumulative Hazard", x ="Time (years)",color ="Treatment", fill ="Treatment") +theme(legend.position =c(0.2, 0.8)) +coord_cartesian(ylim =c(-0.02, 1.22))
KM cumulative hazard estimates by treatment group.
We can also stratify on continuous variables. Here we reproduce the bilirubin groupings from Fleming and Harrington (1991) Figure 4.4.2:
pbc_bili <- pbc_trial |>mutate(bili_grp =cut(bili, breaks =c(0, 0.8, 1.3, 3.4, 29)))plot(gg_survival(interval ="years", censor ="status",by ="bili_grp", data = pbc_bili),error ="none") +labs(y ="Survival Probability", x ="Time (years)", color ="Bilirubin")
KM survival stratified by bilirubin groups.
Higher bilirubin strongly predicts worse survival — an effect the random forest will rediscover without any prior specification.
Growing a Random Survival Forest
Several predictors in the PBC trial data contain missing values (cholesterol, copper, triglycerides, and others). We handle these with a two-step approach: first impute using impute.rfsrc(), which uses the random forest proximity structure to fill in missing values, then fit the survival forest on the complete imputed data. This keeps the fitted forest object free of imputation state, which is required for partial.rfsrc() to work correctly.
# Step 1: impute missing values via random forest proximitypbc_imputed <-impute.rfsrc(Surv(years, status) ~ .,data = pbc_trial,ntree =500,nimpute =2)# Step 2: grow the survival forest on the complete imputed datarfsrc_pbc <-rfsrc(Surv(years, status) ~ .,data = pbc_imputed,nsplit =10,tree.err =TRUE,importance =TRUE)
The forest grew 500 trees, splitting on 5 randomly selected candidate variables at each node, and stopping at a minimum terminal node size of 15.
OOB error convergence
plot(gg_error(rfsrc_pbc))
OOB prediction error vs. number of trees.
The error stabilizes well before 1000 trees, indicating the forest is large enough for reliable predictions.
OOB predicted survival curves. Blue = censored, red = death events.
Each curve represents one patient’s OOB prediction, extended to the last follow-up time. Red (death event) curves generally fall faster, confirming the forest discriminates between risk groups.
plot(gg_rfsrc(rfsrc_pbc, by ="treatment")) +theme(legend.position =c(0.2, 0.2)) +labs(y ="Survival Probability", x ="Time (years)") +coord_cartesian(ylim =c(-0.01, 1.01))
Median predicted survival by treatment group with 95% confidence bands.
Test set predictions
The non-trial patients (pbc_test) have substantial missing data. predict.rfsrc() handles these transparently at prediction time via na.action = "na.impute" — this is distinct from training-time imputation and does not affect the fitted forest object.
Predicted survival for non-trial patients (test set).
Because the training curves use OOB estimates, both plots represent out-of-sample predictions and are directly comparable.
Variable Selection
Random forest uses all available predictors. To understand which matter most, we examine variable importance (VIMP) and minimal depth.
Variable importance (VIMP)
VIMP measures the increase in prediction error when a variable is randomly permuted. Large positive values mean the variable is essential for accuracy; negative values suggest noise is more informative than the variable.
Variable importance ranking. Blue = positive VIMP, red = negative.
Bilirubin ranks highest, followed by copper, prothrombin time, albumin, and age — closely matching the variables selected in the Fleming and Harrington (1991) proportional hazards model.
Minimal depth
Minimal depth (Ishwaran et al. 2010) ranks variables by how close to the root node they first split, on average. Variables that partition large portions of the population early are considered most important.
md_pbc <-max.subtree(rfsrc_pbc)
The max.subtree() function computes minimal depth for each variable. The threshold is 5.83, selecting 8 variables: age, ascites, edema, bili, chol, albumin, copper, prothrombin.
Both selection methods agree on the key predictors: bili, albumin, copper, prothrombin, and age. We add edema (selected by the Fleming and Harrington (1991) model) for the remainder of the analysis.
Variable dependence shows each patient’s predicted survival at a given time horizon plotted against a predictor of interest. Points are colored by event status; a loess smooth indicates the trend.
Variable dependence of survival at 1 and 3 years on bilirubin.
Survival drops sharply with increasing bilirubin, and the 3-year curve drops further than the 1-year curve, suggesting a non-proportional hazards effect.
Variable dependence at 1 and 3 years for continuous predictors.
The plots confirm survival increases with albumin and decreases with copper, prothrombin time, and age. The divergence between time curves for copper further supports a non-proportional hazards mechanism.
Patients with edema = 1 (edema despite diuretics) have markedly lower predicted survival.
Partial dependence
Warning
Known issue (draft):randomForestSRC::partial.rfsrc() currently fails for survival forests in randomForestSRC ≥ 3.3. The partial dependence and interactive surface sections below will show an error until this upstream bug is resolved. All other sections of this vignette are fully functional.
Partial dependence integrates out the effects of other covariates, giving a risk-adjusted view of how each predictor influences the response (Friedman 2001). We use gg_partial_rfsrc() which calls randomForestSRC::partial.rfsrc() directly and returns a gg_partial_rfsrc object. For survival forests, the result includes a time column so plot(pd) produces one curve per predictor value over time, faceted by variable name.
# partial.rfsrc() requires times that match the model's time.interest grid;# gg_partial_rfsrc() snaps the requested values to the nearest observed times.ti <- rfsrc_pbc$time.interestt1yr <- ti[which.min(abs(ti -1))]t3yr <- ti[which.min(abs(ti -3))]pd <-gg_partial_rfsrc(rfsrc_pbc, xvar.names = xvar,partial.time =c(t1yr, t3yr))# Quick S3 plot — survival forests produce time-series curves per predictor valueplot(pd)
Partial dependence of survival at approximately 1 and 3 years on continuous predictors.
For a publication-ready layout with custom colour scale, access pd$continuous directly:
ggplot(pd$continuous, aes(x = x, y = yhat,color =factor(round(time, 2)),group =factor(time))) +geom_line(linewidth =1) +facet_wrap(~name, scales ="free_x") +labs(y ="Predicted Survival", x ="", color ="Year") +scale_color_manual(values =setNames(c("steelblue", "firebrick"),as.character(round(c(t1yr, t3yr), 2)))) +theme_bw()
Partial dependence (custom styling).
The partial dependence curves at approximately 1 and 3 years confirm the variable dependence findings and support the log-transforms used in the Fleming and Harrington (1991) model for bili, albumin, and prothrombin. The divergence between time curves for bili and copper supports non-proportional hazards.
Conditional dependence
To investigate interactions, we can condition variable dependence on group membership in another variable. Here we examine the dependence of survival on bilirubin, stratified by edema group.
Variable dependence on bilirubin, conditional on edema group.
The decreasing trend with bilirubin holds across all edema groups, but survival is uniformly lower in the edema = 1 panel, confirming an additive effect.
We can also condition on a continuous variable by binning into quantile groups:
Variable dependence on bilirubin, conditional on albumin groups.
The effect of bilirubin attenuates at higher albumin levels, suggesting an interaction between these two liver function markers.
Interactive Partial Dependence Surfaces
For a richer view of the interaction between bilirubin and albumin, we construct a partial dependence surface. We compute partial dependence on a grid of 25 albumin values, each evaluated at 25 bilirubin points.
# Create grid of albumin valuesalb_grid <-quantile_pts(pbc_trial$albumin, groups =25)# For each albumin value, compute partial dependence on bili at ~1 yearsurface_list <-lapply(alb_grid, function(alb_val) { newx <- pbc_trial[, rfsrc_pbc$xvar.names, drop =FALSE] newx$albumin <- alb_val pd_alb <-gg_partial_rfsrc(rfsrc_pbc, xvar.names ="bili",newx = newx, partial.time = t1yr) df <- pd_alb$continuous df$albumin <- alb_val df})surface_df <-bind_rows(surface_list)
if (!exists("surface_df")) {message("surface_df not available — skipping plotly surface (see surface-data chunk error above).")} elseif (requireNamespace("plotly", quietly =TRUE)) {# Reshape for surfacelibrary(plotly) surface_wide <- surface_df |>select(bili = x, albumin, survival = yhat) |>arrange(albumin, bili)# Create matrix form bili_vals <-sort(unique(surface_wide$bili)) alb_vals <-sort(unique(surface_wide$albumin)) z_matrix <-matrix(surface_wide$survival,nrow =length(alb_vals),ncol =length(bili_vals),byrow =TRUE)plot_ly(x = bili_vals, y = alb_vals, z = z_matrix) |>add_surface(colorscale ="Viridis", showscale =TRUE) |>layout(scene =list(xaxis =list(title ="Bilirubin"),yaxis =list(title ="Albumin"),zaxis =list(title ="Survival") ) )} else {message("Install the plotly package for interactive 3D surfaces.")# Fallback: contour plot with ggplot2ggplot(surface_df, aes(x = x, y = albumin, fill = yhat)) +geom_tile() +scale_fill_viridis_c(name ="Survival") +labs(x ="Bilirubin", y ="Albumin") +theme_bw()}
Interactive partial dependence surface: survival as a function of bilirubin and albumin.
The surface shows that survival is highest when bilirubin is low and albumin is high (upper-left corner), and drops steeply as bilirubin increases or albumin decreases. The non-planar shape of the surface — particularly the steep gradient at low albumin and high bilirubin — confirms the interaction detected in the conditional plots.
Conclusion
This vignette demonstrated a complete random survival forest analysis using randomForestSRC and ggRandomForests:
Kaplan–Meier curves via gg_survival() confirmed no treatment effect in the PBC trial, consistent with Fleming and Harrington (1991).
OOB error via gg_error() showed the forest stabilized quickly.
VIMP via gg_vimp() and minimal depth via max.subtree() both identified bilirubin, albumin, copper, prothrombin, and age as key predictors — matching the proportional hazards model.
Variable dependence via gg_variable() revealed non-proportional hazards effects for bilirubin and copper.
Partial dependence via gg_partial_rfsrc() provided risk-adjusted confirmation and supported the log-transforms used in parametric models.
Conditional plots and interactive surfaces exposed the bilirubin–albumin interaction.
The ggRandomForests design separates data extraction from plotting: every gg_*() function returns a tidy data frame that can be plotted with the package’s plot() methods or fed directly into custom ggplot2 workflows.
Fleming, Thomas R., and David P. Harrington. 1991. Counting Processes and Survival Analysis. Wiley Series in Probability and Statistics. John Wiley & Sons.
Friedman, Jerome H. 2001. “Greedy Function Approximation: A Gradient Boosting Machine.”The Annals of Statistics 29 (5): 1189–232. https://doi.org/10.1214/aos/1013203451.
Ishwaran, Hemant, and Udaya B. Kogalur. 2007. “Random Survival Forests for R.”R News 7 (2): 25–31.
Ishwaran, Hemant, Udaya B. Kogalur, Eugene H. Blackstone, and Michael S. Lauer. 2008. “Random Survival Forests.”The Annals of Applied Statistics 2 (3): 841–60. https://doi.org/10.1214/08-AOAS169.
Ishwaran, Hemant, Udaya B. Kogalur, Eiran Z. Gorodeski, Andy J. Minn, and Michael S. Lauer. 2010. “High-Dimensional Variable Selection for Survival Data.”Journal of the American Statistical Association 105 (489): 205–17. https://doi.org/10.1198/jasa.2009.tm08622.
These binaries (installable software) and packages are in development. They may not be fully stable and should be used with caution. We make no claims about them.