The csmpv R package offers a comprehensive array of functions covering biomarker confirmation, variable selection, modeling, predictive analysis, and validation. Its primary objectives encompass:

To simplify the modeling process, we’ve designed an all-in-one function capable of managing predictive model development, prediction, and validation for all eight methods within this package across three distinct outcome types. This versatile function streamlines the process, allowing for a concise implementation with just a single function call. It can handle a single method with single or multiple outcome variables. Moreover, if a validation dataset is available, the prediction and validation processes can seamlessly integrate into a unified operation.

In addition to these core functionalities, the csmpv package introduces a unique approach allowing the creation of binary classification models based on survival models. This innovative feature enables predicting binary outcomes for new datasets using the developed model. Please note, the external validation of this model is limited due to the absence of binary classification variables in new datasets. Despite this limitation, the predicted binary classification can serve as a surrogate biomarker, and its correlation with survival outcomes in new datasets can be tested when survival outcome information is available.

The package excels in handling various outcome variable types—binary, continuous, and time-to-event data.

To enhance user experience, the csmpv R package focuses on streamlining coding efforts. Each user-end function acts as a comprehensive wrapper condensing multiple analyses into a single function call. Additionally, result files are conveniently saved locally, further simplifying the analytical process.

I Installation

The csmpv package is available on CRAN, and it can be directly installed in R using the following command:

install.packages("csmpv")

Alternatively, let’s proceed to install csmpv from GitHub using the devtools or remotes R package.

# Install devtools package if not already installed
options(repos = c(CRAN = "https://cloud.r-project.org"))
install.packages("devtools")

# Install csmpv package from GitHub
devtools::install_github("ajiangsfu/csmpv",force = TRUE)
# Using force = TRUE will ensure the installation, overriding any existing versions
# Install remotes package if not already installed
install.packages("remotes")
# Install csmpv package from GitHub
remotes::install_github("ajiangsfu/csmpv",force = TRUE)
# Using force = TRUE will ensure the installation, overriding any existing versions

Both methods will download and install the csmpv package from the GitHub repository. Please ensure an active internet connection and the necessary dependencies for a successful installation.

II Example Code

In this section, we will show some example code, however, before that, we will introduce example data first.

1. Example data

The example data was extracted from our in-house diffuse large B-cell lymphoma (DLBCL) dataset, specifically utilizing supplemental Table S1 from Alduaij et al. (2023, DOI: 10.1182/blood.2022018248).

Upon identifying a substantial amount of missing data, accounting for only 38% complete cases, we conducted Little’s MCAR test, revealing non-randomness in the missing values. This directed our focus toward handling rather than excluding them. Implementing multiple imputation emerged as a robust strategy for addressing this issue, showcasing its versatility and effectiveness across various missing data scenarios. However, for illustrative purposes, we generated only one imputation.

Furthermore, to ensure compatibility with all eight modeling methods within csmpv, we transformed all categorical variables into binary format, overcoming limitations in XGBoost and LASSO when dealing with categorical variables with more than two levels.

Following these procedures, an object named datlist was generated and is included in csmpv, accessible straightforwardly after installing and loading csmpv, as demonstrated below.

library(csmpv)
data("datlist", package = "csmpv")
tdat = datlist$training
dim(tdat)
## [1] 216  22
vdat = datlist$validation
dim(vdat)
## [1] 217  22

Subsequently, we defined three outcome variables and their respective independent variables.

To illustrate different types of outcome variables, we’ll define examples for binary, continuous, and time-to-event categories: - Binary: DZsig (dark zone signature) - Continuous: Age - Time-to-event: FFP (freedom from progression)

For binary and time-to-event variables, independent variables are defined as:

Xvars = c("highIPI","B.Symptoms","MYC.IHC","BCL2.IHC", "CD10.IHC","BCL6.IHC",
 "MUM1.IHC","Male","AgeOver60", "stage3_4","PS1","LDH.Ratio1",
 "Extranodal1","Bulk10cm","HANS_GCB", "DTI")

For the continuous variable, the corresponding independent variables align with those above, excluding AgeOver60 due to its correlation with the outcome variable Age:

AgeXvars = setdiff(Xvars, "AgeOver60")

To enhance reproducibility and minimize variability from random number generation, we established and set a specific random seed:

set.seed(12345)

Users can define their own temporary directory to save all results. If not, tempdir() can be used to get the system’s temporary directory.

temp_dir = tempdir()
# setwd(temp_dir) # this only affect this chunk, not for other part
knitr::opts_knit$set(root.dir = temp_dir)

2. Biomarker confirmation/validation

Whether this procedure is labeled as biomarker confirmation, validation, or testing, the fundamental aspect involves regular regression analyses on both single and multiple variables across three distinct outcome categories: binary, continuous, and time-to-event. In this context, our objective is to assess the presence of an association between outcomes and a set of independent variables. It’s important to note that this differs from model validation, which will be covered subsequently.

2.1 Binary outcome

To confirm biomarkers for binary outcomes:

bconfirm = confirmVars(data = tdat, biomks = Xvars, Y = "DZsig",
                       outfile = "confirmBinary")

The confirmVars function acts as a wrapper, invoking various functions to perform regression analysis based on different outcome types. By default, the outcome type is binary, requiring no explicit specification when handling binary outcomes.

Upon execution, the bconfirm object comprises a multivariable model and a list of two forest plots. The first plot consolidates individual forest plots for each single variable, while the second represents the forest plot for the multivariable model. These outputs are locally saved, along with a combined table containing models for each single variable.

print(bconfirm$fit)
## 
## Call:  glm(formula = f1, family = "binomial", data = datain)
## 
## Coefficients:
## (Intercept)      highIPI   B.Symptoms      MYC.IHC     BCL2.IHC     CD10.IHC  
##   -27.89875     -2.61008     -1.69697      3.72794      1.26593      4.24328  
##    BCL6.IHC     MUM1.IHC         Male    AgeOver60     stage3_4          PS1  
##     1.61152     -3.03434      1.88499      0.82520      1.74159      3.78197  
##  LDH.Ratio1  Extranodal1     Bulk10cm     HANS_GCB          DTI  
##     2.24558      2.05693     -1.03546     16.23813     -0.02331  
## 
## Degrees of Freedom: 215 Total (i.e. Null);  199 Residual
## Null Deviance:       154.8 
## Residual Deviance: 63.14     AIC: 97.14
bconfirm$allplot[[2]]

For instance, the initial output showcases a multivariable model. In the subsequent section, single-variable models are presented with associated forest plots, all amalgamated into a comprehensive display.

2.2 Continous outcome

To confirm biomarkers for continuous outcomes:

cconfirm = confirmVars(data = tdat, biomks = AgeXvars, Y = "Age",
                       outcomeType = "continuous",
                       outfile = "confirmContinuous")

The same confirmVars function is called; however, this time, we specify the outcome type as continuous.

In a similar fashion, the cconfirm object comprises two elements: a multivariable model and a list of two forest plots. The first plot consolidates all forest plots for each single variable, while the second represents the forest plot for the multivariable model. All these outputs are saved locally, accompanied by a combined table containing models for each single variable.

Below, you’ll find the multivariable model and a combined forest plot for each variable with raw p-values:

print(cconfirm$fit)
## 
## Call:  glm(formula = f1, data = datain)
## 
## Coefficients:
## (Intercept)      highIPI   B.Symptoms      MYC.IHC     BCL2.IHC     CD10.IHC  
##    64.10855      9.52589     -3.74092      1.95808      1.58400     -0.35961  
##    BCL6.IHC     MUM1.IHC         Male     stage3_4          PS1   LDH.Ratio1  
##     1.78772      1.32447     -1.51572     -3.87195      3.31566     -1.03366  
## Extranodal1     Bulk10cm     HANS_GCB          DTI  
##    -7.65469     -1.86334     -0.88036     -0.03459  
## 
## Degrees of Freedom: 215 Total (i.e. Null);  200 Residual
## Null Deviance:       41610 
## Residual Deviance: 35760     AIC: 1751
cconfirm$allplot[[2]]

2.3 Time-to-event outcome

To confirm biomarkers for time-to-event outcomes:

tconfirm = confirmVars(data = tdat, biomks = Xvars,
                       time = "FFP..Years.", event = "Code.FFP",
                       outcomeType = "time-to-event",
                       outfile = "confirmSurvival")

The confirmVars function is called once again, this time with the outcome type specified as time-to-event, necessitating the inclusion of both time and event variable names.

Similarly, two PDF and two table files are saved, accompanied by locally stored Kaplan-Meier plots. A single Kaplan-Meier plot is generated for each independent categorical variable with no more than four levels. In this example dataset, 15 Kaplan-Meier plots are produced.

The tconfirm object continues to store two elements: a multivariable model and a list of two forest plots. Below, you’ll find the multivariable model and a combined forest plot for each variable, including raw p-values:

print(tconfirm$fit)
## Call:
## survival::coxph(formula = as.formula(paste(survY, survX, sep = " ~ ")), 
##     data = datain)
## 
##                  coef exp(coef)  se(coef)      z        p
## highIPI     -0.603018  0.547158  0.446953 -1.349 0.177281
## B.Symptoms   0.264292  1.302508  0.256034  1.032 0.301954
## MYC.IHC      0.321325  1.378954  0.240911  1.334 0.182273
## BCL2.IHC     0.580115  1.786243  0.308232  1.882 0.059826
## CD10.IHC    -0.368733  0.691610  0.388518 -0.949 0.342583
## BCL6.IHC    -0.061321  0.940521  0.304312 -0.202 0.840302
## MUM1.IHC     0.267188  1.306286  0.322775  0.828 0.407793
## Male         0.522793  1.686733  0.240032  2.178 0.029405
## AgeOver60    0.419517  1.521226  0.289702  1.448 0.147590
## stage3_4     1.032559  2.808244  0.309732  3.334 0.000857
## PS1          0.840254  2.316956  0.304235  2.762 0.005747
## LDH.Ratio1   1.387365  4.004285  0.338434  4.099 4.14e-05
## Extranodal1  0.191007  1.210468  0.305195  0.626 0.531411
## Bulk10cm    -0.323524  0.723595  0.276848 -1.169 0.242567
## HANS_GCB     0.255505  1.291113  0.490277  0.521 0.602267
## DTI         -0.003866  0.996142  0.007651 -0.505 0.613405
## 
## Likelihood ratio test=81.78  on 16 df, p=7.953e-11
## n= 216, number of events= 85
tconfirm$allplot[[2]]

3. Biomarker discovery with variable selection

This section details the process of biomarker discovery through variable selection, utilizing three distinct methods: LASSO2, LASSO2plus, and LASSO_plus.

3.1 Variable selection with LASSO2

The variable selection process using our customized LASSO algorithm, LASSO2, employs a tailored approach distinct from the conventional LASSO (Least Absolute Shrinkage and Selection Operator) algorithm. This adjustment aims to address the randomness introduced by random splits and to guarantee the inclusion of at least two variables.

This process utilizes glmnet::cv.glmnet for cross-validation-based variable selection. It determines the largest lambda value where the error remains within 1 standard error of the minimum. However, as indicated in the cv.glmnet’s help file, variability in results can arise due to the randomness inherent in cross-validation splits.

To counteract this variability, our new function, LASSO2, conducts 10 runs of 10-fold cv.glmnet. The resulting average lambda value from these iterations becomes the final lambda used for regularization regression on the complete dataset.

It’s important to note that since LASSO2 selects the largest lambda within 1 standard error of the minimum, following the default behavior of cv.glmnet, it may yield a smaller number of selected variables compared to the lambda that minimizes the mean cross-validated error. This more conservative approach could potentially result in only one or no selected variables.

To address this potential issue, when LASSO2 identifies only one or no variables, it defaults to selecting the first lambda that results in at least two variables being chosen from the full dataset. This strategy ensures the inclusion of at least two variables, striking a balance between model complexity and the necessity for meaningful variable inclusion.

3.1.1 Binary outcome

For binary outcomes, no additional specification is needed for outcomeType, as it is the default value.

bl = LASSO2(data = tdat, biomks = Xvars, Y = "DZsig",
            outfile = "binaryLASSO2")

One figure and one text file are saved locally.

bl$coefs
##    MYC.IHC   CD10.IHC   MUM1.IHC 
##  0.8923274  1.5137059 -0.7274479

This displays the selected variables and their corresponding shrunken coefficients.

3.1.2 Continuous outcome

For variable selection involving a continuous outcome variable, specify outcomeType = “continuous”:

cl = LASSO2(data = tdat, biomks = AgeXvars,
            outcomeType = "continuous", Y = "Age",
            outfile = "continuousLASSO2")

Similar to before, one figure and one text file are saved locally.

cl$coefs
##    highIPI        PS1 
## 0.02137912 1.07621511

This shows the selected variables and their associated shrunken coefficients for the continuous outcome.

3.1.3 Time-to-event outcome

For variable selection with a time-to-event outcome, set outcomeType = “time-to-event”, and ensure you provide the variable names for both time and event:

tl = LASSO2(data = tdat, biomks = Xvars,
            outcomeType = "time-to-event",
            time = "FFP..Years.",event = "Code.FFP",
            outfile = "survivalLASSO2")

In a similar fashion, one figure and one text file are saved locally.

tl$coefs
##    highIPI   stage3_4        PS1 LDH.Ratio1 
## 0.16770489 0.04166427 0.02757391 0.43226052

This shows the selected variables and their associated shrunk coefficients for time-to-event outcome.

3.2 Variable selection with LASSO2plus

LASSO2plus is an innovative approach that combines LASSO2, a modified LASSO algorithm, with other techniques. It selects variables in three steps: - applying LASSO2, which is slightly different from the standard LASSO as discussed in Section 3.1; - fitting a simple regression model for each variable and adjusting the p-values using the Benjamini Hochberg method (1995); - performing a stepwise variable selection procedure on the combined list of variables from the previous steps. Therefore, LASSO2plus incorporates both the regularization and the significance testing aspects of variable selection.

All parameter settings for LASSO2plus are the same as for LASSO2.

Binary outcome

For binary outcomes, no additional specification is needed for outcomeType, as it is the default value.

b2fit = LASSO2plus(data = tdat, biomks = Xvars, Y = "DZsig",
        outfile = "binaryLASSO2plus")
## Start:  AIC=101.72
## DZsig ~ MYC.IHC + CD10.IHC + MUM1.IHC
## 
##            Df Deviance    AIC
## <none>           93.72 101.72
## - MUM1.IHC  1   107.09 113.09
## - MYC.IHC   1   115.41 121.41
## - CD10.IHC  1   119.33 125.33
## file saved to binaryLASSO2plusLASSO2plus_varaibleSelection.pdf
b2fit$fit$coefficients
## (Intercept)     MYC.IHC    CD10.IHC    MUM1.IHC 
##   -4.778565    2.503030    3.188996   -2.553409

The coefficients are shown above. Two figures and two tables are stored locally.

Continuous outcome

For variable selection involving a continuous outcome variable, specify outcomeType = “continuous”:

c2fit = LASSO2plus(data = tdat, biomks = AgeXvars,
                   outcomeType = "continuous", Y = "Age",
                   outfile = "continuousLASSO2plus")
## Start:  AIC=1745.15
## Age ~ highIPI + PS1
## 
##           Df Deviance    AIC
## - highIPI  1    39626 1744.8
## <none>          39331 1745.2
## - PS1      1    39848 1746.0
## 
## Step:  AIC=1744.76
## Age ~ PS1
## 
##        Df Deviance    AIC
## <none>       39626 1744.8
## - PS1   1    41606 1753.3
## file saved to continuousLASSO2plusLASSO2plus_varaibleSelection.pdf
c2fit$fit$coefficients
## (Intercept)     highIPI         PS1 
##   62.004372    3.134816    4.311964

Again, the coefficients shown above and Two figures and two tables are stored locally.

Time-to-event outcome

For variable selection with a time-to-event outcome, set outcomeType = “time-to-event”, and ensure you provide the variable names for both time and event:

t2fit = LASSO2plus(data = tdat, biomks = Xvars,
                   outcomeType = "time-to-event",
                   time = "FFP..Years.",event = "Code.FFP",
                   outfile = "survivalLASSO2plus")
## Start:  AIC=815.14
## survival::Surv(FFP..Years., Code.FFP) ~ highIPI + stage3_4 + 
##     PS1 + LDH.Ratio1 + HANS_GCB + B.Symptoms + DTI + CD10.IHC + 
##     MUM1.IHC
## 
##              Df    AIC
## - HANS_GCB    1 813.18
## - B.Symptoms  1 813.36
## - highIPI     1 813.41
## - DTI         1 813.73
## - CD10.IHC    1 813.89
## - MUM1.IHC    1 814.17
## <none>          815.14
## - PS1         1 818.22
## - stage3_4    1 822.90
## - LDH.Ratio1  1 824.45
## 
## Step:  AIC=813.18
## survival::Surv(FFP..Years., Code.FFP) ~ highIPI + stage3_4 + 
##     PS1 + LDH.Ratio1 + B.Symptoms + DTI + CD10.IHC + MUM1.IHC
## 
##              Df    AIC
## - B.Symptoms  1 811.39
## - highIPI     1 811.43
## - DTI         1 811.78
## - CD10.IHC    1 812.30
## - MUM1.IHC    1 812.38
## <none>          813.18
## - PS1         1 816.24
## - stage3_4    1 821.44
## - LDH.Ratio1  1 822.45
## 
## Step:  AIC=811.39
## survival::Surv(FFP..Years., Code.FFP) ~ highIPI + stage3_4 + 
##     PS1 + LDH.Ratio1 + DTI + CD10.IHC + MUM1.IHC
## 
##              Df    AIC
## - highIPI     1 809.66
## - DTI         1 810.17
## - MUM1.IHC    1 810.59
## - CD10.IHC    1 810.62
## <none>          811.39
## - PS1         1 815.20
## - stage3_4    1 819.75
## - LDH.Ratio1  1 820.80
## 
## Step:  AIC=809.66
## survival::Surv(FFP..Years., Code.FFP) ~ stage3_4 + PS1 + LDH.Ratio1 + 
##     DTI + CD10.IHC + MUM1.IHC
## 
##              Df    AIC
## - DTI         1 808.46
## - MUM1.IHC    1 808.86
## - CD10.IHC    1 808.94
## <none>          809.66
## - PS1         1 814.13
## - stage3_4    1 819.20
## - LDH.Ratio1  1 820.45
## 
## Step:  AIC=808.46
## survival::Surv(FFP..Years., Code.FFP) ~ stage3_4 + PS1 + LDH.Ratio1 + 
##     CD10.IHC + MUM1.IHC
## 
##              Df    AIC
## - CD10.IHC    1 807.66
## - MUM1.IHC    1 807.79
## <none>          808.46
## - PS1         1 813.12
## - stage3_4    1 818.06
## - LDH.Ratio1  1 824.55
## 
## Step:  AIC=807.66
## survival::Surv(FFP..Years., Code.FFP) ~ stage3_4 + PS1 + LDH.Ratio1 + 
##     MUM1.IHC
## 
##              Df    AIC
## <none>          807.66
## - MUM1.IHC    1 808.22
## - PS1         1 813.79
## - stage3_4    1 817.87
## - LDH.Ratio1  1 824.33
## file saved to survivalLASSO2plusLASSO2plus_varaibleSelection.pdf
t2fit$fit$coefficients
##   stage3_4        PS1 LDH.Ratio1   MUM1.IHC 
##  0.8231937  0.6543237  1.0529572  0.3508003

Similar to the other types of outcomes, the coefficients are displayed above, and two figures along with two tables are stored locally.

3.3. Variable selection with LASSO_plus

LASSO_plus is another innovative approach that builds on the LASSO algorithm and adds more techniques. However, it differs from LASSO2plus that is described in Section 3.2 in its initial step. It selects variables in three steps:

    1. using a “Modified LASSO” instead of LASSO2, which selects a stable variable list that also matches a predefined target number;
    1. fitting a simple regression model for each variable and adjusting the p-values using the Benjamini Hochberg method (1995);
    1. performing a stepwise variable selection procedure on the combined list of variables from the previous steps. Therefore, LASSO_plus also incorporates both the regularization and the significance testing aspects of variable selection.

In LASSO_plus, all parameters from LASSO2 and LASSO2plus are retained, with the addition of the unique parameter topN. Please be aware that the topN parameter in LASSO_plus serves as a guide for variable selection.

Binary outcome

Setting the topN parameter to 5 aims to include the top 5 variables in the final model. However, it’s important to note that the resulting model may not always precisely consist of 5 variables. The LASSO_plus method’s selection criteria involve considering variables that appear at least twice across different lambda values. Consequently, even when using the same topN value for different datasets, the number of selected variables may vary.

For binary outcomes, outcome type specification is unnecessary, as it defaults to this type.

bfit = LASSO_plus(data = tdat, biomks = Xvars, Y = "DZsig",
                  outfile = "binaryLASSO_plus", topN = 5)
## Start:  AIC=101.72
## DZsig ~ MYC.IHC + CD10.IHC + MUM1.IHC
## 
##            Df Deviance    AIC
## <none>           93.72 101.72
## - MUM1.IHC  1   107.09 113.09
## - MYC.IHC   1   115.41 121.41
## - CD10.IHC  1   119.33 125.33
## file saved to binaryLASSO_plus_LASSO_plus_varaibleSelection.pdf
bfit$fit$coefficients
## (Intercept)     MYC.IHC    CD10.IHC    MUM1.IHC 
##   -4.778565    2.503030    3.188996   -2.553409

The identified variables and their corresponding coefficients are displayed above. A figure and a table are locally stored.

Continuous outcome

For continuous outcome variables, ensure you specify outcomeType = “continuous”:

cfit = LASSO_plus(data = tdat, biomks = AgeXvars,
                  outcomeType = "continuous", Y = "Age",
                  outfile = "continuousLASSO_plus", topN = 5)
## Start:  AIC=1738.58
## Age ~ highIPI + MUM1.IHC + Male + stage3_4 + PS1 + Extranodal1
## 
##               Df Deviance    AIC
## - PS1          1    36851 1737.1
## - Male         1    36881 1737.3
## - MUM1.IHC     1    37040 1738.2
## <none>              36766 1738.6
## - stage3_4     1    37491 1740.8
## - Extranodal1  1    37999 1743.7
## - highIPI      1    38311 1745.5
## 
## Step:  AIC=1737.09
## Age ~ highIPI + MUM1.IHC + Male + stage3_4 + Extranodal1
## 
##               Df Deviance    AIC
## - Male         1    36975 1735.8
## - MUM1.IHC     1    37160 1736.9
## <none>              36851 1737.1
## - stage3_4     1    37696 1740.0
## - Extranodal1  1    38198 1742.8
## - highIPI      1    40369 1754.8
## 
## Step:  AIC=1735.81
## Age ~ highIPI + MUM1.IHC + stage3_4 + Extranodal1
## 
##               Df Deviance    AIC
## <none>              36975 1735.8
## - MUM1.IHC     1    37336 1735.9
## - stage3_4     1    37902 1739.2
## - Extranodal1  1    38335 1741.6
## - highIPI      1    40706 1754.6
## file saved to continuousLASSO_plus_LASSO_plus_varaibleSelection.pdf
cfit$fit$coefficients
## (Intercept)     highIPI    MUM1.IHC    stage3_4 Extranodal1 
##   63.259326   10.360268    2.601408   -4.854750   -6.927381

The identified variables and their corresponding coefficients are displayed above. A figure and a table are locally stored

Time-to-event outcome

When dealing with time-to-event outcomes, set outcomeType = “time-to-event”, and ensure you provide the names of variables for both time and event:

tfit = LASSO_plus(data = tdat, biomks = Xvars,
                  outcomeType = "time-to-event",
                  time = "FFP..Years.",event = "Code.FFP",
                  outfile = "survivalLASSO_plus", topN = 5)
## Start:  AIC=815.14
## survival::Surv(FFP..Years., Code.FFP) ~ highIPI + stage3_4 + 
##     PS1 + LDH.Ratio1 + HANS_GCB + B.Symptoms + DTI + CD10.IHC + 
##     MUM1.IHC
## 
##              Df    AIC
## - HANS_GCB    1 813.18
## - B.Symptoms  1 813.36
## - highIPI     1 813.41
## - DTI         1 813.73
## - CD10.IHC    1 813.89
## - MUM1.IHC    1 814.17
## <none>          815.14
## - PS1         1 818.22
## - stage3_4    1 822.90
## - LDH.Ratio1  1 824.45
## 
## Step:  AIC=813.18
## survival::Surv(FFP..Years., Code.FFP) ~ highIPI + stage3_4 + 
##     PS1 + LDH.Ratio1 + B.Symptoms + DTI + CD10.IHC + MUM1.IHC
## 
##              Df    AIC
## - B.Symptoms  1 811.39
## - highIPI     1 811.43
## - DTI         1 811.78
## - CD10.IHC    1 812.30
## - MUM1.IHC    1 812.38
## <none>          813.18
## - PS1         1 816.24
## - stage3_4    1 821.44
## - LDH.Ratio1  1 822.45
## 
## Step:  AIC=811.39
## survival::Surv(FFP..Years., Code.FFP) ~ highIPI + stage3_4 + 
##     PS1 + LDH.Ratio1 + DTI + CD10.IHC + MUM1.IHC
## 
##              Df    AIC
## - highIPI     1 809.66
## - DTI         1 810.17
## - MUM1.IHC    1 810.59
## - CD10.IHC    1 810.62
## <none>          811.39
## - PS1         1 815.20
## - stage3_4    1 819.75
## - LDH.Ratio1  1 820.80
## 
## Step:  AIC=809.66
## survival::Surv(FFP..Years., Code.FFP) ~ stage3_4 + PS1 + LDH.Ratio1 + 
##     DTI + CD10.IHC + MUM1.IHC
## 
##              Df    AIC
## - DTI         1 808.46
## - MUM1.IHC    1 808.86
## - CD10.IHC    1 808.94
## <none>          809.66
## - PS1         1 814.13
## - stage3_4    1 819.20
## - LDH.Ratio1  1 820.45
## 
## Step:  AIC=808.46
## survival::Surv(FFP..Years., Code.FFP) ~ stage3_4 + PS1 + LDH.Ratio1 + 
##     CD10.IHC + MUM1.IHC
## 
##              Df    AIC
## - CD10.IHC    1 807.66
## - MUM1.IHC    1 807.79
## <none>          808.46
## - PS1         1 813.12
## - stage3_4    1 818.06
## - LDH.Ratio1  1 824.55
## 
## Step:  AIC=807.66
## survival::Surv(FFP..Years., Code.FFP) ~ stage3_4 + PS1 + LDH.Ratio1 + 
##     MUM1.IHC
## 
##              Df    AIC
## <none>          807.66
## - MUM1.IHC    1 808.22
## - PS1         1 813.79
## - stage3_4    1 817.87
## - LDH.Ratio1  1 824.33
## file saved to survivalLASSO_plus_LASSO_plus_varaibleSelection.pdf
tfit$fit$coefficients
##   stage3_4        PS1 LDH.Ratio1   MUM1.IHC 
##  0.8231937  0.6543237  1.0529572  0.3508003

Displayed above are the identified variables and their corresponding coefficients. A figure and a table are locally stored

4. Predictive model development

Predictive model development is a crucial aspect of the csmpv R package, involving eight distinct approaches:

  • Use shrunk coefficients directly from LASSO2.
  • Select variables with LASSO2, then run a regular regression model.
  • Extract coefficients directly from LASSO_plus output.
  • Extract coefficients directly from LASSO2plus output.
  • Build a machine learning model with XGBoost.
  • Utilize LASSO2 for variable selection and build an XGBoost model.
  • Utilize LASSO_plus for variable selection and build an XGBoost model.
  • Utilize LASSO2plus for variable selection and build an XGBoost model.

4.1 LASSO2

Directly use shrunk coefficients from LASSO2 output as shown in Section 3.1.

4.2 LASSO2 + regular regression

The approach involves utilizing the variables selected by LASSO2 to conduct a standard regression model. Rather than relying on the shrunken coefficients obtained from LASSO2, this method opts for a conventional regression analysis with the chosen variables.

While it’s feasible to manually extract variables from an LASSO2 object for regular regression based on the outcome type, LASSO2_reg function is introduced to simplify this process for coding convenience and efficiency.

All parameter settings are the same as for LASSO2.

Binary outcome

blr = LASSO2_reg(data = tdat, biomks = Xvars, Y = "DZsig",
                 outfile = "binaryLASSO2_reg")
blr$fit$coefficients
## (Intercept)     MYC.IHC    CD10.IHC    MUM1.IHC         PS1 
##   -5.814411    2.918828    3.593155   -2.798703    1.583795

Continuous outcome

clr = LASSO2_reg(data = tdat, biomks = AgeXvars,
                 outcomeType = "continuous", Y = "Age",
                 outfile = "continuousLASSO2_reg")
clr$fit$coefficients
## (Intercept)     highIPI         PS1 
##   62.004372    3.134816    4.311964

Time-to-event outcome

tlr = LASSO2_reg(data = tdat, biomks = Xvars,
                 outcomeType = "time-to-event",
                 time = "FFP..Years.",event = "Code.FFP",
                 outfile = "survivalLASSO2_reg")
tlr$fit$coefficients
##    highIPI LDH.Ratio1 
##  0.6925077  1.0186194

The selected variables and their coefficients are shown above. For each outcome type, three figure files, one text file, and two tables are saved locally. Additionally, for time-to-event outcome variables, Kaplan-Meier plots are generated and saved locally. A single Kaplan-Meier plot is generated for each independent categorical variable with no more than four levels. In this example dataset, 15 Kaplan-Meier plots are generated.

4.3 LASSO_plus

Directly use coefficients from LASSO2_plus output as shown in Section 3.3.

4.4 LASSO2plus

Directly use coefficients from the LASSO2plus output, as described in Section 3.2.

4.5 XGBoost

XGBoost is a powerful machine learning algorithm recognized for its boosting capabilities. The XGBtraining function within the csmpv package leverages the strengths of XGBoost for model training. As XGBoost doesn’t inherently feature a dedicated variable selection procedure, you’ll need to manually define or select a set of variables using other methods. Once you have a predefined set of variables for constructing an XGBoost model, the XGBtraining function in the csmpv package streamlines this process.

Binary outcome

bxfit = XGBtraining(data = tdat, biomks = Xvars, Y = "DZsig",
                    outfile = "binary_XGBoost")
## [1]  train-logloss:0.511255 
## [2]  train-logloss:0.408615 
## [3]  train-logloss:0.343507 
## [4]  train-logloss:0.298065 
## [5]  train-logloss:0.264358
head(bxfit$XGBoost_score)
##     pt103     pt246     pt874     pt219     pt138     pt328 
## 0.2255560 0.1195362 0.7725156 0.1140642 0.1708665 0.1634810

The output from the above code consists of training log-loss values for specific iterations of the model. Log-loss, a widely used loss function in classification tasks, assesses the alignment between the model’s predicted probabilities and the actual class labels. By default, XGBtraining runs for 5 iterations, and the output is saved locally as a text file.

The bxfit object contains four components:

  • XGBoost object.
  • XGBoost scores for all entries in the tdat dataset. Notably, XGBoost operates as a black box model and doesn’t return coefficients; however, it provides model scores. For binary outcomes, these scores represent the probability of the positive class.
  • Observed outcome.
  • Outcome type.

Continuous outcome

cxfit = XGBtraining(data = tdat, biomks = AgeXvars,
                    outcomeType = "continuous", Y = "Age",
                    outfile = "continuous_XGBoost")
## [1]  train-rmse:47.112278 
## [2]  train-rmse:34.492776 
## [3]  train-rmse:26.071191 
## [4]  train-rmse:20.554692 
## [5]  train-rmse:17.100220
head(cxfit$XGBoost_score)
##    pt103    pt246    pt874    pt219    pt138    pt328 
## 55.28930 51.14508 55.28930 52.75171 52.75171 54.72178

The reported values, train-rmse, signify the RMSE metric calculated during each iteration of the XGBoost model on the training set. RMSE measures the average variance between predicted and actual values within the training set, where lower values indicate superior model performance. The output is saved locally as a text file.

These metrics illustrate the iterative nature of training the XGBoost model, where each iteration aims to minimize the RMSE on the training set. The diminishing RMSE values signify the model’s learning process, showcasing its progressive improvement in predictive accuracy during training.

Within cxfit, there are four elements:

  • XGBoost object.
  • XGBoost scores for all entries in tdat. Notably, XGBoost, functioning as a black box model, does not yield coefficients but provides model scores. For continuous outcomes, these scores represent the estimated continuous values.
  • Observed outcome.
  • Outcome type.

Time-to-event outcome

txfit = XGBtraining(data = tdat, biomks = Xvars,
                    outcomeType = "time-to-event",
                    time = "FFP..Years.",event = "Code.FFP",
                    outfile = "survival_XGBoost")
## [1]  train-cox-nloglik:4.859160 
## [2]  train-cox-nloglik:4.736148 
## [3]  train-cox-nloglik:4.648801 
## [4]  train-cox-nloglik:4.563267 
## [5]  train-cox-nloglik:4.507056
head(txfit$XGBoost_score)
##     pt103     pt246     pt874     pt219     pt138     pt328 
## 1.1142874 0.2993234 1.8398050 0.2993234 0.1824380 0.4074278

The negative log-likelihood, displayed in the output, serves as a standard loss function in survival analysis, notably prominent in Cox proportional hazards models. It quantifies the disparity between predicted survival probabilities and observed survival times and events within the training data. Minimizing this metric is crucial, as lower values signify a better fit of the model to the training data. The resulting output is saved locally as a text file.

By monitoring the negative log-likelihood throughout the training process, you can evaluate the model’s learning progress and its convergence toward an optimal solution. Ideally, a decreasing trend in the negative log-likelihood indicates the model’s improved fit to the training data across iterations.

In txfit, there are six components:

  • XGBoost object.
  • XGBoost scores for all entries in tdat. Notably, XGBoost operates as a black box model and does not yield coefficients but provides model scores. For time-to-event outcomes, these scores represent the risk score.
  • Baseline hazard table.
  • Observed time.
  • Event.
  • Outcome type.

4.6 LASSO2 + XGBoost

Combine LASSO2 variable selection with XGBoost modeling using the LASSO2_XGBtraining function, which selects variables via LASSO2 but constructs an XGBoost model without relying on shrunk coefficients. The resulting objects maintain the output format of the XGBtraining function.

Binary outcome

blxfit = LASSO2_XGBtraining(data = tdat, biomks = Xvars, Y = "DZsig",
                            outfile = "binary_LASSO2_XGBoost")
## [1]  train-logloss:0.511725 
## [2]  train-logloss:0.410850 
## [3]  train-logloss:0.348831 
## [4]  train-logloss:0.308959 
## [5]  train-logloss:0.283560
head(blxfit$XGBoost_score)
##     pt103     pt246     pt874     pt219     pt138     pt328 
## 0.2034558 0.1193058 0.6517244 0.1193058 0.2034558 0.2034558

Continuous outcome

clxfit = LASSO2_XGBtraining(data = tdat, biomks = AgeXvars,
                            outcomeType = "continuous", Y = "Age",
                            outfile = "continuous_LASSO2_XGBoost")
## [1]  train-rmse:47.112278 
## [2]  train-rmse:34.492776 
## [3]  train-rmse:26.089994 
## [4]  train-rmse:20.707653 
## [5]  train-rmse:17.442132
head(clxfit$XGBoost_score)
##    pt103    pt246    pt874    pt219    pt138    pt328 
## 56.47782 52.21841 56.47782 52.21841 52.21841 53.58232

Time-to-event outcome

tlxfit = LASSO2_XGBtraining(data = tdat, biomks = Xvars,
                            outcomeType = "time-to-event",
                            time = "FFP..Years.",event = "Code.FFP",
                            outfile = "survival_LASSO2_XGBoost")
## [1]  train-cox-nloglik:4.940429 
## [2]  train-cox-nloglik:4.870696 
## [3]  train-cox-nloglik:4.833368 
## [4]  train-cox-nloglik:4.811892 
## [5]  train-cox-nloglik:4.803468
head(tlxfit$XGBoost_score)
##     pt103     pt246     pt874     pt219     pt138     pt328 
## 0.9911962 0.2521406 0.9911962 0.2521406 0.2521406 0.9911962

4.7 LASSO_plus + XGBoost

To combine LASSO_plus variable selection with XGBoost modeling, the LASSO_plus_XGBtraining R function is employed. This approach selects variables using LASSO_plus but does not utilize the coefficients from LASSO_plus to construct the model; instead, it generates an XGBoost model. The resulting output mirrors that of the XGBtraining function.

The output and format of the returned objects are identical to those of the XGBtraining function. Furthermore, for each outcome type, one figure, one text, and one table file are saved locally.

Binary outcome

blpxfit = LASSO_plus_XGBtraining(data = tdat, biomks = Xvars, Y = "DZsig",
                                 topN = 5,outfile = "binary_LASSO_plus_XGBoost")
## Start:  AIC=101.72
## DZsig ~ MYC.IHC + CD10.IHC + MUM1.IHC
## 
##            Df Deviance    AIC
## <none>           93.72 101.72
## - MUM1.IHC  1   107.09 113.09
## - MYC.IHC   1   115.41 121.41
## - CD10.IHC  1   119.33 125.33
## file saved to binary_LASSO_plus_XGBoost_LASSO_plus_varaibleSelection.pdf
## [1]  train-logloss:0.511725 
## [2]  train-logloss:0.410850 
## [3]  train-logloss:0.348831 
## [4]  train-logloss:0.308959 
## [5]  train-logloss:0.283560
head(blpxfit$XGBoost_score)
##     pt103     pt246     pt874     pt219     pt138     pt328 
## 0.2034558 0.1193058 0.6517244 0.1193058 0.2034558 0.2034558

The majority of the outputs stem from LASSO_plus, with the final portion being attributed to XGBoost. Each line within the XGBoost output denotes the training log-loss value for a specific iteration of the model. Log-loss, a widely used loss function in classification tasks, gauges the alignment between the model’s predicted probabilities and the actual class labels. The default number of iterations in XGBtraining is 5.

The blpxfit output comprises four items: the first item corresponds to the XGBoost object, while the second item presents the XGBoost scores for all entries in the tdat dataset. Notably, XGBoost is a black box model that does not yield coefficients; however, model scores are provided. For binary outcomes, the model score pertains to the probability of the positive class. The remaining two items are the observed outcome and the outcome type.

Continuous outcome

clpxfit = LASSO_plus_XGBtraining(data = tdat, biomks = AgeXvars,
                                 outcomeType = "continuous", Y = "Age",
                                 topN = 5,outfile = "continuous_LASSO_plus_XGBoost")
## Start:  AIC=1738.58
## Age ~ highIPI + MUM1.IHC + Male + stage3_4 + PS1 + Extranodal1
## 
##               Df Deviance    AIC
## - PS1          1    36851 1737.1
## - Male         1    36881 1737.3
## - MUM1.IHC     1    37040 1738.2
## <none>              36766 1738.6
## - stage3_4     1    37491 1740.8
## - Extranodal1  1    37999 1743.7
## - highIPI      1    38311 1745.5
## 
## Step:  AIC=1737.09
## Age ~ highIPI + MUM1.IHC + Male + stage3_4 + Extranodal1
## 
##               Df Deviance    AIC
## - Male         1    36975 1735.8
## - MUM1.IHC     1    37160 1736.9
## <none>              36851 1737.1
## - stage3_4     1    37696 1740.0
## - Extranodal1  1    38198 1742.8
## - highIPI      1    40369 1754.8
## 
## Step:  AIC=1735.81
## Age ~ highIPI + MUM1.IHC + stage3_4 + Extranodal1
## 
##               Df Deviance    AIC
## <none>              36975 1735.8
## - MUM1.IHC     1    37336 1735.9
## - stage3_4     1    37902 1739.2
## - Extranodal1  1    38335 1741.6
## - highIPI      1    40706 1754.6
## file saved to continuous_LASSO_plus_XGBoost_LASSO_plus_varaibleSelection.pdf
## [1]  train-rmse:47.112278 
## [2]  train-rmse:34.492776 
## [3]  train-rmse:26.064544 
## [4]  train-rmse:20.629215 
## [5]  train-rmse:17.273109
head(clpxfit$XGBoost_score)
##    pt103    pt246    pt874    pt219    pt138    pt328 
## 54.92717 51.91544 53.46028 51.91544 52.28172 56.04781

Similar to the previous scenario, the primary outputs stem from LASSO_plus, while the concluding section originates from XGBoost. Within the XGBoost output, the train-rmse values reflect the root mean squared error (RMSE) metric calculated during each iteration of the XGBoost model. The RMSE gauges the average discrepancy between the predicted and actual values in the training set, with lower values signifying improved model performance.

These lines indicate that the XGBoost model undergoes iterative training, with each iteration aimed at minimizing the RMSE on the training set. The declining RMSE values suggest that the model progressively learns from the data, enhancing its predictive capabilities.

The clpxfit output includes four components: the first represents the XGBoost object, and the second offers XGBoost scores for all entries in the tdat dataset. Similar to before, XGBoost is a black box model that does not yield coefficients; however, model scores are provided. For continuous outcomes, the model score pertains to the estimated continuous values. The remaining two components are the observed outcome and the outcome type.

Time-to-event outcome

tlpxfit = LASSO_plus_XGBtraining(data = tdat, biomks = Xvars,
                                 outcomeType = "time-to-event",
                                 time = "FFP..Years.",event = "Code.FFP",
                                 topN = 5,outfile = "survival_LASSO_plus_XGBoost")
## Start:  AIC=815.14
## survival::Surv(FFP..Years., Code.FFP) ~ highIPI + stage3_4 + 
##     PS1 + LDH.Ratio1 + HANS_GCB + B.Symptoms + DTI + CD10.IHC + 
##     MUM1.IHC
## 
##              Df    AIC
## - HANS_GCB    1 813.18
## - B.Symptoms  1 813.36
## - highIPI     1 813.41
## - DTI         1 813.73
## - CD10.IHC    1 813.89
## - MUM1.IHC    1 814.17
## <none>          815.14
## - PS1         1 818.22
## - stage3_4    1 822.90
## - LDH.Ratio1  1 824.45
## 
## Step:  AIC=813.18
## survival::Surv(FFP..Years., Code.FFP) ~ highIPI + stage3_4 + 
##     PS1 + LDH.Ratio1 + B.Symptoms + DTI + CD10.IHC + MUM1.IHC
## 
##              Df    AIC
## - B.Symptoms  1 811.39
## - highIPI     1 811.43
## - DTI         1 811.78
## - CD10.IHC    1 812.30
## - MUM1.IHC    1 812.38
## <none>          813.18
## - PS1         1 816.24
## - stage3_4    1 821.44
## - LDH.Ratio1  1 822.45
## 
## Step:  AIC=811.39
## survival::Surv(FFP..Years., Code.FFP) ~ highIPI + stage3_4 + 
##     PS1 + LDH.Ratio1 + DTI + CD10.IHC + MUM1.IHC
## 
##              Df    AIC
## - highIPI     1 809.66
## - DTI         1 810.17
## - MUM1.IHC    1 810.59
## - CD10.IHC    1 810.62
## <none>          811.39
## - PS1         1 815.20
## - stage3_4    1 819.75
## - LDH.Ratio1  1 820.80
## 
## Step:  AIC=809.66
## survival::Surv(FFP..Years., Code.FFP) ~ stage3_4 + PS1 + LDH.Ratio1 + 
##     DTI + CD10.IHC + MUM1.IHC
## 
##              Df    AIC
## - DTI         1 808.46
## - MUM1.IHC    1 808.86
## - CD10.IHC    1 808.94
## <none>          809.66
## - PS1         1 814.13
## - stage3_4    1 819.20
## - LDH.Ratio1  1 820.45
## 
## Step:  AIC=808.46
## survival::Surv(FFP..Years., Code.FFP) ~ stage3_4 + PS1 + LDH.Ratio1 + 
##     CD10.IHC + MUM1.IHC
## 
##              Df    AIC
## - CD10.IHC    1 807.66
## - MUM1.IHC    1 807.79
## <none>          808.46
## - PS1         1 813.12
## - stage3_4    1 818.06
## - LDH.Ratio1  1 824.55
## 
## Step:  AIC=807.66
## survival::Surv(FFP..Years., Code.FFP) ~ stage3_4 + PS1 + LDH.Ratio1 + 
##     MUM1.IHC
## 
##              Df    AIC
## <none>          807.66
## - MUM1.IHC    1 808.22
## - PS1         1 813.79
## - stage3_4    1 817.87
## - LDH.Ratio1  1 824.33
## file saved to survival_LASSO_plus_XGBoost_LASSO_plus_varaibleSelection.pdf
## [1]  train-cox-nloglik:4.873247 
## [2]  train-cox-nloglik:4.789556 
## [3]  train-cox-nloglik:4.743813 
## [4]  train-cox-nloglik:4.718117 
## [5]  train-cox-nloglik:4.695957
head(tlpxfit$XGBoost_score)
##     pt103     pt246     pt874     pt219     pt138     pt328 
## 1.5001757 0.2840731 1.5001757 0.3679212 0.1763741 0.7963076

Analogous to the previous cases, the bulk of the outputs originate from LASSO_plus, while the final segment is attributed to XGBoost. Within the XGBoost output, the negative log-likelihood serves as a prevalent loss function in survival analysis, encompassing Cox proportional hazards models. It quantifies the dissimilarity between the projected survival probabilities and the observed survival times and events in the training data. The objective is to minimize this metric, as lower values denote a superior fit of the model to the training data.

Monitoring the negative log-likelihood throughout training enables the assessment of the model’s capacity to learn from the data and its convergence towards an optimal solution. Ideally, a diminishing trend in the negative log-likelihood signifies an improved fit of the model to the training data across iterations.

The tlpxfit output comprises six components: the first represents the XGBoost object, and the second provides XGBoost scores for all entries in the tdat dataset. Similar to earlier instances, XGBoost is a black box model that does not yield coefficients; however, model scores are provided. For time-to-event outcomes, the model score pertains to the risk score. The remaining four components encompass the baseline hazard table, observed time, event, and outcome type.

4.8 LASSO2plus + XGBoost

To seamlessly integrate LASSO2plus variable selection with XGBoost modeling, we leverage the LASSO2plus_XGBtraining R function. This hybrid approach utilizes LASSO2plus for variable selection but diverges from using its coefficients to construct the model. Instead, it generates an XGBoost model, producing an output akin to that of the XGBtraining function.

The output and format of the returned objects mirror those of the XGBtraining function. Furthermore, for each outcome type, the process generates two figures, two text files, and one table, saving them locally.

Binary outcome

bl2xfit = LASSO2plus_XGBtraining(data = tdat, biomks = Xvars, Y = "DZsig",
                                 outfile = "binary_LASSO2plus_XGBoost")
## Start:  AIC=101.72
## DZsig ~ MYC.IHC + CD10.IHC + MUM1.IHC
## 
##            Df Deviance    AIC
## <none>           93.72 101.72
## - MUM1.IHC  1   107.09 113.09
## - MYC.IHC   1   115.41 121.41
## - CD10.IHC  1   119.33 125.33
## file saved to binary_LASSO2plus_XGBoostLASSO2plus_varaibleSelection.pdf
## [1]  train-logloss:0.511725 
## [2]  train-logloss:0.410850 
## [3]  train-logloss:0.348831 
## [4]  train-logloss:0.308959 
## [5]  train-logloss:0.283560
head(bl2xfit$XGBoost_score)
##     pt103     pt246     pt874     pt219     pt138     pt328 
## 0.2034558 0.1193058 0.6517244 0.1193058 0.2034558 0.2034558

The primary outputs stem from LASSO2plus, while the latter part pertains to XGBoost. In the XGBoost output, each line denotes the training log-loss value for a specific model iteration. Log-loss, a widely used classification loss function, assesses the alignment between predicted probabilities and actual class labels. By default, the XGBtraining runs for 5 iterations.

The bl2xfit output comprises four components: the first being the XGBoost object, followed by the XGBoost scores for all entries in the tdat dataset. Notably, XGBoost, being a black box model, doesn’t yield coefficients but provides model scores. For binary outcomes, these scores represent the probability of the positive class. The remaining two items include the observed outcome and the outcome type.

Continuous outcome

cl2xfit = LASSO2plus_XGBtraining(data = tdat, biomks = AgeXvars,
                                 outcomeType = "continuous", Y = "Age",
                                 outfile = "continuous_LASSO2plus_XGBoost")
## Start:  AIC=1745.15
## Age ~ highIPI + PS1
## 
##           Df Deviance    AIC
## - highIPI  1    39626 1744.8
## <none>          39331 1745.2
## - PS1      1    39848 1746.0
## 
## Step:  AIC=1744.76
## Age ~ PS1
## 
##        Df Deviance    AIC
## <none>       39626 1744.8
## - PS1   1    41606 1753.3
## file saved to continuous_LASSO2plus_XGBoostLASSO2plus_varaibleSelection.pdf
## [1]  train-rmse:47.112278 
## [2]  train-rmse:34.492776 
## [3]  train-rmse:26.089994 
## [4]  train-rmse:20.707653 
## [5]  train-rmse:17.442132
head(cl2xfit$XGBoost_score)
##    pt103    pt246    pt874    pt219    pt138    pt328 
## 56.47782 52.21841 56.47782 52.21841 52.21841 53.58232

Similar to the previous case, the primary outputs arise from LASSO2plus, while the final section pertains to XGBoost. Within the XGBoost output, the train-rmse values signify the root mean squared error (RMSE) calculated during each iteration. RMSE measures the average discrepancy between predicted and actual values in the training set, with lower values indicating improved model performance.

The declining RMSE values showcase the iterative training of the XGBoost model, where each iteration aims to minimize the RMSE on the training set, indicating progressive learning and enhanced predictive abilities.

The cl2xfit output also includes four components: the XGBoost object, XGBoost scores for all entries in tdat, observed outcome, and outcome type.

Time-to-event outcome

tl2xfit = LASSO2plus_XGBtraining(data = tdat, biomks = Xvars,
                                 outcomeType = "time-to-event",
                                 time = "FFP..Years.", event = "Code.FFP",
                                 outfile = "survival_LASSO2plus_XGBoost")
## Start:  AIC=815.14
## survival::Surv(FFP..Years., Code.FFP) ~ highIPI + stage3_4 + 
##     PS1 + LDH.Ratio1 + HANS_GCB + B.Symptoms + DTI + CD10.IHC + 
##     MUM1.IHC
## 
##              Df    AIC
## - HANS_GCB    1 813.18
## - B.Symptoms  1 813.36
## - highIPI     1 813.41
## - DTI         1 813.73
## - CD10.IHC    1 813.89
## - MUM1.IHC    1 814.17
## <none>          815.14
## - PS1         1 818.22
## - stage3_4    1 822.90
## - LDH.Ratio1  1 824.45
## 
## Step:  AIC=813.18
## survival::Surv(FFP..Years., Code.FFP) ~ highIPI + stage3_4 + 
##     PS1 + LDH.Ratio1 + B.Symptoms + DTI + CD10.IHC + MUM1.IHC
## 
##              Df    AIC
## - B.Symptoms  1 811.39
## - highIPI     1 811.43
## - DTI         1 811.78
## - CD10.IHC    1 812.30
## - MUM1.IHC    1 812.38
## <none>          813.18
## - PS1         1 816.24
## - stage3_4    1 821.44
## - LDH.Ratio1  1 822.45
## 
## Step:  AIC=811.39
## survival::Surv(FFP..Years., Code.FFP) ~ highIPI + stage3_4 + 
##     PS1 + LDH.Ratio1 + DTI + CD10.IHC + MUM1.IHC
## 
##              Df    AIC
## - highIPI     1 809.66
## - DTI         1 810.17
## - MUM1.IHC    1 810.59
## - CD10.IHC    1 810.62
## <none>          811.39
## - PS1         1 815.20
## - stage3_4    1 819.75
## - LDH.Ratio1  1 820.80
## 
## Step:  AIC=809.66
## survival::Surv(FFP..Years., Code.FFP) ~ stage3_4 + PS1 + LDH.Ratio1 + 
##     DTI + CD10.IHC + MUM1.IHC
## 
##              Df    AIC
## - DTI         1 808.46
## - MUM1.IHC    1 808.86
## - CD10.IHC    1 808.94
## <none>          809.66
## - PS1         1 814.13
## - stage3_4    1 819.20
## - LDH.Ratio1  1 820.45
## 
## Step:  AIC=808.46
## survival::Surv(FFP..Years., Code.FFP) ~ stage3_4 + PS1 + LDH.Ratio1 + 
##     CD10.IHC + MUM1.IHC
## 
##              Df    AIC
## - CD10.IHC    1 807.66
## - MUM1.IHC    1 807.79
## <none>          808.46
## - PS1         1 813.12
## - stage3_4    1 818.06
## - LDH.Ratio1  1 824.55
## 
## Step:  AIC=807.66
## survival::Surv(FFP..Years., Code.FFP) ~ stage3_4 + PS1 + LDH.Ratio1 + 
##     MUM1.IHC
## 
##              Df    AIC
## <none>          807.66
## - MUM1.IHC    1 808.22
## - PS1         1 813.79
## - stage3_4    1 817.87
## - LDH.Ratio1  1 824.33
## file saved to survival_LASSO2plus_XGBoostLASSO2plus_varaibleSelection.pdf
## [1]  train-cox-nloglik:4.873247 
## [2]  train-cox-nloglik:4.789556 
## [3]  train-cox-nloglik:4.743813 
## [4]  train-cox-nloglik:4.718117 
## [5]  train-cox-nloglik:4.695957
head(tl2xfit$XGBoost_score)
##     pt103     pt246     pt874     pt219     pt138     pt328 
## 1.5001757 0.2840731 1.5001757 0.3679212 0.1763741 0.7963076

Similarly, most outputs arise from LASSO2plus, while the final section pertains to XGBoost. In the XGBoost output, the negative log-likelihood serves as a prevalent loss function in survival analysis, encompassing Cox proportional hazards models. It quantifies dissimilarity between projected survival probabilities and observed survival times and events in the training data, aiming to minimize this metric for a better fit.

Monitoring the negative log-likelihood throughout training allows assessment of the model’s learning from data, ideally showcasing a decreasing trend signifying an improved fit to training data.

The tl2xfit output contains six components: XGBoost object, XGBoost scores for tdat, baseline hazard table, observed time, event, and outcome type.

5. Model prediction

In this section, we outline the prediction process for the six different modeling approaches included in this package when given the input variables (X) in a new dataset.

5.1 LASSO2 prediction

We begin by discussing predictions for LASSO2 model outcomes.

Binary ouctome

To predict binary outcomes using LASSO2, we use the following code snippet:

pbl = LASSO2_predict(bl, newdata = vdat, outfile = "pred_LASSO2_binary")
head(pbl)
##        pt3       pt10       pt20       pt25       pt30       pt52 
## 0.20970033 0.18367987 0.04718620 0.04718620 0.10784060 0.02336746

The pbl object holds the predicted probabilities for the positive group for each entry/sample.

Continuous ouctome

For continuous outcomes prediction, the code snippet is as follows:

pcl = LASSO2_predict(cl, newdata = vdat, outfile = "pred_LASSO2_cont")
head(pbl)
##        pt3       pt10       pt20       pt25       pt30       pt52 
## 0.20970033 0.18367987 0.04718620 0.04718620 0.10784060 0.02336746

The pcl object holds the predicted Y values for each entry/sample.

Time-to-event ouctome

When predicting time-to-event outcomes, we use the code:

ptl = LASSO2_predict(tl, newdata = vdat,
                     outfile = "pred_LASSO2_time_to_event")
head(pbl)
##        pt3       pt10       pt20       pt25       pt30       pt52 
## 0.20970033 0.18367987 0.04718620 0.04718620 0.10784060 0.02336746

The ptl object holds predicted risk scores for each entry/sample.

5.2 LASSO2 + regular regression prediction

Moving forward, let’s explore predictions concerning the combined LASSO2 and regular regression model outcomes. The function rms_model specifically caters to model prediction when utilizing a regular modeling object like those produced by LASSO2_reg. Upon performing predictions for binary and continuous outcomes, this step generates one figure and five tables. Additionally, for time-to-event outcomes, an extra table is generated. These resulting files are all saved locally for convenient access.

Binary ouctome

To predict binary outcomes using the LASSO2 + regular regression model:

pblr = rms_model(blr$fit, newdata = vdat, outfile = "pred_LASSO2reg_binary")

##             index.orig     training        test     optimism index.corrected
## Dxy        0.842931937  0.851528862  0.82351204  0.028016820      0.81491512
## R2         0.527718998  0.552879991  0.48756661  0.065313386      0.46240561
## Intercept  0.000000000  0.000000000 -0.15365199  0.153651991     -0.15365199
## Slope      1.000000000  1.000000000  0.80921368  0.190786316      0.80921368
## Emax       0.000000000  0.000000000  0.07527519  0.075275194      0.07527519
## D          0.310084555  0.322462320  0.28277788  0.039684442      0.27040011
## U         -0.009259259 -0.009259259  0.01821248 -0.027471744      0.01821248
## Q          0.319343814  0.331721579  0.26456539  0.067156186      0.25218763
## B          0.060659654  0.056392481  0.06413595 -0.007743470      0.06840312
## g          3.075229015  4.158616075  3.02608229  1.132533785      1.94269523
## gp         0.173738980  0.170714350  0.16773921  0.002975144      0.17076384
## Cindex     0.921465969  0.925764431  0.91175602  0.014008410      0.90745756
##             n
## Dxy       200
## R2        200
## Intercept 200
## Slope     200
## Emax      200
## D         200
## U         200
## Q         200
## B         200
## g         200
## gp        200
## Cindex    200
head(pblr)
##       pt3      pt10      pt20      pt25      pt30      pt52 
## -2.101131 -2.221256 -5.814410 -4.230616 -2.895583 -8.613113

Continuous ouctome

For continuous outcomes prediction:

pclr = rms_model(clr$fit, newdata = vdat,
                 outfile = "pred_LASSO2reg_continuous")

##           index.orig training     test optimism index.corrected   n
## R-square      0.0547   0.0632   0.0433   0.0199          0.0348 200
## MSE         182.0879 176.5652 184.2864  -7.7212        189.8091 200
## g             3.3604   3.4145   3.2358   0.1787          3.1817 200
## Intercept     0.0000   0.0000  -2.0414   2.0414         -2.0414 200
## Slope         1.0000   1.0000   1.0298  -0.0298          1.0298 200
head(pclr)
##      pt3     pt10     pt20     pt25     pt30     pt52 
## 62.00437 62.00437 62.00437 69.45115 65.13919 65.13919

Time-to-event ouctome

To predict time-to-event outcomes:

ptlr = rms_model(tlr$fit, data = tdat, newdata = vdat,
                outfile = "pred_LASSO2reg_time_to_event")

##          index.orig     training        test     optimism index.corrected   n
## Dxy     0.426967889  0.433890466 0.426121768  0.007768699     0.419199190 200
## R2      0.200249909  0.210292163 0.195782136  0.014510028     0.185739882 200
## Slope   1.000000000  1.000000000 0.972636972  0.027363028     0.972636972 200
## D       0.053514663  0.056885491 0.052163776  0.004721715     0.048792948 200
## U      -0.002312546 -0.002318816 0.001148229 -0.003467045     0.001154499 200
## Q       0.055827209  0.059204307 0.051015547  0.008188760     0.047638450 200
## g       0.813999013  0.832952948 0.797517544  0.035435404     0.778563609 200
## Cindex  0.713483944  0.716945233 0.713060884  0.003884349     0.709599595 200
head(ptlr)
##         pt3        pt10        pt20        pt25        pt30        pt52 
##  0.28074825  0.28074825 -0.73787179 -0.04536548  0.97325456  0.97325456

For time-to-event outcomes, the LASSO2_reg object requires the training dataset to be provided.

5.3 LASSO_plus prediction

We also use rms_model to predict LASSO_plus model outcomes.

Binary ouctome

To predict binary outcomes using the LASSO_plus model:

pbfit = rms_model(bfit$fit, newdata = vdat,
                  outfile = "pred_LASSOplus_binary")

##             index.orig     training        test     optimism index.corrected
## Dxy        0.790157068  0.800075272  0.77798534  0.022089931      0.76806714
## R2         0.481472357  0.497669844  0.45071403  0.046955810      0.43451655
## Intercept  0.000000000  0.000000000 -0.15094005  0.150940045     -0.15094005
## Slope      1.000000000  1.000000000  0.86220502  0.137794977      0.86220502
## Emax       0.000000000  0.000000000  0.06093911  0.060939114      0.06093911
## D          0.278185437  0.290630349  0.25787890  0.032751452      0.24543398
## U         -0.009259259 -0.009259259  0.01501785 -0.024277111      0.01501785
## Q          0.287444696  0.299889609  0.24286105  0.057028562      0.23041613
## B          0.064166124  0.062790107  0.06655954 -0.003769434      0.06793556
## g          2.702828842  3.576024982  2.70732762  0.868697361      1.83413148
## gp         0.162611026  0.164791936  0.15920238  0.005589553      0.15702147
## Cindex     0.895078534  0.900037636  0.88899267  0.011044966      0.88403357
##             n
## Dxy       200
## R2        200
## Intercept 200
## Slope     200
## Emax      200
## D         200
## U         200
## Q         200
## B         200
## g         200
## gp        200
## Cindex    200

Continuous ouctome

For continuous outcomes prediction:

pcfit = rms_model(cfit$fit, newdata = vdat,
                  outfile = "pred_LASSOplus_continuous")

##           index.orig training     test optimism index.corrected   n
## R-square      0.1113   0.1240   0.0915   0.0325          0.0788 200
## MSE         171.1808 168.0692 175.0057  -6.9365        178.1174 200
## g             5.1667   5.2993   4.8105   0.4888          4.6780 200
## Intercept     0.0000   0.0000   4.9015  -4.9015          4.9015 200
## Slope         1.0000   1.0000   0.9248   0.0752          0.9248 200

Time-to-event ouctome

To predict time-to-event outcomes:

ptfit = rms_model(tfit$fit, data = tdat, newdata = vdat,
                  outfile = "pred_LASSOplus_time_to_event")

##          index.orig     training        test     optimism index.corrected   n
## Dxy     0.513853367  0.515499635 0.500640097  0.014859538     0.498993829 200
## R2      0.265362936  0.274046535 0.254619687  0.019426848     0.245936089 200
## Slope   1.000000000  1.000000000 0.954016306  0.045983694     0.954016306 200
## D       0.074222355  0.078065651 0.070698255  0.007367396     0.066854959 200
## U      -0.002312546 -0.002325827 0.001801053 -0.004126880     0.001814334 200
## Q       0.076534901  0.080391477 0.068897201  0.011494276     0.065040625 200
## g       1.078420120  1.107354866 1.044136718  0.063218149     1.015201972 200
## Cindex  0.756926684  0.757749818 0.750320048  0.007429769     0.749496915 200

5.4 LASSO2plus prediction

Similarly, we use rms_model to predict LASSO2plus model outcomes.

Binary ouctome

To predict binary outcomes using the LASSO_plus model:

p2bfit = rms_model(b2fit$fit, newdata = vdat,
                   outfile = "pred_LASSO2plus_binary")

##             index.orig     training        test     optimism index.corrected
## Dxy        0.790157068  0.803563241  0.78029110  0.023272142      0.76688493
## R2         0.481472357  0.499574865  0.45148724  0.048087624      0.43338473
## Intercept  0.000000000  0.000000000 -0.10915293  0.109152931     -0.10915293
## Slope      1.000000000  1.000000000  0.85642426  0.143575743      0.85642426
## Emax       0.000000000  0.000000000  0.05395268  0.053952682      0.05395268
## D          0.278185437  0.289302762  0.25842693  0.030875828      0.24730961
## U         -0.009259259 -0.009259259  0.01815098 -0.027410241      0.01815098
## Q          0.287444696  0.298562022  0.24027595  0.058286069      0.22915863
## B          0.064166124  0.061466464  0.06658390 -0.005117440      0.06928356
## g          2.702828842  3.675632786  2.72701452  0.948618263      1.75421058
## gp         0.162611026  0.162674650  0.15950938  0.003165273      0.15944575
## Cindex     0.895078534  0.901781621  0.89014555  0.011636071      0.88344246
##             n
## Dxy       200
## R2        200
## Intercept 200
## Slope     200
## Emax      200
## D         200
## U         200
## Q         200
## B         200
## g         200
## gp        200
## Cindex    200

Continuous ouctome

For continuous outcomes prediction:

p2cfit = rms_model(c2fit$fit, newdata = vdat,
                   outfile = "pred_LASSO2plus_continuous")

##           index.orig training     test optimism index.corrected   n
## R-square      0.0547   0.0654   0.0415   0.0239          0.0308 200
## MSE         182.0879 181.0783 184.6321  -3.5538        185.6417 200
## g             3.3604   3.5298   3.2081   0.3217          3.0387 200
## Intercept     0.0000   0.0000   1.6415  -1.6415          1.6415 200
## Slope         1.0000   1.0000   0.9755   0.0245          0.9755 200

Time-to-event ouctome

To predict time-to-event outcomes:

p2tfit = rms_model(t2fit$fit, data = tdat, newdata = vdat,
                   outfile = "pred_LASSO2plus_time_to_event")

##          index.orig     training        test     optimism index.corrected   n
## Dxy     0.513853367  0.517035127 0.501410912  0.015624214      0.49822915 200
## R2      0.265362936  0.275890472 0.254297677  0.021592794      0.24377014 200
## Slope   1.000000000  1.000000000 0.950393088  0.049606912      0.95039309 200
## D       0.074222355  0.078564634 0.070597812  0.007966822      0.06625553 200
## U      -0.002312546 -0.002332982 0.001318394 -0.003651376      0.00133883 200
## Q       0.076534901  0.080897616 0.069279418  0.011618198      0.06491670 200
## g       1.078420120  1.102956596 1.039869324  0.063087272      1.01533285 200
## Cindex  0.756926684  0.758517563 0.750705456  0.007812107      0.74911458 200

5.5 XGBoost prediction

Continuing, we discuss predictions for the XGBoost model outcomes.

Binary ouctome

To predict binary outcomes using the XGBoost model:

pbxfit = XGBtraining_predict(bxfit, newdata = vdat,
                             outfile = "pred_XGBoost_binary")

Continuous ouctome

For continuous outcomes prediction:

pcxfit = XGBtraining_predict(cxfit, newdata = vdat,
                             outfile = "pred_XGBoost_cont")

Time-to-event ouctome

To predict time-to-event outcomes:

ptxfit = XGBtraining_predict(txfit, newdata = vdat,
                             outfile = "pred_XGBoost_time_to_event")

5.6 LASSO2 + XGBoost prediction

Next, we explore predictions for the combined LASSO and XGBoost model outcomes.

Binary ouctome

To predict binary outcomes:

pblxfit = XGBtraining_predict(blxfit, newdata = vdat,
                              outfile = "pred_LXGBoost_binary")

Continuous ouctome

To predict continuous outcomes:

pclxfit = XGBtraining_predict(clxfit, newdata = vdat,
                              outfile = "pred_LXGBoost_cont")

Time-to-event ouctome

To predict time-to-event outcomes:

ptlxfit = XGBtraining_predict(tlxfit, newdata = vdat,
                              outfile = "pred_LXGBoost_time_to_event")

5.7 LASSO_plus + XGBoost prediction

Lastly, we discuss predictions for the combined LASSO_plus and XGBoost model outcomes.

1) Binary ouctome

To predict binary outcomes:

pblpxfit = XGBtraining_predict(blpxfit, newdata = vdat,
                               outfile = "pred_LpXGBoost_binary")
2) Continuous ouctome

For continuous outcomes prediction:

pclpxfit = XGBtraining_predict(clpxfit, newdata = vdat,
                               outfile = "pred_LpXGBoost_cont")
3) Time-to-event ouctome

To predict time-to-event outcomes:

ptlpxfit = XGBtraining_predict(tlpxfit, newdata = vdat,
                               outfile = "pred_LpXGBoost_time_to_event")

5.8 LASSO2plus + XGBoost prediction

Binary ouctome

To predict binary outcomes using the LASSO2plus + XGBoost model:

pbl2xfit = XGBtraining_predict(bl2xfit, newdata = vdat,
                               outfile = "pred_L2XGBoost_binary")

Continuous ouctome

For continuous outcomes prediction:

pcl2xfit = XGBtraining_predict(cl2xfit, newdata = vdat,
                               outfile = "pred_L2XGBoost_cont")

Time-to-event ouctome

To predict time-to-event outcomes:

ptl2xfit = XGBtraining_predict(tl2xfit, newdata = vdat,
                               outfile = "pred_L2XGBoost_time_to_event")

6. (External) Model Validation

In the validation phase, we assess our models’ effectiveness by utilizing a fresh dataset that includes the outcome variable. This separate dataset, known as the validation dataset, stands distinct from the one used for training and is termed external validation. This distinction is crucial, setting it apart from internal validation methods such as sampling, cross-validation, leave-one-out, and bootstrapping.

It’s important to emphasize that while the same functions are used for both prediction and validation, the validation process requires the inclusion of an outcome variable. This distinction prompts additional analyses and comparisons beyond mere prediction.

All generated validation plots and associated result files are stored locally for easy reference.

6.1 LASSO2 validation

We conduct validation for the LASSO2 model with different types of outcome variables.

Binary ouctome

vbl = LASSO2_predict(bl, newdata = vdat, newY = TRUE,
                     outfile = "valid_LASSO2_binary")

While the returned object vbl also holds predicted probabilities for the ’DZsig’ positive group, in addition, a validation performance figure is saved locally.

Continuous ouctome

vcl = LASSO2_predict(cl, newdata = vdat, newY = TRUE,
                     outfile = "valid_LASSO2_cont")
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

Similarly, the returned object vcl holds predicted value, and a validation performance plot is saved.

Time-to-event ouctome

vtl = LASSO2_predict(tl, newdata = vdat, newY = TRUE,
               outfile = "valid_LASSO2_time_to_event")

The returned object vtl keeps the predicted risk scores, and locally saved validation results include a calibration plot and a table containing performance statistics.

6.2 LASSO2 + regular regression validation

Similar to prediction step, we use rms_model to validate the combined LASSO2 and regular regression model.

Binary ouctome

vblr = rms_model(blr$fit, newdata = vdat, newY = TRUE,
                 outfile = "valid_LASSO2reg_binary")

##             index.orig     training        test     optimism index.corrected
## Dxy        0.842931937  0.852864366  0.82443141  0.028432952      0.81449898
## R2         0.527718998  0.556745099  0.48959034  0.067154755      0.46056424
## Intercept  0.000000000  0.000000000 -0.16861874  0.168618736     -0.16861874
## Slope      1.000000000  1.000000000  0.80183609  0.198163913      0.80183609
## Emax       0.000000000  0.000000000  0.08034152  0.080341516      0.08034152
## D          0.310084555  0.329349113  0.28405687  0.045292238      0.26479232
## U         -0.009259259 -0.009259259  0.02034229 -0.029601553      0.02034229
## Q          0.319343814  0.338608372  0.26371458  0.074893791      0.24445002
## B          0.060659654  0.056924343  0.06396689 -0.007042547      0.06770220
## g          3.075229015  4.241118946  3.01611986  1.224999086      1.85022993
## gp         0.173738980  0.173862233  0.16806814  0.005794088      0.16794489
## Cindex     0.921465969  0.926432183  0.91221571  0.014216476      0.90724949
##             n
## Dxy       200
## R2        200
## Intercept 200
## Slope     200
## Emax      200
## D         200
## U         200
## Q         200
## B         200
## g         200
## gp        200
## Cindex    200

The above code generates and saves two figures and five tables and some of them are duplicated to the prediction step.

Continuous ouctome

vclr = rms_model(clr$fit, newdata = vdat, newY = TRUE,
                 outfile = "valid_LASSO2reg_continuous")
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

##           index.orig training     test optimism index.corrected   n
## R-square      0.0547   0.0634   0.0421   0.0213          0.0333 200
## MSE         182.0879 181.9932 184.5155  -2.5223        184.6102 200
## g             3.3604   3.4535   3.2035   0.2500          3.1105 200
## Intercept     0.0000   0.0000  -0.1968   0.1968         -0.1968 200
## Slope         1.0000   1.0000   1.0027  -0.0027          1.0027 200

The above code also generates and saves two figures and five tables and some of them are duplicated to the prediction step.

Time-to-event ouctome

vtlr = rms_model(tlr$fit, data = tdat, newdata = vdat, newY = TRUE,
                 outfile = "valid_LASSO2reg_time_to_event")

##          index.orig     training        test     optimism index.corrected   n
## Dxy     0.426967889  0.437033445 0.426916738  0.010116707     0.416851182 200
## R2      0.200249909  0.212654732 0.195929693  0.016725039     0.183524870 200
## Slope   1.000000000  1.000000000 0.964331482  0.035668518     0.964331482 200
## D       0.053514663  0.057353587 0.052206694  0.005146893     0.048367771 200
## U      -0.002312546 -0.002306487 0.001012522 -0.003319010     0.001006464 200
## Q       0.055827209  0.059660074 0.051194172  0.008465902     0.047361307 200
## g       0.813999013  0.838751394 0.798348552  0.040402842     0.773596172 200
## Cindex  0.713483944  0.718516722 0.713458369  0.005058353     0.708425591 200

Same as for prediction step, validation of time-to-event outcome requires training data as well. The above code generates and saves two figures and six tables and some of them are duplicated to the prediction step.

6.3 LASSO_plus validation

Next, we utilize the same rms_model function for validating the LASSO_plus models. The parameter settings and outputs mirror those detailed in the combined LASSO2 and regular regression validation validation of Section 6.2.

Binary ouctome

vbfit = rms_model(bfit$fit, newdata = vdat, newY = TRUE,
                  outfile = "valid_LASSOplus_binary")

##             index.orig     training        test     optimism index.corrected
## Dxy        0.790157068  0.803388279  0.77933822  0.024050059      0.76610701
## R2         0.481472357  0.499006824  0.45086750  0.048139323      0.43333303
## Intercept  0.000000000  0.000000000 -0.15539422  0.155394220     -0.15539422
## Slope      1.000000000  1.000000000  0.85343547  0.146564526      0.85343547
## Emax       0.000000000  0.000000000  0.06404446  0.064044461      0.06404446
## D          0.278185437  0.289255131  0.25796669  0.031288436      0.24689700
## U         -0.009259259 -0.009259259  0.01900001 -0.028259273      0.01900001
## Q          0.287444696  0.298514390  0.23896668  0.059547709      0.22789699
## B          0.064166124  0.062653488  0.06658645 -0.003932959      0.06809908
## g          2.702828842  3.697474633  2.71932938  0.978145248      1.72468359
## gp         0.162611026  0.163854502  0.15944139  0.004413114      0.15819791
## Cindex     0.895078534  0.901694140  0.88966911  0.012025030      0.88305350
##             n
## Dxy       200
## R2        200
## Intercept 200
## Slope     200
## Emax      200
## D         200
## U         200
## Q         200
## B         200
## g         200
## gp        200
## Cindex    200

Continuous ouctome

vcfit = rms_model(cfit$fit, newdata = vdat, newY = TRUE,
                  outfile = "valid_LASSOplus_continuous")
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

##           index.orig training     test optimism index.corrected   n
## R-square      0.1113   0.1256   0.0923   0.0332          0.0781 200
## MSE         171.1808 166.7987 174.8346  -8.0360        179.2168 200
## g             5.1667   5.3244   4.8371   0.4872          4.6795 200
## Intercept     0.0000   0.0000   5.0827  -5.0827          5.0827 200
## Slope         1.0000   1.0000   0.9191   0.0809          0.9191 200

Time-to-event ouctome

vtfit = rms_model(tfit$fit, data = tdat, newdata = vdat, newY = TRUE,
                  outfile = "valid_LASSOplus_time_to_event")

##          index.orig     training        test     optimism index.corrected   n
## Dxy     0.513853367  0.520177786 0.500051861  0.020125925     0.493727442 200
## R2      0.265362936  0.278929001 0.254352540  0.024576461     0.240786475 200
## Slope   1.000000000  1.000000000 0.942296628  0.057703372     0.942296628 200
## D       0.074222355  0.079377064 0.070615670  0.008761394     0.065460961 200
## U      -0.002312546 -0.002325518 0.001531321 -0.003856839     0.001544293 200
## Q       0.076534901  0.081702582 0.069084349  0.012618233     0.063916668 200
## g       1.078420120  1.119889538 1.042314482  0.077575056     1.000845064 200
## Cindex  0.756926684  0.760088893 0.750025931  0.010062963     0.746863721 200

6.4 LASSO2plus validation

Additionally, we leverage the same rms_model function to validate the LASSO_plus models. The parameter configurations and outputs align with those outlined in the combined LASSO2 and regular regression validation detailed in Section 6.2.

Binary ouctome

v2bfit = rms_model(b2fit$fit, newdata = vdat, newY = TRUE,
                   outfile = "valid_LASSO2plus_binary")

##             index.orig     training        test     optimism index.corrected
## Dxy        0.790157068  0.808066446  0.77752042  0.030546028      0.75961104
## R2         0.481472357  0.508765669  0.44657735  0.062188322      0.41928404
## Intercept  0.000000000  0.000000000 -0.19134450  0.191344497     -0.19134450
## Slope      1.000000000  1.000000000  0.81785604  0.182143959      0.81785604
## Emax       0.000000000  0.000000000  0.08073939  0.080739386      0.08073939
## D          0.278185437  0.298129804  0.25516929  0.042960514      0.23522492
## U         -0.009259259 -0.009259259  0.01631215 -0.025571409      0.01631215
## Q          0.287444696  0.307389063  0.23885714  0.068531924      0.21891277
## B          0.064166124  0.061327218  0.06655706 -0.005229840      0.06939596
## g          2.702828842  3.794266494  2.74176003  1.052506467      1.65032238
## gp         0.162611026  0.166459228  0.15876477  0.007694462      0.15491656
## Cindex     0.895078534  0.904033223  0.88876021  0.015273014      0.87980552
##             n
## Dxy       200
## R2        200
## Intercept 200
## Slope     200
## Emax      200
## D         200
## U         200
## Q         200
## B         200
## g         200
## gp        200
## Cindex    200

Continuous ouctome

v2cfit = rms_model(c2fit$fit, newdata = vdat, newY = TRUE,
                   outfile = "valid_LASSO2plus_continuous")
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

##           index.orig training     test optimism index.corrected   n
## R-square      0.0547   0.0639   0.0415   0.0224          0.0323 200
## MSE         182.0879 178.9352 184.6205  -5.6852        187.7732 200
## g             3.3604   3.4703   3.1967   0.2736          3.0868 200
## Intercept     0.0000   0.0000   0.1922  -0.1922          0.1922 200
## Slope         1.0000   1.0000   0.9974   0.0026          0.9974 200

Time-to-event ouctome

v2tfit = rms_model(t2fit$fit, data = tdat, newdata = vdat, newY = TRUE,
                   outfile = "valid_LASSO2plus_time_to_event")

##          index.orig     training        test     optimism index.corrected   n
## Dxy     0.513853367  0.521392165 0.499849389  0.021542776     0.492310591 200
## R2      0.265362936  0.280418373 0.254040389  0.026377985     0.238984951 200
## Slope   1.000000000  1.000000000 0.936835025  0.063164975     0.936835025 200
## D       0.074222355  0.080365460 0.070513329  0.009852131     0.064370224 200
## U      -0.002312546 -0.002338996 0.001542323 -0.003881319     0.001568773 200
## Q       0.076534901  0.082704456 0.068971006  0.013733449     0.062801452 200
## g       1.078420120  1.122557100 1.042425107  0.080131993     0.998288127 200
## Cindex  0.756926684  0.760696083 0.749924695  0.010771388     0.746155295 200

6.5 XGBoost validation

The XGBtraining_predict function introduced in Section 5.5, as indicated by its name, also serves for model validation when the outcome variable is present in the validation cohort. The parameter settings and outputs are the same as those for the LASSO2_prediction function detailed in Section 6.1.

Binary ouctome

vbxfit = XGBtraining_predict(bxfit, newdata = vdat, newY = TRUE,
                             outfile = "valid_XGBoost_binary")

Predicted probability for the positive group is given for each entry/sample.

Continuous ouctome

vcxfit = XGBtraining_predict(cxfit, newdata = vdat, newY = TRUE,
                             outfile = "valid_XGBoost_cont")
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

Time-to-event ouctome

vtxfit = XGBtraining_predict(txfit, newdata = vdat, newY = TRUE,
                             outfile = "valid_XGBoost_time_to_event")

6.6 LASSO2 + XGBoost validation

The same XGBtraining_predict function is employed for LASSO2 + XGBoost model validation as for the standalone XGBoost model shown in Section 6.5, with consistent parameter settings and identical outputs.

Binary ouctome

vblxfit = XGBtraining_predict(blxfit, newdata = vdat, newY = TRUE,
                              outfile = "valid_LXGBoost_binary")

Continuous ouctome

vclxfit = XGBtraining_predict(clxfit, newdata = vdat, newY = TRUE,
                              outfile = "valid_LXGBoost_cont")
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

Time-to-event ouctome

vtlxfit = XGBtraining_predict(tlxfit, newdata = vdat, newY = TRUE,
                              outfile = "valid_LXGBoost_time_to_event")

6.7 LASSO_plus + XGBoost validation

The same XGBtraining_predict function is employed for LASSO_plus + XGBoost model validation as for the standalone XGBoost model shown in Section 6.5, with consistent parameter settings and identical outputs.

Binary ouctome

vblpxfit = XGBtraining_predict(blpxfit, newdata = vdat, newY = TRUE,
                               outfile = "valid_LpXGBoost_binary")

Continuous ouctome

vclpxfit = XGBtraining_predict(clpxfit, newdata = vdat, newY = TRUE,
                               outfile = "valid_LpXGBoost_cont")
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

Time-to-event ouctome

vtlpxfit = XGBtraining_predict(tlpxfit, newdata = vdat, newY = TRUE,
                               outfile = "valid_LpXGBoost_time_to_event")

6.8 LASSO2plus + XGBoost validation

The same XGBtraining_predict function is employed for LASSO2plus + XGBoost model validation as for the standalone XGBoost model shown in Section 6.5, with consistent parameter settings and identical outputs.

Binary ouctome

vbl2xfit = XGBtraining_predict(bl2xfit, newdata = vdat, newY = TRUE,
                               outfile = "valid_L2XGBoost_binary")

Continuous ouctome

vcl2xfit = XGBtraining_predict(cl2xfit, newdata = vdat, newY = TRUE,
                               outfile = "valid_L2XGBoost_cont")
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

Time-to-event ouctome

vtl2xfit = XGBtraining_predict(tl2xfit, newdata = vdat, newY = TRUE,
                               outfile = "valid_L2XGBoost_time_to_event")

7. All-in-one!

If you find it challenging to call various functions separately, the all-in-one function provides a simplified solution. It efficiently manages predictive model development and validation for all six methods integrated into this package, spanning three distinct outcome types, with a single function call. Moreover, you can employ this versatile function for a single method with one or more outcome variables, offering flexibility to suit your specific needs. If a validation dataset is at your disposal, the function seamlessly incorporates the validation process within the same operation.

modelout = csmpvModelling(tdat = tdat, vdat = vdat,
                          Ybinary = "DZsig", varsBinary = Xvars,
                          Ycont = "Age", varsCont = AgeXvars,
                          time = "FFP..Years.", event = "Code.FFP",
                          varsSurvival = Xvars,
                          outfileName= "all_in_one")

This single function call generates all models and provides predictions and validations for each of them. To save space, the running results are hidden. In other words, this single function call can replace all three sections discussed in Sections 4, 5, and 6. The models will be returned, and all 179 result files will be saved locally with our exmaple training data: tdat and validation data: vdat.

Certainly, we can use this all-in-one function to work on one outcome variable and one model at a time, for example:

DZlassoreg = csmpvModelling(tdat = tdat, vdat = vdat,
                            Ybinary = "DZsig", varsBinary = Xvars,
                            methods = "LASSO2_reg",
                            outfileName= "just_one")
## Resized limits to included dashed line in forest panel
## Resized limits to included dashed line in forest panel
## Resized limits to included dashed line in forest panel
## file saved to just_one_binary_LASSO2reg_LASSO_reg.pdf
## file saved to just_one_binary_LASSO2reg_LASSO_regallMarks.pdf

##             index.orig     training        test     optimism index.corrected
## Dxy        0.790157068  0.809430853  0.77916440  0.030266455      0.75989061
## R2         0.481472357  0.513673005  0.45482537  0.058847637      0.42262472
## Intercept  0.000000000  0.000000000 -0.15184124  0.151841242     -0.15184124
## Slope      1.000000000  1.000000000  0.83708041  0.162919589      0.83708041
## Emax       0.000000000  0.000000000  0.06748667  0.067486675      0.06748667
## D          0.278185437  0.302503596  0.26051003  0.041993564      0.23619187
## U         -0.009259259 -0.009259259  0.01651497 -0.025774229      0.01651497
## Q          0.287444696  0.311762855  0.24399506  0.067767793      0.21967690
## B          0.064166124  0.060846460  0.06618349 -0.005337033      0.06950316
## g          2.702828842  3.711119190  2.71891485  0.992204335      1.71062451
## gp         0.162611026  0.167573743  0.16005494  0.007518802      0.15509222
## Cindex     0.895078534  0.904715427  0.88958220  0.015133228      0.87994531
##             n
## Dxy       200
## R2        200
## Intercept 200
## Slope     200
## Emax      200
## D         200
## U         200
## Q         200
## B         200
## g         200
## gp        200
## Cindex    200

This is equivalent to using LASSO2_reg for modeling, followed by prediction and validation with rms_model for the classification task “DZsig”. Six result files are then saved locally.

8. Special modelling

In preceding sections, the target model type consistently matched the provided output. However, scenarios can emerge where they do not necessarily correspond.

For instance, situations might arise in which we aim to construct a risk classification model even when our training cohort lacks risk classification data but includes survival information.

To undertake this specialized modeling, let’s assume that we possess a set of variables associated with survival outcomes. This variable list could stem from other research and be validated within the given training dataset, or it could be established through variable selection techniques such as LASSO2, LASSO_plus and LASSO2plus.

By employing the same variable list, denoted as Xvars, we can invoke the XGpred function with choices to perform variable selection with LASSO2. This wrapper function applies XGBoost and Cox modeling to get high and low risk groups using survival data. Subsequently, these groups undergo filtration and are utilized to construct both an XGpred (linear prediction score) model and an empirical Bayesian-based binary risk classification model.

Build the XGpred object for the training cohort:

xgobj = XGpred(data = tdat, varsIn = Xvars, 
               selection = TRUE,
               time = "FFP..Years.",
               event = "Code.FFP", outfile = "XGpred")

The XGpred output object, xgobj, contains all the necessary information for risk classification, including that of the training cohort.

To observe the performance of the risk group in the training set, we can generate a KM plot using the confirmVars function:

tdat$XGpred_class = xgobj$XGpred_prob_class
training_risk_confirm = confirmVars(data = tdat, biomks = "XGpred_class",
                                    time = "FFP..Years.", event = "Code.FFP",
                                    outfile = "training_riskSurvival",
                                    outcomeType = "time-to-event")
training_risk_confirm[[3]]

Then we can predict the risk classification for a validation cohort:

xgNew = XGpred_predict(newdat = vdat, XGpredObj = xgobj)

While the default calibration shift (scoreShift) is set to 0, you can adjust it based on model scores if there’s a platform/batch difference between the training and validation cohorts.

If survival data is available for the testing dataset, we can employ the confirmVars function introduced earlier to assess the reasonableness of risk classification.

vdat$XGpred_class = xgNew$XGpred_prob_class
risk_confirm = confirmVars(data = vdat, biomks = "XGpred_class",
                           time = "FFP..Years.", event = "Code.FFP",
                           outfile = "riskSurvival",
                           outcomeType = "time-to-event")
risk_confirm[[3]]

csmpv R package general information

Title: Biomarker confirmation, selection, modelling, prediction and validation

Version: 1.0.2

Author: Aixiang Jiang

Maintainer: Aixiang Jiang {.email}

Depends: R (>= 4.2.0)

Suggests: knitr

VignetteBuilder: knitr

Imports: survival, glmnet, Hmisc, rms, forestmodel, ggplot2, ggpubr,survminer, mclust, xgboost, cowplot

References

devtools::session_info()
## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value
##  version  R version 4.3.2 (2023-10-31)
##  os       macOS Ventura 13.2.1
##  system   x86_64, darwin20
##  ui       X11
##  language (EN)
##  collate  C
##  ctype    en_US.UTF-8
##  tz       America/Vancouver
##  date     2024-01-10
##  pandoc   2.19.2 @ /Users/aijiang/Desktop/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package      * version date (UTC) lib source
##  abind          1.4-5   2016-07-21 [3] CRAN (R 4.3.0)
##  backports      1.4.1   2021-12-13 [3] CRAN (R 4.3.0)
##  base64enc      0.1-3   2015-07-28 [3] CRAN (R 4.3.0)
##  broom          1.0.5   2023-06-09 [3] CRAN (R 4.3.0)
##  bslib          0.6.1   2023-11-28 [3] CRAN (R 4.3.0)
##  cachem         1.0.8   2023-05-01 [3] CRAN (R 4.3.0)
##  car            3.1-2   2023-03-30 [3] CRAN (R 4.3.0)
##  carData        3.0-5   2022-01-06 [3] CRAN (R 4.3.0)
##  checkmate      2.3.1   2023-12-04 [3] CRAN (R 4.3.0)
##  cli            3.6.2   2023-12-11 [3] CRAN (R 4.3.0)
##  cluster        2.1.4   2022-08-22 [4] CRAN (R 4.3.2)
##  codetools      0.2-19  2023-02-01 [4] CRAN (R 4.3.2)
##  colorspace     2.1-0   2023-01-23 [3] CRAN (R 4.3.0)
##  commonmark     1.9.0   2023-03-17 [3] CRAN (R 4.3.0)
##  cowplot        1.1.2   2023-12-15 [3] CRAN (R 4.3.0)
##  csmpv        * 1.0.2   2024-01-10 [1] local
##  data.table     1.14.10 2023-12-08 [3] CRAN (R 4.3.0)
##  devtools       2.4.5   2022-10-11 [3] CRAN (R 4.3.0)
##  digest         0.6.33  2023-07-07 [3] CRAN (R 4.3.0)
##  dplyr          1.1.4   2023-11-17 [3] CRAN (R 4.3.0)
##  ellipsis       0.3.2   2021-04-29 [3] CRAN (R 4.3.0)
##  evaluate       0.23    2023-11-01 [3] CRAN (R 4.3.0)
##  fansi          1.0.6   2023-12-08 [3] CRAN (R 4.3.0)
##  farver         2.1.1   2022-07-06 [3] CRAN (R 4.3.0)
##  fastmap        1.1.1   2023-02-24 [3] CRAN (R 4.3.0)
##  foreach        1.5.2   2022-02-02 [3] CRAN (R 4.3.0)
##  foreign        0.8-86  2023-11-28 [3] CRAN (R 4.3.0)
##  forestmodel    0.6.2   2020-07-19 [3] CRAN (R 4.3.0)
##  Formula        1.2-5   2023-02-24 [3] CRAN (R 4.3.0)
##  fs             1.6.3   2023-07-20 [3] CRAN (R 4.3.0)
##  generics       0.1.3   2022-07-05 [3] CRAN (R 4.3.0)
##  ggplot2        3.4.4   2023-10-12 [3] CRAN (R 4.3.0)
##  ggpubr         0.6.0   2023-02-10 [3] CRAN (R 4.3.0)
##  ggsignif       0.6.4   2022-10-13 [3] CRAN (R 4.3.0)
##  ggtext         0.1.2   2022-09-16 [3] CRAN (R 4.3.0)
##  glmnet         4.1-8   2023-08-22 [3] CRAN (R 4.3.0)
##  glue           1.6.2   2022-02-24 [3] CRAN (R 4.3.0)
##  gridExtra      2.3     2017-09-09 [3] CRAN (R 4.3.0)
##  gridtext       0.1.5   2022-09-16 [3] CRAN (R 4.3.0)
##  gtable         0.3.4   2023-08-21 [3] CRAN (R 4.3.0)
##  highr          0.10    2022-12-22 [3] CRAN (R 4.3.0)
##  Hmisc          5.1-1   2023-09-12 [3] CRAN (R 4.3.0)
##  htmlTable      2.4.2   2023-10-29 [3] CRAN (R 4.3.0)
##  htmltools      0.5.7   2023-11-03 [3] CRAN (R 4.3.0)
##  htmlwidgets    1.6.4   2023-12-06 [3] CRAN (R 4.3.0)
##  httpuv         1.6.13  2023-12-06 [3] CRAN (R 4.3.0)
##  iterators      1.0.14  2022-02-05 [3] CRAN (R 4.3.0)
##  jquerylib      0.1.4   2021-04-26 [3] CRAN (R 4.3.0)
##  jsonlite       1.8.8   2023-12-04 [3] CRAN (R 4.3.0)
##  km.ci          0.5-6   2022-04-06 [3] CRAN (R 4.3.0)
##  KMsurv         0.1-5   2012-12-03 [3] CRAN (R 4.3.0)
##  knitr          1.45    2023-10-30 [3] CRAN (R 4.3.0)
##  labeling       0.4.3   2023-08-29 [3] CRAN (R 4.3.0)
##  later          1.3.2   2023-12-06 [3] CRAN (R 4.3.0)
##  lattice        0.22-5  2023-10-24 [3] CRAN (R 4.3.0)
##  lifecycle      1.0.4   2023-11-07 [3] CRAN (R 4.3.0)
##  magrittr       2.0.3   2022-03-30 [3] CRAN (R 4.3.0)
##  markdown       1.12    2023-12-06 [3] CRAN (R 4.3.0)
##  MASS           7.3-60  2023-05-04 [4] CRAN (R 4.3.2)
##  Matrix         1.6-4   2023-11-30 [3] CRAN (R 4.3.0)
##  MatrixModels   0.5-3   2023-11-06 [3] CRAN (R 4.3.0)
##  memoise        2.0.1   2021-11-26 [3] CRAN (R 4.3.0)
##  mgcv           1.9-1   2023-12-21 [3] CRAN (R 4.3.0)
##  mime           0.12    2021-09-28 [3] CRAN (R 4.3.0)
##  miniUI         0.1.1.1 2018-05-18 [3] CRAN (R 4.3.0)
##  multcomp       1.4-25  2023-06-20 [3] CRAN (R 4.3.0)
##  munsell        0.5.0   2018-06-12 [3] CRAN (R 4.3.0)
##  mvtnorm        1.2-4   2023-11-27 [3] CRAN (R 4.3.0)
##  nlme           3.1-164 2023-11-27 [3] CRAN (R 4.3.0)
##  nnet           7.3-19  2023-05-03 [4] CRAN (R 4.3.2)
##  pillar         1.9.0   2023-03-22 [3] CRAN (R 4.3.0)
##  pkgbuild       1.4.3   2023-12-10 [3] CRAN (R 4.3.0)
##  pkgconfig      2.0.3   2019-09-22 [3] CRAN (R 4.3.0)
##  pkgload        1.3.3   2023-09-22 [3] CRAN (R 4.3.0)
##  polspline      1.1.24  2023-10-26 [3] CRAN (R 4.3.0)
##  profvis        0.3.8   2023-05-02 [3] CRAN (R 4.3.0)
##  promises       1.2.1   2023-08-10 [3] CRAN (R 4.3.0)
##  purrr          1.0.2   2023-08-10 [3] CRAN (R 4.3.0)
##  quantreg       5.97    2023-08-19 [3] CRAN (R 4.3.0)
##  R6             2.5.1   2021-08-19 [3] CRAN (R 4.3.0)
##  Rcpp           1.0.11  2023-07-06 [3] CRAN (R 4.3.0)
##  remotes        2.4.2.1 2023-07-18 [3] CRAN (R 4.3.0)
##  rlang          1.1.2   2023-11-04 [3] CRAN (R 4.3.0)
##  rmarkdown      2.25    2023-09-18 [3] CRAN (R 4.3.0)
##  rms            6.7-1   2023-09-12 [3] CRAN (R 4.3.0)
##  rpart          4.1.23  2023-12-05 [3] CRAN (R 4.3.0)
##  rstatix        0.7.2   2023-02-01 [3] CRAN (R 4.3.0)
##  rstudioapi     0.15.0  2023-07-07 [3] CRAN (R 4.3.0)
##  sandwich       3.1-0   2023-12-11 [3] CRAN (R 4.3.0)
##  sass           0.4.8   2023-12-06 [3] CRAN (R 4.3.0)
##  scales         1.3.0   2023-11-28 [3] CRAN (R 4.3.0)
##  sessioninfo    1.2.2   2021-12-06 [3] CRAN (R 4.3.0)
##  shape          1.4.6   2021-05-19 [3] CRAN (R 4.3.0)
##  shiny          1.8.0   2023-11-17 [3] CRAN (R 4.3.0)
##  SparseM        1.81    2021-02-18 [3] CRAN (R 4.3.0)
##  stringi        1.8.3   2023-12-11 [3] CRAN (R 4.3.0)
##  stringr        1.5.1   2023-11-14 [3] CRAN (R 4.3.0)
##  survival       3.5-7   2023-08-14 [3] CRAN (R 4.3.0)
##  survminer      0.4.9   2021-03-09 [3] CRAN (R 4.3.0)
##  survMisc       0.5.6   2022-04-07 [3] CRAN (R 4.3.0)
##  TH.data        1.1-2   2023-04-17 [3] CRAN (R 4.3.0)
##  tibble         3.2.1   2023-03-20 [3] CRAN (R 4.3.0)
##  tidyr          1.3.0   2023-01-24 [3] CRAN (R 4.3.0)
##  tidyselect     1.2.0   2022-10-10 [3] CRAN (R 4.3.0)
##  urlchecker     1.0.1   2021-11-30 [3] CRAN (R 4.3.0)
##  usethis        2.2.2   2023-07-06 [3] CRAN (R 4.3.0)
##  utf8           1.2.4   2023-10-22 [3] CRAN (R 4.3.0)
##  vctrs          0.6.5   2023-12-01 [3] CRAN (R 4.3.0)
##  withr          2.5.2   2023-10-30 [3] CRAN (R 4.3.0)
##  xfun           0.41    2023-11-01 [3] CRAN (R 4.3.0)
##  xgboost        1.7.6.1 2023-12-06 [3] CRAN (R 4.3.0)
##  xml2           1.3.6   2023-12-04 [3] CRAN (R 4.3.0)
##  xtable         1.8-4   2019-04-21 [3] CRAN (R 4.3.0)
##  yaml           2.3.8   2023-12-11 [3] CRAN (R 4.3.0)
##  zoo            1.8-12  2023-04-13 [3] CRAN (R 4.3.0)
## 
##  [1] /private/var/folders/mw/nv2pnn4x0rz0t3rxgfl_l8twlkqc_t/T/RtmpQJN2JH/Rinst5f0ac4740a4
##  [2] /private/var/folders/mw/nv2pnn4x0rz0t3rxgfl_l8twlkqc_t/T/Rtmpmp1GHu/temp_libpath5ee535144f7f
##  [3] /Users/aijiang/Library/R/x86_64/4.3/library
##  [4] /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/library
## 
## ──────────────────────────────────────────────────────────────────────────────

Hastie et al. (1992, ISBN 0 534 16765-9), Therneau et al. (2000, ISBN 0-387-98784-3), Friedman et al. (2010) doi:10.18637/jss.v033.i01 Simon et al. (2011) doi:doi:10.18637/jss.v039.i05 Chen and Guestrin (2016) <arXiv:1603.02754> Aoki et al. (2023) doi:10.1200/JCO.23.01115