The csmpv R package offers a comprehensive array of functions covering biomarker confirmation, variable selection, modeling, predictive analysis, and validation. Its primary objectives encompass:
Biomarker Confirmation/Validation: This feature employs both single-variable and multivariable regression techniques to confirm and validate established biomarkers.
Biomarker Discovery: The package streamlines the identification of new biomarkers through variable selection methods like LASSO2, LASSO_plus, and LASSO2plus.
Predictive Model Development: By harnessing a fusion of machine learning and traditional statistical tools, this process facilitates the creation of predictive models focused on specific biomarkers.
Model Prediction: Developed models can predict outcomes when applied to new datasets.
Model Validation: These models validate outcomes when applied to novel datasets, provided an outcome variable is present.
To simplify the modeling process, we’ve designed an all-in-one function capable of managing predictive model development, prediction, and validation for all eight methods within this package across three distinct outcome types. This versatile function streamlines the process, allowing for a concise implementation with just a single function call. It can handle a single method with single or multiple outcome variables. Moreover, if a validation dataset is available, the prediction and validation processes can seamlessly integrate into a unified operation.
In addition to these core functionalities, the csmpv package introduces a unique approach allowing the creation of binary classification models based on survival models. This innovative feature enables predicting binary outcomes for new datasets using the developed model. Please note, the external validation of this model is limited due to the absence of binary classification variables in new datasets. Despite this limitation, the predicted binary classification can serve as a surrogate biomarker, and its correlation with survival outcomes in new datasets can be tested when survival outcome information is available.
The package excels in handling various outcome variable types—binary, continuous, and time-to-event data.
To enhance user experience, the csmpv R package focuses on streamlining coding efforts. Each user-end function acts as a comprehensive wrapper condensing multiple analyses into a single function call. Additionally, result files are conveniently saved locally, further simplifying the analytical process.
The csmpv package is available on CRAN, and it can be directly installed in R using the following command:
install.packages("csmpv")
Alternatively, let’s proceed to install csmpv from GitHub using the devtools or remotes R package.
# Install devtools package if not already installed
options(repos = c(CRAN = "https://cloud.r-project.org"))
install.packages("devtools")
# Install csmpv package from GitHub
devtools::install_github("ajiangsfu/csmpv",force = TRUE)
# Using force = TRUE will ensure the installation, overriding any existing versions
# Install remotes package if not already installed
install.packages("remotes")
# Install csmpv package from GitHub
remotes::install_github("ajiangsfu/csmpv",force = TRUE)
# Using force = TRUE will ensure the installation, overriding any existing versions
Both methods will download and install the csmpv package from the GitHub repository. Please ensure an active internet connection and the necessary dependencies for a successful installation.
In this section, we will show some example code, however, before that, we will introduce example data first.
The example data was extracted from our in-house diffuse large B-cell lymphoma (DLBCL) dataset, specifically utilizing supplemental Table S1 from Alduaij et al. (2023, DOI: 10.1182/blood.2022018248).
Upon identifying a substantial amount of missing data, accounting for only 38% complete cases, we conducted Little’s MCAR test, revealing non-randomness in the missing values. This directed our focus toward handling rather than excluding them. Implementing multiple imputation emerged as a robust strategy for addressing this issue, showcasing its versatility and effectiveness across various missing data scenarios. However, for illustrative purposes, we generated only one imputation.
Furthermore, to ensure compatibility with all eight modeling methods within csmpv, we transformed all categorical variables into binary format, overcoming limitations in XGBoost and LASSO when dealing with categorical variables with more than two levels.
Following these procedures, an object named datlist was generated and is included in csmpv, accessible straightforwardly after installing and loading csmpv, as demonstrated below.
library(csmpv)
data("datlist", package = "csmpv")
tdat = datlist$training
dim(tdat)
## [1] 216 22
vdat = datlist$validation
dim(vdat)
## [1] 217 22
Subsequently, we defined three outcome variables and their respective independent variables.
To illustrate different types of outcome variables, we’ll define examples for binary, continuous, and time-to-event categories: - Binary: DZsig (dark zone signature) - Continuous: Age - Time-to-event: FFP (freedom from progression)
For binary and time-to-event variables, independent variables are defined as:
Xvars = c("highIPI","B.Symptoms","MYC.IHC","BCL2.IHC", "CD10.IHC","BCL6.IHC",
"MUM1.IHC","Male","AgeOver60", "stage3_4","PS1","LDH.Ratio1",
"Extranodal1","Bulk10cm","HANS_GCB", "DTI")
For the continuous variable, the corresponding independent variables align with those above, excluding AgeOver60 due to its correlation with the outcome variable Age:
AgeXvars = setdiff(Xvars, "AgeOver60")
To enhance reproducibility and minimize variability from random number generation, we established and set a specific random seed:
set.seed(12345)
Users can define their own temporary directory to save all results. If not, tempdir() can be used to get the system’s temporary directory.
temp_dir = tempdir()
# setwd(temp_dir) # this only affect this chunk, not for other part
knitr::opts_knit$set(root.dir = temp_dir)
Whether this procedure is labeled as biomarker confirmation, validation, or testing, the fundamental aspect involves regular regression analyses on both single and multiple variables across three distinct outcome categories: binary, continuous, and time-to-event. In this context, our objective is to assess the presence of an association between outcomes and a set of independent variables. It’s important to note that this differs from model validation, which will be covered subsequently.
To confirm biomarkers for binary outcomes:
bconfirm = confirmVars(data = tdat, biomks = Xvars, Y = "DZsig",
outfile = "confirmBinary")
The confirmVars function acts as a wrapper, invoking various functions to perform regression analysis based on different outcome types. By default, the outcome type is binary, requiring no explicit specification when handling binary outcomes.
Upon execution, the bconfirm object comprises a multivariable model and a list of two forest plots. The first plot consolidates individual forest plots for each single variable, while the second represents the forest plot for the multivariable model. These outputs are locally saved, along with a combined table containing models for each single variable.
print(bconfirm$fit)
##
## Call: glm(formula = f1, family = "binomial", data = datain)
##
## Coefficients:
## (Intercept) highIPI B.Symptoms MYC.IHC BCL2.IHC CD10.IHC
## -27.89875 -2.61008 -1.69697 3.72794 1.26593 4.24328
## BCL6.IHC MUM1.IHC Male AgeOver60 stage3_4 PS1
## 1.61152 -3.03434 1.88499 0.82520 1.74159 3.78197
## LDH.Ratio1 Extranodal1 Bulk10cm HANS_GCB DTI
## 2.24558 2.05693 -1.03546 16.23813 -0.02331
##
## Degrees of Freedom: 215 Total (i.e. Null); 199 Residual
## Null Deviance: 154.8
## Residual Deviance: 63.14 AIC: 97.14
bconfirm$allplot[[2]]
For instance, the initial output showcases a multivariable model. In the subsequent section, single-variable models are presented with associated forest plots, all amalgamated into a comprehensive display.
To confirm biomarkers for continuous outcomes:
cconfirm = confirmVars(data = tdat, biomks = AgeXvars, Y = "Age",
outcomeType = "continuous",
outfile = "confirmContinuous")
The same confirmVars function is called; however, this time, we specify the outcome type as continuous.
In a similar fashion, the cconfirm object comprises two elements: a multivariable model and a list of two forest plots. The first plot consolidates all forest plots for each single variable, while the second represents the forest plot for the multivariable model. All these outputs are saved locally, accompanied by a combined table containing models for each single variable.
Below, you’ll find the multivariable model and a combined forest plot for each variable with raw p-values:
print(cconfirm$fit)
##
## Call: glm(formula = f1, data = datain)
##
## Coefficients:
## (Intercept) highIPI B.Symptoms MYC.IHC BCL2.IHC CD10.IHC
## 64.10855 9.52589 -3.74092 1.95808 1.58400 -0.35961
## BCL6.IHC MUM1.IHC Male stage3_4 PS1 LDH.Ratio1
## 1.78772 1.32447 -1.51572 -3.87195 3.31566 -1.03366
## Extranodal1 Bulk10cm HANS_GCB DTI
## -7.65469 -1.86334 -0.88036 -0.03459
##
## Degrees of Freedom: 215 Total (i.e. Null); 200 Residual
## Null Deviance: 41610
## Residual Deviance: 35760 AIC: 1751
cconfirm$allplot[[2]]
To confirm biomarkers for time-to-event outcomes:
tconfirm = confirmVars(data = tdat, biomks = Xvars,
time = "FFP..Years.", event = "Code.FFP",
outcomeType = "time-to-event",
outfile = "confirmSurvival")
The confirmVars function is called once again, this time with the outcome type specified as time-to-event, necessitating the inclusion of both time and event variable names.
Similarly, two PDF and two table files are saved, accompanied by locally stored Kaplan-Meier plots. A single Kaplan-Meier plot is generated for each independent categorical variable with no more than four levels. In this example dataset, 15 Kaplan-Meier plots are produced.
The tconfirm object continues to store two elements: a multivariable model and a list of two forest plots. Below, you’ll find the multivariable model and a combined forest plot for each variable, including raw p-values:
print(tconfirm$fit)
## Call:
## survival::coxph(formula = as.formula(paste(survY, survX, sep = " ~ ")),
## data = datain)
##
## coef exp(coef) se(coef) z p
## highIPI -0.603018 0.547158 0.446953 -1.349 0.177281
## B.Symptoms 0.264292 1.302508 0.256034 1.032 0.301954
## MYC.IHC 0.321325 1.378954 0.240911 1.334 0.182273
## BCL2.IHC 0.580115 1.786243 0.308232 1.882 0.059826
## CD10.IHC -0.368733 0.691610 0.388518 -0.949 0.342583
## BCL6.IHC -0.061321 0.940521 0.304312 -0.202 0.840302
## MUM1.IHC 0.267188 1.306286 0.322775 0.828 0.407793
## Male 0.522793 1.686733 0.240032 2.178 0.029405
## AgeOver60 0.419517 1.521226 0.289702 1.448 0.147590
## stage3_4 1.032559 2.808244 0.309732 3.334 0.000857
## PS1 0.840254 2.316956 0.304235 2.762 0.005747
## LDH.Ratio1 1.387365 4.004285 0.338434 4.099 4.14e-05
## Extranodal1 0.191007 1.210468 0.305195 0.626 0.531411
## Bulk10cm -0.323524 0.723595 0.276848 -1.169 0.242567
## HANS_GCB 0.255505 1.291113 0.490277 0.521 0.602267
## DTI -0.003866 0.996142 0.007651 -0.505 0.613405
##
## Likelihood ratio test=81.78 on 16 df, p=7.953e-11
## n= 216, number of events= 85
tconfirm$allplot[[2]]
This section details the process of biomarker discovery through variable selection, utilizing three distinct methods: LASSO2, LASSO2plus, and LASSO_plus.
The variable selection process using our customized LASSO algorithm, LASSO2, employs a tailored approach distinct from the conventional LASSO (Least Absolute Shrinkage and Selection Operator) algorithm. This adjustment aims to address the randomness introduced by random splits and to guarantee the inclusion of at least two variables.
This process utilizes glmnet::cv.glmnet for cross-validation-based variable selection. It determines the largest lambda value where the error remains within 1 standard error of the minimum. However, as indicated in the cv.glmnet’s help file, variability in results can arise due to the randomness inherent in cross-validation splits.
To counteract this variability, our new function, LASSO2, conducts 10 runs of 10-fold cv.glmnet. The resulting average lambda value from these iterations becomes the final lambda used for regularization regression on the complete dataset.
It’s important to note that since LASSO2 selects the largest lambda within 1 standard error of the minimum, following the default behavior of cv.glmnet, it may yield a smaller number of selected variables compared to the lambda that minimizes the mean cross-validated error. This more conservative approach could potentially result in only one or no selected variables.
To address this potential issue, when LASSO2 identifies only one or no variables, it defaults to selecting the first lambda that results in at least two variables being chosen from the full dataset. This strategy ensures the inclusion of at least two variables, striking a balance between model complexity and the necessity for meaningful variable inclusion.
For binary outcomes, no additional specification is needed for outcomeType, as it is the default value.
bl = LASSO2(data = tdat, biomks = Xvars, Y = "DZsig",
outfile = "binaryLASSO2")
One figure and one text file are saved locally.
bl$coefs
## MYC.IHC CD10.IHC MUM1.IHC
## 0.8923274 1.5137059 -0.7274479
This displays the selected variables and their corresponding shrunken coefficients.
For variable selection involving a continuous outcome variable, specify outcomeType = “continuous”:
cl = LASSO2(data = tdat, biomks = AgeXvars,
outcomeType = "continuous", Y = "Age",
outfile = "continuousLASSO2")
Similar to before, one figure and one text file are saved locally.
cl$coefs
## highIPI PS1
## 0.02137912 1.07621511
This shows the selected variables and their associated shrunken coefficients for the continuous outcome.
For variable selection with a time-to-event outcome, set outcomeType = “time-to-event”, and ensure you provide the variable names for both time and event:
tl = LASSO2(data = tdat, biomks = Xvars,
outcomeType = "time-to-event",
time = "FFP..Years.",event = "Code.FFP",
outfile = "survivalLASSO2")
In a similar fashion, one figure and one text file are saved locally.
tl$coefs
## highIPI stage3_4 PS1 LDH.Ratio1
## 0.16770489 0.04166427 0.02757391 0.43226052
This shows the selected variables and their associated shrunk coefficients for time-to-event outcome.
LASSO2plus is an innovative approach that combines LASSO2, a modified LASSO algorithm, with other techniques. It selects variables in three steps: - applying LASSO2, which is slightly different from the standard LASSO as discussed in Section 3.1; - fitting a simple regression model for each variable and adjusting the p-values using the Benjamini Hochberg method (1995); - performing a stepwise variable selection procedure on the combined list of variables from the previous steps. Therefore, LASSO2plus incorporates both the regularization and the significance testing aspects of variable selection.
All parameter settings for LASSO2plus are the same as for LASSO2.
For binary outcomes, no additional specification is needed for outcomeType, as it is the default value.
b2fit = LASSO2plus(data = tdat, biomks = Xvars, Y = "DZsig",
outfile = "binaryLASSO2plus")
## Start: AIC=101.72
## DZsig ~ MYC.IHC + CD10.IHC + MUM1.IHC
##
## Df Deviance AIC
## <none> 93.72 101.72
## - MUM1.IHC 1 107.09 113.09
## - MYC.IHC 1 115.41 121.41
## - CD10.IHC 1 119.33 125.33
## file saved to binaryLASSO2plusLASSO2plus_varaibleSelection.pdf
b2fit$fit$coefficients
## (Intercept) MYC.IHC CD10.IHC MUM1.IHC
## -4.778565 2.503030 3.188996 -2.553409
The coefficients are shown above. Two figures and two tables are stored locally.
For variable selection involving a continuous outcome variable, specify outcomeType = “continuous”:
c2fit = LASSO2plus(data = tdat, biomks = AgeXvars,
outcomeType = "continuous", Y = "Age",
outfile = "continuousLASSO2plus")
## Start: AIC=1745.15
## Age ~ highIPI + PS1
##
## Df Deviance AIC
## - highIPI 1 39626 1744.8
## <none> 39331 1745.2
## - PS1 1 39848 1746.0
##
## Step: AIC=1744.76
## Age ~ PS1
##
## Df Deviance AIC
## <none> 39626 1744.8
## - PS1 1 41606 1753.3
## file saved to continuousLASSO2plusLASSO2plus_varaibleSelection.pdf
c2fit$fit$coefficients
## (Intercept) highIPI PS1
## 62.004372 3.134816 4.311964
Again, the coefficients shown above and Two figures and two tables are stored locally.
For variable selection with a time-to-event outcome, set outcomeType = “time-to-event”, and ensure you provide the variable names for both time and event:
t2fit = LASSO2plus(data = tdat, biomks = Xvars,
outcomeType = "time-to-event",
time = "FFP..Years.",event = "Code.FFP",
outfile = "survivalLASSO2plus")
## Start: AIC=815.14
## survival::Surv(FFP..Years., Code.FFP) ~ highIPI + stage3_4 +
## PS1 + LDH.Ratio1 + HANS_GCB + B.Symptoms + DTI + CD10.IHC +
## MUM1.IHC
##
## Df AIC
## - HANS_GCB 1 813.18
## - B.Symptoms 1 813.36
## - highIPI 1 813.41
## - DTI 1 813.73
## - CD10.IHC 1 813.89
## - MUM1.IHC 1 814.17
## <none> 815.14
## - PS1 1 818.22
## - stage3_4 1 822.90
## - LDH.Ratio1 1 824.45
##
## Step: AIC=813.18
## survival::Surv(FFP..Years., Code.FFP) ~ highIPI + stage3_4 +
## PS1 + LDH.Ratio1 + B.Symptoms + DTI + CD10.IHC + MUM1.IHC
##
## Df AIC
## - B.Symptoms 1 811.39
## - highIPI 1 811.43
## - DTI 1 811.78
## - CD10.IHC 1 812.30
## - MUM1.IHC 1 812.38
## <none> 813.18
## - PS1 1 816.24
## - stage3_4 1 821.44
## - LDH.Ratio1 1 822.45
##
## Step: AIC=811.39
## survival::Surv(FFP..Years., Code.FFP) ~ highIPI + stage3_4 +
## PS1 + LDH.Ratio1 + DTI + CD10.IHC + MUM1.IHC
##
## Df AIC
## - highIPI 1 809.66
## - DTI 1 810.17
## - MUM1.IHC 1 810.59
## - CD10.IHC 1 810.62
## <none> 811.39
## - PS1 1 815.20
## - stage3_4 1 819.75
## - LDH.Ratio1 1 820.80
##
## Step: AIC=809.66
## survival::Surv(FFP..Years., Code.FFP) ~ stage3_4 + PS1 + LDH.Ratio1 +
## DTI + CD10.IHC + MUM1.IHC
##
## Df AIC
## - DTI 1 808.46
## - MUM1.IHC 1 808.86
## - CD10.IHC 1 808.94
## <none> 809.66
## - PS1 1 814.13
## - stage3_4 1 819.20
## - LDH.Ratio1 1 820.45
##
## Step: AIC=808.46
## survival::Surv(FFP..Years., Code.FFP) ~ stage3_4 + PS1 + LDH.Ratio1 +
## CD10.IHC + MUM1.IHC
##
## Df AIC
## - CD10.IHC 1 807.66
## - MUM1.IHC 1 807.79
## <none> 808.46
## - PS1 1 813.12
## - stage3_4 1 818.06
## - LDH.Ratio1 1 824.55
##
## Step: AIC=807.66
## survival::Surv(FFP..Years., Code.FFP) ~ stage3_4 + PS1 + LDH.Ratio1 +
## MUM1.IHC
##
## Df AIC
## <none> 807.66
## - MUM1.IHC 1 808.22
## - PS1 1 813.79
## - stage3_4 1 817.87
## - LDH.Ratio1 1 824.33
## file saved to survivalLASSO2plusLASSO2plus_varaibleSelection.pdf
t2fit$fit$coefficients
## stage3_4 PS1 LDH.Ratio1 MUM1.IHC
## 0.8231937 0.6543237 1.0529572 0.3508003
Similar to the other types of outcomes, the coefficients are displayed above, and two figures along with two tables are stored locally.
LASSO_plus is another innovative approach that builds on the LASSO algorithm and adds more techniques. However, it differs from LASSO2plus that is described in Section 3.2 in its initial step. It selects variables in three steps:
In LASSO_plus, all parameters from LASSO2 and LASSO2plus are retained, with the addition of the unique parameter topN. Please be aware that the topN parameter in LASSO_plus serves as a guide for variable selection.
Setting the topN parameter to 5 aims to include the top 5 variables in the final model. However, it’s important to note that the resulting model may not always precisely consist of 5 variables. The LASSO_plus method’s selection criteria involve considering variables that appear at least twice across different lambda values. Consequently, even when using the same topN value for different datasets, the number of selected variables may vary.
For binary outcomes, outcome type specification is unnecessary, as it defaults to this type.
bfit = LASSO_plus(data = tdat, biomks = Xvars, Y = "DZsig",
outfile = "binaryLASSO_plus", topN = 5)
## Start: AIC=101.72
## DZsig ~ MYC.IHC + CD10.IHC + MUM1.IHC
##
## Df Deviance AIC
## <none> 93.72 101.72
## - MUM1.IHC 1 107.09 113.09
## - MYC.IHC 1 115.41 121.41
## - CD10.IHC 1 119.33 125.33
## file saved to binaryLASSO_plus_LASSO_plus_varaibleSelection.pdf
bfit$fit$coefficients
## (Intercept) MYC.IHC CD10.IHC MUM1.IHC
## -4.778565 2.503030 3.188996 -2.553409
The identified variables and their corresponding coefficients are displayed above. A figure and a table are locally stored.
For continuous outcome variables, ensure you specify outcomeType = “continuous”:
cfit = LASSO_plus(data = tdat, biomks = AgeXvars,
outcomeType = "continuous", Y = "Age",
outfile = "continuousLASSO_plus", topN = 5)
## Start: AIC=1738.58
## Age ~ highIPI + MUM1.IHC + Male + stage3_4 + PS1 + Extranodal1
##
## Df Deviance AIC
## - PS1 1 36851 1737.1
## - Male 1 36881 1737.3
## - MUM1.IHC 1 37040 1738.2
## <none> 36766 1738.6
## - stage3_4 1 37491 1740.8
## - Extranodal1 1 37999 1743.7
## - highIPI 1 38311 1745.5
##
## Step: AIC=1737.09
## Age ~ highIPI + MUM1.IHC + Male + stage3_4 + Extranodal1
##
## Df Deviance AIC
## - Male 1 36975 1735.8
## - MUM1.IHC 1 37160 1736.9
## <none> 36851 1737.1
## - stage3_4 1 37696 1740.0
## - Extranodal1 1 38198 1742.8
## - highIPI 1 40369 1754.8
##
## Step: AIC=1735.81
## Age ~ highIPI + MUM1.IHC + stage3_4 + Extranodal1
##
## Df Deviance AIC
## <none> 36975 1735.8
## - MUM1.IHC 1 37336 1735.9
## - stage3_4 1 37902 1739.2
## - Extranodal1 1 38335 1741.6
## - highIPI 1 40706 1754.6
## file saved to continuousLASSO_plus_LASSO_plus_varaibleSelection.pdf
cfit$fit$coefficients
## (Intercept) highIPI MUM1.IHC stage3_4 Extranodal1
## 63.259326 10.360268 2.601408 -4.854750 -6.927381
The identified variables and their corresponding coefficients are displayed above. A figure and a table are locally stored
When dealing with time-to-event outcomes, set outcomeType = “time-to-event”, and ensure you provide the names of variables for both time and event:
tfit = LASSO_plus(data = tdat, biomks = Xvars,
outcomeType = "time-to-event",
time = "FFP..Years.",event = "Code.FFP",
outfile = "survivalLASSO_plus", topN = 5)
## Start: AIC=815.14
## survival::Surv(FFP..Years., Code.FFP) ~ highIPI + stage3_4 +
## PS1 + LDH.Ratio1 + HANS_GCB + B.Symptoms + DTI + CD10.IHC +
## MUM1.IHC
##
## Df AIC
## - HANS_GCB 1 813.18
## - B.Symptoms 1 813.36
## - highIPI 1 813.41
## - DTI 1 813.73
## - CD10.IHC 1 813.89
## - MUM1.IHC 1 814.17
## <none> 815.14
## - PS1 1 818.22
## - stage3_4 1 822.90
## - LDH.Ratio1 1 824.45
##
## Step: AIC=813.18
## survival::Surv(FFP..Years., Code.FFP) ~ highIPI + stage3_4 +
## PS1 + LDH.Ratio1 + B.Symptoms + DTI + CD10.IHC + MUM1.IHC
##
## Df AIC
## - B.Symptoms 1 811.39
## - highIPI 1 811.43
## - DTI 1 811.78
## - CD10.IHC 1 812.30
## - MUM1.IHC 1 812.38
## <none> 813.18
## - PS1 1 816.24
## - stage3_4 1 821.44
## - LDH.Ratio1 1 822.45
##
## Step: AIC=811.39
## survival::Surv(FFP..Years., Code.FFP) ~ highIPI + stage3_4 +
## PS1 + LDH.Ratio1 + DTI + CD10.IHC + MUM1.IHC
##
## Df AIC
## - highIPI 1 809.66
## - DTI 1 810.17
## - MUM1.IHC 1 810.59
## - CD10.IHC 1 810.62
## <none> 811.39
## - PS1 1 815.20
## - stage3_4 1 819.75
## - LDH.Ratio1 1 820.80
##
## Step: AIC=809.66
## survival::Surv(FFP..Years., Code.FFP) ~ stage3_4 + PS1 + LDH.Ratio1 +
## DTI + CD10.IHC + MUM1.IHC
##
## Df AIC
## - DTI 1 808.46
## - MUM1.IHC 1 808.86
## - CD10.IHC 1 808.94
## <none> 809.66
## - PS1 1 814.13
## - stage3_4 1 819.20
## - LDH.Ratio1 1 820.45
##
## Step: AIC=808.46
## survival::Surv(FFP..Years., Code.FFP) ~ stage3_4 + PS1 + LDH.Ratio1 +
## CD10.IHC + MUM1.IHC
##
## Df AIC
## - CD10.IHC 1 807.66
## - MUM1.IHC 1 807.79
## <none> 808.46
## - PS1 1 813.12
## - stage3_4 1 818.06
## - LDH.Ratio1 1 824.55
##
## Step: AIC=807.66
## survival::Surv(FFP..Years., Code.FFP) ~ stage3_4 + PS1 + LDH.Ratio1 +
## MUM1.IHC
##
## Df AIC
## <none> 807.66
## - MUM1.IHC 1 808.22
## - PS1 1 813.79
## - stage3_4 1 817.87
## - LDH.Ratio1 1 824.33
## file saved to survivalLASSO_plus_LASSO_plus_varaibleSelection.pdf
tfit$fit$coefficients
## stage3_4 PS1 LDH.Ratio1 MUM1.IHC
## 0.8231937 0.6543237 1.0529572 0.3508003
Displayed above are the identified variables and their corresponding coefficients. A figure and a table are locally stored
Predictive model development is a crucial aspect of the csmpv R package, involving eight distinct approaches:
Directly use shrunk coefficients from LASSO2 output as shown in Section 3.1.
The approach involves utilizing the variables selected by LASSO2 to conduct a standard regression model. Rather than relying on the shrunken coefficients obtained from LASSO2, this method opts for a conventional regression analysis with the chosen variables.
While it’s feasible to manually extract variables from an LASSO2 object for regular regression based on the outcome type, LASSO2_reg function is introduced to simplify this process for coding convenience and efficiency.
All parameter settings are the same as for LASSO2.
blr = LASSO2_reg(data = tdat, biomks = Xvars, Y = "DZsig",
outfile = "binaryLASSO2_reg")
blr$fit$coefficients
## (Intercept) MYC.IHC CD10.IHC MUM1.IHC PS1
## -5.814411 2.918828 3.593155 -2.798703 1.583795
clr = LASSO2_reg(data = tdat, biomks = AgeXvars,
outcomeType = "continuous", Y = "Age",
outfile = "continuousLASSO2_reg")
clr$fit$coefficients
## (Intercept) highIPI PS1
## 62.004372 3.134816 4.311964
tlr = LASSO2_reg(data = tdat, biomks = Xvars,
outcomeType = "time-to-event",
time = "FFP..Years.",event = "Code.FFP",
outfile = "survivalLASSO2_reg")
tlr$fit$coefficients
## highIPI LDH.Ratio1
## 0.6925077 1.0186194
The selected variables and their coefficients are shown above. For each outcome type, three figure files, one text file, and two tables are saved locally. Additionally, for time-to-event outcome variables, Kaplan-Meier plots are generated and saved locally. A single Kaplan-Meier plot is generated for each independent categorical variable with no more than four levels. In this example dataset, 15 Kaplan-Meier plots are generated.
Directly use coefficients from LASSO2_plus output as shown in Section 3.3.
Directly use coefficients from the LASSO2plus output, as described in Section 3.2.
XGBoost is a powerful machine learning algorithm recognized for its boosting capabilities. The XGBtraining function within the csmpv package leverages the strengths of XGBoost for model training. As XGBoost doesn’t inherently feature a dedicated variable selection procedure, you’ll need to manually define or select a set of variables using other methods. Once you have a predefined set of variables for constructing an XGBoost model, the XGBtraining function in the csmpv package streamlines this process.
bxfit = XGBtraining(data = tdat, biomks = Xvars, Y = "DZsig",
outfile = "binary_XGBoost")
## [1] train-logloss:0.511255
## [2] train-logloss:0.408615
## [3] train-logloss:0.343507
## [4] train-logloss:0.298065
## [5] train-logloss:0.264358
head(bxfit$XGBoost_score)
## pt103 pt246 pt874 pt219 pt138 pt328
## 0.2255560 0.1195362 0.7725156 0.1140642 0.1708665 0.1634810
The output from the above code consists of training log-loss values for specific iterations of the model. Log-loss, a widely used loss function in classification tasks, assesses the alignment between the model’s predicted probabilities and the actual class labels. By default, XGBtraining runs for 5 iterations, and the output is saved locally as a text file.
The bxfit object contains four components:
cxfit = XGBtraining(data = tdat, biomks = AgeXvars,
outcomeType = "continuous", Y = "Age",
outfile = "continuous_XGBoost")
## [1] train-rmse:47.112278
## [2] train-rmse:34.492776
## [3] train-rmse:26.071191
## [4] train-rmse:20.554692
## [5] train-rmse:17.100220
head(cxfit$XGBoost_score)
## pt103 pt246 pt874 pt219 pt138 pt328
## 55.28930 51.14508 55.28930 52.75171 52.75171 54.72178
The reported values, train-rmse, signify the RMSE metric calculated during each iteration of the XGBoost model on the training set. RMSE measures the average variance between predicted and actual values within the training set, where lower values indicate superior model performance. The output is saved locally as a text file.
These metrics illustrate the iterative nature of training the XGBoost model, where each iteration aims to minimize the RMSE on the training set. The diminishing RMSE values signify the model’s learning process, showcasing its progressive improvement in predictive accuracy during training.
Within cxfit, there are four elements:
txfit = XGBtraining(data = tdat, biomks = Xvars,
outcomeType = "time-to-event",
time = "FFP..Years.",event = "Code.FFP",
outfile = "survival_XGBoost")
## [1] train-cox-nloglik:4.859160
## [2] train-cox-nloglik:4.736148
## [3] train-cox-nloglik:4.648801
## [4] train-cox-nloglik:4.563267
## [5] train-cox-nloglik:4.507056
head(txfit$XGBoost_score)
## pt103 pt246 pt874 pt219 pt138 pt328
## 1.1142874 0.2993234 1.8398050 0.2993234 0.1824380 0.4074278
The negative log-likelihood, displayed in the output, serves as a standard loss function in survival analysis, notably prominent in Cox proportional hazards models. It quantifies the disparity between predicted survival probabilities and observed survival times and events within the training data. Minimizing this metric is crucial, as lower values signify a better fit of the model to the training data. The resulting output is saved locally as a text file.
By monitoring the negative log-likelihood throughout the training process, you can evaluate the model’s learning progress and its convergence toward an optimal solution. Ideally, a decreasing trend in the negative log-likelihood indicates the model’s improved fit to the training data across iterations.
In txfit, there are six components:
Combine LASSO2 variable selection with XGBoost modeling using the LASSO2_XGBtraining function, which selects variables via LASSO2 but constructs an XGBoost model without relying on shrunk coefficients. The resulting objects maintain the output format of the XGBtraining function.
blxfit = LASSO2_XGBtraining(data = tdat, biomks = Xvars, Y = "DZsig",
outfile = "binary_LASSO2_XGBoost")
## [1] train-logloss:0.511725
## [2] train-logloss:0.410850
## [3] train-logloss:0.348831
## [4] train-logloss:0.308959
## [5] train-logloss:0.283560
head(blxfit$XGBoost_score)
## pt103 pt246 pt874 pt219 pt138 pt328
## 0.2034558 0.1193058 0.6517244 0.1193058 0.2034558 0.2034558
clxfit = LASSO2_XGBtraining(data = tdat, biomks = AgeXvars,
outcomeType = "continuous", Y = "Age",
outfile = "continuous_LASSO2_XGBoost")
## [1] train-rmse:47.112278
## [2] train-rmse:34.492776
## [3] train-rmse:26.089994
## [4] train-rmse:20.707653
## [5] train-rmse:17.442132
head(clxfit$XGBoost_score)
## pt103 pt246 pt874 pt219 pt138 pt328
## 56.47782 52.21841 56.47782 52.21841 52.21841 53.58232
tlxfit = LASSO2_XGBtraining(data = tdat, biomks = Xvars,
outcomeType = "time-to-event",
time = "FFP..Years.",event = "Code.FFP",
outfile = "survival_LASSO2_XGBoost")
## [1] train-cox-nloglik:4.940429
## [2] train-cox-nloglik:4.870696
## [3] train-cox-nloglik:4.833368
## [4] train-cox-nloglik:4.811892
## [5] train-cox-nloglik:4.803468
head(tlxfit$XGBoost_score)
## pt103 pt246 pt874 pt219 pt138 pt328
## 0.9911962 0.2521406 0.9911962 0.2521406 0.2521406 0.9911962
To combine LASSO_plus variable selection with XGBoost modeling, the LASSO_plus_XGBtraining R function is employed. This approach selects variables using LASSO_plus but does not utilize the coefficients from LASSO_plus to construct the model; instead, it generates an XGBoost model. The resulting output mirrors that of the XGBtraining function.
The output and format of the returned objects are identical to those of the XGBtraining function. Furthermore, for each outcome type, one figure, one text, and one table file are saved locally.
blpxfit = LASSO_plus_XGBtraining(data = tdat, biomks = Xvars, Y = "DZsig",
topN = 5,outfile = "binary_LASSO_plus_XGBoost")
## Start: AIC=101.72
## DZsig ~ MYC.IHC + CD10.IHC + MUM1.IHC
##
## Df Deviance AIC
## <none> 93.72 101.72
## - MUM1.IHC 1 107.09 113.09
## - MYC.IHC 1 115.41 121.41
## - CD10.IHC 1 119.33 125.33
## file saved to binary_LASSO_plus_XGBoost_LASSO_plus_varaibleSelection.pdf
## [1] train-logloss:0.511725
## [2] train-logloss:0.410850
## [3] train-logloss:0.348831
## [4] train-logloss:0.308959
## [5] train-logloss:0.283560
head(blpxfit$XGBoost_score)
## pt103 pt246 pt874 pt219 pt138 pt328
## 0.2034558 0.1193058 0.6517244 0.1193058 0.2034558 0.2034558
The majority of the outputs stem from LASSO_plus, with the final portion being attributed to XGBoost. Each line within the XGBoost output denotes the training log-loss value for a specific iteration of the model. Log-loss, a widely used loss function in classification tasks, gauges the alignment between the model’s predicted probabilities and the actual class labels. The default number of iterations in XGBtraining is 5.
The blpxfit output comprises four items: the first item corresponds to the XGBoost object, while the second item presents the XGBoost scores for all entries in the tdat dataset. Notably, XGBoost is a black box model that does not yield coefficients; however, model scores are provided. For binary outcomes, the model score pertains to the probability of the positive class. The remaining two items are the observed outcome and the outcome type.
clpxfit = LASSO_plus_XGBtraining(data = tdat, biomks = AgeXvars,
outcomeType = "continuous", Y = "Age",
topN = 5,outfile = "continuous_LASSO_plus_XGBoost")
## Start: AIC=1738.58
## Age ~ highIPI + MUM1.IHC + Male + stage3_4 + PS1 + Extranodal1
##
## Df Deviance AIC
## - PS1 1 36851 1737.1
## - Male 1 36881 1737.3
## - MUM1.IHC 1 37040 1738.2
## <none> 36766 1738.6
## - stage3_4 1 37491 1740.8
## - Extranodal1 1 37999 1743.7
## - highIPI 1 38311 1745.5
##
## Step: AIC=1737.09
## Age ~ highIPI + MUM1.IHC + Male + stage3_4 + Extranodal1
##
## Df Deviance AIC
## - Male 1 36975 1735.8
## - MUM1.IHC 1 37160 1736.9
## <none> 36851 1737.1
## - stage3_4 1 37696 1740.0
## - Extranodal1 1 38198 1742.8
## - highIPI 1 40369 1754.8
##
## Step: AIC=1735.81
## Age ~ highIPI + MUM1.IHC + stage3_4 + Extranodal1
##
## Df Deviance AIC
## <none> 36975 1735.8
## - MUM1.IHC 1 37336 1735.9
## - stage3_4 1 37902 1739.2
## - Extranodal1 1 38335 1741.6
## - highIPI 1 40706 1754.6
## file saved to continuous_LASSO_plus_XGBoost_LASSO_plus_varaibleSelection.pdf
## [1] train-rmse:47.112278
## [2] train-rmse:34.492776
## [3] train-rmse:26.064544
## [4] train-rmse:20.629215
## [5] train-rmse:17.273109
head(clpxfit$XGBoost_score)
## pt103 pt246 pt874 pt219 pt138 pt328
## 54.92717 51.91544 53.46028 51.91544 52.28172 56.04781
Similar to the previous scenario, the primary outputs stem from LASSO_plus, while the concluding section originates from XGBoost. Within the XGBoost output, the train-rmse values reflect the root mean squared error (RMSE) metric calculated during each iteration of the XGBoost model. The RMSE gauges the average discrepancy between the predicted and actual values in the training set, with lower values signifying improved model performance.
These lines indicate that the XGBoost model undergoes iterative training, with each iteration aimed at minimizing the RMSE on the training set. The declining RMSE values suggest that the model progressively learns from the data, enhancing its predictive capabilities.
The clpxfit output includes four components: the first represents the XGBoost object, and the second offers XGBoost scores for all entries in the tdat dataset. Similar to before, XGBoost is a black box model that does not yield coefficients; however, model scores are provided. For continuous outcomes, the model score pertains to the estimated continuous values. The remaining two components are the observed outcome and the outcome type.
tlpxfit = LASSO_plus_XGBtraining(data = tdat, biomks = Xvars,
outcomeType = "time-to-event",
time = "FFP..Years.",event = "Code.FFP",
topN = 5,outfile = "survival_LASSO_plus_XGBoost")
## Start: AIC=815.14
## survival::Surv(FFP..Years., Code.FFP) ~ highIPI + stage3_4 +
## PS1 + LDH.Ratio1 + HANS_GCB + B.Symptoms + DTI + CD10.IHC +
## MUM1.IHC
##
## Df AIC
## - HANS_GCB 1 813.18
## - B.Symptoms 1 813.36
## - highIPI 1 813.41
## - DTI 1 813.73
## - CD10.IHC 1 813.89
## - MUM1.IHC 1 814.17
## <none> 815.14
## - PS1 1 818.22
## - stage3_4 1 822.90
## - LDH.Ratio1 1 824.45
##
## Step: AIC=813.18
## survival::Surv(FFP..Years., Code.FFP) ~ highIPI + stage3_4 +
## PS1 + LDH.Ratio1 + B.Symptoms + DTI + CD10.IHC + MUM1.IHC
##
## Df AIC
## - B.Symptoms 1 811.39
## - highIPI 1 811.43
## - DTI 1 811.78
## - CD10.IHC 1 812.30
## - MUM1.IHC 1 812.38
## <none> 813.18
## - PS1 1 816.24
## - stage3_4 1 821.44
## - LDH.Ratio1 1 822.45
##
## Step: AIC=811.39
## survival::Surv(FFP..Years., Code.FFP) ~ highIPI + stage3_4 +
## PS1 + LDH.Ratio1 + DTI + CD10.IHC + MUM1.IHC
##
## Df AIC
## - highIPI 1 809.66
## - DTI 1 810.17
## - MUM1.IHC 1 810.59
## - CD10.IHC 1 810.62
## <none> 811.39
## - PS1 1 815.20
## - stage3_4 1 819.75
## - LDH.Ratio1 1 820.80
##
## Step: AIC=809.66
## survival::Surv(FFP..Years., Code.FFP) ~ stage3_4 + PS1 + LDH.Ratio1 +
## DTI + CD10.IHC + MUM1.IHC
##
## Df AIC
## - DTI 1 808.46
## - MUM1.IHC 1 808.86
## - CD10.IHC 1 808.94
## <none> 809.66
## - PS1 1 814.13
## - stage3_4 1 819.20
## - LDH.Ratio1 1 820.45
##
## Step: AIC=808.46
## survival::Surv(FFP..Years., Code.FFP) ~ stage3_4 + PS1 + LDH.Ratio1 +
## CD10.IHC + MUM1.IHC
##
## Df AIC
## - CD10.IHC 1 807.66
## - MUM1.IHC 1 807.79
## <none> 808.46
## - PS1 1 813.12
## - stage3_4 1 818.06
## - LDH.Ratio1 1 824.55
##
## Step: AIC=807.66
## survival::Surv(FFP..Years., Code.FFP) ~ stage3_4 + PS1 + LDH.Ratio1 +
## MUM1.IHC
##
## Df AIC
## <none> 807.66
## - MUM1.IHC 1 808.22
## - PS1 1 813.79
## - stage3_4 1 817.87
## - LDH.Ratio1 1 824.33
## file saved to survival_LASSO_plus_XGBoost_LASSO_plus_varaibleSelection.pdf
## [1] train-cox-nloglik:4.873247
## [2] train-cox-nloglik:4.789556
## [3] train-cox-nloglik:4.743813
## [4] train-cox-nloglik:4.718117
## [5] train-cox-nloglik:4.695957
head(tlpxfit$XGBoost_score)
## pt103 pt246 pt874 pt219 pt138 pt328
## 1.5001757 0.2840731 1.5001757 0.3679212 0.1763741 0.7963076
Analogous to the previous cases, the bulk of the outputs originate from LASSO_plus, while the final segment is attributed to XGBoost. Within the XGBoost output, the negative log-likelihood serves as a prevalent loss function in survival analysis, encompassing Cox proportional hazards models. It quantifies the dissimilarity between the projected survival probabilities and the observed survival times and events in the training data. The objective is to minimize this metric, as lower values denote a superior fit of the model to the training data.
Monitoring the negative log-likelihood throughout training enables the assessment of the model’s capacity to learn from the data and its convergence towards an optimal solution. Ideally, a diminishing trend in the negative log-likelihood signifies an improved fit of the model to the training data across iterations.
The tlpxfit output comprises six components: the first represents the XGBoost object, and the second provides XGBoost scores for all entries in the tdat dataset. Similar to earlier instances, XGBoost is a black box model that does not yield coefficients; however, model scores are provided. For time-to-event outcomes, the model score pertains to the risk score. The remaining four components encompass the baseline hazard table, observed time, event, and outcome type.
To seamlessly integrate LASSO2plus variable selection with XGBoost modeling, we leverage the LASSO2plus_XGBtraining R function. This hybrid approach utilizes LASSO2plus for variable selection but diverges from using its coefficients to construct the model. Instead, it generates an XGBoost model, producing an output akin to that of the XGBtraining function.
The output and format of the returned objects mirror those of the XGBtraining function. Furthermore, for each outcome type, the process generates two figures, two text files, and one table, saving them locally.
bl2xfit = LASSO2plus_XGBtraining(data = tdat, biomks = Xvars, Y = "DZsig",
outfile = "binary_LASSO2plus_XGBoost")
## Start: AIC=101.72
## DZsig ~ MYC.IHC + CD10.IHC + MUM1.IHC
##
## Df Deviance AIC
## <none> 93.72 101.72
## - MUM1.IHC 1 107.09 113.09
## - MYC.IHC 1 115.41 121.41
## - CD10.IHC 1 119.33 125.33
## file saved to binary_LASSO2plus_XGBoostLASSO2plus_varaibleSelection.pdf
## [1] train-logloss:0.511725
## [2] train-logloss:0.410850
## [3] train-logloss:0.348831
## [4] train-logloss:0.308959
## [5] train-logloss:0.283560
head(bl2xfit$XGBoost_score)
## pt103 pt246 pt874 pt219 pt138 pt328
## 0.2034558 0.1193058 0.6517244 0.1193058 0.2034558 0.2034558
The primary outputs stem from LASSO2plus, while the latter part pertains to XGBoost. In the XGBoost output, each line denotes the training log-loss value for a specific model iteration. Log-loss, a widely used classification loss function, assesses the alignment between predicted probabilities and actual class labels. By default, the XGBtraining runs for 5 iterations.
The bl2xfit output comprises four components: the first being the XGBoost object, followed by the XGBoost scores for all entries in the tdat dataset. Notably, XGBoost, being a black box model, doesn’t yield coefficients but provides model scores. For binary outcomes, these scores represent the probability of the positive class. The remaining two items include the observed outcome and the outcome type.
cl2xfit = LASSO2plus_XGBtraining(data = tdat, biomks = AgeXvars,
outcomeType = "continuous", Y = "Age",
outfile = "continuous_LASSO2plus_XGBoost")
## Start: AIC=1745.15
## Age ~ highIPI + PS1
##
## Df Deviance AIC
## - highIPI 1 39626 1744.8
## <none> 39331 1745.2
## - PS1 1 39848 1746.0
##
## Step: AIC=1744.76
## Age ~ PS1
##
## Df Deviance AIC
## <none> 39626 1744.8
## - PS1 1 41606 1753.3
## file saved to continuous_LASSO2plus_XGBoostLASSO2plus_varaibleSelection.pdf
## [1] train-rmse:47.112278
## [2] train-rmse:34.492776
## [3] train-rmse:26.089994
## [4] train-rmse:20.707653
## [5] train-rmse:17.442132
head(cl2xfit$XGBoost_score)
## pt103 pt246 pt874 pt219 pt138 pt328
## 56.47782 52.21841 56.47782 52.21841 52.21841 53.58232
Similar to the previous case, the primary outputs arise from LASSO2plus, while the final section pertains to XGBoost. Within the XGBoost output, the train-rmse values signify the root mean squared error (RMSE) calculated during each iteration. RMSE measures the average discrepancy between predicted and actual values in the training set, with lower values indicating improved model performance.
The declining RMSE values showcase the iterative training of the XGBoost model, where each iteration aims to minimize the RMSE on the training set, indicating progressive learning and enhanced predictive abilities.
The cl2xfit output also includes four components: the XGBoost object, XGBoost scores for all entries in tdat, observed outcome, and outcome type.
tl2xfit = LASSO2plus_XGBtraining(data = tdat, biomks = Xvars,
outcomeType = "time-to-event",
time = "FFP..Years.", event = "Code.FFP",
outfile = "survival_LASSO2plus_XGBoost")
## Start: AIC=815.14
## survival::Surv(FFP..Years., Code.FFP) ~ highIPI + stage3_4 +
## PS1 + LDH.Ratio1 + HANS_GCB + B.Symptoms + DTI + CD10.IHC +
## MUM1.IHC
##
## Df AIC
## - HANS_GCB 1 813.18
## - B.Symptoms 1 813.36
## - highIPI 1 813.41
## - DTI 1 813.73
## - CD10.IHC 1 813.89
## - MUM1.IHC 1 814.17
## <none> 815.14
## - PS1 1 818.22
## - stage3_4 1 822.90
## - LDH.Ratio1 1 824.45
##
## Step: AIC=813.18
## survival::Surv(FFP..Years., Code.FFP) ~ highIPI + stage3_4 +
## PS1 + LDH.Ratio1 + B.Symptoms + DTI + CD10.IHC + MUM1.IHC
##
## Df AIC
## - B.Symptoms 1 811.39
## - highIPI 1 811.43
## - DTI 1 811.78
## - CD10.IHC 1 812.30
## - MUM1.IHC 1 812.38
## <none> 813.18
## - PS1 1 816.24
## - stage3_4 1 821.44
## - LDH.Ratio1 1 822.45
##
## Step: AIC=811.39
## survival::Surv(FFP..Years., Code.FFP) ~ highIPI + stage3_4 +
## PS1 + LDH.Ratio1 + DTI + CD10.IHC + MUM1.IHC
##
## Df AIC
## - highIPI 1 809.66
## - DTI 1 810.17
## - MUM1.IHC 1 810.59
## - CD10.IHC 1 810.62
## <none> 811.39
## - PS1 1 815.20
## - stage3_4 1 819.75
## - LDH.Ratio1 1 820.80
##
## Step: AIC=809.66
## survival::Surv(FFP..Years., Code.FFP) ~ stage3_4 + PS1 + LDH.Ratio1 +
## DTI + CD10.IHC + MUM1.IHC
##
## Df AIC
## - DTI 1 808.46
## - MUM1.IHC 1 808.86
## - CD10.IHC 1 808.94
## <none> 809.66
## - PS1 1 814.13
## - stage3_4 1 819.20
## - LDH.Ratio1 1 820.45
##
## Step: AIC=808.46
## survival::Surv(FFP..Years., Code.FFP) ~ stage3_4 + PS1 + LDH.Ratio1 +
## CD10.IHC + MUM1.IHC
##
## Df AIC
## - CD10.IHC 1 807.66
## - MUM1.IHC 1 807.79
## <none> 808.46
## - PS1 1 813.12
## - stage3_4 1 818.06
## - LDH.Ratio1 1 824.55
##
## Step: AIC=807.66
## survival::Surv(FFP..Years., Code.FFP) ~ stage3_4 + PS1 + LDH.Ratio1 +
## MUM1.IHC
##
## Df AIC
## <none> 807.66
## - MUM1.IHC 1 808.22
## - PS1 1 813.79
## - stage3_4 1 817.87
## - LDH.Ratio1 1 824.33
## file saved to survival_LASSO2plus_XGBoostLASSO2plus_varaibleSelection.pdf
## [1] train-cox-nloglik:4.873247
## [2] train-cox-nloglik:4.789556
## [3] train-cox-nloglik:4.743813
## [4] train-cox-nloglik:4.718117
## [5] train-cox-nloglik:4.695957
head(tl2xfit$XGBoost_score)
## pt103 pt246 pt874 pt219 pt138 pt328
## 1.5001757 0.2840731 1.5001757 0.3679212 0.1763741 0.7963076
Similarly, most outputs arise from LASSO2plus, while the final section pertains to XGBoost. In the XGBoost output, the negative log-likelihood serves as a prevalent loss function in survival analysis, encompassing Cox proportional hazards models. It quantifies dissimilarity between projected survival probabilities and observed survival times and events in the training data, aiming to minimize this metric for a better fit.
Monitoring the negative log-likelihood throughout training allows assessment of the model’s learning from data, ideally showcasing a decreasing trend signifying an improved fit to training data.
The tl2xfit output contains six components: XGBoost object, XGBoost scores for tdat, baseline hazard table, observed time, event, and outcome type.
In this section, we outline the prediction process for the six different modeling approaches included in this package when given the input variables (X) in a new dataset.
We begin by discussing predictions for LASSO2 model outcomes.
To predict binary outcomes using LASSO2, we use the following code snippet:
pbl = LASSO2_predict(bl, newdata = vdat, outfile = "pred_LASSO2_binary")
head(pbl)
## pt3 pt10 pt20 pt25 pt30 pt52
## 0.20970033 0.18367987 0.04718620 0.04718620 0.10784060 0.02336746
The pbl object holds the predicted probabilities for the positive group for each entry/sample.
For continuous outcomes prediction, the code snippet is as follows:
pcl = LASSO2_predict(cl, newdata = vdat, outfile = "pred_LASSO2_cont")
head(pbl)
## pt3 pt10 pt20 pt25 pt30 pt52
## 0.20970033 0.18367987 0.04718620 0.04718620 0.10784060 0.02336746
The pcl object holds the predicted Y values for each entry/sample.
When predicting time-to-event outcomes, we use the code:
ptl = LASSO2_predict(tl, newdata = vdat,
outfile = "pred_LASSO2_time_to_event")
head(pbl)
## pt3 pt10 pt20 pt25 pt30 pt52
## 0.20970033 0.18367987 0.04718620 0.04718620 0.10784060 0.02336746
The ptl object holds predicted risk scores for each entry/sample.
Moving forward, let’s explore predictions concerning the combined LASSO2 and regular regression model outcomes. The function rms_model specifically caters to model prediction when utilizing a regular modeling object like those produced by LASSO2_reg. Upon performing predictions for binary and continuous outcomes, this step generates one figure and five tables. Additionally, for time-to-event outcomes, an extra table is generated. These resulting files are all saved locally for convenient access.
To predict binary outcomes using the LASSO2 + regular regression model:
pblr = rms_model(blr$fit, newdata = vdat, outfile = "pred_LASSO2reg_binary")
## index.orig training test optimism index.corrected
## Dxy 0.842931937 0.851528862 0.82351204 0.028016820 0.81491512
## R2 0.527718998 0.552879991 0.48756661 0.065313386 0.46240561
## Intercept 0.000000000 0.000000000 -0.15365199 0.153651991 -0.15365199
## Slope 1.000000000 1.000000000 0.80921368 0.190786316 0.80921368
## Emax 0.000000000 0.000000000 0.07527519 0.075275194 0.07527519
## D 0.310084555 0.322462320 0.28277788 0.039684442 0.27040011
## U -0.009259259 -0.009259259 0.01821248 -0.027471744 0.01821248
## Q 0.319343814 0.331721579 0.26456539 0.067156186 0.25218763
## B 0.060659654 0.056392481 0.06413595 -0.007743470 0.06840312
## g 3.075229015 4.158616075 3.02608229 1.132533785 1.94269523
## gp 0.173738980 0.170714350 0.16773921 0.002975144 0.17076384
## Cindex 0.921465969 0.925764431 0.91175602 0.014008410 0.90745756
## n
## Dxy 200
## R2 200
## Intercept 200
## Slope 200
## Emax 200
## D 200
## U 200
## Q 200
## B 200
## g 200
## gp 200
## Cindex 200
head(pblr)
## pt3 pt10 pt20 pt25 pt30 pt52
## -2.101131 -2.221256 -5.814410 -4.230616 -2.895583 -8.613113
For continuous outcomes prediction:
pclr = rms_model(clr$fit, newdata = vdat,
outfile = "pred_LASSO2reg_continuous")
## index.orig training test optimism index.corrected n
## R-square 0.0547 0.0632 0.0433 0.0199 0.0348 200
## MSE 182.0879 176.5652 184.2864 -7.7212 189.8091 200
## g 3.3604 3.4145 3.2358 0.1787 3.1817 200
## Intercept 0.0000 0.0000 -2.0414 2.0414 -2.0414 200
## Slope 1.0000 1.0000 1.0298 -0.0298 1.0298 200
head(pclr)
## pt3 pt10 pt20 pt25 pt30 pt52
## 62.00437 62.00437 62.00437 69.45115 65.13919 65.13919
To predict time-to-event outcomes:
ptlr = rms_model(tlr$fit, data = tdat, newdata = vdat,
outfile = "pred_LASSO2reg_time_to_event")
## index.orig training test optimism index.corrected n
## Dxy 0.426967889 0.433890466 0.426121768 0.007768699 0.419199190 200
## R2 0.200249909 0.210292163 0.195782136 0.014510028 0.185739882 200
## Slope 1.000000000 1.000000000 0.972636972 0.027363028 0.972636972 200
## D 0.053514663 0.056885491 0.052163776 0.004721715 0.048792948 200
## U -0.002312546 -0.002318816 0.001148229 -0.003467045 0.001154499 200
## Q 0.055827209 0.059204307 0.051015547 0.008188760 0.047638450 200
## g 0.813999013 0.832952948 0.797517544 0.035435404 0.778563609 200
## Cindex 0.713483944 0.716945233 0.713060884 0.003884349 0.709599595 200
head(ptlr)
## pt3 pt10 pt20 pt25 pt30 pt52
## 0.28074825 0.28074825 -0.73787179 -0.04536548 0.97325456 0.97325456
For time-to-event outcomes, the LASSO2_reg object requires the training dataset to be provided.
We also use rms_model to predict LASSO_plus model outcomes.
To predict binary outcomes using the LASSO_plus model:
pbfit = rms_model(bfit$fit, newdata = vdat,
outfile = "pred_LASSOplus_binary")
## index.orig training test optimism index.corrected
## Dxy 0.790157068 0.800075272 0.77798534 0.022089931 0.76806714
## R2 0.481472357 0.497669844 0.45071403 0.046955810 0.43451655
## Intercept 0.000000000 0.000000000 -0.15094005 0.150940045 -0.15094005
## Slope 1.000000000 1.000000000 0.86220502 0.137794977 0.86220502
## Emax 0.000000000 0.000000000 0.06093911 0.060939114 0.06093911
## D 0.278185437 0.290630349 0.25787890 0.032751452 0.24543398
## U -0.009259259 -0.009259259 0.01501785 -0.024277111 0.01501785
## Q 0.287444696 0.299889609 0.24286105 0.057028562 0.23041613
## B 0.064166124 0.062790107 0.06655954 -0.003769434 0.06793556
## g 2.702828842 3.576024982 2.70732762 0.868697361 1.83413148
## gp 0.162611026 0.164791936 0.15920238 0.005589553 0.15702147
## Cindex 0.895078534 0.900037636 0.88899267 0.011044966 0.88403357
## n
## Dxy 200
## R2 200
## Intercept 200
## Slope 200
## Emax 200
## D 200
## U 200
## Q 200
## B 200
## g 200
## gp 200
## Cindex 200
For continuous outcomes prediction:
pcfit = rms_model(cfit$fit, newdata = vdat,
outfile = "pred_LASSOplus_continuous")
## index.orig training test optimism index.corrected n
## R-square 0.1113 0.1240 0.0915 0.0325 0.0788 200
## MSE 171.1808 168.0692 175.0057 -6.9365 178.1174 200
## g 5.1667 5.2993 4.8105 0.4888 4.6780 200
## Intercept 0.0000 0.0000 4.9015 -4.9015 4.9015 200
## Slope 1.0000 1.0000 0.9248 0.0752 0.9248 200
To predict time-to-event outcomes:
ptfit = rms_model(tfit$fit, data = tdat, newdata = vdat,
outfile = "pred_LASSOplus_time_to_event")
## index.orig training test optimism index.corrected n
## Dxy 0.513853367 0.515499635 0.500640097 0.014859538 0.498993829 200
## R2 0.265362936 0.274046535 0.254619687 0.019426848 0.245936089 200
## Slope 1.000000000 1.000000000 0.954016306 0.045983694 0.954016306 200
## D 0.074222355 0.078065651 0.070698255 0.007367396 0.066854959 200
## U -0.002312546 -0.002325827 0.001801053 -0.004126880 0.001814334 200
## Q 0.076534901 0.080391477 0.068897201 0.011494276 0.065040625 200
## g 1.078420120 1.107354866 1.044136718 0.063218149 1.015201972 200
## Cindex 0.756926684 0.757749818 0.750320048 0.007429769 0.749496915 200
Similarly, we use rms_model to predict LASSO2plus model outcomes.
To predict binary outcomes using the LASSO_plus model:
p2bfit = rms_model(b2fit$fit, newdata = vdat,
outfile = "pred_LASSO2plus_binary")
## index.orig training test optimism index.corrected
## Dxy 0.790157068 0.803563241 0.78029110 0.023272142 0.76688493
## R2 0.481472357 0.499574865 0.45148724 0.048087624 0.43338473
## Intercept 0.000000000 0.000000000 -0.10915293 0.109152931 -0.10915293
## Slope 1.000000000 1.000000000 0.85642426 0.143575743 0.85642426
## Emax 0.000000000 0.000000000 0.05395268 0.053952682 0.05395268
## D 0.278185437 0.289302762 0.25842693 0.030875828 0.24730961
## U -0.009259259 -0.009259259 0.01815098 -0.027410241 0.01815098
## Q 0.287444696 0.298562022 0.24027595 0.058286069 0.22915863
## B 0.064166124 0.061466464 0.06658390 -0.005117440 0.06928356
## g 2.702828842 3.675632786 2.72701452 0.948618263 1.75421058
## gp 0.162611026 0.162674650 0.15950938 0.003165273 0.15944575
## Cindex 0.895078534 0.901781621 0.89014555 0.011636071 0.88344246
## n
## Dxy 200
## R2 200
## Intercept 200
## Slope 200
## Emax 200
## D 200
## U 200
## Q 200
## B 200
## g 200
## gp 200
## Cindex 200
For continuous outcomes prediction:
p2cfit = rms_model(c2fit$fit, newdata = vdat,
outfile = "pred_LASSO2plus_continuous")
## index.orig training test optimism index.corrected n
## R-square 0.0547 0.0654 0.0415 0.0239 0.0308 200
## MSE 182.0879 181.0783 184.6321 -3.5538 185.6417 200
## g 3.3604 3.5298 3.2081 0.3217 3.0387 200
## Intercept 0.0000 0.0000 1.6415 -1.6415 1.6415 200
## Slope 1.0000 1.0000 0.9755 0.0245 0.9755 200
To predict time-to-event outcomes:
p2tfit = rms_model(t2fit$fit, data = tdat, newdata = vdat,
outfile = "pred_LASSO2plus_time_to_event")
## index.orig training test optimism index.corrected n
## Dxy 0.513853367 0.517035127 0.501410912 0.015624214 0.49822915 200
## R2 0.265362936 0.275890472 0.254297677 0.021592794 0.24377014 200
## Slope 1.000000000 1.000000000 0.950393088 0.049606912 0.95039309 200
## D 0.074222355 0.078564634 0.070597812 0.007966822 0.06625553 200
## U -0.002312546 -0.002332982 0.001318394 -0.003651376 0.00133883 200
## Q 0.076534901 0.080897616 0.069279418 0.011618198 0.06491670 200
## g 1.078420120 1.102956596 1.039869324 0.063087272 1.01533285 200
## Cindex 0.756926684 0.758517563 0.750705456 0.007812107 0.74911458 200
Continuing, we discuss predictions for the XGBoost model outcomes.
To predict binary outcomes using the XGBoost model:
pbxfit = XGBtraining_predict(bxfit, newdata = vdat,
outfile = "pred_XGBoost_binary")
For continuous outcomes prediction:
pcxfit = XGBtraining_predict(cxfit, newdata = vdat,
outfile = "pred_XGBoost_cont")
To predict time-to-event outcomes:
ptxfit = XGBtraining_predict(txfit, newdata = vdat,
outfile = "pred_XGBoost_time_to_event")
Next, we explore predictions for the combined LASSO and XGBoost model outcomes.
To predict binary outcomes:
pblxfit = XGBtraining_predict(blxfit, newdata = vdat,
outfile = "pred_LXGBoost_binary")
To predict continuous outcomes:
pclxfit = XGBtraining_predict(clxfit, newdata = vdat,
outfile = "pred_LXGBoost_cont")
To predict time-to-event outcomes:
ptlxfit = XGBtraining_predict(tlxfit, newdata = vdat,
outfile = "pred_LXGBoost_time_to_event")
Lastly, we discuss predictions for the combined LASSO_plus and XGBoost model outcomes.
To predict binary outcomes:
pblpxfit = XGBtraining_predict(blpxfit, newdata = vdat,
outfile = "pred_LpXGBoost_binary")
For continuous outcomes prediction:
pclpxfit = XGBtraining_predict(clpxfit, newdata = vdat,
outfile = "pred_LpXGBoost_cont")
To predict time-to-event outcomes:
ptlpxfit = XGBtraining_predict(tlpxfit, newdata = vdat,
outfile = "pred_LpXGBoost_time_to_event")
To predict binary outcomes using the LASSO2plus + XGBoost model:
pbl2xfit = XGBtraining_predict(bl2xfit, newdata = vdat,
outfile = "pred_L2XGBoost_binary")
For continuous outcomes prediction:
pcl2xfit = XGBtraining_predict(cl2xfit, newdata = vdat,
outfile = "pred_L2XGBoost_cont")
To predict time-to-event outcomes:
ptl2xfit = XGBtraining_predict(tl2xfit, newdata = vdat,
outfile = "pred_L2XGBoost_time_to_event")
In the validation phase, we assess our models’ effectiveness by utilizing a fresh dataset that includes the outcome variable. This separate dataset, known as the validation dataset, stands distinct from the one used for training and is termed external validation. This distinction is crucial, setting it apart from internal validation methods such as sampling, cross-validation, leave-one-out, and bootstrapping.
It’s important to emphasize that while the same functions are used for both prediction and validation, the validation process requires the inclusion of an outcome variable. This distinction prompts additional analyses and comparisons beyond mere prediction.
All generated validation plots and associated result files are stored locally for easy reference.
We conduct validation for the LASSO2 model with different types of outcome variables.
vbl = LASSO2_predict(bl, newdata = vdat, newY = TRUE,
outfile = "valid_LASSO2_binary")
While the returned object vbl also holds predicted probabilities for the ’DZsig’ positive group, in addition, a validation performance figure is saved locally.
vcl = LASSO2_predict(cl, newdata = vdat, newY = TRUE,
outfile = "valid_LASSO2_cont")
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
Similarly, the returned object vcl holds predicted value, and a validation performance plot is saved.
vtl = LASSO2_predict(tl, newdata = vdat, newY = TRUE,
outfile = "valid_LASSO2_time_to_event")
The returned object vtl keeps the predicted risk scores, and locally saved validation results include a calibration plot and a table containing performance statistics.
Similar to prediction step, we use rms_model to validate the combined LASSO2 and regular regression model.
vblr = rms_model(blr$fit, newdata = vdat, newY = TRUE,
outfile = "valid_LASSO2reg_binary")
## index.orig training test optimism index.corrected
## Dxy 0.842931937 0.852864366 0.82443141 0.028432952 0.81449898
## R2 0.527718998 0.556745099 0.48959034 0.067154755 0.46056424
## Intercept 0.000000000 0.000000000 -0.16861874 0.168618736 -0.16861874
## Slope 1.000000000 1.000000000 0.80183609 0.198163913 0.80183609
## Emax 0.000000000 0.000000000 0.08034152 0.080341516 0.08034152
## D 0.310084555 0.329349113 0.28405687 0.045292238 0.26479232
## U -0.009259259 -0.009259259 0.02034229 -0.029601553 0.02034229
## Q 0.319343814 0.338608372 0.26371458 0.074893791 0.24445002
## B 0.060659654 0.056924343 0.06396689 -0.007042547 0.06770220
## g 3.075229015 4.241118946 3.01611986 1.224999086 1.85022993
## gp 0.173738980 0.173862233 0.16806814 0.005794088 0.16794489
## Cindex 0.921465969 0.926432183 0.91221571 0.014216476 0.90724949
## n
## Dxy 200
## R2 200
## Intercept 200
## Slope 200
## Emax 200
## D 200
## U 200
## Q 200
## B 200
## g 200
## gp 200
## Cindex 200
The above code generates and saves two figures and five tables and some of them are duplicated to the prediction step.
vclr = rms_model(clr$fit, newdata = vdat, newY = TRUE,
outfile = "valid_LASSO2reg_continuous")
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## index.orig training test optimism index.corrected n
## R-square 0.0547 0.0634 0.0421 0.0213 0.0333 200
## MSE 182.0879 181.9932 184.5155 -2.5223 184.6102 200
## g 3.3604 3.4535 3.2035 0.2500 3.1105 200
## Intercept 0.0000 0.0000 -0.1968 0.1968 -0.1968 200
## Slope 1.0000 1.0000 1.0027 -0.0027 1.0027 200
The above code also generates and saves two figures and five tables and some of them are duplicated to the prediction step.
vtlr = rms_model(tlr$fit, data = tdat, newdata = vdat, newY = TRUE,
outfile = "valid_LASSO2reg_time_to_event")
## index.orig training test optimism index.corrected n
## Dxy 0.426967889 0.437033445 0.426916738 0.010116707 0.416851182 200
## R2 0.200249909 0.212654732 0.195929693 0.016725039 0.183524870 200
## Slope 1.000000000 1.000000000 0.964331482 0.035668518 0.964331482 200
## D 0.053514663 0.057353587 0.052206694 0.005146893 0.048367771 200
## U -0.002312546 -0.002306487 0.001012522 -0.003319010 0.001006464 200
## Q 0.055827209 0.059660074 0.051194172 0.008465902 0.047361307 200
## g 0.813999013 0.838751394 0.798348552 0.040402842 0.773596172 200
## Cindex 0.713483944 0.718516722 0.713458369 0.005058353 0.708425591 200
Same as for prediction step, validation of time-to-event outcome requires training data as well. The above code generates and saves two figures and six tables and some of them are duplicated to the prediction step.
Next, we utilize the same rms_model function for validating the LASSO_plus models. The parameter settings and outputs mirror those detailed in the combined LASSO2 and regular regression validation validation of Section 6.2.
vbfit = rms_model(bfit$fit, newdata = vdat, newY = TRUE,
outfile = "valid_LASSOplus_binary")
## index.orig training test optimism index.corrected
## Dxy 0.790157068 0.803388279 0.77933822 0.024050059 0.76610701
## R2 0.481472357 0.499006824 0.45086750 0.048139323 0.43333303
## Intercept 0.000000000 0.000000000 -0.15539422 0.155394220 -0.15539422
## Slope 1.000000000 1.000000000 0.85343547 0.146564526 0.85343547
## Emax 0.000000000 0.000000000 0.06404446 0.064044461 0.06404446
## D 0.278185437 0.289255131 0.25796669 0.031288436 0.24689700
## U -0.009259259 -0.009259259 0.01900001 -0.028259273 0.01900001
## Q 0.287444696 0.298514390 0.23896668 0.059547709 0.22789699
## B 0.064166124 0.062653488 0.06658645 -0.003932959 0.06809908
## g 2.702828842 3.697474633 2.71932938 0.978145248 1.72468359
## gp 0.162611026 0.163854502 0.15944139 0.004413114 0.15819791
## Cindex 0.895078534 0.901694140 0.88966911 0.012025030 0.88305350
## n
## Dxy 200
## R2 200
## Intercept 200
## Slope 200
## Emax 200
## D 200
## U 200
## Q 200
## B 200
## g 200
## gp 200
## Cindex 200
vcfit = rms_model(cfit$fit, newdata = vdat, newY = TRUE,
outfile = "valid_LASSOplus_continuous")
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## index.orig training test optimism index.corrected n
## R-square 0.1113 0.1256 0.0923 0.0332 0.0781 200
## MSE 171.1808 166.7987 174.8346 -8.0360 179.2168 200
## g 5.1667 5.3244 4.8371 0.4872 4.6795 200
## Intercept 0.0000 0.0000 5.0827 -5.0827 5.0827 200
## Slope 1.0000 1.0000 0.9191 0.0809 0.9191 200
vtfit = rms_model(tfit$fit, data = tdat, newdata = vdat, newY = TRUE,
outfile = "valid_LASSOplus_time_to_event")
## index.orig training test optimism index.corrected n
## Dxy 0.513853367 0.520177786 0.500051861 0.020125925 0.493727442 200
## R2 0.265362936 0.278929001 0.254352540 0.024576461 0.240786475 200
## Slope 1.000000000 1.000000000 0.942296628 0.057703372 0.942296628 200
## D 0.074222355 0.079377064 0.070615670 0.008761394 0.065460961 200
## U -0.002312546 -0.002325518 0.001531321 -0.003856839 0.001544293 200
## Q 0.076534901 0.081702582 0.069084349 0.012618233 0.063916668 200
## g 1.078420120 1.119889538 1.042314482 0.077575056 1.000845064 200
## Cindex 0.756926684 0.760088893 0.750025931 0.010062963 0.746863721 200
Additionally, we leverage the same rms_model function to validate the LASSO_plus models. The parameter configurations and outputs align with those outlined in the combined LASSO2 and regular regression validation detailed in Section 6.2.
v2bfit = rms_model(b2fit$fit, newdata = vdat, newY = TRUE,
outfile = "valid_LASSO2plus_binary")
## index.orig training test optimism index.corrected
## Dxy 0.790157068 0.808066446 0.77752042 0.030546028 0.75961104
## R2 0.481472357 0.508765669 0.44657735 0.062188322 0.41928404
## Intercept 0.000000000 0.000000000 -0.19134450 0.191344497 -0.19134450
## Slope 1.000000000 1.000000000 0.81785604 0.182143959 0.81785604
## Emax 0.000000000 0.000000000 0.08073939 0.080739386 0.08073939
## D 0.278185437 0.298129804 0.25516929 0.042960514 0.23522492
## U -0.009259259 -0.009259259 0.01631215 -0.025571409 0.01631215
## Q 0.287444696 0.307389063 0.23885714 0.068531924 0.21891277
## B 0.064166124 0.061327218 0.06655706 -0.005229840 0.06939596
## g 2.702828842 3.794266494 2.74176003 1.052506467 1.65032238
## gp 0.162611026 0.166459228 0.15876477 0.007694462 0.15491656
## Cindex 0.895078534 0.904033223 0.88876021 0.015273014 0.87980552
## n
## Dxy 200
## R2 200
## Intercept 200
## Slope 200
## Emax 200
## D 200
## U 200
## Q 200
## B 200
## g 200
## gp 200
## Cindex 200
v2cfit = rms_model(c2fit$fit, newdata = vdat, newY = TRUE,
outfile = "valid_LASSO2plus_continuous")
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## index.orig training test optimism index.corrected n
## R-square 0.0547 0.0639 0.0415 0.0224 0.0323 200
## MSE 182.0879 178.9352 184.6205 -5.6852 187.7732 200
## g 3.3604 3.4703 3.1967 0.2736 3.0868 200
## Intercept 0.0000 0.0000 0.1922 -0.1922 0.1922 200
## Slope 1.0000 1.0000 0.9974 0.0026 0.9974 200
v2tfit = rms_model(t2fit$fit, data = tdat, newdata = vdat, newY = TRUE,
outfile = "valid_LASSO2plus_time_to_event")
## index.orig training test optimism index.corrected n
## Dxy 0.513853367 0.521392165 0.499849389 0.021542776 0.492310591 200
## R2 0.265362936 0.280418373 0.254040389 0.026377985 0.238984951 200
## Slope 1.000000000 1.000000000 0.936835025 0.063164975 0.936835025 200
## D 0.074222355 0.080365460 0.070513329 0.009852131 0.064370224 200
## U -0.002312546 -0.002338996 0.001542323 -0.003881319 0.001568773 200
## Q 0.076534901 0.082704456 0.068971006 0.013733449 0.062801452 200
## g 1.078420120 1.122557100 1.042425107 0.080131993 0.998288127 200
## Cindex 0.756926684 0.760696083 0.749924695 0.010771388 0.746155295 200
The XGBtraining_predict function introduced in Section 5.5, as indicated by its name, also serves for model validation when the outcome variable is present in the validation cohort. The parameter settings and outputs are the same as those for the LASSO2_prediction function detailed in Section 6.1.
vbxfit = XGBtraining_predict(bxfit, newdata = vdat, newY = TRUE,
outfile = "valid_XGBoost_binary")
Predicted probability for the positive group is given for each entry/sample.
vcxfit = XGBtraining_predict(cxfit, newdata = vdat, newY = TRUE,
outfile = "valid_XGBoost_cont")
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
vtxfit = XGBtraining_predict(txfit, newdata = vdat, newY = TRUE,
outfile = "valid_XGBoost_time_to_event")
The same XGBtraining_predict function is employed for LASSO2 + XGBoost model validation as for the standalone XGBoost model shown in Section 6.5, with consistent parameter settings and identical outputs.
vblxfit = XGBtraining_predict(blxfit, newdata = vdat, newY = TRUE,
outfile = "valid_LXGBoost_binary")
vclxfit = XGBtraining_predict(clxfit, newdata = vdat, newY = TRUE,
outfile = "valid_LXGBoost_cont")
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
vtlxfit = XGBtraining_predict(tlxfit, newdata = vdat, newY = TRUE,
outfile = "valid_LXGBoost_time_to_event")
The same XGBtraining_predict function is employed for LASSO_plus + XGBoost model validation as for the standalone XGBoost model shown in Section 6.5, with consistent parameter settings and identical outputs.
vblpxfit = XGBtraining_predict(blpxfit, newdata = vdat, newY = TRUE,
outfile = "valid_LpXGBoost_binary")
vclpxfit = XGBtraining_predict(clpxfit, newdata = vdat, newY = TRUE,
outfile = "valid_LpXGBoost_cont")
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
vtlpxfit = XGBtraining_predict(tlpxfit, newdata = vdat, newY = TRUE,
outfile = "valid_LpXGBoost_time_to_event")
The same XGBtraining_predict function is employed for LASSO2plus + XGBoost model validation as for the standalone XGBoost model shown in Section 6.5, with consistent parameter settings and identical outputs.
vbl2xfit = XGBtraining_predict(bl2xfit, newdata = vdat, newY = TRUE,
outfile = "valid_L2XGBoost_binary")
vcl2xfit = XGBtraining_predict(cl2xfit, newdata = vdat, newY = TRUE,
outfile = "valid_L2XGBoost_cont")
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
vtl2xfit = XGBtraining_predict(tl2xfit, newdata = vdat, newY = TRUE,
outfile = "valid_L2XGBoost_time_to_event")
If you find it challenging to call various functions separately, the all-in-one function provides a simplified solution. It efficiently manages predictive model development and validation for all six methods integrated into this package, spanning three distinct outcome types, with a single function call. Moreover, you can employ this versatile function for a single method with one or more outcome variables, offering flexibility to suit your specific needs. If a validation dataset is at your disposal, the function seamlessly incorporates the validation process within the same operation.
modelout = csmpvModelling(tdat = tdat, vdat = vdat,
Ybinary = "DZsig", varsBinary = Xvars,
Ycont = "Age", varsCont = AgeXvars,
time = "FFP..Years.", event = "Code.FFP",
varsSurvival = Xvars,
outfileName= "all_in_one")
This single function call generates all models and provides predictions and validations for each of them. To save space, the running results are hidden. In other words, this single function call can replace all three sections discussed in Sections 4, 5, and 6. The models will be returned, and all 179 result files will be saved locally with our exmaple training data: tdat and validation data: vdat.
Certainly, we can use this all-in-one function to work on one outcome variable and one model at a time, for example:
DZlassoreg = csmpvModelling(tdat = tdat, vdat = vdat,
Ybinary = "DZsig", varsBinary = Xvars,
methods = "LASSO2_reg",
outfileName= "just_one")
## Resized limits to included dashed line in forest panel
## Resized limits to included dashed line in forest panel
## Resized limits to included dashed line in forest panel
## file saved to just_one_binary_LASSO2reg_LASSO_reg.pdf
## file saved to just_one_binary_LASSO2reg_LASSO_regallMarks.pdf
## index.orig training test optimism index.corrected
## Dxy 0.790157068 0.809430853 0.77916440 0.030266455 0.75989061
## R2 0.481472357 0.513673005 0.45482537 0.058847637 0.42262472
## Intercept 0.000000000 0.000000000 -0.15184124 0.151841242 -0.15184124
## Slope 1.000000000 1.000000000 0.83708041 0.162919589 0.83708041
## Emax 0.000000000 0.000000000 0.06748667 0.067486675 0.06748667
## D 0.278185437 0.302503596 0.26051003 0.041993564 0.23619187
## U -0.009259259 -0.009259259 0.01651497 -0.025774229 0.01651497
## Q 0.287444696 0.311762855 0.24399506 0.067767793 0.21967690
## B 0.064166124 0.060846460 0.06618349 -0.005337033 0.06950316
## g 2.702828842 3.711119190 2.71891485 0.992204335 1.71062451
## gp 0.162611026 0.167573743 0.16005494 0.007518802 0.15509222
## Cindex 0.895078534 0.904715427 0.88958220 0.015133228 0.87994531
## n
## Dxy 200
## R2 200
## Intercept 200
## Slope 200
## Emax 200
## D 200
## U 200
## Q 200
## B 200
## g 200
## gp 200
## Cindex 200
This is equivalent to using LASSO2_reg for modeling, followed by prediction and validation with rms_model for the classification task “DZsig”. Six result files are then saved locally.
In preceding sections, the target model type consistently matched the provided output. However, scenarios can emerge where they do not necessarily correspond.
For instance, situations might arise in which we aim to construct a risk classification model even when our training cohort lacks risk classification data but includes survival information.
To undertake this specialized modeling, let’s assume that we possess a set of variables associated with survival outcomes. This variable list could stem from other research and be validated within the given training dataset, or it could be established through variable selection techniques such as LASSO2, LASSO_plus and LASSO2plus.
By employing the same variable list, denoted as Xvars, we can invoke the XGpred function with choices to perform variable selection with LASSO2. This wrapper function applies XGBoost and Cox modeling to get high and low risk groups using survival data. Subsequently, these groups undergo filtration and are utilized to construct both an XGpred (linear prediction score) model and an empirical Bayesian-based binary risk classification model.
Build the XGpred object for the training cohort:
xgobj = XGpred(data = tdat, varsIn = Xvars,
selection = TRUE,
time = "FFP..Years.",
event = "Code.FFP", outfile = "XGpred")
The XGpred output object, xgobj, contains all the necessary information for risk classification, including that of the training cohort.
To observe the performance of the risk group in the training set, we can generate a KM plot using the confirmVars function:
tdat$XGpred_class = xgobj$XGpred_prob_class
training_risk_confirm = confirmVars(data = tdat, biomks = "XGpred_class",
time = "FFP..Years.", event = "Code.FFP",
outfile = "training_riskSurvival",
outcomeType = "time-to-event")
training_risk_confirm[[3]]
Then we can predict the risk classification for a validation cohort:
xgNew = XGpred_predict(newdat = vdat, XGpredObj = xgobj)
While the default calibration shift (scoreShift) is set to 0, you can adjust it based on model scores if there’s a platform/batch difference between the training and validation cohorts.
If survival data is available for the testing dataset, we can employ the confirmVars function introduced earlier to assess the reasonableness of risk classification.
vdat$XGpred_class = xgNew$XGpred_prob_class
risk_confirm = confirmVars(data = vdat, biomks = "XGpred_class",
time = "FFP..Years.", event = "Code.FFP",
outfile = "riskSurvival",
outcomeType = "time-to-event")
risk_confirm[[3]]
Title: Biomarker confirmation, selection, modelling, prediction and validation
Version: 1.0.2
Author: Aixiang Jiang
Maintainer: Aixiang Jiang aijiang@bccrc.ca{.email}
Depends: R (>= 4.2.0)
Suggests: knitr
VignetteBuilder: knitr
Imports: survival, glmnet, Hmisc, rms, forestmodel, ggplot2, ggpubr,survminer, mclust, xgboost, cowplot
devtools::session_info()
## ─ Session info ───────────────────────────────────────────────────────────────
## setting value
## version R version 4.3.2 (2023-10-31)
## os macOS Ventura 13.2.1
## system x86_64, darwin20
## ui X11
## language (EN)
## collate C
## ctype en_US.UTF-8
## tz America/Vancouver
## date 2024-01-10
## pandoc 2.19.2 @ /Users/aijiang/Desktop/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)
##
## ─ Packages ───────────────────────────────────────────────────────────────────
## package * version date (UTC) lib source
## abind 1.4-5 2016-07-21 [3] CRAN (R 4.3.0)
## backports 1.4.1 2021-12-13 [3] CRAN (R 4.3.0)
## base64enc 0.1-3 2015-07-28 [3] CRAN (R 4.3.0)
## broom 1.0.5 2023-06-09 [3] CRAN (R 4.3.0)
## bslib 0.6.1 2023-11-28 [3] CRAN (R 4.3.0)
## cachem 1.0.8 2023-05-01 [3] CRAN (R 4.3.0)
## car 3.1-2 2023-03-30 [3] CRAN (R 4.3.0)
## carData 3.0-5 2022-01-06 [3] CRAN (R 4.3.0)
## checkmate 2.3.1 2023-12-04 [3] CRAN (R 4.3.0)
## cli 3.6.2 2023-12-11 [3] CRAN (R 4.3.0)
## cluster 2.1.4 2022-08-22 [4] CRAN (R 4.3.2)
## codetools 0.2-19 2023-02-01 [4] CRAN (R 4.3.2)
## colorspace 2.1-0 2023-01-23 [3] CRAN (R 4.3.0)
## commonmark 1.9.0 2023-03-17 [3] CRAN (R 4.3.0)
## cowplot 1.1.2 2023-12-15 [3] CRAN (R 4.3.0)
## csmpv * 1.0.2 2024-01-10 [1] local
## data.table 1.14.10 2023-12-08 [3] CRAN (R 4.3.0)
## devtools 2.4.5 2022-10-11 [3] CRAN (R 4.3.0)
## digest 0.6.33 2023-07-07 [3] CRAN (R 4.3.0)
## dplyr 1.1.4 2023-11-17 [3] CRAN (R 4.3.0)
## ellipsis 0.3.2 2021-04-29 [3] CRAN (R 4.3.0)
## evaluate 0.23 2023-11-01 [3] CRAN (R 4.3.0)
## fansi 1.0.6 2023-12-08 [3] CRAN (R 4.3.0)
## farver 2.1.1 2022-07-06 [3] CRAN (R 4.3.0)
## fastmap 1.1.1 2023-02-24 [3] CRAN (R 4.3.0)
## foreach 1.5.2 2022-02-02 [3] CRAN (R 4.3.0)
## foreign 0.8-86 2023-11-28 [3] CRAN (R 4.3.0)
## forestmodel 0.6.2 2020-07-19 [3] CRAN (R 4.3.0)
## Formula 1.2-5 2023-02-24 [3] CRAN (R 4.3.0)
## fs 1.6.3 2023-07-20 [3] CRAN (R 4.3.0)
## generics 0.1.3 2022-07-05 [3] CRAN (R 4.3.0)
## ggplot2 3.4.4 2023-10-12 [3] CRAN (R 4.3.0)
## ggpubr 0.6.0 2023-02-10 [3] CRAN (R 4.3.0)
## ggsignif 0.6.4 2022-10-13 [3] CRAN (R 4.3.0)
## ggtext 0.1.2 2022-09-16 [3] CRAN (R 4.3.0)
## glmnet 4.1-8 2023-08-22 [3] CRAN (R 4.3.0)
## glue 1.6.2 2022-02-24 [3] CRAN (R 4.3.0)
## gridExtra 2.3 2017-09-09 [3] CRAN (R 4.3.0)
## gridtext 0.1.5 2022-09-16 [3] CRAN (R 4.3.0)
## gtable 0.3.4 2023-08-21 [3] CRAN (R 4.3.0)
## highr 0.10 2022-12-22 [3] CRAN (R 4.3.0)
## Hmisc 5.1-1 2023-09-12 [3] CRAN (R 4.3.0)
## htmlTable 2.4.2 2023-10-29 [3] CRAN (R 4.3.0)
## htmltools 0.5.7 2023-11-03 [3] CRAN (R 4.3.0)
## htmlwidgets 1.6.4 2023-12-06 [3] CRAN (R 4.3.0)
## httpuv 1.6.13 2023-12-06 [3] CRAN (R 4.3.0)
## iterators 1.0.14 2022-02-05 [3] CRAN (R 4.3.0)
## jquerylib 0.1.4 2021-04-26 [3] CRAN (R 4.3.0)
## jsonlite 1.8.8 2023-12-04 [3] CRAN (R 4.3.0)
## km.ci 0.5-6 2022-04-06 [3] CRAN (R 4.3.0)
## KMsurv 0.1-5 2012-12-03 [3] CRAN (R 4.3.0)
## knitr 1.45 2023-10-30 [3] CRAN (R 4.3.0)
## labeling 0.4.3 2023-08-29 [3] CRAN (R 4.3.0)
## later 1.3.2 2023-12-06 [3] CRAN (R 4.3.0)
## lattice 0.22-5 2023-10-24 [3] CRAN (R 4.3.0)
## lifecycle 1.0.4 2023-11-07 [3] CRAN (R 4.3.0)
## magrittr 2.0.3 2022-03-30 [3] CRAN (R 4.3.0)
## markdown 1.12 2023-12-06 [3] CRAN (R 4.3.0)
## MASS 7.3-60 2023-05-04 [4] CRAN (R 4.3.2)
## Matrix 1.6-4 2023-11-30 [3] CRAN (R 4.3.0)
## MatrixModels 0.5-3 2023-11-06 [3] CRAN (R 4.3.0)
## memoise 2.0.1 2021-11-26 [3] CRAN (R 4.3.0)
## mgcv 1.9-1 2023-12-21 [3] CRAN (R 4.3.0)
## mime 0.12 2021-09-28 [3] CRAN (R 4.3.0)
## miniUI 0.1.1.1 2018-05-18 [3] CRAN (R 4.3.0)
## multcomp 1.4-25 2023-06-20 [3] CRAN (R 4.3.0)
## munsell 0.5.0 2018-06-12 [3] CRAN (R 4.3.0)
## mvtnorm 1.2-4 2023-11-27 [3] CRAN (R 4.3.0)
## nlme 3.1-164 2023-11-27 [3] CRAN (R 4.3.0)
## nnet 7.3-19 2023-05-03 [4] CRAN (R 4.3.2)
## pillar 1.9.0 2023-03-22 [3] CRAN (R 4.3.0)
## pkgbuild 1.4.3 2023-12-10 [3] CRAN (R 4.3.0)
## pkgconfig 2.0.3 2019-09-22 [3] CRAN (R 4.3.0)
## pkgload 1.3.3 2023-09-22 [3] CRAN (R 4.3.0)
## polspline 1.1.24 2023-10-26 [3] CRAN (R 4.3.0)
## profvis 0.3.8 2023-05-02 [3] CRAN (R 4.3.0)
## promises 1.2.1 2023-08-10 [3] CRAN (R 4.3.0)
## purrr 1.0.2 2023-08-10 [3] CRAN (R 4.3.0)
## quantreg 5.97 2023-08-19 [3] CRAN (R 4.3.0)
## R6 2.5.1 2021-08-19 [3] CRAN (R 4.3.0)
## Rcpp 1.0.11 2023-07-06 [3] CRAN (R 4.3.0)
## remotes 2.4.2.1 2023-07-18 [3] CRAN (R 4.3.0)
## rlang 1.1.2 2023-11-04 [3] CRAN (R 4.3.0)
## rmarkdown 2.25 2023-09-18 [3] CRAN (R 4.3.0)
## rms 6.7-1 2023-09-12 [3] CRAN (R 4.3.0)
## rpart 4.1.23 2023-12-05 [3] CRAN (R 4.3.0)
## rstatix 0.7.2 2023-02-01 [3] CRAN (R 4.3.0)
## rstudioapi 0.15.0 2023-07-07 [3] CRAN (R 4.3.0)
## sandwich 3.1-0 2023-12-11 [3] CRAN (R 4.3.0)
## sass 0.4.8 2023-12-06 [3] CRAN (R 4.3.0)
## scales 1.3.0 2023-11-28 [3] CRAN (R 4.3.0)
## sessioninfo 1.2.2 2021-12-06 [3] CRAN (R 4.3.0)
## shape 1.4.6 2021-05-19 [3] CRAN (R 4.3.0)
## shiny 1.8.0 2023-11-17 [3] CRAN (R 4.3.0)
## SparseM 1.81 2021-02-18 [3] CRAN (R 4.3.0)
## stringi 1.8.3 2023-12-11 [3] CRAN (R 4.3.0)
## stringr 1.5.1 2023-11-14 [3] CRAN (R 4.3.0)
## survival 3.5-7 2023-08-14 [3] CRAN (R 4.3.0)
## survminer 0.4.9 2021-03-09 [3] CRAN (R 4.3.0)
## survMisc 0.5.6 2022-04-07 [3] CRAN (R 4.3.0)
## TH.data 1.1-2 2023-04-17 [3] CRAN (R 4.3.0)
## tibble 3.2.1 2023-03-20 [3] CRAN (R 4.3.0)
## tidyr 1.3.0 2023-01-24 [3] CRAN (R 4.3.0)
## tidyselect 1.2.0 2022-10-10 [3] CRAN (R 4.3.0)
## urlchecker 1.0.1 2021-11-30 [3] CRAN (R 4.3.0)
## usethis 2.2.2 2023-07-06 [3] CRAN (R 4.3.0)
## utf8 1.2.4 2023-10-22 [3] CRAN (R 4.3.0)
## vctrs 0.6.5 2023-12-01 [3] CRAN (R 4.3.0)
## withr 2.5.2 2023-10-30 [3] CRAN (R 4.3.0)
## xfun 0.41 2023-11-01 [3] CRAN (R 4.3.0)
## xgboost 1.7.6.1 2023-12-06 [3] CRAN (R 4.3.0)
## xml2 1.3.6 2023-12-04 [3] CRAN (R 4.3.0)
## xtable 1.8-4 2019-04-21 [3] CRAN (R 4.3.0)
## yaml 2.3.8 2023-12-11 [3] CRAN (R 4.3.0)
## zoo 1.8-12 2023-04-13 [3] CRAN (R 4.3.0)
##
## [1] /private/var/folders/mw/nv2pnn4x0rz0t3rxgfl_l8twlkqc_t/T/RtmpQJN2JH/Rinst5f0ac4740a4
## [2] /private/var/folders/mw/nv2pnn4x0rz0t3rxgfl_l8twlkqc_t/T/Rtmpmp1GHu/temp_libpath5ee535144f7f
## [3] /Users/aijiang/Library/R/x86_64/4.3/library
## [4] /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/library
##
## ──────────────────────────────────────────────────────────────────────────────
Hastie et al. (1992, ISBN 0 534 16765-9), Therneau et al. (2000, ISBN 0-387-98784-3), Friedman et al. (2010) doi:10.18637/jss.v033.i01 Simon et al. (2011) doi:doi:10.18637/jss.v039.i05 Chen and Guestrin (2016) <arXiv:1603.02754> Aoki et al. (2023) doi:10.1200/JCO.23.01115