The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.

SBMTrees: Introduction and Usage

Introduction

The R package SBMTrees (Sequential Imputation with Bayesian Trees Mixed-Effects Models) provides a powerful Bayesian non-parametric framework for prediction and imputing missing covariates and outcomes in longitudinal data under the Missing at Random (MAR) assumption. The package leverages centralized Dirichlet Process (CDP) Normal Mixture priors to model non-normal random effects and errors, offering robust handling of model misspecification and capturing complex relationships in longitudinal data.

This vignette introduces the key functionalities of the package, including:

Install and load the package

library(SBMTrees)
library(mitml)
#> *** This is beta software. Please report any bugs!
#> *** See the NEWS file for recent changes.
library(lme4)
#> Warning: package 'lme4' was built under R version 4.5.2
#> Loading required package: Matrix

Prediction

The BMTrees_prediction function is used to predict longitudinal outcomes based on Bayesian Mixed-Effects Models. Below is an example of how to generate data, split it into training and testing datasets, and run predictions.

# Simulate data
data <- simulation_prediction_conti(
   train_prop = 0.5,
   n_subject = 20,
   n_obs_per_sub = 5,
   nonlinear = TRUE,
   residual = "normal",
   randeff = "skewed_MVN",
   seed = 123)

We then run the prediction model BMTrees, with 1 burn-in iterations and 1 posterior samples. The number of burn-in and posterior iterations should be increase to 4000 and 4000, respectively. Here we only use the small numbers to simply debug.

# Fit the predictive model
model <- BMTrees_prediction(
   X_train = data$X_train,
   Y_train = data$Y_train,
   Z_train = data$Z_train,
   subject_id_train = data$subject_id_train,
   X_test = data$X_test,
   Z_test = data$Z_test,
   subject_id_test = data$subject_id_test,
   model = "BMTrees",
   binary = FALSE,
   nburn = 1L, npost = 1L, skip = 1L, verbose = FALSE, seed = 1234
 )
#> 2123

# Posterior expectation for the testing dataset
posterior_predictions <- model$post_predictive_y_test
head(colMeans(posterior_predictions))
#> [1] -3.06391285  8.10941305  1.94976209 -5.67648045  0.07866686 -1.43042918

To evaluate the model’s predictive performance, we compute the Mean Absolute Error (MAE), and the Mean Square Error (MSE). We also calculate the 95% posterior predictive intervals to check coverage, and visualize the results using scatterplots of true versus predicted values.

point_predictions = colMeans(posterior_predictions)

# Compute MAE
mae <- mean(abs(point_predictions - data$Y_test))
cat("Mean Absolute Error (MAE):", mae, "\n")
#> Mean Absolute Error (MAE): 5.485407

# Compute MSE
mse <- mean((point_predictions - data$Y_test)^2)
cat("Mean Squared Error (MSE):", mse, "\n")
#> Mean Squared Error (MSE): 59.17032

# Compute 95% credible intervals
lower_bounds <- apply(posterior_predictions, 2, quantile, probs = 0.025)
upper_bounds <- apply(posterior_predictions, 2, quantile, probs = 0.975)

# Check if true values fall within the intervals
coverage <- mean(data$Y_test >= lower_bounds & data$Y_test <= upper_bounds)
cat("95% Posterior Predictive Interval Coverage:", coverage * 100, "%\n")
#> 95% Posterior Predictive Interval Coverage: 0 %



plot(data$Y_test, point_predictions, 
     xlab = "True Values", 
     ylab = "Predicted Values", 
     main = "True vs Predicted Values")
abline(0, 1, col = "red") # Add a 45-degree reference line

Multiple Imputation

The sequential_imputation function is used to impute missing covariates and outcomes in longitudinal data. Below is an example of how to generate longitudinal data with MAR missing, and run imputations.

# Simulate data with missing values
data <- simulation_imputation(NNY = TRUE, NNX = TRUE, 
                                  n_subject = 20, seed = 123)

We then run the sequential imputation with BMTrees, with 2 posterior iterations, and sample one posterior sample for every posterior iterations, ensuring 2 multiply-imputed sets are generated. The number of burn-in and posterior iterations should be increase to 4000 and 4000, respectively. Here we only use the small numbers to simply debug.

imputed_model <- sequential_imputation(X = data$data_M[,3:14], Y = data$data_M$Y, Z = data$Z,
   subject_id = data$data_M$subject_id, type = c(0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1),
   outcome_model = "BMLM", binary_outcome = FALSE, model = "BMTrees", nburn = 0,
   npost = 2, skip = 1, verbose = FALSE, seed = 123)
#> reordering: new covariates order is intercept X_4 X_5 X_6 X_10 X_11 X_12 X_1 X_2 X_3 X_8 X_9 X_7
#> Start to initialize imputed missing data by LOCF and NOCB.
#> Completed.
#> Start to impute using Longitudinal Sequential Imputation with:
#> BMTrees
#> Outcome variable has missing values
#> Start initializing models
#> 
#> 2123
#> 2123
#> 2123
#> 2123
#> 2123
#> 2123
#> 2123
#> 2123
#> 2123
#> 2123
#> 2123
#> 2123
#> 2123
#> 2123
#> 
#> Finish imputation with 2 imputed sets

# Extract imputed data
imputed_data <- imputed_model$imputed_data
dim(imputed_data) # Dimensions: posterior samples x observations x variables
#> [1]   2 100  13

To evaluate the model’s imputation performance, we apply Rubin`s rule to estimate linear mixed-effects model on the multiply-imputed sets.

# create structure which can be used in mitml
MI_data = list()
for (i in 1:dim(imputed_data)[1]) {
  MI_data[[i]] = cbind(as.data.frame(imputed_data[i,,]), data$Z, data$data_M$subject_id)
  colnames(MI_data[[i]]) = c(colnames(data$data_M[,3:14]), "Y", "Z1", "Z2", "subject_id")
}
MI_data <- as.mitml.list(MI_data)  # Replace with actual datasets
# Fit the model on each imputed dataset
lmm_results <- with(MI_data, lmer(Y ~ X_1 + X_2 + X_3 + X_4 + X_5 + X_6
                                  + X_7 + X_8 + X_9 + X_10 + X_11 + X_12
                                  + (0 + Z1 + Z2 | subject_id)))

# Pool fixed effects using Rubin's Rules
pooled_results <- testEstimates(lmm_results)

# Print pooled results
print(pooled_results)
#> 
#> Call:
#> 
#> testEstimates(model = lmm_results)
#> 
#> Final parameter estimates and inferences obtained from 2 imputed data sets.
#> 
#>              Estimate Std.Error   t.value        df   P(>|t|)       RIV       FMI 
#> (Intercept)    11.318     0.811    13.954 2.847e+02     0.000     0.063     0.066 
#> X_1             1.168     0.414     2.818 1.103e+02     0.006     0.105     0.111 
#> X_2             1.351     0.345     3.921 1.061e+06     0.000     0.001     0.001 
#> X_3             0.936     0.394     2.378 4.293e+02     0.018     0.051     0.053 
#> X_4            -0.425     0.822    -0.517 5.544e+00     0.625     0.738     0.559 
#> X_5             0.346     0.661     0.523 2.248e+08     0.601     0.000     0.000 
#> X_6             0.928     0.643     1.444 1.587e+06     0.149     0.001     0.001 
#> X_7             0.806     0.130     6.217 1.682e+11     0.000     0.000     0.000 
#> X_8             0.931     0.103     9.042 1.933e+02     0.000     0.077     0.081 
#> X_9             0.947     0.096     9.883 6.142e+01     0.000     0.146     0.155 
#> X_10           -1.735     0.829    -2.093 1.556e+01     0.053     0.340     0.334 
#> X_11            1.762     0.872     2.021 5.440e+00     0.095     0.751     0.564 
#> X_12            1.790     0.707     2.530 1.759e+02     0.012     0.082     0.086 
#> 
#> Unadjusted hypothesis test as appropriate in larger samples.

Summary

The SBMTrees package provides flexible tools for handling missing values and making predictions in longitudinal data. By leveraging Bayesian non-parametric methods, it effectively addresses challenges associated with model misspecification, non-normal random effects, and non-normal errors.

For further details, please refer to the package documentation and the paper: Nonparametric Bayesian Additive Regression Trees for Predicting and Handling Missing Covariates and Outcomes in Longitudinal Data.

License

This vignette is part of the SBMTrees R package and is distributed under the terms of the GNU General Public License (GPL-2).

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.