Imputation Method based on xgboost

The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.

Birgit Karlhuber

2024-07-08

This vignette showcases the function xgboostImpute(), which can be used to impute missing values based on a random forest model using `[xgboost::xgboost()].

Data

The following example demonstrates the functionality of xgboostImpute() using a subset of sleep. The columns have been selected deliberately to include some interactions between the missing values

library(VIM)
dataset <- sleep[, c("Dream", "NonD", "BodyWgt", "Span")] # dataset with missings
dataset$BodyWgt <- log(dataset$BodyWgt)
dataset$Span <- log(dataset$Span)
aggr(dataset)

str(dataset)
#> 'data.frame':    62 obs. of  4 variables:
#>  $ Dream  : num  NA 2 NA NA 1.8 0.7 3.9 1 3.6 1.4 ...
#>  $ NonD   : num  NA 6.3 NA NA 2.1 9.1 15.8 5.2 10.9 8.3 ...
#>  $ BodyWgt: num  8.803 0 1.2194 -0.0834 7.8427 ...
#>  $ Span   : num  3.65 1.5 2.64 NA 4.23 ...

Imputation

In order to invoke the imputation methods, a formula is used to specify which variables are to be estimated and which variables should be used as regressors.First Dream will be imputed based on BodyWgt.

imp_xgboost <- xgboostImpute(formula=Dream~BodyWgt,data = dataset)
aggr(imp_xgboost, delimiter = "_imp")

The plot shows that all missing values of the variable Dream were imputed by the xgboostImpute() function.

Diagnosing the result

As we can see in the next plot, the correlation structure of Dream and BodyWgt is preserved by the imputation method.

imp_xgboost[, c("Dream", "BodyWgt", "Dream_imp")] |> 
  marginplot(delimiter = "_imp")

Imputing multiple variables

To impute several variables at once, the formula can be specified with more than one column name on the left hand side.

imp_xgboost <- xgboostImpute(Dream+NonD+Span~BodyWgt,data=dataset)
aggr(imp_xgboost, delimiter = "_imp")

Performance of method

In order to validate the performance of xgboostImpute() the iris dataset is used. Firstly, some values are randomly set to NA.

library(reactable)

data(iris)
df <- iris
colnames(df) <- c("S.Length","S.Width","P.Length","P.Width","Species")
# randomly produce some missing values in the data
set.seed(1)
nbr_missing <- 48
y <- data.frame(row=sample(nrow(iris),size = nbr_missing,replace = FALSE),
                col=sample(rep(1:4,12)))
df[as.matrix(y)]<-NA

aggr(df)

sapply(df, function(x)sum(is.na(x)))
#> S.Length  S.Width P.Length  P.Width  Species 
#>       12       12       12       12        0

We can see that there are missings in all variables and some observations reveal missing values on several points. In the next step we perform a multiple variable imputation and Species serves as a regressor.

imp_xgboost <- xgboostImpute(S.Length + S.Width + P.Length + P.Width ~ Species, df)
aggr(imp_xgboost, delimiter = "imp")

The plot indicates that all missing values have been imputed by the xgboostImpute() algorithm. The following table displays the rounded first five results of the imputation for all variables.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.