To deal with missing data, multiple imputation is the golden standard (Schafer & Graham, 2002). With GLMs, the models fitted on each imputed dataset can then be pooled. For non-parameteric methods like prediction rule ensembles, such pooling is more difficult. Little research has been performed on how to best deal with missing data in fitting prediction rule ensembles, but there are currently three options:
Listwise deletion. Although the default in pre()
, this is certainly the least favorable option.
Single imputation: Perform only a single imputation and fit a prediction rule ensemble on this single dataset. This is likely better than listwise deletion, but likely inferior to multiple imputation, but easy to implement.
Multiple imputation approach by aggregating ensembles: Create multiple imputed datasets; fit a separate prediction rule ensemble to each of the imputed datasets; aggregate the ensembles into a single final ensemble. In terms of predictive accuracy, this approach will work well. It will, however, yield more complex ensembles than the former two approaches, and the next approach.
Combining mean imputation with the Missing-In-Attributes approach. According to Josse et al. (2019), mean imputation and the Missing-In-Attributes approaches are not so bad from a prediction perspective. Furthermore, they are computationally inexpensive.
Below, we provide examples for the first three approaches described above. In future versions of package pre
, the mean imputation combined with MIA approach will be implemented.
For the examples, we will be predicting Wind speels using the airquality
dataset (we focus on predicting the wind
variable, because it does not have missing values, while variables Ozone
and Solar.R
do):
head(airquality)
nrow(airquality)
library("mice")
md.pattern(airquality, rotate.names = TRUE)
This option is the default of function pre()
:
library("pre")
set.seed(43)
<- pre(Wind ~., data = airquality) airq.ens
## Warning in pre(Wind ~ ., data = airquality): Some observations have missing values and have been removed from the data. New sample size is 111.
airq.ens
##
## Final ensemble with cv error within 1se of minimum:
## lambda = 0.5076221
## number of terms = 2
## mean cv error (se) = 9.955139 (1.553628)
##
## cv error type : Mean-Squared Error
##
## rule coefficient description
## (Intercept) 9.034190233 1
## rule8 1.743533723 Ozone <= 45
## Ozone -0.006180118 6.75 <= Ozone <= 119
With listwise deletion, only 111 out of 153 observations are retained. We obtain a rather sparse ensemble.
Here we apply a single imputation by filling in the mean:
<- airquality
imp0 $Solar.R[is.na(imp0$Solar.R)] <- mean(imp0$Solar.R, na.rm=TRUE)
imp0$Ozone[is.na(imp0$Ozone)] <- mean(imp0$Ozone, na.rm=TRUE)
imp0set.seed(43)
<- pre(Wind ~., data = imp0)
airq.ens.imp0 airq.ens.imp0
##
## Final ensemble with cv error within 1se of minimum:
## lambda = 0.2751573
## number of terms = 5
## mean cv error (se) = 9.757455 (0.7622478)
##
## cv error type : Mean-Squared Error
##
## rule coefficient description
## (Intercept) 10.48717592 1
## rule2 1.18133248 Ozone <= 45
## rule48 -0.56453456 Temp > 72 & Solar.R <= 258
## rule28 -0.51357673 Temp > 73 & Ozone > 22
## Ozone -0.01910646 7 <= Ozone <= 115.6
## rule40 0.01440472 Temp <= 81
We obtain a larger number of rules. The mean squared error between this model and the former cannot really be compared, because they are computed over different sets of observations.
We perform multiple imputation by chained equations, using the predictive mean matching method. We generate 5 imputed datasets:
set.seed(42)
<- mice(airquality, m = 5) imp
We create a list
with imputed datasets:
<- complete(imp, action = "all", include = FALSE) imp1
We load the pre
library:
library("pre")
We create a custom function that fits PREs to several datasets contained in a list:
<- function(datasets, ...) {
pre.agg <- list()
result for (i in 1:length(datasets)) {
<- pre(datasets[[i]], ...)
result[[i]]
}
result }
We apply the new function:
set.seed(43)
<- pre.agg(imp1, formula = Wind ~ .) airq.agg
Note that we can used the ellipsis (...
) to pass arguments to pre
(see ?pre
for an overview of arguments that can be specified).
We now define print
, summary
, predict
and coef
methods to extract results from the fitted ensemble. Again, we can use the ellipsis (...
) to pass arguments to the print
, summary
, predict
and coef
methods of function pre
(see e.g., ?pre:::print.pre
for more info):
<- function(object, ...) {
print.agg <- list()
result sink("NULL")
for (i in 1:length(object)) {
<- print(object[[i]], ...)
result[[i]]
}sink()
print(result)
}print.agg(airq.agg) ## results suppressed for space considerations
<- function(object, ...) {
summary.agg for (i in 1:length(object)) summary(object[[i]], ...)
}summary.agg(airq.agg) ## results suppressed for space considerations
For averaging over predictions, there is only one option for continuous outcomes. For non-continuous outcomes, we can average over the linear predictor, or over the predicted values on the scale of the response. I do not know which would be more appropriate. The resulting predicted values will be highly correlated, though.
<- function(object, newdata, ...) {
predict.agg rowMeans(sapply(object, predict, newdata = newdata, ...))
}<- predict.agg(airq.agg, newdata = airquality[1:4, ])
agg_preds agg_preds
## 1 2 3 4
## 10.42757 10.59272 10.93324 11.00302
Finally, the coef
method should return the averaged / aggregated final PRE. That is, it returns:
One averaged intercept;
All rules and linear terms, with their coefficients scaled by the number of datasets;
In presence of identical rules and linear terms, it aggregates those rules and their coefficients into one rule / term, and adds together the scaled coefficients.
Note that linear terms that do not have the same winsorizing points will not be aggregated. Note that the labels of rules and variables may overlap between different datasets (e.g., the label rule 12
or may appear multiple times in the aggregated ensemble, but each rule 12
will have different conditions).
<- function(object, ...) {
coef.agg <- coef(object[[1]], ...)
coefs <- coefs[coefs$coefficient != 0,]
coefs for (i in 2:length(object)) {
<- coef(object[[i]], ...)
coefs_tmp <- coefs_tmp[coefs_tmp$coefficient != 0,]
coefs_tmp ## Add intercepts:
$rule == "(Intercept)", "coefficient"] <-
coefs[coefs$rule == "(Intercept)", "coefficient"] +
coefs[coefs$rule == "(Intercept)", "coefficient"]
coefs_tmp[coefs_tmp## Append other terms rest to coefs:
<- rbind(coefs, coefs_tmp[coefs_tmp$rule!= "(Intercept)", ])
coefs
}## Divide coefficients by the number of datasets:
$coefficient <- coefs$coefficient / length(object)
coefs## Identify identical rules:
<- which(duplicated(coefs$description))
duplicates for (i in duplicates) {
<- which(coefs$description == coefs$description[i])[1]
first_match ## Add the coefficients:
$coefficient[first_match] <-
coefs$coefficient[first_match] + coefs$coefficient[i]
coefs
}## Remove duplicates:
<- coefs[-duplicates, ]
coefs ## Return results:
coefs
}coef.agg(airq.agg)
## rule coefficient description
## 65 (Intercept) 9.4335719839 1
## 29 rule32 -0.1428273137 Temp > 73 & Ozone > 23
## 3 rule3 0.9687460334 Ozone <= 45
## 7 rule7 -0.0941839173 Ozone > 14 & Solar.R <= 238
## 66 Ozone -0.0057739123 7 <= Ozone <= 115.6
## 6 rule6 0.0748353530 Ozone <= 59
## 25 rule26 -0.0677719742 Temp > 63 & Ozone > 14
## 17 rule17 -0.0380923160 Temp > 75 & Ozone > 47
## 1 rule1 0.0907492728 Ozone <= 21
## 71 rule7 0.1346917292 Ozone <= 37
## 291 rule32 0.2401553457 Ozone <= 47
## 58 rule62 -0.2775102443 Ozone > 45 & Solar.R <= 275
## 36 rule39 -0.0469380324 Ozone > 14 & Solar.R <= 201
## 40 rule43 -0.0352134315 Temp > 72 & Solar.R <= 255
## 51 rule5 0.0318214526 Ozone <= 52
## 10 rule10 0.0700563669 Temp <= 73
## 19 rule22 0.0669623309 Ozone <= 63
## 14 rule15 0.0010440734 Ozone <= 45 & Solar.R > 212
## 55 Ozone -0.0003158451 7 <= Ozone <= 118.8
We have obtained a final ensemble of 17 terms.
We compare performance using 10-fold cross validation. WE evaluate predictive accuracy and the number of selected rules. We only evaluate accuracy for observations that have no missing values.
<- 10
k set.seed(43)
<- sample(1:k, size = nrow(airquality), replace = TRUE)
fold_ids
<- c()
observed for (i in 1:k) {
## Separate training and test data
<- airquality[fold_ids == i, ]
test <- test[!is.na(test$Ozone), ]
test <- test[!is.na(test$Solar.R), ]
test <- c(observed, test$Wind)
observed
}
<- data.frame(observed)
preds $LW <- preds$SI <- preds$MI <- preds$observed
preds<- matrix(nrow = k, ncol = 3)
nterms colnames(nterms) <- c("LW", "SI", "MI")
<- 1
row
for (i in 1:k) {
if (i > 1) row <- row + nrow(test)
## Separate training and test data
<- airquality[fold_ids != i, ]
train <- airquality[fold_ids == i, ]
test <- test[!is.na(test$Ozone), ]
test <- test[!is.na(test$Solar.R), ]
test
## Fit and evaluate listwise deletion
<- pre(Wind ~ ., data = train)
premod $LW[row:(row+nrow(test)-1)] <- predict(premod, newdata = test)
preds<- print(premod)
tmp "LW"] <- nrow(tmp) - 1
nterms[i,
## Fit and evaluate single imputation
<- train
imp0 $Solar.R[is.na(imp0$Solar.R)] <- mean(imp0$Solar.R, na.rm=TRUE)
imp0$Ozone[is.na(imp0$Ozone)] <- mean(imp0$Ozone, na.rm=TRUE)
imp0<- pre(Wind ~., data = imp0)
premod.imp0 <- test
imp1 $Solar.R[is.na(imp1$Solar.R)] <- mean(imp0$Solar.R, na.rm=TRUE)
imp1$Ozone[is.na(imp1$Ozone)] <- mean(imp0$Ozone, na.rm=TRUE)
imp1$SI[row:(row+nrow(test)-1)] <- predict(premod.imp0, newdata = imp1)
preds<- print(premod.imp0)
tmp "SI"] <- nrow(tmp) - 1
nterms[i,
## Perform multiple imputation
<- mice(train, m = 5)
imp <- complete(imp, action = "all", include = FALSE)
imp1 <- pre.agg(imp1, formula = Wind ~ .)
airq.agg $MI[row:(row+nrow(test)-1)] <- predict.agg(airq.agg, newdata = test)
preds"MI"] <- nrow(coef.agg(airq.agg)) - 1
nterms[i,
}
sapply(preds, function(x) mean((preds$observed - x)^2)) ## MSE
## observed MI SI LW
## 0.000000 9.729284 9.727134 10.037726
sapply(preds, function(x) sd((preds$observed - x)^2)/sqrt(nrow(preds))) ## SE of MSE
## observed MI SI LW
## 0.000000 1.493207 1.484849 1.531708
var(preds$observed) ## benchmark: Predict mean for all
## [1] 12.65732
plot(preds, main = "Observed against predicted", cex.main = .8)
boxplot(nterms, main = "Number of selected terms per missing-data method",
cex.main = .8)
Interestingly, we see that all three methods yield similar predictions and accuracy. Multiple imputation did not outperform single imputation on this dataset, which is somewhat in line with the findings of Josse et al. (2019). Single imputation performed bestin terms of predictive accuracy, at the ocst of only a relatively small increase in complexity. Note, however, that this evaluation on only a single dataset should not be taken too seriously. For other datasets, multiple imputation may likely perform better than single imputation; this would be in line with more general expectations and results on missing data treatment. However, multiple imputation will always come with increased costs in terms of computational burden and increased complexity of the final ensemble.
Josse, J., Prost, N., Scornet, E., & Varoquaux, G. (2019). On the consistency of supervised learning with missing values. arXiv preprint arXiv:1902.06931. https://arxiv.org/abs/1902.06931
Miles, A. (2016). Obtaining predictions from models fit to multiply imputed data. Sociological Methods & Research, 45(1), 175-185. https://doi.org/10.1177/0049124115610345
Schafer, J. L., & Graham, J. W. (2002). Missing data: our view of the state of the art. Psychological Methods, 7(2), 147. https://doi.org/10.1037/1082-989X.7.2.147