The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.
The UCI Adult dataset is a binary classification
task: predict whether an adult’s income exceeds $50K/year from census
features. We use it here to show BRM with a probabilistic learner
(learner_glm_binomial()) on data with simulated blockwise
missingness.
library(blockwise)
data(adult)
# Drop `native.country`: it has ~41 levels with a long tail of rare
# countries, which can leave a per-block logistic-regression model
# without an example of a level that later appears at predict time.
# Tree-based learners (`learner_rpart`, `learner_gbm`) tolerate this;
# `learner_glm_binomial` does not. Either drop the column or coarsen
# its levels before fitting.
adult <- adult[, setdiff(names(adult), "native.country")]
str(adult, list.len = 20)
#> 'data.frame': 32561 obs. of 11 variables:
#> $ age : int 49 44 38 38 42 20 49 37 46 36 ...
#> $ workclass : Factor w/ 9 levels " ?"," Federal-gov",..: 5 5 5 6 7 5 5 5 5 6 ...
#> $ education : Factor w/ 16 levels " 10th"," 11th",..: 8 13 12 15 6 12 16 2 12 12 ...
#> $ education.num : num 12 14 NA 15 NA 9 10 7 9 NA ...
#> $ marital.status: Factor w/ 7 levels " Divorced"," Married-AF-spouse",..: 3 1 1 3 3 5 1 3 3 3 ...
#> $ occupation : Factor w/ 16 levels ""," ?"," Adm-clerical",..: 1 6 1 12 10 8 1 1 5 1 ...
#> $ relationship : Factor w/ 6 levels " Husband"," Not-in-family",..: 6 2 5 1 6 4 3 1 1 1 ...
#> $ race : Factor w/ 5 levels " Amer-Indian-Eskimo",..: 5 5 3 2 3 5 5 5 5 5 ...
#> $ sex : Factor w/ 2 levels " Female"," Male": 1 2 1 2 1 2 2 2 2 2 ...
#> $ hours.per.week: int 40 45 32 40 50 15 35 40 40 50 ...
#> $ salary : int 1 1 0 1 0 0 0 0 1 1 ...
table(adult$salary)
#>
#> 0 1
#> 24720 7841Mirroring the design in the paper, we jointly mask two column groups — a “demographics” block and a “work history” block — plus a small column-wise noise rate.
bike_style_groups <- list(
c("age", "workclass", "education"),
c("marital.status", "occupation", "relationship")
)
adult_miss <- simulate_blockwise_missing(
adult,
blocks = bike_style_groups,
prop_missing = 0.30,
noise = 0.02
)
round(colMeans(is.na(adult_miss)) * 100, 1)
#> age workclass education education.num marital.status
#> 31.4 31.4 31.4 1.5 31.4
#> occupation relationship race sex hours.per.week
#> 31.4 31.4 0.0 0.0 0.0
#> salary
#> 0.0prob <- predict(fit, X_test)
pred_class <- as.integer(prob >= 0.5)
acc <- mean(pred_class == y_test)
cat("Accuracy:", round(acc, 3), "\n")
#> Accuracy: 0.828
# Confusion matrix
table(truth = y_test, predicted = pred_class)
#> predicted
#> truth 0 1
#> 0 5695 460
#> 1 939 1047For tree-based classification, swap in
learner_rpart(method = "class") or
learner_gbm(distribution = "bernoulli").
Srinivasan, K., Currim, F., and Ram, S. (2025). A Reduced Modeling Approach for Making Predictions With Incomplete Data Having Blockwise Missing Patterns. INFORMS Journal on Data Science.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.