BRM on the adult dataset (binary classification)

The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.

Overview

The UCI Adult dataset is a binary classification task: predict whether an adult’s income exceeds $50K/year from census features. We use it here to show BRM with a probabilistic learner (learner_glm_binomial()) on data with simulated blockwise missingness.

library(blockwise)
data(adult)

# Drop `native.country`: it has ~41 levels with a long tail of rare
# countries, which can leave a per-block logistic-regression model
# without an example of a level that later appears at predict time.
# Tree-based learners (`learner_rpart`, `learner_gbm`) tolerate this;
# `learner_glm_binomial` does not. Either drop the column or coarsen
# its levels before fitting.
adult <- adult[, setdiff(names(adult), "native.country")]

str(adult, list.len = 20)
#> 'data.frame':    32561 obs. of  11 variables:
#>  $ age           : int  49 44 38 38 42 20 49 37 46 36 ...
#>  $ workclass     : Factor w/ 9 levels " ?"," Federal-gov",..: 5 5 5 6 7 5 5 5 5 6 ...
#>  $ education     : Factor w/ 16 levels " 10th"," 11th",..: 8 13 12 15 6 12 16 2 12 12 ...
#>  $ education.num : num  12 14 NA 15 NA 9 10 7 9 NA ...
#>  $ marital.status: Factor w/ 7 levels " Divorced"," Married-AF-spouse",..: 3 1 1 3 3 5 1 3 3 3 ...
#>  $ occupation    : Factor w/ 16 levels ""," ?"," Adm-clerical",..: 1 6 1 12 10 8 1 1 5 1 ...
#>  $ relationship  : Factor w/ 6 levels " Husband"," Not-in-family",..: 6 2 5 1 6 4 3 1 1 1 ...
#>  $ race          : Factor w/ 5 levels " Amer-Indian-Eskimo",..: 5 5 3 2 3 5 5 5 5 5 ...
#>  $ sex           : Factor w/ 2 levels " Female"," Male": 1 2 1 2 1 2 2 2 2 2 ...
#>  $ hours.per.week: int  40 45 32 40 50 15 35 40 40 50 ...
#>  $ salary        : int  1 1 0 1 0 0 0 0 1 1 ...
table(adult$salary)
#> 
#>     0     1 
#> 24720  7841

Inducing blockwise missingness

Mirroring the design in the paper, we jointly mask two column groups — a “demographics” block and a “work history” block — plus a small column-wise noise rate.

bike_style_groups <- list(
  c("age", "workclass", "education"),
  c("marital.status", "occupation", "relationship")
)

adult_miss <- simulate_blockwise_missing(
  adult,
  blocks       = bike_style_groups,
  prop_missing = 0.30,
  noise        = 0.02
)
round(colMeans(is.na(adult_miss)) * 100, 1)
#>            age      workclass      education  education.num marital.status 
#>           31.4           31.4           31.4            1.5           31.4 
#>     occupation   relationship           race            sex hours.per.week 
#>           31.4           31.4            0.0            0.0            0.0 
#>         salary 
#>            0.0

Train / test split

set.seed(1234)
idx <- sample(nrow(adult_miss), floor(0.75 * nrow(adult_miss)))
train <- adult_miss[idx, ]
test  <- adult_miss[-idx, ]

X_train <- train[, setdiff(names(train), "salary")]
y_train <- train$salary
X_test  <- test[,  setdiff(names(test),  "salary")]
y_test  <- test$salary

Fit BRM with a logistic-regression learner

set.seed(1234)
fit <- brm(X_train, y_train,
           learner = learner_glm_binomial())
fit
#> Blockwise Reduced Model (BRM)
#>   blocks        : 4 
#>   overlap       : TRUE 
#>   learner type  : classification 
#>   features      : 10 
#>   cols / block  : 4, 10, 7, 7

Evaluate

prob <- predict(fit, X_test)
pred_class <- as.integer(prob >= 0.5)

acc <- mean(pred_class == y_test)
cat("Accuracy:", round(acc, 3), "\n")
#> Accuracy: 0.828

# Confusion matrix
table(truth = y_test, predicted = pred_class)
#>      predicted
#> truth    0    1
#>     0 5695  460
#>     1  939 1047

For tree-based classification, swap in learner_rpart(method = "class") or learner_gbm(distribution = "bernoulli").

Citation

Srinivasan, K., Currim, F., and Ram, S. (2025). A Reduced Modeling Approach for Making Predictions With Incomplete Data Having Blockwise Missing Patterns. INFORMS Journal on Data Science.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.