---
title: "BRM on the adult dataset (binary classification)"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{BRM on the adult dataset (binary classification)}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
set.seed(1234)
```

## Overview

The UCI **Adult** dataset is a binary classification task: predict whether
an adult's income exceeds \$50K/year from census features. We use it here to
show BRM with a probabilistic learner (`learner_glm_binomial()`) on data
with simulated blockwise missingness.

```{r load}
library(blockwise)
data(adult)

# Drop `native.country`: it has ~41 levels with a long tail of rare
# countries, which can leave a per-block logistic-regression model
# without an example of a level that later appears at predict time.
# Tree-based learners (`learner_rpart`, `learner_gbm`) tolerate this;
# `learner_glm_binomial` does not. Either drop the column or coarsen
# its levels before fitting.
adult <- adult[, setdiff(names(adult), "native.country")]

str(adult, list.len = 20)
table(adult$salary)
```

## Inducing blockwise missingness

Mirroring the design in the paper, we jointly mask two column groups — a
"demographics" block and a "work history" block — plus a small column-wise
noise rate.

```{r mask}
bike_style_groups <- list(
  c("age", "workclass", "education"),
  c("marital.status", "occupation", "relationship")
)

adult_miss <- simulate_blockwise_missing(
  adult,
  blocks       = bike_style_groups,
  prop_missing = 0.30,
  noise        = 0.02
)
round(colMeans(is.na(adult_miss)) * 100, 1)
```

## Train / test split

```{r split}
set.seed(1234)
idx <- sample(nrow(adult_miss), floor(0.75 * nrow(adult_miss)))
train <- adult_miss[idx, ]
test  <- adult_miss[-idx, ]

X_train <- train[, setdiff(names(train), "salary")]
y_train <- train$salary
X_test  <- test[,  setdiff(names(test),  "salary")]
y_test  <- test$salary
```

## Fit BRM with a logistic-regression learner

```{r fit}
set.seed(1234)
fit <- brm(X_train, y_train,
           learner = learner_glm_binomial())
fit
```

## Evaluate

```{r eval}
prob <- predict(fit, X_test)
pred_class <- as.integer(prob >= 0.5)

acc <- mean(pred_class == y_test)
cat("Accuracy:", round(acc, 3), "\n")

# Confusion matrix
table(truth = y_test, predicted = pred_class)
```

For tree-based classification, swap in `learner_rpart(method = "class")` or
`learner_gbm(distribution = "bernoulli")`.

## Citation

Srinivasan, K., Currim, F., and Ram, S. (2025). *A Reduced Modeling Approach
for Making Predictions With Incomplete Data Having Blockwise Missing
Patterns.* INFORMS Journal on Data Science.
