Validating predictInterval() against brms

The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.

Jared Knowles

Written 2026-05-29 (last updated 2026-05-29)

Why compare to brms?

merTools::predictInterval() was written to put prediction intervals on mixed-effect models that are too large or too slow to bootstrap. It does this by simulating from the estimated distribution of the fixed and random effects – fast, but an approximation. Since then, brms has become a standard for fully Bayesian multilevel modeling, and its posterior predictive distribution is about as close to a gold standard as we have. This vignette asks a simple question: how close are predictInterval()’s intervals to the ones brms produces?

We look at three things: a Gaussian random-slopes model, what happens for an entirely unobserved group, and a binomial GLMM. This vignette is precompiled – reproducing it requires the brms package and a working Stan toolchain.

A random-slopes model

We use the bundled hsb data (7,185 students in 160 schools) and model math achievement from student SES, school-mean SES, and a school-varying SES slope. We hold out 20% of students for testing, plus six whole schools to stand in for “new groups” later.

data(hsb)
hsb$schid <- factor(hsb$schid)
new_schools <- sample(levels(hsb$schid), 6)
hsb$.set <- "train"
hsb$.set[hsb$schid %in% new_schools] <- "test_new"
seen <- which(hsb$.set == "train")
hsb$.set[sample(seen, round(0.2 * length(seen)))] <- "test_seen"
train     <- droplevels(hsb[hsb$.set == "train", ])
test_seen <- hsb[hsb$.set == "test_seen" & hsb$schid %in% levels(train$schid), ]
test_new  <- hsb[hsb$.set == "test_new", ]

f1 <- mathach ~ ses + meanses + (ses | schid)
m_lme <- lmer(f1, data = train)
brms_fit_seconds <- system.time(
  m_brm <- fit_brm(f1, train, gaussian(), "brms_hsb_meanses"))[["elapsed"]]

We draw a predictive distribution for each held-out student from each method – predictInterval(returnSims = TRUE) and brms::posterior_predict() – and compute everything (point estimates, intervals, coverage) from those draws.

mt_seen <- mt_draws(m_lme, test_seen)
pp_seen <- posterior_predict(m_brm, newdata = test_seen, allow_new_levels = TRUE)

Point estimates are essentially identical

Both methods condition on the same estimated fixed effects and school BLUPs, so the conditional-mean predictions should agree almost exactly – and they do.

lme_mean <- predict(m_lme, newdata = test_seen)
brm_mean <- colMeans(posterior_epred(m_brm, newdata = test_seen, allow_new_levels = TRUE))
c(correlation = cor(lme_mean, brm_mean),
  mean_abs_diff = mean(abs(lme_mean - brm_mean)),
  response_sd = sd(test_seen$mathach))

#>   correlation mean_abs_diff   response_sd 
#>    0.99984045    0.03670134    6.71745425

ggplot(data.frame(merTools = lme_mean, brms = brm_mean), aes(merTools, brms)) +
  geom_abline(slope = 1, intercept = 0, linetype = 2, color = "grey50") +
  geom_point(alpha = 0.3, size = 0.8, color = "#1B9E77") + coord_equal() +
  labs(x = "lme4 predict()", y = "brms posterior_epred()") +
  theme_minimal(base_size = 12)

The prediction intervals agree

b_mt <- qband(mt_seen, 0.90); b_brm <- qband(pp_seen, 0.90)
c(width_merTools = median(b_mt$upr - b_mt$lwr),
  width_brms = median(b_brm$upr - b_brm$lwr),
  cor_lower = cor(b_mt$lwr, b_brm$lwr),
  cor_upper = cor(b_mt$upr, b_brm$upr))

#> width_merTools     width_brms      cor_lower      cor_upper 
#>     20.1789145     20.2344075      0.9901934      0.9899334

endpts <- rbind(data.frame(bound = "lower", merTools = b_mt$lwr, brms = b_brm$lwr),
                data.frame(bound = "upper", merTools = b_mt$upr, brms = b_brm$upr))
ggplot(endpts, aes(merTools, brms, color = bound)) +
  geom_abline(slope = 1, intercept = 0, linetype = 2, color = "grey50") +
  geom_point(alpha = 0.3, size = 0.8) + coord_equal() +
  labs(title = "90% prediction-interval endpoints", x = "merTools", y = "brms") +
  theme_minimal(base_size = 12)

And they are equally well calibrated

The real test is out-of-sample: do nominal intervals cover held-out scores at the nominal rate? Both methods track the diagonal, and each other.

data.frame(nominal = NOMINAL,
           merTools = coverage(mt_seen, test_seen$mathach, NOMINAL),
           brms     = coverage(pp_seen, test_seen$mathach, NOMINAL))

#>   nominal  merTools      brms
#> 1    0.50 0.4569776 0.4598698
#> 2    0.80 0.7895879 0.7932032
#> 3    0.90 0.9168474 0.9168474
#> 4    0.95 0.9660159 0.9703543

At a fraction of the cost

t_lme <- system.time(lmer(f1, data = train))[["elapsed"]]
t_mt  <- system.time(mt_draws(m_lme, test_seen))[["elapsed"]]
t_pp  <- system.time(posterior_predict(m_brm, newdata = test_seen,
                                       allow_new_levels = TRUE))[["elapsed"]]
data.frame(  # brms_fit_seconds was captured when the model was fit, above
  step = c("lme4 fit", "predictInterval", "brms fit (compile+sample)", "posterior_predict"),
  seconds = round(c(t_lme, t_mt, brms_fit_seconds, t_pp), 2))

#>                        step seconds
#> 1                  lme4 fit    0.05
#> 2           predictInterval    0.53
#> 3 brms fit (compile+sample)  398.97
#> 4         posterior_predict    1.52

For this model, the entire lme4 + predictInterval() path runs in about a second, while a single brms fit takes several minutes – a couple of orders of magnitude difference, before the (much larger) models predictInterval() was really built for.

What about an entirely new group?

For a school that was never seen during fitting there is a genuine modeling choice: do you drop the school’s effect and predict from the fixed effects only, or sample a plausible effect for it from the estimated school-level distribution? Both packages support either, and the comparison is only fair when the choices match:

treatment of an unseen group	`predictInterval()`	brms
drop the effect (population prediction)	`new.levels = "zero"` (default)	`re_formula = NA`
sample a new effect	`new.levels = "draw"`	`allow_new_levels = TRUE` (default)

Matched up, the two tools agree on both the width and the coverage of the new-school intervals:

nd <- test_new
pp_pop  <- posterior_predict(m_brm, newdata = nd, re_formula = NA)
pp_draw <- posterior_predict(m_brm, newdata = nd, allow_new_levels = TRUE)
mt_zero <- mt_draws(m_lme, nd, new.levels = "zero")
mt_draw <- mt_draws(m_lme, nd, new.levels = "draw")
w <- function(d) median(qband(d, 0.90)$upr - qband(d, 0.90)$lwr)
data.frame(
  unseen_group   = c("drop effect", "sample effect"),
  merTools_width = c(w(mt_zero), w(mt_draw)),
  brms_width     = c(w(pp_pop),  w(pp_draw)),
  merTools_cov90 = c(coverage(mt_zero, nd$mathach, 0.90), coverage(mt_draw, nd$mathach, 0.90)),
  brms_cov90     = c(coverage(pp_pop,  nd$mathach, 0.90), coverage(pp_draw, nd$mathach, 0.90)))

#>    unseen_group merTools_width brms_width merTools_cov90 brms_cov90
#> 1   drop effect       19.96801   20.02855      0.8708487  0.8745387
#> 2 sample effect       20.67910   20.76924      0.8892989  0.8856089

So the interval width for an unseen group is set by the choice, not the method. Earlier I compared predictInterval()’s default against brms’s default, which is an apples-to-oranges pairing – the apparent “gap” was the two packages making different default assumptions, not a deficiency in either.

How much the choice matters depends on how much group variance is left after the fixed effects. Here meanses already absorbs most of the between-school variation, so dropping vs. sampling the school effect barely changes the width. Remove meanses and the school SD roughly doubles, so the two options diverge:

m0 <- lmer(mathach ~ ses + (ses | schid), data = train)
data.frame(
  model = c("with meanses", "without meanses"),
  school_SD = c(attr(VarCorr(m_lme)$schid, "stddev")[["(Intercept)"]],
                attr(VarCorr(m0)$schid,    "stddev")[["(Intercept)"]]),
  width_zero = c(w(mt_draws(m_lme, nd, new.levels = "zero")),
                 w(mt_draws(m0,    nd, new.levels = "zero"))),
  width_draw = c(w(mt_draws(m_lme, nd, new.levels = "draw")),
                 w(mt_draws(m0,    nd, new.levels = "draw"))))

#>             model school_SD width_zero width_draw
#> 1    with meanses  1.598503   19.96801   20.67910
#> 2 without meanses  2.125770   19.94993   21.22576

The practical guidance: predictInterval()’s default (new.levels = "zero") answers “what is the expected outcome for a typical member of an unseen group,” while new.levels = "draw" answers “what is the outcome for an unseen group drawn from the population” – the same quantity brms returns by default. Pick the one that matches your question.

A generalized linear mixed model

The story carries over to GLMMs. We fit a binomial model for whether grouse had ticks and compare predicted probabilities – predictInterval(type = "probability") against brms::posterior_epred().

data(grouseticks)
grouseticks$TICKS_BIN <- as.integer(grouseticks$TICKS >= 1)
grouseticks$cHEIGHT   <- as.numeric(scale(grouseticks$HEIGHT))
gk <- grouseticks
gk$.set <- ifelse(runif(nrow(gk)) < 0.2, "test", "train")
gk_tr <- droplevels(gk[gk$.set == "train", ])
gk_te <- gk[gk$.set == "test" & gk$BROOD %in% levels(gk_tr$BROOD) &
            gk$LOCATION %in% unique(gk_tr$LOCATION), ]

gf <- TICKS_BIN ~ YEAR + cHEIGHT + (1 | BROOD) + (1 | LOCATION)
g_lme <- glmer(gf, family = binomial, data = gk_tr,
               control = glmerControl(optimizer = "bobyqa"))
g_brm <- fit_brm(gf, gk_tr, bernoulli(), "brms_grouseticks")

mt_pr <- mt_draws(g_lme, gk_te, type = "probability", include.resid.var = FALSE)
if (max(mt_pr) > 1 || min(mt_pr) < 0) mt_pr <- plogis(mt_pr)
brm_pr <- posterior_epred(g_brm, newdata = gk_te, allow_new_levels = TRUE)
mt_p <- apply(mt_pr, 2, median); brm_p <- apply(brm_pr, 2, median)
c(correlation = cor(mt_p, brm_p), mean_abs_diff = mean(abs(mt_p - brm_p)))

#>   correlation mean_abs_diff 
#>    0.99488587    0.02704796

ggplot(data.frame(merTools = mt_p, brms = brm_p), aes(merTools, brms)) +
  geom_abline(slope = 1, intercept = 0, linetype = 2, color = "grey50") +
  geom_point(alpha = 0.4, color = "#7570B3") + coord_equal(xlim = c(0, 1), ylim = c(0, 1)) +
  labs(title = "Predicted P(tick): merTools vs brms",
       x = "predictInterval(type = 'probability')", y = "posterior_epred()") +
  theme_minimal(base_size = 12)

Takeaways

For observed groups, predictInterval() reproduces brms’s point estimates and prediction intervals almost exactly, with the same out-of-sample coverage, at a tiny fraction of the computational cost.
For new groups, you choose how to treat the unseen effect: new.levels = "zero" (the default) drops it and predicts from the fixed effects, while new.levels = "draw" samples it from the estimated covariance – matching brms’s re_formula = NA and allow_new_levels = TRUE respectively. Either way the two packages agree; pick the convention that fits your question.
The same agreement holds for GLMMs on the probability scale.

In short, when bootstrapping or full MCMC is impractical, predictInterval() is a fast, well-calibrated stand-in for the cases it was designed to handle.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.