The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.
The forest balance estimator uses the data twice: once to fit the random forest that defines the kernel, and again to estimate the treatment effect using that kernel. This creates a subtle overfitting bias that persists even at large sample sizes.
To see this, we compare the standard (no cross-fitting) estimator with small leaf size against the cross-fitted estimator with adaptive leaf size and an oracle that uses the true propensity scores. We use \(n = 10{,}000\) and \(p = 50\) covariates:
library(forestBalance)
set.seed(123)
nreps <- 2
res <- matrix(NA, nreps, 3,
dimnames = list(NULL, c("No CF (mns=10)", "CF (default)", "Oracle IPW")))
for (r in seq_len(nreps)) {
dat <- simulate_data(n = 5000, p = 50, ate = 0)
# Standard: no cross-fitting, small min.node.size
fit_nocf <- forest_balance(dat$X, dat$A, dat$Y,
cross.fitting = FALSE, min.node.size = 10,
num.trees = 500)
res[r, "No CF (mns=10)"] <- fit_nocf$ate
# Cross-fitted with adaptive leaf size (package default)
fit_cf <- forest_balance(dat$X, dat$A, dat$Y, num.trees = 500)
res[r, "CF (default)"] <- fit_cf$ate
# Oracle IPW (true propensity scores)
ps <- dat$propensity
w_ipw <- ifelse(dat$A == 1, 1 / ps, 1 / (1 - ps))
res[r, "Oracle IPW"] <- weighted.mean(dat$Y[dat$A == 1], w_ipw[dat$A == 1]) -
weighted.mean(dat$Y[dat$A == 0], w_ipw[dat$A == 0])
}| Method | Bias | SD | RMSE |
|---|---|---|---|
| No CF (mns=10) | 0.1912 | 0.0288 | 0.1934 |
| CF (default) | 0.0021 | 0.0314 | 0.0315 |
| Oracle IPW | 0.0228 | 0.0149 | 0.0272 |
The no-cross-fitting estimator with small leaves has substantial bias. The default cross-fitted estimator with adaptive leaf size achieves much lower bias and RMSE.
The idea follows the double/debiased machine learning (DML) framework of Chernozhukov et al. (2018), adapted to kernel energy balancing.
Let \(K\) denote the \(n \times n\) proximity kernel built from a random forest trained on the full sample \((X, A, Y)\). The kernel captures which observations are “similar” in terms of confounding structure. However, because \(K\) was estimated from the same data used to compute the ATE, the kernel overfits: it encodes information about the specific realisation of outcomes, not just the confounding structure. This creates a bias that does not vanish as \(n \to \infty\).
Given \(K\) folds, the cross-fitted estimator proceeds as follows:
Randomly partition the data into \(K\) roughly equal folds \(F_1, \ldots, F_K\).
For each fold \(k = 1, \ldots, K\):
multi_regression_forest on the data in folds
\(F_{-k}\) (all folds except \(k\)).Average the fold-level estimates (DML1): \[\hat\tau_{\mathrm{CF}} = \frac{1}{K} \sum_{k=1}^{K} \hat\tau_k.\]
Cross-fitting alone is not sufficient to eliminate bias. The
minimum leaf size (min.node.size) plays a
crucial role:
Small leaves (e.g.,
min.node.size = 5): the kernel is very granular,
distinguishing observations at a fine scale. But with out-of-sample
predictions, small leaves lead to noisy similarity estimates—two
observations may share a tiny leaf by chance rather than true
similarity.
Large leaves (e.g.,
min.node.size = 50--100): the kernel captures broader
confounding structure. Out-of-sample predictions are more stable because
large leaves better represent population-level similarity.
The optimal leaf size scales with both \(n\) (more data supports finer leaves) and
\(p\) (more covariates require coarser
leaves to avoid the curse of dimensionality). forestBalance
uses an adaptive heuristic:
\[\mathrm{min.node.size} = \max\!\Big(20,\;\min\!\big(\lfloor n/200 \rfloor + p,\;\lfloor n/50 \rfloor\big)\Big).\]
set.seed(123)
nreps <- 2
node_sizes <- c(5, 10, 20, 50, 75, 100)
n <- 5000; p <- 50
leaf_res <- do.call(rbind, lapply(node_sizes, function(mns) {
ates <- replicate(nreps, {
dat <- simulate_data(n = n, p = p, ate = 0)
forest_balance(dat$X, dat$A, dat$Y, num.trees = 500,
min.node.size = mns)$ate
})
data.frame(mns = mns, bias = mean(ates), sd = sd(ates),
rmse = sqrt(mean(ates)^2 + var(ates)))
}))
heuristic <- max(20, min(floor(n / 200) + p, floor(n / 50)))| min.node.size | Bias | SD | RMSE | |
|---|---|---|---|---|
| 5 | 0.1716 | 0.0134 | 0.1721 | |
| 10 | 0.1017 | 0.0015 | 0.1017 | |
| 20 | 0.0153 | 0.0447 | 0.0472 | |
| 50 | 0.0481 | 0.0063 | 0.0485 | |
| 75 | -0.0099 | 0.0351 | 0.0364 | <– |
| 100 | 0.0332 | 0.0007 | 0.0332 |
Bias decreases with larger leaves until variance begins to dominate. The adaptive default balances bias reduction against variance.
The default call uses cross-fitting with the adaptive leaf size:
dat <- simulate_data(n = 2000, p = 10, ate = 0)
fit <- forest_balance(dat$X, dat$A, dat$Y)
fit
#> Forest Kernel Energy Balancing (cross-fitted)
#> --------------------------------------------------
#> n = 2,000 (n_treated = 745, n_control = 1255)
#> Trees: 1000
#> Cross-fitting: 2 folds
#> Solver: direct
#> ATE estimate: -0.0315
#> Fold ATEs: -0.0808, 0.0177
#> ESS: treated = 529/745 (71%) control = 878/1255 (70%)
#> --------------------------------------------------
#> Use summary() for covariate balance details.To disable cross-fitting (e.g., for speed or to inspect the kernel):
fit_nocf <- forest_balance(dat$X, dat$A, dat$Y, cross.fitting = FALSE)
fit_nocf
#> Forest Kernel Energy Balancing
#> --------------------------------------------------
#> n = 2,000 (n_treated = 745, n_control = 1255)
#> Trees: 1000
#> Solver: direct
#> ATE estimate: -0.0215
#> ESS: treated = 459/745 (62%) control = 759/1255 (60%)
#> --------------------------------------------------
#> Use summary() for covariate balance details.To manually set the leaf size:
fit_custom <- forest_balance(dat$X, dat$A, dat$Y, min.node.size = 50)
fit_custom
#> Forest Kernel Energy Balancing (cross-fitted)
#> --------------------------------------------------
#> n = 2,000 (n_treated = 745, n_control = 1255)
#> Trees: 1000
#> Cross-fitting: 2 folds
#> Solver: direct
#> ATE estimate: -0.0784
#> Fold ATEs: -0.0918, -0.065
#> ESS: treated = 424/745 (57%) control = 739/1255 (59%)
#> --------------------------------------------------
#> Use summary() for covariate balance details.The default is num.folds = 2 (sample splitting). With
two folds, each fold retains half the observations, producing
high-quality per-fold kernels. More folds train the forest on more data
but produce smaller per-fold kernels. Our experiments show that the
number of folds has a modest effect compared to the leaf size. Values of
2–5 all work well:
set.seed(123)
nreps <- 2
n <- 5000
fold_res <- do.call(rbind, lapply(c(2, 3, 5, 10), function(nfolds) {
ates <- replicate(nreps, {
dat <- simulate_data(n = n, p = 10, ate = 0)
forest_balance(dat$X, dat$A, dat$Y, num.trees = 500,
num.folds = nfolds)$ate
})
data.frame(folds = nfolds, bias = round(mean(ates), 4),
sd = round(sd(ates), 4),
rmse = round(sqrt(mean(ates)^2 + var(ates)), 4))
}))| Folds | Bias | SD | RMSE |
|---|---|---|---|
| 2 | 0.0907 | 0.0456 | 0.1015 |
| 3 | 0.0246 | 0.0275 | 0.0369 |
| 5 | 0.1025 | 0.0096 | 0.1030 |
| 10 | 0.2065 | 0.1269 | 0.2424 |
Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W. and Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1), C1–C68.
De, S. and Huling, J.D. (2025). Data adaptive covariate balancing for causal effect estimation for high dimensional data. arXiv preprint arXiv:2512.18069.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.