The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.
This vignette benchmarks the forestBalance pipeline to
characterize how speed and memory scale with sample size (\(n\)) and number of trees (\(B\)).
The pipeline has four stages, each optimized:
grf::multi_regression_forest) – C++ via grf.get_leaf_node_matrix) – C++ via Rcpp.leaf_node_kernel_Z) – C++ via Rcpp, building CSC sparse
format directly without sorting.kernel_balance) –
adaptive solver:
Given a kernel matrix \(K \in \mathbb{R}^{n \times n}\) and a binary treatment vector \(A \in \{0,1\}^n\) with \(n_1\) treated and \(n_0\) control units, the kernel energy balancing weights \(w\) are obtained by solving the linear system \[ K_q \, w = z, \] where \(K_q\) is the modified kernel and \(z\) is a right-hand side vector determined by a linear constraint.
The modified kernel is defined element-wise as \[ K_q(i,j) = \frac{A_i \, A_j \, K(i,j)}{n_1^2} + \frac{(1-A_i)(1-A_j) \, K(i,j)}{n_0^2}. \]
A critical structural observation is that \(K_q(i,j) = 0\) whenever \(A_i \neq A_j\): the treated–control cross-blocks are identically zero. Therefore \(K_q\) is block-diagonal: \[ K_q = \begin{pmatrix} K_{tt} / n_1^2 & 0 \\ 0 & K_{cc} / n_0^2 \end{pmatrix}, \] where \(K_{tt} = K[A{=}1,\; A{=}1]\) is the \(n_1 \times n_1\) treated sub-block and \(K_{cc} = K[A{=}0,\; A{=}0]\) is the \(n_0 \times n_0\) control sub-block.
The right-hand side vector \(b\) has a similarly separable structure. Writing \(\mathbf{r} = K \mathbf{1}\) (the row sums of \(K\)), we have \[ b_i = \frac{A_i \, r_i}{n_1 \, n} + \frac{(1-A_i) \, r_i}{n_0 \, n}. \]
The full system decomposes into two independent sub-problems:
where \(z_t\) and \(z_c\) are the constraint-adjusted right-hand sides for each group (each requiring two preliminary solves to determine the Lagrange multiplier).
The proximity kernel is \(K = Z Z^\top / B\), where \(Z\) is a sparse \(n \times L\) indicator matrix (\(L = \sum_{b=1}^B L_b\), total leaves across all trees). Each row of \(Z\) has exactly \(B\) nonzero entries (one per tree), so \(Z\) has \(nB\) nonzeros total. The \(Z\) matrix is constructed in C++ directly in compressed sparse column (CSC) format, avoiding the overhead of R’s triplet sorting.
For moderate \(n\), we form the
sub-block kernels explicitly: \[
K_{tt} = Z_t Z_t^\top / B, \qquad K_{cc} = Z_c Z_c^\top / B,
\] where \(Z_t = Z[A{=}1,
\,\cdot\,]\) and \(Z_c = Z[A{=}0,
\,\cdot\,]\). Each sub-block is computed via a sparse
tcrossprod. The linear systems are then solved by sparse
Cholesky factorization.
Computational Cost: \(O(nB \cdot \bar{s})\) for the two sub-block cross-products (where \(\bar{s}\) is the average leaf size), plus \(O(n_1^{3/2} + n_0^{3/2})\) for sparse Cholesky (in the best case).
For large \(n\), forming \(K_{tt}\) and \(K_{cc}\) becomes expensive. The conjugate gradient (CG) solver avoids forming any kernel matrix by operating on the factored representation. To solve \[ K_{tt} \, x = r \quad \Longleftrightarrow \quad Z_t Z_t^\top x = B \, r, \] CG only needs matrix–vector products of the form \[ v \;\mapsto\; Z_t \bigl(Z_t^\top v\bigr), \] each costing \(O(n_1 B)\) via two sparse matrix–vector multiplies. The same applies to the control block with \(Z_c\).
Here, the memory use is \(O(nB)\) for \(Z\) alone, versus \(O(n^2)\) for the kernel. Each CG iteration costs \(O(n_g B)\) (where \(n_g\) is the group size). Convergence typically requires 100–200 iterations, independent of \(n\), so the total cost is \(O(n_g B \cdot T_{\text{iter}})\). Only four solves are needed (two per block), since the third right-hand side is a linear combination of the first two.
Computational Cost: \(O\bigl((n_1 + n_0) \cdot B \cdot T_{\text{iter}}\bigr)\) where \(T_{\text{iter}} \approx 100\text{--}200\). This scales linearly in both \(n\) and \(B\).
Plain CG can require many iterations at large \(n\). The Block Jacobi
solver uses the first tree’s leaf partition to build a block-diagonal
preconditioner. Each leaf block is a small dense system
(~min.node.size \(\times\)
min.node.size) that is cheap to factor. Preconditioned CG
then converges in far fewer iterations—typically 5–10\(\times\) fewer than plain CG—giving a large
overall speedup.
Note: only 2 linear solves per treatment-group block are needed (not 3), because the third right-hand side is a linear combination of the first two.
We compare all three solvers at \(n = 10{,}000\) and \(n = 25{,}000\):
# Fit one forest per n, then time each solver on the SAME kernel system.
solver_bench <- do.call(rbind, lapply(c(10000, 25000), function(nn) {
set.seed(123)
dat <- simulate_data(n = nn, p = 10)
B_val <- 1000
mns <- max(20L, min(floor(nn / 200) + 10, floor(nn / 50)))
forest <- grf::multi_regression_forest(
dat$X, scale(cbind(dat$A, dat$Y)),
num.trees = B_val, min.node.size = mns
)
leaf_mat <- get_leaf_node_matrix(forest, dat$X)
Z <- leaf_node_kernel_Z(leaf_mat)
solvers <- if (nn <= 10000) c("bj", "cg", "direct") else c("bj", "cg")
do.call(rbind, lapply(solvers, function(s) {
if (s == "direct") {
K <- leaf_node_kernel(leaf_mat)
t <- system.time(
w <- kernel_balance(dat$A, kern = K, solver = "direct")
)["elapsed"]
} else {
t <- system.time(
w <- kernel_balance(dat$A, Z = Z, leaf_matrix = leaf_mat,
num.trees = B_val, solver = s)
)["elapsed"]
}
ate <- weighted.mean(dat$Y[dat$A == 1], w$weights[dat$A == 1]) -
weighted.mean(dat$Y[dat$A == 0], w$weights[dat$A == 0])
data.frame(n = nn, solver = s, time = t, ate = ate)
}))
}))| n | Solver | Time (s) | ATE |
|---|---|---|---|
| 10000 | bj | 5.54 | 0.015260 |
| 10000 | cg | 4.26 | 0.015261 |
| 10000 | direct | 17.39 | 0.015260 |
| 25000 | bj | 30.47 | 0.006337 |
| 25000 | cg | 40.63 | 0.006344 |
All solvers produce the same ATE to high precision, confirming they
solve the same system. CG (Rcpp) is the robust default; Block Jacobi can
be faster in specific regimes (moderate leaf size, many trees) and is
available via solver = "bj".
We benchmark the full forest_balance() pipeline (forest
fitting, leaf extraction, Z construction, and weight computation) across
a range of sample sizes up to \(n =
50{,}000\) with \(B = 1{,}000\)
trees:
n_vals <- c(500, 1000, 2500, 5000, 10000, 25000, 50000)
B <- 1000
p <- 10
bench <- do.call(rbind, lapply(n_vals, function(nn) {
set.seed(123)
dat <- simulate_data(n = nn, p = p)
# Auto (default)
t_auto <- system.time(
fit_auto <- forest_balance(dat$X, dat$A, dat$Y, num.trees = B)
)["elapsed"]
data.frame(n = nn, trees = B, auto = t_auto,
auto_solver = fit_auto$solver)
}))| n | Trees | Time (s) | Solver |
|---|---|---|---|
| 500 | 1000 | 0.15 | cg |
| 1000 | 1000 | 0.18 | direct |
| 2500 | 1000 | 0.62 | direct |
| 5000 | 1000 | 2.20 | direct |
| 10000 | 1000 | 9.51 | direct |
| 25000 | 1000 | 38.16 | cg |
| 50000 | 1000 | 115.98 | cg |
The solver is selected automatically based on the per-fold sample size. The Block Jacobi preconditioned CG (green) activates for large \(n\) and provides the best scaling.
tree_vals <- c(200, 500, 1000, 2000)
n_test <- c(1000, 5000, 25000)
tree_bench <- do.call(rbind, lapply(n_test, function(nn) {
do.call(rbind, lapply(tree_vals, function(B) {
set.seed(123)
dat <- simulate_data(n = nn, p = 10)
t <- system.time(
fit <- forest_balance(dat$X, dat$A, dat$Y, num.trees = B)
)["elapsed"]
data.frame(n = nn, trees = B, time = t)
}))
}))| n | Trees | Time (s) |
|---|---|---|
| 1000 | 200 | 0.05 |
| 1000 | 500 | 0.10 |
| 1000 | 1000 | 0.17 |
| 1000 | 2000 | 0.34 |
| 5000 | 200 | 0.93 |
| 5000 | 500 | 1.30 |
| 5000 | 1000 | 2.09 |
| 5000 | 2000 | 3.97 |
| 25000 | 200 | 10.00 |
| 25000 | 500 | 23.66 |
| 25000 | 1000 | 37.66 |
| 25000 | 2000 | 66.03 |
The pipeline scales approximately linearly in both \(n\) and \(B\).
The following experiment profiles each stage of the pipeline individually to show where time is spent at different scales:
breakdown <- do.call(rbind, lapply(
list(c(1000, 500), c(5000, 500), c(5000, 2000),
c(25000, 500), c(25000, 1000)),
function(cfg) {
nn <- cfg[1]; B_val <- cfg[2]
set.seed(123)
dat <- simulate_data(n = nn, p = 10)
mns <- max(20L, min(floor(nn / 200) + 10, floor(nn / 50)))
# 1. Forest fitting
t_fit <- system.time({
resp <- scale(cbind(dat$A, dat$Y))
forest <- grf::multi_regression_forest(dat$X, Y = resp,
num.trees = B_val, min.node.size = mns)
})["elapsed"]
# 2. Leaf extraction (Rcpp)
t_leaf <- system.time(
leaf_mat <- get_leaf_node_matrix(forest, dat$X)
)["elapsed"]
# 3. Z construction (Rcpp)
t_z <- system.time(
Z <- leaf_node_kernel_Z(leaf_mat)
)["elapsed"]
# 4. Weight computation (kernel_balance)
t_bal <- system.time(
w <- kernel_balance(dat$A, Z = Z, num.trees = B_val, solver = "cg")
)["elapsed"]
data.frame(n = nn, trees = B_val, mns = mns,
fit = t_fit, leaf = t_leaf, Z = t_z, balance = t_bal,
total = t_fit + t_leaf + t_z + t_bal)
}
))| n | Trees | mns | Forest fit (s) | Leaf extract (s) | Z build (s) | Balance (s) | Total (s) |
|---|---|---|---|---|---|---|---|
| 1000 | 500 | 20 | 0.03 | 0.02 | 0.004 | 0.10 | 0.15 |
| 5000 | 500 | 35 | 0.17 | 0.11 | 0.012 | 0.65 | 0.94 |
| 5000 | 2000 | 35 | 0.66 | 0.42 | 0.102 | 2.33 | 3.52 |
| 25000 | 500 | 135 | 0.94 | 0.53 | 0.069 | 26.14 | 27.68 |
| 25000 | 1000 | 135 | 1.87 | 1.06 | 0.181 | 41.14 | 44.26 |
At small \(n\), forest fitting and the balance solver take similar time. At large \(n\), the CG solver dominates — its cost grows with the number of iterations and the sparse matrix–vector product size. The C++ stages (leaf extraction and \(Z\) construction) remain a small fraction throughout.
When the direct solver is used, the kernel is formed as a sparse matrix. For the CG solver, only the sparse indicator matrix \(Z\) is stored. The table below shows the actual memory usage compared to the theoretical dense kernel:
mem_data <- do.call(rbind, lapply(
c(500, 1000, 2500, 5000, 10000, 25000, 50000),
function(nn) {
set.seed(123)
dat <- simulate_data(n = nn, p = 10)
B_val <- 1000
mns <- max(20L, min(floor(nn / 200) + ncol(dat$X), floor(nn / 50)))
fit_forest <- multi_regression_forest(
dat$X, scale(cbind(dat$A, dat$Y)),
num.trees = B_val, min.node.size = mns
)
leaf_mat <- get_leaf_node_matrix(fit_forest, dat$X)
Z <- leaf_node_kernel_Z(leaf_mat)
z_mb <- as.numeric(object.size(Z)) / 1e6
# Kernel sparsity (skip forming K for very large n)
if (nn <= 10000) {
K <- leaf_node_kernel(leaf_mat)
pct_nz <- round(100 * length(K@x) / as.numeric(nn)^2, 1)
k_mb <- round(as.numeric(object.size(K)) / 1e6, 1)
} else {
pct_nz <- NA
k_mb <- NA
}
n_fold <- nn %/% 2
auto_solver <- if (n_fold > 5000 || mns > n_fold / 20) "cg" else "direct"
dense_mb <- round(8 * as.numeric(nn)^2 / 1e6, 0)
data.frame(n = nn, mns = mns, pct_nz = pct_nz,
sparse_MB = k_mb, Z_MB = round(z_mb, 1),
dense_MB = dense_mb, solver = auto_solver)
}
))| n | mns | Solver | K % nonzero | Stored | Actual (MB) | Dense K (MB) | Actual / Dense |
|---|---|---|---|---|---|---|---|
| 500 | 20 | cg | 39.4% | Z (factor) | 6.0 | 2 | 300% |
| 1000 | 20 | direct | 28.8% | K (sparse) | 3.5 | 8 | 43.8% |
| 2500 | 22 | direct | 20.1% | K (sparse) | 15.1 | 50 | 30.2% |
| 5000 | 35 | direct | 16% | K (sparse) | 48.1 | 200 | 24% |
| 10000 | 60 | direct | 12.6% | K (sparse) | 151.7 | 800 | 19% |
| 25000 | 135 | cg | – | Z (factor) | 300.4 | 5000 | 6% |
| 50000 | 260 | cg | – | Z (factor) | 600.3 | 20000 | 3% |
At \(n = 50{,}000\), a dense kernel would require 19 GB. The CG solver stores only the \(Z\) matrix, using a small fraction of this.
forest_balance() pipeline scales to
\(n = 50{,}000\) with
1,000 trees.min.node.size heuristic adjusts leaf size
to both \(n\) and \(p\), creating denser kernels than the old
default of 10."auto"
setting handles this transparently.These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.