The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.
library(crossfit) set.seed(1)
Many modern estimators (double / debiased ML, meta-learners, etc. ) share the same pattern:
If we fit the nuisances and evaluate the target on the same observations, we usually:
Cross-fitting fixes this by:
The crossfit package generalizes this logic to:
train_fold, per-target
eval_fold),"estimate" (numeric target) and
"predict" (cross-fitted predictor).A nuisance is defined via
create_nuisance():
fit(data, ...) → trains a model on (a subset of) the
data,predict(model, data, ...) → returns predictions on (a
subset of) the data,train_fold → how many folds the nuisance trains
on,fit_deps, pred_deps → which other
nuisances it depends on.Example: regression \(m(x) = E[Y \mid X]\):
The target is just a function of:
data,Example: cross-fitted mean squared error (MSE) of \(m(x)\):
During cross-fitting, the engine will:
nuis_y’s predict() on held-out
folds,target_mse(data_eval, nuis_y = predicted_values_on_eval).You don’t have to manage folds manually in the target.
A method bundles:
target,mse_method <- create_method(
target = target_mse,
list_nuisance = list(nuis_y = nuis_y),
folds = 4, # total number of folds K
repeats = 3, # how many times to re-draw fold splits
eval_fold = 1, # evaluation window width (in folds)
mode = "estimate",
fold_allocation = "independence",
aggregate_panels = mean_estimate,
aggregate_repeats = mean_estimate
)Conceptually:
folds and repeats define K-fold
cross-fitting repeated R times,eval_fold tells how many folds to reserve for
evaluating the target,mode controls whether we return a numeric
estimate ("estimate") or a prediction
function ("predict"),fold_allocation controls how training windows are laid
out across folds,aggregate_panels combines panel-wise results (within
one repetition),aggregate_repeats combines repetition-wise
results.Let’s walk through a full workflow on a toy regression problem.
We reuse the nuisance and target defined above (nuis_y,
target_mse), and the method mse_method.
crossfit()The result is a list with elements:
estimates – one entry per method (here only one),per_method – panel-wise and repetition-wise values and
errors,repeats_done – how many repetitions successfully
ran,K, K_required, methods,
plan – extra diagnostics.We can inspect the per-repetition values:
Each element in values is the aggregated MSE over panels
for that repetition.
In "predict" mode, the engine returns a
prediction function instead of a numeric estimate. This
is useful when you want:
Here we build a cross-fitted ensemble predictor that averages a linear and a quadratic regression for \(E[Y \mid X]\).
We simulate a slightly nonlinear regression problem:
n2 <- 300
x2 <- runif(n2, -2, 2)
y2 <- sin(x2) + rnorm(n2, sd = 0.3)
data2 <- data.frame(x = x2, y = y2)Two nuisances:
nuis_lin: linear regression,nuis_quad: quadratic regression via
poly(x, 2).nuis_lin <- create_nuisance(
fit = function(data, ...) lm(y ~ x, data = data),
predict = function(model, data, ...) {
as.numeric(predict(model, newdata = data))
},
train_fold = 2
)
nuis_quad <- create_nuisance(
fit = function(data, ...) lm(y ~ poly(x, 2), data = data),
predict = function(model, data, ...) {
as.numeric(predict(model, newdata = data))
},
train_fold = 2
)Now define a target in predict mode that combines the two nuisance predictions into an ensemble prediction:
We build a method in "predict" mode:
eval_fold = 0L (no dedicated evaluation window),m_lin and m_quad,m_ens <- create_method(
target = target_ensemble,
list_nuisance = list(
m_lin = nuis_lin,
m_quad = nuis_quad
),
folds = 4,
repeats = 3,
eval_fold = 0, # no eval window in predict mode
mode = "predict",
fold_allocation = "independence"
)Run cross-fitting in predict mode, using
mean_predictor() to aggregate panel-level and
repetition-level predictors:
res_pred <- crossfit_multi(
data = data2,
methods = list(ensemble = m_ens),
aggregate_panels = mean_predictor,
aggregate_repeats = mean_predictor
)
# estimates$ensemble is now a prediction function
f_hat <- res_pred$estimates$ensemble
newdata <- data.frame(x = seq(-2, 2, length.out = 7))
cbind(x = newdata$x, y_hat = f_hat(newdata))Here:
m_lin, m_quad and the ensemble
target_ensemble.mean_predictor() aggregates predictors over panels and
repetitions.f_hat(newdata) gives cross-fitted ensemble predictions
on new data.This is the typical pattern in "predict" mode: your
target combines one or several nuisance predictors into a
derived predictor (pseudo-outcome, CATE, ensemble, …),
and the engine returns a cross-fitted version of that predictor.
The fold_allocation argument controls how training
blocks are placed relative to the evaluation window.
For each method:
eval_fold folds are reserved for evaluating the
target,train_fold width,fold_allocation decides how the training blocks for
nuisances occupy the K folds.The engine supports three strategies:
"independence"
"overlap"
"estimate" mode."disjoint"
independence and
overlap.You choose the strategy per method:
By default, fold assignments are:
You can override this in crossfit() or
crossfit_multi() if you need:
Example: simple grouped folds by an integer id:
# toy group variable
group_id <- sample(1:10, size = nrow(data), replace = TRUE)
fold_split_grouped <- function(data, K) {
# assign folds at group level, then expand to rows
groups <- unique(group_id)
gfolds <- sample(rep_len(1:K, length(groups)))
g2f <- setNames(gfolds, groups)
g2f[group_id]
}
res_grouped <- crossfit(
data = data,
method = mse_method,
fold_split = fold_split_grouped
)
res_grouped$estimates[[1]]The only requirement is that fold_split(data, K) returns
a vector of length nrow(data) with integer labels in
{1, …, K}, and that all folds are non-empty.
You can plug in any aggregation you like:
For example, a simple trimmed mean over panels:
trimmed_mean_estimate <- function(xs, trim = 0.1) {
x <- unlist(xs)
mean(x, trim = trim)
}
m_trim <- create_method(
target = target_mse,
list_nuisance = list(nuis_y = nuis_y),
folds = 4,
repeats = 5,
eval_fold = 1L,
mode = "estimate",
fold_allocation = "independence",
aggregate_panels = trimmed_mean_estimate,
aggregate_repeats = trimmed_mean_estimate
)
res_trim <- crossfit(data, m_trim)
res_trim$estimates[[1]]Use ?crossfit, ?crossfit_multi,
?create_method, ?create_nuisance for detailed
argument reference.
Explore the per_method and plan
components in the result if you need to:
crossfit is meant to be a small, flexible engine: you
define the nuisances and targets; it takes care of the cross-fitting
schedule, reuse of models, and basic safety checks (cycles, coverage of
dependencies, fold geometry).
If you encounter edge cases or have ideas for higher-level helpers (e.g., ready-made DML ATE wrappers), they can be built conveniently on top of this core.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.