The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.

ml

A grammar of machine learning workflows for R.

Split, fit, evaluate, assess — four verbs that encode the workflow from Hastie, Tibshirani & Friedman (The Elements of Statistical Learning, Ch. 7). The evaluate/assess boundary makes data leakage inexpressible: ml_evaluate() runs on validation data and can be called freely; ml_assess() runs on held-out test data and locks after one use.

Installation

# Install from GitHub (current)
remotes::install_github("epagogy/ml", subdir = "r")

# install.packages("ml")
# CRAN submission is under review — the line above will work once accepted.

R >= 4.1.0. Optional backends: ‘xgboost’, ‘ranger’, ‘glmnet’, ‘kknn’, ‘e1071’, ‘naivebayes’, ‘rpart’.

Usage

library(ml)

s <- ml_split(iris, "Species", seed = 42)

model <- ml_fit(s$train, "Species", seed = 42)
ml_evaluate(model, s$valid)       # check performance, tweak, repeat

final <- ml_fit(s$dev, "Species", seed = 42)
ml_assess(final, test = s$test)   # final exam — second call errors

s$dev is train + valid combined, used for the final refit before assessment. This three-way split (train 60 / valid 20 / test 20) with a .dev convenience accessor follows the textbook protocol exactly.

Core verbs

`ml_split()`	Stratified three-way split → `$train`, `$valid`, `$test`, `$dev`
`ml_fit()`	Train a model (per-fold preprocessing, deterministic seeding)
`ml_evaluate()`	Validation metrics — repeat freely
`ml_assess()`	Test metrics — once, final, locks after use

These four are the grammar. Everything else extends it:

`ml_screen()`	Algorithm leaderboard
`ml_tune()`	Hyperparameter search
`ml_stack()`	OOF ensemble stacking
`ml_predict()`	Class labels or probabilities
`ml_explain()`	Feature importance
`ml_compare()`	Side-by-side model comparison
`ml_validate()`	Pass/fail deployment gate
`ml_drift()`	Distribution shift detection (KS, chi-squared)
`ml_calibrate()`	Probability calibration (Platt, isotonic)
`ml_profile()`	Dataset summary
`ml_save()` / `ml_load()`	Serialize to `.mlr`

Algorithms

13 families. engine = "auto" uses the Rust backend when available; engine = "r" forces the R package backend.

Algorithm	String	Clf	Reg	Backend
Logistic	`"logistic"`	Y		nnet
Decision Tree	`"decision_tree"`	Y	Y	rpart
Random Forest	`"random_forest"`	Y	Y	ranger
Extra Trees	`"extra_trees"`	Y	Y	Rust
Gradient Boosting	`"gradient_boosting"`	Y	Y	Rust
XGBoost	`"xgboost"`	Y	Y	xgboost
Ridge	`"linear"`		Y	glmnet
Elastic Net	`"elastic_net"`		Y	glmnet
SVM	`"svm"`	Y	Y	e1071
KNN	`"knn"`	Y	Y	kknn
Naive Bayes	`"naive_bayes"`	Y		naivebayes
AdaBoost	`"adaboost"`	Y		Rust
Hist. Gradient Boosting	`"histgradient"`	Y	Y	Rust

Design notes

Seeds. seed = NULL auto-generates a seed and stores it on the result for reproducibility. seed = 42 gives full deterministic control.

Per-fold preprocessing. Scaling and encoding fit on training folds only, never on validation or test. No information leaks across the split boundary.

Error messages. Wrong column name? ml_fit() tells you what columns exist. Wrong algorithm string? It lists the valid ones. Errors aim to fix themselves.

Citation

Roth, S. (2026). A Grammar of Machine Learning Workflows.
doi:10.5281/zenodo.19023838

License

MIT. Simon Roth, 2026.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.