The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.

ml ml hex logo

A grammar of machine learning workflows for R.

R-CMD-check MIT epagogy.ai

Split, fit, evaluate, assess — four verbs that encode the workflow from Hastie, Tibshirani & Friedman (The Elements of Statistical Learning, Ch. 7). The evaluate/assess boundary makes data leakage inexpressible: ml_evaluate() runs on validation data and can be called freely; ml_assess() runs on held-out test data and locks after one use.

Installation

# Install from GitHub (current)
remotes::install_github("epagogy/ml", subdir = "r")

# install.packages("ml")
# CRAN submission is under review — the line above will work once accepted.

R >= 4.1.0. Optional backends: ‘xgboost’, ‘ranger’, ‘glmnet’, ‘kknn’, ‘e1071’, ‘naivebayes’, ‘rpart’.

Usage

library(ml)

s <- ml_split(iris, "Species", seed = 42)

model <- ml_fit(s$train, "Species", seed = 42)
ml_evaluate(model, s$valid)       # check performance, tweak, repeat

final <- ml_fit(s$dev, "Species", seed = 42)
ml_assess(final, test = s$test)   # final exam — second call errors

s$dev is train + valid combined, used for the final refit before assessment. This three-way split (train 60 / valid 20 / test 20) with a .dev convenience accessor follows the textbook protocol exactly.

Core verbs

ml_split() Stratified three-way split → $train, $valid, $test, $dev
ml_fit() Train a model (per-fold preprocessing, deterministic seeding)
ml_evaluate() Validation metrics — repeat freely
ml_assess() Test metrics — once, final, locks after use

These four are the grammar. Everything else extends it:

ml_screen() Algorithm leaderboard
ml_tune() Hyperparameter search
ml_stack() OOF ensemble stacking
ml_predict() Class labels or probabilities
ml_explain() Feature importance
ml_compare() Side-by-side model comparison
ml_validate() Pass/fail deployment gate
ml_drift() Distribution shift detection (KS, chi-squared)
ml_calibrate() Probability calibration (Platt, isotonic)
ml_profile() Dataset summary
ml_save() / ml_load() Serialize to .mlr

Algorithms

13 families. engine = "auto" uses the Rust backend when available; engine = "r" forces the R package backend.

Algorithm String Clf Reg Backend
Logistic "logistic" Y nnet
Decision Tree "decision_tree" Y Y rpart
Random Forest "random_forest" Y Y ranger
Extra Trees "extra_trees" Y Y Rust
Gradient Boosting "gradient_boosting" Y Y Rust
XGBoost "xgboost" Y Y xgboost
Ridge "linear" Y glmnet
Elastic Net "elastic_net" Y glmnet
SVM "svm" Y Y e1071
KNN "knn" Y Y kknn
Naive Bayes "naive_bayes" Y naivebayes
AdaBoost "adaboost" Y Rust
Hist. Gradient Boosting "histgradient" Y Y Rust

Design notes

Seeds. seed = NULL auto-generates a seed and stores it on the result for reproducibility. seed = 42 gives full deterministic control.

Per-fold preprocessing. Scaling and encoding fit on training folds only, never on validation or test. No information leaks across the split boundary.

Error messages. Wrong column name? ml_fit() tells you what columns exist. Wrong algorithm string? It lists the valid ones. Errors aim to fix themselves.

Citation

Roth, S. (2026). A Grammar of Machine Learning Workflows.
doi:10.5281/zenodo.19023838

License

MIT. Simon Roth, 2026.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.