The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.

NumericEnsembles

Welcome to the NumericEnsembles package! This package only does one thing, but it does it much better than anything else I have ever seen anywhere. That one thing is building the most accurate model for numeric data.

How does NumericEnsembles automatically build the most accurate model?

The NumericEnsembles package builds 40 models. 23 of the models are individual models, and 17 are ensembles of models. Only one of the 40 will have the lowest RMSE on the holdout data, but that result virtually always beats the best published results, both in published research and data science contests. In past measures, multiple results from NumericEnsembles have beat the best published results.

Let’s walk through the steps to see how this is done (example 1 of 2).

Our first example is one of the best known data sets of all time: Boston Housing. We will work for the same result as data science contests and professionally published articles: Minimizing the RMSE of the predicted sale price. Let’s start with a baseline.

First example: The Boston Housing data set

As we will see, the NumericEnsembles package will automatically return an RMSE that is more than 90% smaller than the winning entry in that Kaggle competition.

The Boston Housing data set is in the MASS package.

The NumericEnsembles package only has one function, Numeric (with a capital N), and it automatically does everything for the user.

Set up the NumericEnsembles function

Here is what the Numeric function looks like with comments about each part:

Numeric(data = df # though you may use any numeric data you wish

  colnum = 14 # this is MEDV, the median sales price

  numresamples = 2 # this will randomly resample the data two times

  how_to_handle_strings = 0 # there are no strings in the Boston Housing data set

  do_you_have_new_data = "N" # We do not have new data in this example, that is coming

  save_all_trained_models = "N" # We are not saving trained models in this example, that is     coming

  remove_ensemble_correlations_greater_than = 1.00 # We are not doing correlation reduction    in this example, that is coming

  train_amount = 0.60 # set up values for train, the Kaggle contest used 0.50, so we will, too

  test_amount = 0.25 # we will use 0.25 for test

  validation_amount = 0.25 # we will use 0.25 for validation set

)

The function will do the following, all completed automatically:

Plot exploratory data analysis:

• Pairwise scatter plots of the numeric features/columns in the data

• Print a correlation table of the numeric data

• Print a correlation table of the data as circles and colors

• Print a correlation table of the data as numbers and colors

• Print boxplots of the numeric data

• Print histograms of the numeric data

• Print a table of predictor (MEDV) vs each column in the data

Randomly resample the data, then split data into train, test and validation

For example, I frequently do 25 random resamplings, as this gives a more accurate result than a single sampling of the data.

Automatically builds 23 individual numeric models, as follows:

• Bagged Random Forest (tuned with optimized hyperparameters)

• Bagging

• BayesGLM (Generalized Linear Models), called BayesGLM in the package

• Bayes Regularized Neural Networks (called BayesRNN in the package)

• Boosed Random Forest (tuned with optimized hyperparameters)

• Cubist

• Earth

• Elastic (tuned with optimized via cross validation)

• Generalized Additive Models including smoothing splines (tuned with optimized hyperparameters)

• Gradient Boosted

• K-Nearest Neighbors (tuned with optimized hyperparamaters)

• Lasso (optimized via cross validation)

• Linear (optimized hyperparameters)

• Neuralnet (optimized)

• Partial Least Squares

• Principal Components Regression

• Random Forest (tumed woth optimized hyperparameters)

• Ridge (optimized via cross validation)

• RPart

• Support Vector Machines (tuned with optimized hyperparameters)

• Tree

• XGBoost

For each of the 23 models, the function fits the model and completes the following, all done automatically for each resampling:

• Starts a timer to measure the amount of time for each function (such as Random Forest)

• Fits the model to the training data

• Makes predictions and calculates accuracy on the train, test and validation data sets

• Calculates the mean of the results for test and validation, labels those as holdout

• Calculates overfitting as mean of the holdout RMSE / mean of the train RMSE

• Calculates Bias, MAE, MSE

• Creates a vector of predictions, called y_hat (such as y_hat_bag_rf) which will be used to make ensembles of model results

Automatically build weighted ensembles

Next the function automatically builds weighted ensembles. For example the first result is:

“BagRF” = y_hat_bag_rf * 1 / bag_rf_holdout_RMSE_mean

which takes the output of the bag_rf function (y_hat) and multiplies it by 1/holdout_RMSE. What this does is give higher value to more accurate results, and lower value to less accurate results. For example, if the mean RMSE on the holdout data is 5, this will divide that value by 5. If the mean result is 20, it will divide the result by 20, thus giving a smaller value to the weighted ensemble.

Next the function uses the user input to remove data that is above a certain correlation (in the function the user has an option to remove_all_ensemble_correlations_above), such as 0.90

Exploratory data analysis for the weighted ensembles

Next the function completes several plots for exploratory data analysis for the ensembles:

• Head of the ensemble

• Correlation table of the ensemble

Split the ensemble into train, test and validation sets

The function splits the data into the same values as the initial data.

Randomly resample the ensemble data

Automatically build 17 models from the ensemble data

It does the same procedures for the ensemble as it did for the individual models:

Fit the model on the training data, make predictions and check accuracy on the test and validation data. Calculate bias, MAE, MSE, SSE for each model, calculate the time.

• Ensemble Bagged Random Forest (tuned with optimized hyperparamters)

• Ensemble Bagging

• Ensemble Bayes GLM

• Ensemble BayesRNN

• Ensemble Boosted Random Forest (tuned with optimized hyperparameters)

• Ensemble Cubist

• Ensemble Earth

• Ensemble Elastic (optimized via cross validation)

• Ensemble Gradient Boosted

• Ensemble K-Nearest Neighbors (tuned with optimized hyperparameters)

• Ensemble Lasso (optimized via cross validation)

• Ensemble Linear (tuned with optimized hyperparameters)

• Ensemble Random Forest (tuned with optimized hyperparameters)

• Ensemble Ridge (optimized via cross validation)

• Ensemble RPart

• Ensemble Trees

• Ensemble XGBoost

Automatic summary table

The function automatically creates a summary table with the following results for each of the 40 models, sorted by Holdout RMSE:

• Model name

• Holdout RMSE (accuracy)

• Standard deviation of the holdout RMSE

• Mean bias

• Mean MAE

• Mean MSE

• Mean SSE

• Mean of the data

• Standard deviation of the data

• Mean train RMSE

• Mean test RMSE

• Mean validation RMSE (the holdout RMSE is the mean of test and validation)

• Overfitting Min

• Overfitting Mean

• Overfitting Max

• Duration

Here is an example of the head of the summary report when we ran the function on Boston Housing. Note that all the most accurate models are ensembles except one (BayesRNN):

Model RMSE
EnsembleEarth 0.1135
BayesRNN 0.1216
EnsembleBayesGLM 0.1274
EnsembleCubist 0.1459
EnsembleBayesRNN 0.2393
EnsembleLasso 0.2815

Automatic summary plots

The function automatically returns 25 plots, here is the predicted vs actual for the best model:

Best model, actual (x-axis) vs predicted (y-axis)
Best model, actual (x-axis) vs predicted (y-axis)

As a note, the overfitting by resample plot is potentially very useful. Clearly some of the models overfit more than others. This plot gives the results for each resample (25 in this case), and it’s very easy to see some of the models are more consistent than others. Here is a closeup of a few of the results, so the difference between train and holdout can be easily seen:

Train vs holdout
Train vs holdout



Model RMSE barchart, lower is better
Model RMSE barchart, lower is better



Mean bias by model, closer to 0 is better
Mean bias by model, closer to 0 is better



Duration by model, shorter is better
Duration by model, shorter is better

Summary: NumericEnsembles accomplishes all of these tasks in one code chunk, and typically the best model here beats the best models from previously published results.

Lowest RMSE from the Kaggle competition: 2.09684

Lowest RMSE from the NumericEnsembles package: 0.1135

Decrease in error rate:94.5871% decrease using Numeric Ensembles compared to the best result from the student Kaggle competition, and NumericEnsembles returns tables, charts, and much more, in only a few minutes.

Calculate percentage change: from V1 = 2.09684 to V2 = 0.1135 

(𝑉2−𝑉1)|𝑉1|×100

=(0.1135−2.09684)|2.09684|×100

=−1.983342.09684×100

=−0.945871×100

=−94.5871%change

=94.5871%decrease

In fact, the 21st most accurate model from NumericEnsembles was very close to the #1 result from the Kaggle results. (21st most accurate NumericEnsembles result = 2.0973, best result in the Kaggle competition = 2.09684)

Example #2: Including categorical data in NumericEnsembles

For this example we are going to model the price of carseats. The issue is there are three non-numeric columns in the data set. Here is the head of the data:

Head of Carseats data
Head of Carseats data

The NumericEnsembles function gives the user some options when there is categorical data. The options are:

0: No strings

1: Factor Values

2: One-Hot Encoding

3: One-Hot Encoding with jitter

Otherwise everything runs exactly the same as in the first example.

Example #3: Predicting on totally new data, saving all trained models

We will look at a subset of the Boston Housing data set, rows 6-505. All the models will be built as in the first example. However (and this is a huge difference), we will use those pre-trained models to make predictions on totally new data.

A common issue when determining if models are able to replicate on totally new data is using the exact same trained models on the new data. Thus we will not just be using (for example) Bagged Random Forest on the new data, but the exact same trained model. The NumericEnsembles package makes this very easy to do.

We will run the analysis in a manner very similar to our first analysis, but with two very important changes:

First, the data will be the Boston Housing data but the first five rows have been removed.

Second, our new data will be the first five rows of Boston Housing, which have not been trained by any of the models. Here is what that function looks like in real life:

When the function asks for the location of the new data, enter https://raw.githubusercontent.com/InfiniteCuriosity/EnsemblesData/refs/heads/main/NewBoston.csv

The most accurate model this time is BayesRNN. Here are the actual vs predicted values for BayesRNN as given in the table:

Model House 1 House 2 House 3 House 4 House5
Actual data 24 21.6 34.7 33.4 36.2
Best Predicted 23.8884 21.549 34.8931 33.5463 36.414
Difference -0.1116 -0.051 0.1931 0.1563 0.214

The mean difference is 0.0816. Since these measured are in thousands of dollars, the mean error is approximately $80.16 on a home with a mean value of $29,980.

It is a valid conclusion that the best trained model (BayesRNN) worked successfully on totally new data.

Working with all the trained models

The NumericEnsembles package will also save the trained models in the Environment (but not on the user’s hard drive). The trained models can provide a wealth of insight. For example, simply type the name of any model, and a $, and all the options in the model become available. For example:

BayesRNN Options
BayesRNN Options

One that might prove very useful is the Random Forest trained model. This will provide the highest importance for the given model and data. For example:

Random Forest highest importanc

Grand summary

The NumericEnsembles package automatically completes all the model building, creation of tables, exploratory data analysis, summary tables, plots for the best model, summary barcharts, and much more. It only requires one block of code, the package completes everything else for the user.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.