Getting Started with Modeltime H2O

Forecasting with modeltime.h2o made easy! This short tutorial shows how you can use:

Libraries

Load the following libraries:

library(tidymodels)
library(modeltime.h2o)
library(tidyverse)
library(timetk)

Collect data and split into training and test sets

Next, we load the walmart_sales_weekly data containing 7 time series and visualize them using the timetk::plot_time_series() function.

data_tbl <- walmart_sales_weekly %>%
    select(id, Date, Weekly_Sales)

data_tbl %>% 
  group_by(id) %>% 
  plot_time_series(
      .date_var    = Date,
      .value       = Weekly_Sales,
      .facet_ncol  = 2,
      .smooth      = F,
      .interactive = F
  )

Then, we separate the data with the initial_time_split() function and generate a training dataset and a test one.

splits <- time_series_split(data_tbl, assess = "3 month", cumulative = TRUE)

recipe_spec <- recipe(Weekly_Sales ~ ., data = training(splits)) %>%
    step_timeseries_signature(Date) 

train_tbl <- training(splits) %>% bake(prep(recipe_spec), .)
test_tbl  <- testing(splits) %>% bake(prep(recipe_spec), .)

Model specification, training and prediction

In order to correctly use modeltime.h2o, it is necessary to connect to an H2O cluster through the h2o.init() function. You can find more information on how to set up the cluster by typing ?h2o.init or by visiting the official site.

h2o.init(
    nthreads = -1,
    ip       = 'localhost',
    port     = 54321
)
#>  Connection successful!
#> 
#> R is connected to the H2O cluster: 
#>     H2O cluster uptime:         2 days 23 hours 
#>     H2O cluster timezone:       America/New_York 
#>     H2O data parsing timezone:  UTC 
#>     H2O cluster version:        3.32.0.1 
#>     H2O cluster version age:    5 months and 6 days !!! 
#>     H2O cluster name:           H2O_started_from_R_mdancho_rfu672 
#>     H2O cluster total nodes:    1 
#>     H2O cluster total memory:   7.66 GB 
#>     H2O cluster total cores:    12 
#>     H2O cluster allowed cores:  12 
#>     H2O cluster healthy:        TRUE 
#>     H2O Connection ip:          localhost 
#>     H2O Connection port:        54321 
#>     H2O Connection proxy:       NA 
#>     H2O Internal Security:      FALSE 
#>     H2O API Extensions:         Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4 
#>     R Version:                  R version 4.0.2 (2020-06-22)

Now comes the fun part! We define our model specification with the automl_reg() function and pass the arguments through the engine:

model_spec <- automl_reg(mode = 'regression') %>%
    set_engine(
         engine                     = 'h2o',
         max_runtime_secs           = 5, 
         max_runtime_secs_per_model = 3,
         max_models                 = 3,
         nfolds                     = 5,
         exclude_algos              = c("DeepLearning"),
         verbosity                  = NULL,
         seed                       = 786
    ) 

model_spec
#> H2O AutoML Model Specification (regression)
#> 
#> Engine-Specific Arguments:
#>   max_runtime_secs = 5
#>   max_runtime_secs_per_model = 3
#>   max_models = 3
#>   nfolds = 5
#>   exclude_algos = c("DeepLearning")
#>   verbosity = NULL
#>   seed = 786
#> 
#> Computational engine: h2o

Next, let’s train the model with fit()!

model_fitted <- model_spec %>%
    fit(Weekly_Sales ~ ., data = train_tbl)
#> 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
#> 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |==============                                                        |  20%
  |                                                                            
  |============================                                          |  41%
  |                                                                            
  |===========================================                           |  61%
  |                                                                            
  |======================================================================| 100%
#> 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
#>                                           model_id mean_residual_deviance
#> 1 StackedEnsemble_AllModels_AutoML_20210315_105707               36064483
#> 2                 XGBoost_3_AutoML_20210315_105707               37869757
#> 3                 XGBoost_2_AutoML_20210315_105707               39097555
#> 4                 XGBoost_1_AutoML_20210315_105707               40118649
#>       rmse      mse      mae     rmsle
#> 1 6005.371 36064483 3679.126 0.1465321
#> 2 6153.841 37869757 3774.116 0.1485089
#> 3 6252.804 39097555 3983.517 0.1645246
#> 4 6333.928 40118649 4156.995 0.1716963
#> 
#> [4 rows x 6 columns]

model_fitted
#> parsnip model object
#> 
#> Fit time:  7.8s 
#> 
#> H2O AutoML - Stackedensemble
#> --------
#> Model: Model Details:
#> ==============
#> 
#> H2ORegressionModel: stackedensemble
#> Model ID:  StackedEnsemble_AllModels_AutoML_20210315_105707 
#> Number of Base Models: 3
#> 
#> Base Models (count by algorithm type):
#> 
#> xgboost 
#>       3 
#> 
#> Metalearner:
#> 
#> Metalearner algorithm: glm
#> Metalearner cross-validation fold assignment:
#>   Fold assignment scheme: AUTO
#>   Number of folds: 5
#>   Fold column: NULL
#> Metalearner hyperparameters: 
#> 
#> 
#> H2ORegressionMetrics: stackedensemble
#> ** Reported on training data. **
#> 
#> MSE:  11532948
#> RMSE:  3396.019
#> MAE:  2151.187
#> RMSLE:  0.07858776
#> Mean Residual Deviance :  11532948
#> 
#> 
#> 
#> H2ORegressionMetrics: stackedensemble
#> ** Reported on cross-validation data. **
#> ** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
#> 
#> MSE:  36064483
#> RMSE:  6005.371
#> MAE:  3679.126
#> RMSLE:  0.1465321
#> Mean Residual Deviance :  36064483

Finally, we predict() on the test dataset:

predict(model_fitted, test_tbl)
#> 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
#> 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
#> # A tibble: 84 x 1
#>      .pred
#>      <dbl>
#>  1  18396.
#>  2  31732.
#>  3  38593.
#>  4  40830.
#>  5  74870.
#>  6  82097.
#>  7 135766.
#>  8  17851.
#>  9  36977.
#> 10  37434.
#> # … with 74 more rows

Modeltime Workflow

Once we have our fitted model, we can follow the Modeltime Workflow:

Add fitted models to a Model Table

First, we create the model table:

modeltime_tbl <- modeltime_table(
    model_fitted
) 

modeltime_tbl
#> # Modeltime Table
#> # A tibble: 1 x 3
#>   .model_id .model   .model_desc                 
#>       <int> <list>   <chr>                       
#> 1         1 <fit[+]> H2O AUTOML - STACKEDENSEMBLE

Calibrate & Testing Set Forecast & Accuracy Evaluation

Next, we calibrate to the testing set and visualize the forecasts:

modeltime_tbl %>%
  modeltime_calibrate(test_tbl) %>%
    modeltime_forecast(
        new_data    = test_tbl,
        actual_data = data_tbl,
        keep_data   = TRUE
    ) %>%
    group_by(id) %>%
    plot_modeltime_forecast(
        .facet_ncol = 2, 
        .interactive = FALSE
    )
#> 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
#> 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
#> 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
#> 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%

Refit to Full Dataset & Forecast Forward

Before using refit on our dataset, let’s prepare our data. We create data_prepared_tbl which represents the complete dataset (the union of train and test) with the variables created with the recipe named recipe_spec. Subsequently, we create the dataset future_prepared_tbl that represents the dataset with the future data to one year and the required variables.

data_prepared_tbl <- bind_rows(train_tbl, test_tbl)

future_tbl <- data_prepared_tbl %>%
    group_by(id) %>%
    future_frame(.length_out = "1 year") %>%
    ungroup()

future_prepared_tbl <- bake(prep(recipe_spec), future_tbl)

Finally, we use forecast in our future dataset and visualize the results once we had reffited.

refit_tbl <- modeltime_tbl %>%
    modeltime_refit(data_prepared_tbl)
#> 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
#> 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |==============                                                        |  20%
  |                                                                            
  |============================                                          |  41%
  |                                                                            
  |===========================================                           |  61%
  |                                                                            
  |======================================================================| 100%
#> 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
#>                                           model_id mean_residual_deviance
#> 1 StackedEnsemble_AllModels_AutoML_20210315_105718               33718722
#> 2                 XGBoost_3_AutoML_20210315_105718               35982393
#> 3                 XGBoost_2_AutoML_20210315_105718               37693026
#> 4                 XGBoost_1_AutoML_20210315_105718               37841998
#>       rmse      mse      mae     rmsle
#> 1 5806.782 33718722 3547.778 0.1418645
#> 2 5998.533 35982393 3653.799 0.1446879
#> 3 6139.465 37693026 3945.148 0.1644903
#> 4 6151.585 37841998 4121.814 0.1804933
#> 
#> [4 rows x 6 columns]

refit_tbl %>%
    modeltime_forecast(
        new_data    = future_prepared_tbl,
        actual_data = data_prepared_tbl,
        keep_data   = TRUE
    ) %>%
    group_by(id) %>%
    plot_modeltime_forecast(
        .facet_ncol  = 2,
        .interactive = FALSE
    )
#> 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
#> 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%

We can likely do better than this if we train longer but really good for a quick example!

Saving and Loading Models

H2O models will need to “serialized” (a fancy word for saved to a directory that contains the recipe for recreating the models). To save the models, use save_h2o_model().

model_fitted %>% 
  save_h2o_model(path = "../model_fitted", overwrite = TRUE)

You can reload the model into R using load_h2o_model().

model_h2o <- load_h2o_model(path = "../model_fitted/")

Take the High-Performance Forecasting Course

Become the forecasting expert for your organization

High-Performance Time Series Forecasting Course

High-Performance Time Series Course

Time Series is Changing

Time series is changing. Businesses now need 10,000+ time series forecasts every day. This is what I call a High-Performance Time Series Forecasting System (HPTSF) - Accurate, Robust, and Scalable Forecasting.

High-Performance Forecasting Systems will save companies by improving accuracy and scalability. Imagine what will happen to your career if you can provide your organization a “High-Performance Time Series Forecasting System” (HPTSF System).

How to Learn High-Performance Time Series Forecasting

I teach how to build a HPTFS System in my High-Performance Time Series Forecasting Course. You will learn:

Become the Time Series Expert for your organization.


Take the High-Performance Time Series Forecasting Course