The goal of this document is to introduce kms (as in keras_model_sequential()), a regression-style function which allows users to call keras neural nets with R formula objects (hence, library(kerasformula)).

First, make sure that keras is properly configured:

install.packages("keras")
library(keras)
install_keras() # see https://keras.rstudio.com/ for details. 

kms splits training and test data into sparse matrices.kms also auto-detects whether the dependent variable is categorical, binary, or continuous. kms accepts the major parameters found in library(keras) as inputs (loss function, batch size, number of epochs, etc.) and allows users to customize basic neural nets (dense neural nets of various input shapes and dropout rates). The final example below also shows how to pass a compiled keras_model_sequential to kms (preferable for more complex models).

IMDB Movie Reviews

This example works with some of the imdb movie review data that comes with library(keras). Specifically, this example compares the default dense model that ksm generates to the lstm model described here. To expedite package building and installation, the code below is not actually run but can be run in under six minutes on a 2017 MacBook Pro with 16 GB of RAM (of which the majority of the time is for the lstm).

max_features <- 5000 # 5,000 words (ranked by popularity) found in movie reviews
maxlen <- 50  # Cut texts after 50 words (among top max_features most common words) 
Nsample <- 1000 

cat('Loading data...\n')
imdb <- keras::dataset_imdb(num_words = max_features)
imdb_df <- as.data.frame(cbind(c(imdb$train$y, imdb$test$y),
                               pad_sequences(c(imdb$train$x, imdb$test$x))))

set.seed(2017)   # can also set kms(..., seed = 2017)

demo_sample <- sample(nrow(imdb_df), Nsample)
P <- ncol(imdb_df) - 1
colnames(imdb_df) <- c("y", paste0("x", 1:P))

out_dense <- kms("y ~ .", data = imdb_df[demo_sample, ], Nepochs = 10, 
                 scale=NULL) # scale=NULL means leave data on original scale


plot(out_dense$history)  # incredibly useful 
# choose Nepochs to maximize out of sample accuracy

out_dense$confusion
    1
  0 107
  1 105
cat('Test accuracy:', out_dense$evaluations$acc, "\n")
Test accuracy: 0.495283 

Pretty bad–that's a 'broken clock' model. Suppose want to add some more layers. Below find the default setting for layers appart from an additional softmax layer. Notice in layers below anything that appears only once is repeated for each layer as appropriate.

out_dense <- kms("y ~ .", data = imdb_df[demo_sample, ], Nepochs = 10, seed=123, scale=NULL,
                 layers = list(units = c(512, 256, 128, NA), 
                               activation = c("softmax", "relu", "relu", "softmax"),
                               dropout = c(0.75, 0.4, 0.3, NA),
                               use_bias = TRUE,
                               kernel_initializer = NULL,
                               kernel_regularizer = "regularizer_l1",
                               bias_regularizer = "regularizer_l1",
                               activity_regularizer = "regularizer_l1"
                               ))
out_dense$confusion
     1
  0 92
  1 106
cat('Test accuracy:', out_dense$evaluations$acc, "\n")
Test accuracy: 0.4816514

No progress. Suppose we want to build an lstm model and pass it to ksm.

use_session_with_seed(12345)
k <- keras_model_sequential()
k %>%
  layer_embedding(input_dim = max_features, output_dim = 128) %>% 
  layer_lstm(units = 64, dropout = 0.2, recurrent_dropout = 0.2) %>% 
  layer_dense(units = 1, activation = 'sigmoid')

k %>% compile(
  loss = 'binary_crossentropy',
  optimizer = 'adam',
  metrics = c('accuracy')
)
out_lstm <- kms("y ~ .", imdb_df[demo_sample, ], 
                keras_model_seq = k, Nepochs = 10, seed = 12345, scale = NULL)
out_lstm$confusion
     0  1
  0 74 23
  1 23 79
cat('Test accuracy:', out_lstm$evaluations$acc, "\n")
Test accuracy: 0.7688442 

76.8% out-of-sample accuracy. That's marked improvement!

If you're OK with -> (right assignment), the above is equivalent to:

use_session_with_seed(12345)

keras_model_sequential() %>%

  layer_embedding(input_dim = max_features, output_dim = 128) %>% 

    layer_lstm(units = 64, dropout = 0.2, recurrent_dropout = 0.2) %>% 

      layer_dense(units = 1, activation = 'sigmoid') %>% 

        compile(loss = 'binary_crossentropy', 
                optimizer = 'adam', metrics = c('accuracy')) %>%

            kms(input_formula = "y ~ .", data = imdb_df[demo_sample, ], 
                Nepochs = 10, seed = 12345, scale = NULL) -> 
  out_lstm

plot(out_lstm$history)

For another worked example starting with raw data (from rtweet) visit here.