Introduction to SuperML

Manish Saraswat

2018-12-19

SuperML R package is designed to unify the model training process in R like Python. Generally, it’s seen that people spend lot of time in searching for packages, figuring out the syntax for training machine learning models in R. This behaviour is highly apparent in users who frequently switch between R and Python. This package provides a python´s scikit-learn interface (fit, predict) to train models faster.

In addition to building machine learning models, there are handy functionalities to do feature engineering

This ambitious package is my ongoing effort to help the r-community build ML models easily and faster in R.

Install

This package is still not on cran. Currently, you can install using:

devtools::install_github("saraswatmks/superml")

Examples - Machine Learning Models

This package uses existing r-packages to build machine learning model. In this tutorial, we’ll use data.table R package to do all tasks related to data manipulation.

Regression Data

We’ll quickly prepare the data set to be ready to served for model training.

load("../data/reg_train.rda")

library(data.table)
library(caret)
#> Loading required package: lattice
#> Loading required package: ggplot2
library(superml)
library(kableExtra)
library(Metrics)
#> 
#> Attaching package: 'Metrics'
#> The following objects are masked from 'package:caret':
#> 
#>     precision, recall

kable(head(reg_train, 10)) %>%
  scroll_box(width = "100%", height = "300px")
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical 1stFlrSF 2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
1 60 RL 65 8450 Pave NA Reg Lvl AllPub Inside Gtl CollgCr Norm Norm 1Fam 2Story 7 5 2003 2003 Gable CompShg VinylSd VinylSd BrkFace 196 Gd TA PConc Gd TA No GLQ 706 Unf 0 150 856 GasA Ex Y SBrkr 856 854 0 1710 1 0 2 1 3 1 Gd 8 Typ 0 NA Attchd 2003 RFn 2 548 TA TA Y 0 61 0 0 0 0 NA NA NA 0 2 2008 WD Normal 208500
2 20 RL 80 9600 Pave NA Reg Lvl AllPub FR2 Gtl Veenker Feedr Norm 1Fam 1Story 6 8 1976 1976 Gable CompShg MetalSd MetalSd None 0 TA TA CBlock Gd TA Gd ALQ 978 Unf 0 284 1262 GasA Ex Y SBrkr 1262 0 0 1262 0 1 2 0 3 1 TA 6 Typ 1 TA Attchd 1976 RFn 2 460 TA TA Y 298 0 0 0 0 0 NA NA NA 0 5 2007 WD Normal 181500
3 60 RL 68 11250 Pave NA IR1 Lvl AllPub Inside Gtl CollgCr Norm Norm 1Fam 2Story 7 5 2001 2002 Gable CompShg VinylSd VinylSd BrkFace 162 Gd TA PConc Gd TA Mn GLQ 486 Unf 0 434 920 GasA Ex Y SBrkr 920 866 0 1786 1 0 2 1 3 1 Gd 6 Typ 1 TA Attchd 2001 RFn 2 608 TA TA Y 0 42 0 0 0 0 NA NA NA 0 9 2008 WD Normal 223500
4 70 RL 60 9550 Pave NA IR1 Lvl AllPub Corner Gtl Crawfor Norm Norm 1Fam 2Story 7 5 1915 1970 Gable CompShg Wd Sdng Wd Shng None 0 TA TA BrkTil TA Gd No ALQ 216 Unf 0 540 756 GasA Gd Y SBrkr 961 756 0 1717 1 0 1 0 3 1 Gd 7 Typ 1 Gd Detchd 1998 Unf 3 642 TA TA Y 0 35 272 0 0 0 NA NA NA 0 2 2006 WD Abnorml 140000
5 60 RL 84 14260 Pave NA IR1 Lvl AllPub FR2 Gtl NoRidge Norm Norm 1Fam 2Story 8 5 2000 2000 Gable CompShg VinylSd VinylSd BrkFace 350 Gd TA PConc Gd TA Av GLQ 655 Unf 0 490 1145 GasA Ex Y SBrkr 1145 1053 0 2198 1 0 2 1 4 1 Gd 9 Typ 1 TA Attchd 2000 RFn 3 836 TA TA Y 192 84 0 0 0 0 NA NA NA 0 12 2008 WD Normal 250000
6 50 RL 85 14115 Pave NA IR1 Lvl AllPub Inside Gtl Mitchel Norm Norm 1Fam 1.5Fin 5 5 1993 1995 Gable CompShg VinylSd VinylSd None 0 TA TA Wood Gd TA No GLQ 732 Unf 0 64 796 GasA Ex Y SBrkr 796 566 0 1362 1 0 1 1 1 1 TA 5 Typ 0 NA Attchd 1993 Unf 2 480 TA TA Y 40 30 0 320 0 0 NA MnPrv Shed 700 10 2009 WD Normal 143000
7 20 RL 75 10084 Pave NA Reg Lvl AllPub Inside Gtl Somerst Norm Norm 1Fam 1Story 8 5 2004 2005 Gable CompShg VinylSd VinylSd Stone 186 Gd TA PConc Ex TA Av GLQ 1369 Unf 0 317 1686 GasA Ex Y SBrkr 1694 0 0 1694 1 0 2 0 3 1 Gd 7 Typ 1 Gd Attchd 2004 RFn 2 636 TA TA Y 255 57 0 0 0 0 NA NA NA 0 8 2007 WD Normal 307000
8 60 RL NA 10382 Pave NA IR1 Lvl AllPub Corner Gtl NWAmes PosN Norm 1Fam 2Story 7 6 1973 1973 Gable CompShg HdBoard HdBoard Stone 240 TA TA CBlock Gd TA Mn ALQ 859 BLQ 32 216 1107 GasA Ex Y SBrkr 1107 983 0 2090 1 0 2 1 3 1 TA 7 Typ 2 TA Attchd 1973 RFn 2 484 TA TA Y 235 204 228 0 0 0 NA NA Shed 350 11 2009 WD Normal 200000
9 50 RM 51 6120 Pave NA Reg Lvl AllPub Inside Gtl OldTown Artery Norm 1Fam 1.5Fin 7 5 1931 1950 Gable CompShg BrkFace Wd Shng None 0 TA TA BrkTil TA TA No Unf 0 Unf 0 952 952 GasA Gd Y FuseF 1022 752 0 1774 0 0 2 0 2 2 TA 8 Min1 2 TA Detchd 1931 Unf 2 468 Fa TA Y 90 0 205 0 0 0 NA NA NA 0 4 2008 WD Abnorml 129900
10 190 RL 50 7420 Pave NA Reg Lvl AllPub Corner Gtl BrkSide Artery Artery 2fmCon 1.5Unf 5 6 1939 1950 Gable CompShg MetalSd MetalSd None 0 TA TA BrkTil TA TA No GLQ 851 Unf 0 140 991 GasA Ex Y SBrkr 1077 0 0 1077 1 0 1 0 2 2 TA 5 Typ 2 TA Attchd 1939 RFn 1 205 Gd TA Y 0 4 0 0 0 0 NA NA NA 0 1 2008 WD Normal 118000

split <- createDataPartition(y = reg_train$SalePrice, p = 0.7)
xtrain <- reg_train[split$Resample1]
xtest <- reg_train[!split$Resample1]
# remove features with 90% or more missing values
# we will also remove the Id column because it doesn't contain
# any useful information
na_cols <- colSums(is.na(xtrain)) / nrow(xtrain)
na_cols <- names(na_cols[which(na_cols > 0.9)])

xtrain[, c(na_cols, "Id") := NULL]
xtest[, c(na_cols, "Id") := NULL]

# encode categorical variables
cat_cols <- names(xtrain)[sapply(xtrain, is.character)]

for(c in cat_cols){
    lbl <- LabelEncoder$new()
    lbl$fit(c(xtrain[[c]], xtest[[c]]))
    xtrain[[c]] <- lbl$transform(xtrain[[c]])
    xtest[[c]] <- lbl$transform(xtest[[c]])
}
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA'

# removing noise column
noise <- c('GrLivArea','TotalBsmtSF')

xtrain[, c(noise) := NULL]
xtest[, c(noise) := NULL]

# fill missing value with  -1
xtrain[is.na(xtrain)] <- -1
xtest[is.na(xtest)] <- -1

KNN Regression

knn <- KNNTrainer$new(k = 2,prob = T,type = 'reg')
knn$fit_predict(train = xtrain, test = xtest, y = 'SalePrice')
probs <- knn$get_predictions(type = 'prob')
labels <- knn$get_predictions(type='raw')
auc(actual = xtest$Survived, predicted=labels)
#> Warning in auc(actual = xtest$Survived, predicted = labels): longer object
#> length is not a multiple of shorter object length
#> [1] NaN

Simple Regresison

lf <- LMTrainer$new(family="gaussian")
lf$fit(X = xtrain, y = "SalePrice")
summary(lf$model)
#> 
#> Call:
#> stats::glm(formula = f, family = self$family, data = X, weights = self$weights)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -415628   -14565     -479    12425   257642  
#> 
#> Coefficients:
#>                 Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   -1.829e+06  1.670e+06  -1.095 0.273653    
#> MSSubClass    -1.270e+01  5.975e+01  -0.212 0.831782    
#> MSZoning      -9.749e+02  1.581e+03  -0.617 0.537619    
#> LotFrontage   -4.076e+01  3.383e+01  -1.205 0.228594    
#> LotArea        1.347e-01  1.522e-01   0.885 0.376284    
#> Street        -1.874e+04  1.660e+04  -1.129 0.259297    
#> LotShape      -1.598e+03  2.102e+03  -0.760 0.447338    
#> LandContour   -1.640e+02  2.238e+03  -0.073 0.941611    
#> Utilities     -7.878e+04  3.549e+04  -2.220 0.026639 *  
#> LotConfig      4.127e+03  1.764e+03   2.339 0.019531 *  
#> LandSlope      7.256e+03  5.459e+03   1.329 0.184079    
#> Neighborhood  -2.911e+02  1.919e+02  -1.517 0.129479    
#> Condition1    -2.610e+03  1.131e+03  -2.308 0.021225 *  
#> Condition2    -8.498e+02  6.918e+03  -0.123 0.902257    
#> BldgType      -5.984e+03  2.275e+03  -2.630 0.008668 ** 
#> HouseStyle    -1.697e+02  9.126e+02  -0.186 0.852514    
#> OverallQual    1.199e+04  1.442e+03   8.315 3.15e-16 ***
#> OverallCond    7.063e+03  1.328e+03   5.321 1.29e-07 ***
#> YearBuilt      5.073e+02  8.363e+01   6.066 1.89e-09 ***
#> YearRemodAdd   2.758e+01  8.249e+01   0.334 0.738163    
#> RoofStyle      8.112e+03  2.131e+03   3.807 0.000150 ***
#> RoofMatl      -1.453e+04  2.652e+03  -5.479 5.48e-08 ***
#> Exterior1st   -7.066e+02  6.600e+02  -1.071 0.284576    
#> Exterior2nd    6.001e+02  6.169e+02   0.973 0.330926    
#> MasVnrType    -5.226e+03  1.886e+03  -2.771 0.005693 ** 
#> MasVnrArea     4.103e+01  7.666e+00   5.352 1.09e-07 ***
#> ExterQual      1.091e+04  2.638e+03   4.135 3.87e-05 ***
#> ExterCond     -6.188e+03  2.815e+03  -2.199 0.028152 *  
#> Foundation     5.324e+02  1.583e+03   0.336 0.736709    
#> BsmtQual       8.106e+03  1.544e+03   5.251 1.87e-07 ***
#> BsmtCond      -1.507e+03  2.007e+03  -0.751 0.452824    
#> BsmtExposure  -5.512e+03  1.452e+03  -3.797 0.000156 ***
#> BsmtFinType1   2.960e+02  7.240e+02   0.409 0.682806    
#> BsmtFinSF1     7.887e+00  6.201e+00   1.272 0.203688    
#> BsmtFinType2   6.599e+02  1.244e+03   0.530 0.595935    
#> BsmtFinSF2     3.285e+00  1.077e+01   0.305 0.760400    
#> BsmtUnfSF      2.395e+00  5.745e+00   0.417 0.676859    
#> Heating       -3.122e+03  3.675e+03  -0.849 0.395850    
#> HeatingQC      3.928e+02  1.516e+03   0.259 0.795648    
#> CentralAir    -9.254e+02  5.849e+03  -0.158 0.874325    
#> Electrical     2.488e+03  2.256e+03   1.103 0.270482    
#> `1stFlrSF`     6.393e+01  7.624e+00   8.385  < 2e-16 ***
#> `2ndFlrSF`     5.449e+01  6.218e+00   8.763  < 2e-16 ***
#> LowQualFinSF   7.197e+00  2.737e+01   0.263 0.792641    
#> BsmtFullBath   1.251e+04  3.080e+03   4.062 5.28e-05 ***
#> BsmtHalfBath   9.122e+03  4.860e+03   1.877 0.060864 .  
#> FullBath       5.949e+03  3.459e+03   1.720 0.085786 .  
#> HalfBath      -1.727e+03  3.225e+03  -0.535 0.592477    
#> BedroomAbvGr  -7.529e+03  2.101e+03  -3.584 0.000355 ***
#> KitchenAbvGr  -2.488e+04  6.289e+03  -3.956 8.17e-05 ***
#> KitchenQual    6.016e+03  1.875e+03   3.208 0.001381 ** 
#> TotRmsAbvGrd   3.240e+03  1.509e+03   2.147 0.032044 *  
#> Functional    -4.777e+03  1.538e+03  -3.106 0.001954 ** 
#> Fireplaces     5.470e+03  2.338e+03   2.340 0.019505 *  
#> FireplaceQu    4.551e+03  1.314e+03   3.465 0.000555 ***
#> GarageType     9.724e+02  1.300e+03   0.748 0.454708    
#> GarageYrBlt   -2.487e+00  5.872e+00  -0.424 0.672022    
#> GarageFinish   3.832e+03  1.520e+03   2.522 0.011840 *  
#> GarageCars     1.589e+04  3.461e+03   4.592 4.98e-06 ***
#> GarageArea    -1.466e+01  1.155e+01  -1.269 0.204578    
#> GarageQual     3.385e+03  3.880e+03   0.872 0.383270    
#> GarageCond    -2.916e+03  3.624e+03  -0.805 0.421161    
#> PavedDrive    -1.469e+03  3.402e+03  -0.432 0.666018    
#> WoodDeckSF     2.063e+01  9.240e+00   2.233 0.025781 *  
#> OpenPorchSF    1.016e+01  1.879e+01   0.541 0.588791    
#> EnclosedPorch -1.427e+01  1.970e+01  -0.724 0.469073    
#> `3SsnPorch`   -5.051e+00  3.548e+01  -0.142 0.886827    
#> ScreenPorch    5.189e+01  2.123e+01   2.444 0.014690 *  
#> PoolArea      -1.318e+01  2.625e+01  -0.502 0.615790    
#> Fence         -1.954e+03  1.476e+03  -1.324 0.185811    
#> MiscVal        7.190e-02  2.024e+00   0.036 0.971672    
#> MoSold         2.985e+01  3.942e+02   0.076 0.939655    
#> YrSold         3.591e+02  8.274e+02   0.434 0.664385    
#> SaleType      -6.432e+02  1.374e+03  -0.468 0.639833    
#> SaleCondition  4.637e+03  1.493e+03   3.105 0.001961 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for gaussian family taken to be 1082276968)
#> 
#>     Null deviance: 6.7433e+12  on 1023  degrees of freedom
#> Residual deviance: 1.0271e+12  on  949  degrees of freedom
#> AIC: 24282
#> 
#> Number of Fisher Scoring iterations: 2
predictions <- lf$predict(df = xtest)
rmse(actual = xtest$SalePrice, predicted = predictions)
#> [1] 34942.05

Lasso Regression

lf <- LMTrainer$new(family = "gaussian", alpha=1, lambda = 1000)
lf$fit(X = xtrain, y = "SalePrice")
predictions <- lf$predict(df = xtest)
rmse(actual = xtest$SalePrice, predicted = predictions)
#> [1] 37247.37

Ridge Regression

lf <- LMTrainer$new(family = "gaussian", alpha=0)
lf$fit(X = xtrain, y = "SalePrice")
predictions <- lf$predict(df = xtest)
rmse(actual = xtest$SalePrice, predicted = predictions)
#> [1] 37168.12

Logistic Regression with CV

lf <- LMTrainer$new(family = "gaussian")
lf$cv_model(X = xtrain, y = 'SalePrice', nfolds = 5, parallel = FALSE)
#> Computation done.
predictions <- lf$cv_predict(df = xtest)
coefs <- lf$get_importance()
rmse(actual = xtest$SalePrice, predicted = predictions)
#> [1] 36197.44

Random Forest

rf <- RFTrainer$new(n_estimators = 500,classification = 0)
rf$fit(X = xtrain, y = "SalePrice")
pred <- rf$predict(df = xtest)
rf$get_importance()
#>               tmp.order.tmp..decreasing...TRUE..
#> OverallQual                         783165647201
#> GarageCars                          493327424965
#> 1stFlrSF                            444256416285
#> ExterQual                           431473669083
#> GarageArea                          426584499742
#> YearBuilt                           361457805289
#> FullBath                            303023775651
#> BsmtFinSF1                          275239028295
#> GarageYrBlt                         265762810472
#> TotRmsAbvGrd                        264823768807
#> 2ndFlrSF                            235962008304
#> LotArea                             212894703393
#> KitchenQual                         202757765931
#> YearRemodAdd                        180257031584
#> MasVnrArea                          176661524468
#> Fireplaces                          152440938414
#> BsmtQual                            130604044591
#> FireplaceQu                         113777282000
#> LotFrontage                          96974873182
#> OpenPorchSF                          87151480885
#> WoodDeckSF                           80361681881
#> BsmtUnfSF                            69675014973
#> BedroomAbvGr                         53092044734
#> Neighborhood                         52278334327
#> GarageType                           46264741988
#> RoofStyle                            46145773703
#> HeatingQC                            42386340452
#> BsmtExposure                         38352314961
#> MSSubClass                           37502912726
#> GarageFinish                         35226019092
#> Exterior2nd                          33151824597
#> MoSold                               33067821801
#> MasVnrType                           31702846742
#> OverallCond                          31418954660
#> Exterior1st                          31254759265
#> HalfBath                             28300758839
#> BsmtFullBath                         23160900492
#> BsmtFinType1                         22985050944
#> SaleCondition                        22689914367
#> YrSold                               22649917921
#> LotShape                             22600345472
#> MSZoning                             18997236282
#> HouseStyle                           18962190305
#> Foundation                           16788195549
#> LandContour                          15894245763
#> LotConfig                            15454153115
#> SaleType                             15015837653
#> ScreenPorch                          14565973729
#> BldgType                             14114654438
#> GarageCond                           13357462803
#> GarageQual                           12596283176
#> CentralAir                           11295194841
#> PoolArea                             10780586438
#> BsmtCond                             10713639659
#> RoofMatl                              9523627865
#> EnclosedPorch                         9520608357
#> LandSlope                             9104553289
#> Fence                                 8697996384
#> Condition1                            8105823860
#> BsmtHalfBath                          7079007307
#> BsmtFinSF2                            6815789768
#> Functional                            6720942152
#> ExterCond                             6451384706
#> PavedDrive                            6144814768
#> BsmtFinType2                          6046633582
#> KitchenAbvGr                          5739780112
#> Electrical                            2851017800
#> 3SsnPorch                             2156392798
#> MiscVal                               1919882177
#> Heating                               1101675918
#> Street                                 687858599
#> LowQualFinSF                           633587353
#> Condition2                             177295073
#> Utilities                               33334805
rmse(actual = xtest$SalePrice, predicted = pred)
#> [1] 28637.62

Xgboost

xgb <- XGBTrainer$new(objective = "reg:linear"
                      , n_estimators = 500
                      , eval_metric = "rmse"
                      , maximize = F
                      , learning_rate = 0.1
                      ,max_depth = 6)
xgb$fit(X = xtrain, y = "SalePrice", valid = xtest)
#> converting the data into xgboost format..
#> starting with training...
#> [1]  train-rmse:179849.765625    val-rmse:176584.281250 
#> [51] train-rmse:8126.902344  val-rmse:33075.578125 
#> [101]    train-rmse:4940.106445  val-rmse:33232.164062 
#> [151]    train-rmse:3332.927490  val-rmse:33217.136719 
#> [201]    train-rmse:2242.137207  val-rmse:33165.082031 
#> [251]    train-rmse:1568.891602  val-rmse:33145.296875 
#> [301]    train-rmse:1111.639160  val-rmse:33153.835938 
#> [351]    train-rmse:804.120667   val-rmse:33160.457031 
#> [401]    train-rmse:591.843811   val-rmse:33156.765625 
#> [451]    train-rmse:431.189850   val-rmse:33152.750000 
#> [500]    train-rmse:332.463287   val-rmse:33149.964844
pred <- xgb$predict(xtest)
rmse(actual = xtest$SalePrice, predicted = pred)
#> [1] 33149.96

Grid Search

xgb <- XGBTrainer$new(objective="reg:linear")

gst <-GridSearchTrainer$new(trainer = xgb,
                             parameters = list(n_estimators = c(10,50), max_depth = c(5,2)),
                             n_folds = 3,
                             scoring = c('accuracy','auc'))
gst$fit(xtrain, "SalePrice")
#> [1] "entering grid search"
#> [1] "In total, 4 models will be trained"
#> converting the data into xgboost format..
#> starting with training...
#> [1]  train-rmse:140703.296875 
#> [10] train-rmse:16649.193359
#> converting the data into xgboost format..
#> starting with training...
#> [1]  train-rmse:143633.156250 
#> [10] train-rmse:16114.295898
#> converting the data into xgboost format..
#> starting with training...
#> [1]  train-rmse:144703.062500 
#> [10] train-rmse:16503.988281
#> converting the data into xgboost format..
#> starting with training...
#> [1]  train-rmse:140703.296875 
#> [50] train-rmse:4600.740234
#> converting the data into xgboost format..
#> starting with training...
#> [1]  train-rmse:143633.156250 
#> [50] train-rmse:3559.837402
#> converting the data into xgboost format..
#> starting with training...
#> [1]  train-rmse:144703.062500 
#> [50] train-rmse:4065.962158
#> converting the data into xgboost format..
#> starting with training...
#> [1]  train-rmse:141586.750000 
#> [10] train-rmse:29210.416016
#> converting the data into xgboost format..
#> starting with training...
#> [1]  train-rmse:144483.890625 
#> [10] train-rmse:29556.125000
#> converting the data into xgboost format..
#> starting with training...
#> [1]  train-rmse:145774.046875 
#> [10] train-rmse:32064.306641
#> converting the data into xgboost format..
#> starting with training...
#> [1]  train-rmse:141586.750000 
#> [50] train-rmse:16623.572266
#> converting the data into xgboost format..
#> starting with training...
#> [1]  train-rmse:144483.890625 
#> [50] train-rmse:16999.929688
#> converting the data into xgboost format..
#> starting with training...
#> [1]  train-rmse:145774.046875 
#> [50] train-rmse:18379.587891
gst$best_iteration()
#> $n_estimators
#> [1] 10
#> 
#> $max_depth
#> [1] 5
#> 
#> $accuracy_avg
#> [1] 0
#> 
#> $accuracy_sd
#> [1] 0
#> 
#> $auc_avg
#> [1] NaN
#> 
#> $auc_sd
#> [1] NA

Random Search

rf <- RFTrainer$new()
rst <-RandomSearchTrainer$new(trainer = rf,
                             parameters = list(n_estimators = c(10,50),
                             max_depth = c(5,2)),
                             n_folds = 3,
                             scoring = c('accuracy','auc'),
                             n_iter=3)
rst$fit(xtrain, "SalePrice")
#> [1] "In total, 3 models will be trained"
rst$best_iteration()
#> $n_estimators
#> [1] 50
#> 
#> $max_depth
#> [1] 2
#> 
#> $accuracy_avg
#> [1] 0.001955051
#> 
#> $accuracy_sd
#> [1] 0.001693146
#> 
#> $auc_avg
#> [1] NaN
#> 
#> $auc_sd
#> [1] NA

Binary Classification Data

Here, we will solve a simple binary classification problem (predict people who survived on titanic ship). The idea here is to demonstrate how to use this package to solve classification problems.

Data Preparation

# load class
load('../data/cla_train.rda')

kable(head(cla_train, 10)) %>%
  scroll_box(width = "100%", height = "300px")
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 S
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0 PC 17599 71.2833 C85 C
3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.9250 S
4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1000 C123 S
5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.0500 S
6 0 3 Moran, Mr. James male NA 0 0 330877 8.4583 Q
7 0 1 McCarthy, Mr. Timothy J male 54 0 0 17463 51.8625 E46 S
8 0 3 Palsson, Master. Gosta Leonard male 2 3 1 349909 21.0750 S
9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27 0 2 347742 11.1333 S
10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14 1 0 237736 30.0708 C

# split the data
split <- createDataPartition(y = cla_train$Survived,p = 0.7)
xtrain <- cla_train[split$Resample1]
xtest <- cla_train[!split$Resample1]

# encode categorical variables - shorter way
for(c in c('Embarked','Sex','Cabin')){
    lbl <- LabelEncoder$new()
    lbl$fit(c(xtrain[[c]], xtest[[c]]))
    xtrain[[c]] <- lbl$transform(xtrain[[c]])
    xtest[[c]] <- lbl$transform(xtest[[c]])
}
#> The data contains blank values. Imputing them with 'NA' 
#> The data contains blank values. Imputing them with 'NA' 
#> The data contains blank values. Imputing them with 'NA' 
#> The data contains blank values. Imputing them with 'NA' 
#> The data contains blank values. Imputing them with 'NA' 
#> The data contains blank values. Imputing them with 'NA'

# impute missing values
xtrain[, Age := replace(Age, is.na(Age), median(Age, na.rm = T))]
xtest[, Age := replace(Age, is.na(Age), median(Age, na.rm = T))]

# drop these features
to_drop <- c('PassengerId','Ticket','Name')

xtrain <- xtrain[,-c(to_drop), with=F]
xtest <- xtest[,-c(to_drop), with=F]

Now, our data is ready to be served for model training. Let’s do it.

KNN Classification

knn <- KNNTrainer$new(k = 2,prob = T,type = 'class')
knn$fit_predict(train = xtrain, test = xtest, y = 'Survived')
probs <- knn$get_predictions(type = 'prob')
labels <- knn$get_predictions(type='raw')
auc(actual = xtest$Survived, predicted=labels)
#> [1] 0.6776491

Logistic Regression

lf <- LMTrainer$new(family="binomial")
lf$fit(X = xtrain, y = "Survived")
summary(lf$model)
#> 
#> Call:
#> stats::glm(formula = f, family = self$family, data = X, weights = self$weights)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -2.0647  -0.5139  -0.3550   0.5659   2.5979  
#> 
#> Coefficients:
#>              Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)  1.882835   0.667638   2.820  0.00480 ** 
#> Pclass      -0.991285   0.198153  -5.003 5.66e-07 ***
#> Sex          3.014839   0.250533  12.034  < 2e-16 ***
#> Age         -0.050270   0.010402  -4.833 1.35e-06 ***
#> SibSp       -0.376242   0.132598  -2.837  0.00455 ** 
#> Parch       -0.137521   0.146524  -0.939  0.34796    
#> Fare         0.001671   0.002794   0.598  0.54981    
#> Cabin        0.017868   0.005923   3.017  0.00256 ** 
#> Embarked     0.076637   0.148818   0.515  0.60657    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 823.56  on 623  degrees of freedom
#> Residual deviance: 495.21  on 615  degrees of freedom
#> AIC: 513.21
#> 
#> Number of Fisher Scoring iterations: 5
predictions <- lf$predict(df = xtest)
auc(actual = xtest$Survived, predicted = predictions)
#> [1] 0.7930805

Lasso Logistic Regression

lf <- LMTrainer$new(family="binomial", alpha=1)
lf$cv_model(X = xtrain, y = "Survived", nfolds = 5, parallel = FALSE)
#> Computation done.
pred <- lf$cv_predict(df = xtest)
auc(actual = xtest$Survived, predicted = pred)
#> [1] 0.7960625

Ridge Logistic Regression

lf <- LMTrainer$new(family="binomial", alpha=0)
lf$cv_model(X = xtrain, y = "Survived", nfolds = 5, parallel = FALSE)
#> Computation done.
pred <- lf$cv_predict(df = xtest)
auc(actual = xtest$Survived, predicted = pred)
#> [1] 0.7948466

Random Forest

rf <- RFTrainer$new(n_estimators = 500,classification = 1, max_features = 3)
rf$fit(X = xtrain, y = "Survived")

pred <- rf$predict(df = xtest)
rf$get_importance()
#>          tmp.order.tmp..decreasing...TRUE..
#> Sex                               83.179331
#> Fare                              49.530245
#> Age                               44.026986
#> Cabin                             27.806777
#> Pclass                            22.210427
#> SibSp                             13.742906
#> Parch                              9.837351
#> Embarked                           6.777296

auc(actual = xtest$Survived, predicted = pred)
#> [1] 0.7512739

Xgboost

xgb <- XGBTrainer$new(objective = "binary:logistic"
                      , n_estimators = 500
                      , eval_metric = "auc"
                      , maximize = T
                      , learning_rate = 0.1
                      ,max_depth = 6)
xgb$fit(X = xtrain, y = "Survived", valid = xtest)
#> converting the data into xgboost format..
#> starting with training...
#> [1]  train-auc:0.910654  val-auc:0.815229 
#> [51] train-auc:0.977882  val-auc:0.803243 
#> [101]    train-auc:0.990142  val-auc:0.808280 
#> [151]    train-auc:0.994508  val-auc:0.807354 
#> [201]    train-auc:0.996520  val-auc:0.809352 
#> [251]    train-auc:0.997454  val-auc:0.809496 
#> [301]    train-auc:0.998147  val-auc:0.808309 
#> [351]    train-auc:0.998554  val-auc:0.808136 
#> [401]    train-auc:0.998796  val-auc:0.809120 
#> [451]    train-auc:0.999060  val-auc:0.809699 
#> [500]    train-auc:0.999104  val-auc:0.809699

pred <- xgb$predict(xtest)
auc(actual = xtest$Survived, predicted = pred)
#> [1] 0.8096989

Grid Search

xgb <- XGBTrainer$new(objective="binary:logistic")
gst <-GridSearchTrainer$new(trainer = xgb,
                             parameters = list(n_estimators = c(10,50),
                             max_depth = c(5,2)),
                             n_folds = 3,
                             scoring = c('accuracy','auc'))
gst$fit(xtrain, "Survived")
#> [1] "entering grid search"
#> [1] "In total, 4 models will be trained"
#> converting the data into xgboost format..
#> starting with training...
#> [1]  train-error:0.129808 
#> [10] train-error:0.098558
#> converting the data into xgboost format..
#> starting with training...
#> [1]  train-error:0.105769 
#> [10] train-error:0.088942
#> converting the data into xgboost format..
#> starting with training...
#> [1]  train-error:0.098558 
#> [10] train-error:0.069712
#> converting the data into xgboost format..
#> starting with training...
#> [1]  train-error:0.129808 
#> [50] train-error:0.038462
#> converting the data into xgboost format..
#> starting with training...
#> [1]  train-error:0.105769 
#> [50] train-error:0.036058
#> converting the data into xgboost format..
#> starting with training...
#> [1]  train-error:0.098558 
#> [50] train-error:0.036058
#> converting the data into xgboost format..
#> starting with training...
#> [1]  train-error:0.177885 
#> [10] train-error:0.153846
#> converting the data into xgboost format..
#> starting with training...
#> [1]  train-error:0.201923 
#> [10] train-error:0.137019
#> converting the data into xgboost format..
#> starting with training...
#> [1]  train-error:0.182692 
#> [10] train-error:0.115385
#> converting the data into xgboost format..
#> starting with training...
#> [1]  train-error:0.177885 
#> [50] train-error:0.110577
#> converting the data into xgboost format..
#> starting with training...
#> [1]  train-error:0.201923 
#> [50] train-error:0.096154
#> converting the data into xgboost format..
#> starting with training...
#> [1]  train-error:0.182692 
#> [50] train-error:0.081731
gst$best_iteration()
#> $n_estimators
#> [1] 10
#> 
#> $max_depth
#> [1] 5
#> 
#> $accuracy_avg
#> [1] 0
#> 
#> $accuracy_sd
#> [1] 0
#> 
#> $auc_avg
#> [1] 0.883034
#> 
#> $auc_sd
#> [1] 0.0242347

Random Search

rf <- RFTrainer$new()
rst <-RandomSearchTrainer$new(trainer = rf,
                             parameters = list(n_estimators = c(10,50),
                             max_depth = c(5,2)),
                             n_folds = 3,
                             scoring = c('accuracy','auc'),
                             n_iter = 3)
rst$fit(xtrain, "Survived")
#> [1] "In total, 3 models will be trained"
rst$best_iteration()
#> $n_estimators
#> [1] 50
#> 
#> $max_depth
#> [1] 5
#> 
#> $accuracy_avg
#> [1] 0.849359
#> 
#> $accuracy_sd
#> [1] 0.0264787
#> 
#> $auc_avg
#> [1] 0.8279856
#> 
#> $auc_sd
#> [1] 0.02242134

Let’s create some new feature based on target variable using target encoding and test a model.

# add target encoding features
xtrain[, feat_01 := smoothMean(train_df = xtrain,
                        test_df = xtest,
                        colname = "Embarked",
                        target = "Survived")$train[[2]]]
xtest[, feat_01 := smoothMean(train_df = xtrain,
                               test_df = xtest,
                               colname = "Embarked",
                               target = "Survived")$test[[2]]]

# train a random forest
# Random Forest
rf <- RFTrainer$new(n_estimators = 500,classification = 1, max_features = 4)
rf$fit(X = xtrain, y = "Survived")
pred <- rf$predict(df = xtest)
rf$get_importance()
#>          tmp.order.tmp..decreasing...TRUE..
#> Sex                               85.213569
#> Fare                              51.676287
#> Age                               47.071256
#> Cabin                             28.804936
#> Pclass                            22.431287
#> SibSp                             13.735815
#> Parch                              9.643044
#> feat_01                            4.449812
#> Embarked                           4.385365

auc(actual = xtest$Survived, predicted = pred)
#> [1] 0.7512739