SuperML R package is designed to unify the model training process in R like Python. Generally, it’s seen that people spend lot of time in searching for packages, figuring out the syntax for training machine learning models in R. This behaviour is highly apparent in users who frequently switch between R and Python. This package provides a python´s scikit-learn interface (fit
, predict
) to train models faster.
In addition to building machine learning models, there are handy functionalities to do feature engineering
This ambitious package is my ongoing effort to help the r-community build ML models easily and faster in R.
This package is still not on cran. Currently, you can install using:
devtools::install_github("saraswatmks/superml")
This package uses existing r-packages to build machine learning model. In this tutorial, we’ll use data.table R package to do all tasks related to data manipulation.
We’ll quickly prepare the data set to be ready to served for model training.
load("../data/reg_train.rda")
library(data.table)
library(caret)
#> Loading required package: lattice
#> Loading required package: ggplot2
library(superml)
library(kableExtra)
library(Metrics)
#>
#> Attaching package: 'Metrics'
#> The following objects are masked from 'package:caret':
#>
#> precision, recall
kable(head(reg_train, 10)) %>%
scroll_box(width = "100%", height = "300px")
Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | LotConfig | LandSlope | Neighborhood | Condition1 | Condition2 | BldgType | HouseStyle | OverallQual | OverallCond | YearBuilt | YearRemodAdd | RoofStyle | RoofMatl | Exterior1st | Exterior2nd | MasVnrType | MasVnrArea | ExterQual | ExterCond | Foundation | BsmtQual | BsmtCond | BsmtExposure | BsmtFinType1 | BsmtFinSF1 | BsmtFinType2 | BsmtFinSF2 | BsmtUnfSF | TotalBsmtSF | Heating | HeatingQC | CentralAir | Electrical | 1stFlrSF | 2ndFlrSF | LowQualFinSF | GrLivArea | BsmtFullBath | BsmtHalfBath | FullBath | HalfBath | BedroomAbvGr | KitchenAbvGr | KitchenQual | TotRmsAbvGrd | Functional | Fireplaces | FireplaceQu | GarageType | GarageYrBlt | GarageFinish | GarageCars | GarageArea | GarageQual | GarageCond | PavedDrive | WoodDeckSF | OpenPorchSF | EnclosedPorch | 3SsnPorch | ScreenPorch | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 60 | RL | 65 | 8450 | Pave | NA | Reg | Lvl | AllPub | Inside | Gtl | CollgCr | Norm | Norm | 1Fam | 2Story | 7 | 5 | 2003 | 2003 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 196 | Gd | TA | PConc | Gd | TA | No | GLQ | 706 | Unf | 0 | 150 | 856 | GasA | Ex | Y | SBrkr | 856 | 854 | 0 | 1710 | 1 | 0 | 2 | 1 | 3 | 1 | Gd | 8 | Typ | 0 | NA | Attchd | 2003 | RFn | 2 | 548 | TA | TA | Y | 0 | 61 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 2 | 2008 | WD | Normal | 208500 |
2 | 20 | RL | 80 | 9600 | Pave | NA | Reg | Lvl | AllPub | FR2 | Gtl | Veenker | Feedr | Norm | 1Fam | 1Story | 6 | 8 | 1976 | 1976 | Gable | CompShg | MetalSd | MetalSd | None | 0 | TA | TA | CBlock | Gd | TA | Gd | ALQ | 978 | Unf | 0 | 284 | 1262 | GasA | Ex | Y | SBrkr | 1262 | 0 | 0 | 1262 | 0 | 1 | 2 | 0 | 3 | 1 | TA | 6 | Typ | 1 | TA | Attchd | 1976 | RFn | 2 | 460 | TA | TA | Y | 298 | 0 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 5 | 2007 | WD | Normal | 181500 |
3 | 60 | RL | 68 | 11250 | Pave | NA | IR1 | Lvl | AllPub | Inside | Gtl | CollgCr | Norm | Norm | 1Fam | 2Story | 7 | 5 | 2001 | 2002 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 162 | Gd | TA | PConc | Gd | TA | Mn | GLQ | 486 | Unf | 0 | 434 | 920 | GasA | Ex | Y | SBrkr | 920 | 866 | 0 | 1786 | 1 | 0 | 2 | 1 | 3 | 1 | Gd | 6 | Typ | 1 | TA | Attchd | 2001 | RFn | 2 | 608 | TA | TA | Y | 0 | 42 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 9 | 2008 | WD | Normal | 223500 |
4 | 70 | RL | 60 | 9550 | Pave | NA | IR1 | Lvl | AllPub | Corner | Gtl | Crawfor | Norm | Norm | 1Fam | 2Story | 7 | 5 | 1915 | 1970 | Gable | CompShg | Wd Sdng | Wd Shng | None | 0 | TA | TA | BrkTil | TA | Gd | No | ALQ | 216 | Unf | 0 | 540 | 756 | GasA | Gd | Y | SBrkr | 961 | 756 | 0 | 1717 | 1 | 0 | 1 | 0 | 3 | 1 | Gd | 7 | Typ | 1 | Gd | Detchd | 1998 | Unf | 3 | 642 | TA | TA | Y | 0 | 35 | 272 | 0 | 0 | 0 | NA | NA | NA | 0 | 2 | 2006 | WD | Abnorml | 140000 |
5 | 60 | RL | 84 | 14260 | Pave | NA | IR1 | Lvl | AllPub | FR2 | Gtl | NoRidge | Norm | Norm | 1Fam | 2Story | 8 | 5 | 2000 | 2000 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 350 | Gd | TA | PConc | Gd | TA | Av | GLQ | 655 | Unf | 0 | 490 | 1145 | GasA | Ex | Y | SBrkr | 1145 | 1053 | 0 | 2198 | 1 | 0 | 2 | 1 | 4 | 1 | Gd | 9 | Typ | 1 | TA | Attchd | 2000 | RFn | 3 | 836 | TA | TA | Y | 192 | 84 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 12 | 2008 | WD | Normal | 250000 |
6 | 50 | RL | 85 | 14115 | Pave | NA | IR1 | Lvl | AllPub | Inside | Gtl | Mitchel | Norm | Norm | 1Fam | 1.5Fin | 5 | 5 | 1993 | 1995 | Gable | CompShg | VinylSd | VinylSd | None | 0 | TA | TA | Wood | Gd | TA | No | GLQ | 732 | Unf | 0 | 64 | 796 | GasA | Ex | Y | SBrkr | 796 | 566 | 0 | 1362 | 1 | 0 | 1 | 1 | 1 | 1 | TA | 5 | Typ | 0 | NA | Attchd | 1993 | Unf | 2 | 480 | TA | TA | Y | 40 | 30 | 0 | 320 | 0 | 0 | NA | MnPrv | Shed | 700 | 10 | 2009 | WD | Normal | 143000 |
7 | 20 | RL | 75 | 10084 | Pave | NA | Reg | Lvl | AllPub | Inside | Gtl | Somerst | Norm | Norm | 1Fam | 1Story | 8 | 5 | 2004 | 2005 | Gable | CompShg | VinylSd | VinylSd | Stone | 186 | Gd | TA | PConc | Ex | TA | Av | GLQ | 1369 | Unf | 0 | 317 | 1686 | GasA | Ex | Y | SBrkr | 1694 | 0 | 0 | 1694 | 1 | 0 | 2 | 0 | 3 | 1 | Gd | 7 | Typ | 1 | Gd | Attchd | 2004 | RFn | 2 | 636 | TA | TA | Y | 255 | 57 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 8 | 2007 | WD | Normal | 307000 |
8 | 60 | RL | NA | 10382 | Pave | NA | IR1 | Lvl | AllPub | Corner | Gtl | NWAmes | PosN | Norm | 1Fam | 2Story | 7 | 6 | 1973 | 1973 | Gable | CompShg | HdBoard | HdBoard | Stone | 240 | TA | TA | CBlock | Gd | TA | Mn | ALQ | 859 | BLQ | 32 | 216 | 1107 | GasA | Ex | Y | SBrkr | 1107 | 983 | 0 | 2090 | 1 | 0 | 2 | 1 | 3 | 1 | TA | 7 | Typ | 2 | TA | Attchd | 1973 | RFn | 2 | 484 | TA | TA | Y | 235 | 204 | 228 | 0 | 0 | 0 | NA | NA | Shed | 350 | 11 | 2009 | WD | Normal | 200000 |
9 | 50 | RM | 51 | 6120 | Pave | NA | Reg | Lvl | AllPub | Inside | Gtl | OldTown | Artery | Norm | 1Fam | 1.5Fin | 7 | 5 | 1931 | 1950 | Gable | CompShg | BrkFace | Wd Shng | None | 0 | TA | TA | BrkTil | TA | TA | No | Unf | 0 | Unf | 0 | 952 | 952 | GasA | Gd | Y | FuseF | 1022 | 752 | 0 | 1774 | 0 | 0 | 2 | 0 | 2 | 2 | TA | 8 | Min1 | 2 | TA | Detchd | 1931 | Unf | 2 | 468 | Fa | TA | Y | 90 | 0 | 205 | 0 | 0 | 0 | NA | NA | NA | 0 | 4 | 2008 | WD | Abnorml | 129900 |
10 | 190 | RL | 50 | 7420 | Pave | NA | Reg | Lvl | AllPub | Corner | Gtl | BrkSide | Artery | Artery | 2fmCon | 1.5Unf | 5 | 6 | 1939 | 1950 | Gable | CompShg | MetalSd | MetalSd | None | 0 | TA | TA | BrkTil | TA | TA | No | GLQ | 851 | Unf | 0 | 140 | 991 | GasA | Ex | Y | SBrkr | 1077 | 0 | 0 | 1077 | 1 | 0 | 1 | 0 | 2 | 2 | TA | 5 | Typ | 2 | TA | Attchd | 1939 | RFn | 1 | 205 | Gd | TA | Y | 0 | 4 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 1 | 2008 | WD | Normal | 118000 |
split <- createDataPartition(y = reg_train$SalePrice, p = 0.7)
xtrain <- reg_train[split$Resample1]
xtest <- reg_train[!split$Resample1]
# remove features with 90% or more missing values
# we will also remove the Id column because it doesn't contain
# any useful information
na_cols <- colSums(is.na(xtrain)) / nrow(xtrain)
na_cols <- names(na_cols[which(na_cols > 0.9)])
xtrain[, c(na_cols, "Id") := NULL]
xtest[, c(na_cols, "Id") := NULL]
# encode categorical variables
cat_cols <- names(xtrain)[sapply(xtrain, is.character)]
for(c in cat_cols){
lbl <- LabelEncoder$new()
lbl$fit(c(xtrain[[c]], xtest[[c]]))
xtrain[[c]] <- lbl$transform(xtrain[[c]])
xtest[[c]] <- lbl$transform(xtest[[c]])
}
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
# removing noise column
noise <- c('GrLivArea','TotalBsmtSF')
xtrain[, c(noise) := NULL]
xtest[, c(noise) := NULL]
# fill missing value with -1
xtrain[is.na(xtrain)] <- -1
xtest[is.na(xtest)] <- -1
KNN Regression
knn <- KNNTrainer$new(k = 2,prob = T,type = 'reg')
knn$fit_predict(train = xtrain, test = xtest, y = 'SalePrice')
probs <- knn$get_predictions(type = 'prob')
labels <- knn$get_predictions(type='raw')
auc(actual = xtest$Survived, predicted=labels)
#> Warning in auc(actual = xtest$Survived, predicted = labels): longer object
#> length is not a multiple of shorter object length
#> [1] NaN
Simple Regresison
lf <- LMTrainer$new(family="gaussian")
lf$fit(X = xtrain, y = "SalePrice")
summary(lf$model)
#>
#> Call:
#> stats::glm(formula = f, family = self$family, data = X, weights = self$weights)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -415628 -14565 -479 12425 257642
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -1.829e+06 1.670e+06 -1.095 0.273653
#> MSSubClass -1.270e+01 5.975e+01 -0.212 0.831782
#> MSZoning -9.749e+02 1.581e+03 -0.617 0.537619
#> LotFrontage -4.076e+01 3.383e+01 -1.205 0.228594
#> LotArea 1.347e-01 1.522e-01 0.885 0.376284
#> Street -1.874e+04 1.660e+04 -1.129 0.259297
#> LotShape -1.598e+03 2.102e+03 -0.760 0.447338
#> LandContour -1.640e+02 2.238e+03 -0.073 0.941611
#> Utilities -7.878e+04 3.549e+04 -2.220 0.026639 *
#> LotConfig 4.127e+03 1.764e+03 2.339 0.019531 *
#> LandSlope 7.256e+03 5.459e+03 1.329 0.184079
#> Neighborhood -2.911e+02 1.919e+02 -1.517 0.129479
#> Condition1 -2.610e+03 1.131e+03 -2.308 0.021225 *
#> Condition2 -8.498e+02 6.918e+03 -0.123 0.902257
#> BldgType -5.984e+03 2.275e+03 -2.630 0.008668 **
#> HouseStyle -1.697e+02 9.126e+02 -0.186 0.852514
#> OverallQual 1.199e+04 1.442e+03 8.315 3.15e-16 ***
#> OverallCond 7.063e+03 1.328e+03 5.321 1.29e-07 ***
#> YearBuilt 5.073e+02 8.363e+01 6.066 1.89e-09 ***
#> YearRemodAdd 2.758e+01 8.249e+01 0.334 0.738163
#> RoofStyle 8.112e+03 2.131e+03 3.807 0.000150 ***
#> RoofMatl -1.453e+04 2.652e+03 -5.479 5.48e-08 ***
#> Exterior1st -7.066e+02 6.600e+02 -1.071 0.284576
#> Exterior2nd 6.001e+02 6.169e+02 0.973 0.330926
#> MasVnrType -5.226e+03 1.886e+03 -2.771 0.005693 **
#> MasVnrArea 4.103e+01 7.666e+00 5.352 1.09e-07 ***
#> ExterQual 1.091e+04 2.638e+03 4.135 3.87e-05 ***
#> ExterCond -6.188e+03 2.815e+03 -2.199 0.028152 *
#> Foundation 5.324e+02 1.583e+03 0.336 0.736709
#> BsmtQual 8.106e+03 1.544e+03 5.251 1.87e-07 ***
#> BsmtCond -1.507e+03 2.007e+03 -0.751 0.452824
#> BsmtExposure -5.512e+03 1.452e+03 -3.797 0.000156 ***
#> BsmtFinType1 2.960e+02 7.240e+02 0.409 0.682806
#> BsmtFinSF1 7.887e+00 6.201e+00 1.272 0.203688
#> BsmtFinType2 6.599e+02 1.244e+03 0.530 0.595935
#> BsmtFinSF2 3.285e+00 1.077e+01 0.305 0.760400
#> BsmtUnfSF 2.395e+00 5.745e+00 0.417 0.676859
#> Heating -3.122e+03 3.675e+03 -0.849 0.395850
#> HeatingQC 3.928e+02 1.516e+03 0.259 0.795648
#> CentralAir -9.254e+02 5.849e+03 -0.158 0.874325
#> Electrical 2.488e+03 2.256e+03 1.103 0.270482
#> `1stFlrSF` 6.393e+01 7.624e+00 8.385 < 2e-16 ***
#> `2ndFlrSF` 5.449e+01 6.218e+00 8.763 < 2e-16 ***
#> LowQualFinSF 7.197e+00 2.737e+01 0.263 0.792641
#> BsmtFullBath 1.251e+04 3.080e+03 4.062 5.28e-05 ***
#> BsmtHalfBath 9.122e+03 4.860e+03 1.877 0.060864 .
#> FullBath 5.949e+03 3.459e+03 1.720 0.085786 .
#> HalfBath -1.727e+03 3.225e+03 -0.535 0.592477
#> BedroomAbvGr -7.529e+03 2.101e+03 -3.584 0.000355 ***
#> KitchenAbvGr -2.488e+04 6.289e+03 -3.956 8.17e-05 ***
#> KitchenQual 6.016e+03 1.875e+03 3.208 0.001381 **
#> TotRmsAbvGrd 3.240e+03 1.509e+03 2.147 0.032044 *
#> Functional -4.777e+03 1.538e+03 -3.106 0.001954 **
#> Fireplaces 5.470e+03 2.338e+03 2.340 0.019505 *
#> FireplaceQu 4.551e+03 1.314e+03 3.465 0.000555 ***
#> GarageType 9.724e+02 1.300e+03 0.748 0.454708
#> GarageYrBlt -2.487e+00 5.872e+00 -0.424 0.672022
#> GarageFinish 3.832e+03 1.520e+03 2.522 0.011840 *
#> GarageCars 1.589e+04 3.461e+03 4.592 4.98e-06 ***
#> GarageArea -1.466e+01 1.155e+01 -1.269 0.204578
#> GarageQual 3.385e+03 3.880e+03 0.872 0.383270
#> GarageCond -2.916e+03 3.624e+03 -0.805 0.421161
#> PavedDrive -1.469e+03 3.402e+03 -0.432 0.666018
#> WoodDeckSF 2.063e+01 9.240e+00 2.233 0.025781 *
#> OpenPorchSF 1.016e+01 1.879e+01 0.541 0.588791
#> EnclosedPorch -1.427e+01 1.970e+01 -0.724 0.469073
#> `3SsnPorch` -5.051e+00 3.548e+01 -0.142 0.886827
#> ScreenPorch 5.189e+01 2.123e+01 2.444 0.014690 *
#> PoolArea -1.318e+01 2.625e+01 -0.502 0.615790
#> Fence -1.954e+03 1.476e+03 -1.324 0.185811
#> MiscVal 7.190e-02 2.024e+00 0.036 0.971672
#> MoSold 2.985e+01 3.942e+02 0.076 0.939655
#> YrSold 3.591e+02 8.274e+02 0.434 0.664385
#> SaleType -6.432e+02 1.374e+03 -0.468 0.639833
#> SaleCondition 4.637e+03 1.493e+03 3.105 0.001961 **
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for gaussian family taken to be 1082276968)
#>
#> Null deviance: 6.7433e+12 on 1023 degrees of freedom
#> Residual deviance: 1.0271e+12 on 949 degrees of freedom
#> AIC: 24282
#>
#> Number of Fisher Scoring iterations: 2
predictions <- lf$predict(df = xtest)
rmse(actual = xtest$SalePrice, predicted = predictions)
#> [1] 34942.05
Lasso Regression
lf <- LMTrainer$new(family = "gaussian", alpha=1, lambda = 1000)
lf$fit(X = xtrain, y = "SalePrice")
predictions <- lf$predict(df = xtest)
rmse(actual = xtest$SalePrice, predicted = predictions)
#> [1] 37247.37
Ridge Regression
lf <- LMTrainer$new(family = "gaussian", alpha=0)
lf$fit(X = xtrain, y = "SalePrice")
predictions <- lf$predict(df = xtest)
rmse(actual = xtest$SalePrice, predicted = predictions)
#> [1] 37168.12
Logistic Regression with CV
lf <- LMTrainer$new(family = "gaussian")
lf$cv_model(X = xtrain, y = 'SalePrice', nfolds = 5, parallel = FALSE)
#> Computation done.
predictions <- lf$cv_predict(df = xtest)
coefs <- lf$get_importance()
rmse(actual = xtest$SalePrice, predicted = predictions)
#> [1] 36197.44
Random Forest
rf <- RFTrainer$new(n_estimators = 500,classification = 0)
rf$fit(X = xtrain, y = "SalePrice")
pred <- rf$predict(df = xtest)
rf$get_importance()
#> tmp.order.tmp..decreasing...TRUE..
#> OverallQual 783165647201
#> GarageCars 493327424965
#> 1stFlrSF 444256416285
#> ExterQual 431473669083
#> GarageArea 426584499742
#> YearBuilt 361457805289
#> FullBath 303023775651
#> BsmtFinSF1 275239028295
#> GarageYrBlt 265762810472
#> TotRmsAbvGrd 264823768807
#> 2ndFlrSF 235962008304
#> LotArea 212894703393
#> KitchenQual 202757765931
#> YearRemodAdd 180257031584
#> MasVnrArea 176661524468
#> Fireplaces 152440938414
#> BsmtQual 130604044591
#> FireplaceQu 113777282000
#> LotFrontage 96974873182
#> OpenPorchSF 87151480885
#> WoodDeckSF 80361681881
#> BsmtUnfSF 69675014973
#> BedroomAbvGr 53092044734
#> Neighborhood 52278334327
#> GarageType 46264741988
#> RoofStyle 46145773703
#> HeatingQC 42386340452
#> BsmtExposure 38352314961
#> MSSubClass 37502912726
#> GarageFinish 35226019092
#> Exterior2nd 33151824597
#> MoSold 33067821801
#> MasVnrType 31702846742
#> OverallCond 31418954660
#> Exterior1st 31254759265
#> HalfBath 28300758839
#> BsmtFullBath 23160900492
#> BsmtFinType1 22985050944
#> SaleCondition 22689914367
#> YrSold 22649917921
#> LotShape 22600345472
#> MSZoning 18997236282
#> HouseStyle 18962190305
#> Foundation 16788195549
#> LandContour 15894245763
#> LotConfig 15454153115
#> SaleType 15015837653
#> ScreenPorch 14565973729
#> BldgType 14114654438
#> GarageCond 13357462803
#> GarageQual 12596283176
#> CentralAir 11295194841
#> PoolArea 10780586438
#> BsmtCond 10713639659
#> RoofMatl 9523627865
#> EnclosedPorch 9520608357
#> LandSlope 9104553289
#> Fence 8697996384
#> Condition1 8105823860
#> BsmtHalfBath 7079007307
#> BsmtFinSF2 6815789768
#> Functional 6720942152
#> ExterCond 6451384706
#> PavedDrive 6144814768
#> BsmtFinType2 6046633582
#> KitchenAbvGr 5739780112
#> Electrical 2851017800
#> 3SsnPorch 2156392798
#> MiscVal 1919882177
#> Heating 1101675918
#> Street 687858599
#> LowQualFinSF 633587353
#> Condition2 177295073
#> Utilities 33334805
rmse(actual = xtest$SalePrice, predicted = pred)
#> [1] 28637.62
Xgboost
xgb <- XGBTrainer$new(objective = "reg:linear"
, n_estimators = 500
, eval_metric = "rmse"
, maximize = F
, learning_rate = 0.1
,max_depth = 6)
xgb$fit(X = xtrain, y = "SalePrice", valid = xtest)
#> converting the data into xgboost format..
#> starting with training...
#> [1] train-rmse:179849.765625 val-rmse:176584.281250
#> [51] train-rmse:8126.902344 val-rmse:33075.578125
#> [101] train-rmse:4940.106445 val-rmse:33232.164062
#> [151] train-rmse:3332.927490 val-rmse:33217.136719
#> [201] train-rmse:2242.137207 val-rmse:33165.082031
#> [251] train-rmse:1568.891602 val-rmse:33145.296875
#> [301] train-rmse:1111.639160 val-rmse:33153.835938
#> [351] train-rmse:804.120667 val-rmse:33160.457031
#> [401] train-rmse:591.843811 val-rmse:33156.765625
#> [451] train-rmse:431.189850 val-rmse:33152.750000
#> [500] train-rmse:332.463287 val-rmse:33149.964844
pred <- xgb$predict(xtest)
rmse(actual = xtest$SalePrice, predicted = pred)
#> [1] 33149.96
Grid Search
xgb <- XGBTrainer$new(objective="reg:linear")
gst <-GridSearchTrainer$new(trainer = xgb,
parameters = list(n_estimators = c(10,50), max_depth = c(5,2)),
n_folds = 3,
scoring = c('accuracy','auc'))
gst$fit(xtrain, "SalePrice")
#> [1] "entering grid search"
#> [1] "In total, 4 models will be trained"
#> converting the data into xgboost format..
#> starting with training...
#> [1] train-rmse:140703.296875
#> [10] train-rmse:16649.193359
#> converting the data into xgboost format..
#> starting with training...
#> [1] train-rmse:143633.156250
#> [10] train-rmse:16114.295898
#> converting the data into xgboost format..
#> starting with training...
#> [1] train-rmse:144703.062500
#> [10] train-rmse:16503.988281
#> converting the data into xgboost format..
#> starting with training...
#> [1] train-rmse:140703.296875
#> [50] train-rmse:4600.740234
#> converting the data into xgboost format..
#> starting with training...
#> [1] train-rmse:143633.156250
#> [50] train-rmse:3559.837402
#> converting the data into xgboost format..
#> starting with training...
#> [1] train-rmse:144703.062500
#> [50] train-rmse:4065.962158
#> converting the data into xgboost format..
#> starting with training...
#> [1] train-rmse:141586.750000
#> [10] train-rmse:29210.416016
#> converting the data into xgboost format..
#> starting with training...
#> [1] train-rmse:144483.890625
#> [10] train-rmse:29556.125000
#> converting the data into xgboost format..
#> starting with training...
#> [1] train-rmse:145774.046875
#> [10] train-rmse:32064.306641
#> converting the data into xgboost format..
#> starting with training...
#> [1] train-rmse:141586.750000
#> [50] train-rmse:16623.572266
#> converting the data into xgboost format..
#> starting with training...
#> [1] train-rmse:144483.890625
#> [50] train-rmse:16999.929688
#> converting the data into xgboost format..
#> starting with training...
#> [1] train-rmse:145774.046875
#> [50] train-rmse:18379.587891
gst$best_iteration()
#> $n_estimators
#> [1] 10
#>
#> $max_depth
#> [1] 5
#>
#> $accuracy_avg
#> [1] 0
#>
#> $accuracy_sd
#> [1] 0
#>
#> $auc_avg
#> [1] NaN
#>
#> $auc_sd
#> [1] NA
Random Search
rf <- RFTrainer$new()
rst <-RandomSearchTrainer$new(trainer = rf,
parameters = list(n_estimators = c(10,50),
max_depth = c(5,2)),
n_folds = 3,
scoring = c('accuracy','auc'),
n_iter=3)
rst$fit(xtrain, "SalePrice")
#> [1] "In total, 3 models will be trained"
rst$best_iteration()
#> $n_estimators
#> [1] 50
#>
#> $max_depth
#> [1] 2
#>
#> $accuracy_avg
#> [1] 0.001955051
#>
#> $accuracy_sd
#> [1] 0.001693146
#>
#> $auc_avg
#> [1] NaN
#>
#> $auc_sd
#> [1] NA
Here, we will solve a simple binary classification problem (predict people who survived on titanic ship). The idea here is to demonstrate how to use this package to solve classification problems.
Data Preparation
# load class
load('../data/cla_train.rda')
kable(head(cla_train, 10)) %>%
scroll_box(width = "100%", height = "300px")
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22 | 1 | 0 | A/5 21171 | 7.2500 | S | |
2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | female | 38 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26 | 0 | 0 | STON/O2. 3101282 | 7.9250 | S | |
4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35 | 1 | 0 | 113803 | 53.1000 | C123 | S |
5 | 0 | 3 | Allen, Mr. William Henry | male | 35 | 0 | 0 | 373450 | 8.0500 | S | |
6 | 0 | 3 | Moran, Mr. James | male | NA | 0 | 0 | 330877 | 8.4583 | Q | |
7 | 0 | 1 | McCarthy, Mr. Timothy J | male | 54 | 0 | 0 | 17463 | 51.8625 | E46 | S |
8 | 0 | 3 | Palsson, Master. Gosta Leonard | male | 2 | 3 | 1 | 349909 | 21.0750 | S | |
9 | 1 | 3 | Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | female | 27 | 0 | 2 | 347742 | 11.1333 | S | |
10 | 1 | 2 | Nasser, Mrs. Nicholas (Adele Achem) | female | 14 | 1 | 0 | 237736 | 30.0708 | C |
# split the data
split <- createDataPartition(y = cla_train$Survived,p = 0.7)
xtrain <- cla_train[split$Resample1]
xtest <- cla_train[!split$Resample1]
# encode categorical variables - shorter way
for(c in c('Embarked','Sex','Cabin')){
lbl <- LabelEncoder$new()
lbl$fit(c(xtrain[[c]], xtest[[c]]))
xtrain[[c]] <- lbl$transform(xtrain[[c]])
xtest[[c]] <- lbl$transform(xtest[[c]])
}
#> The data contains blank values. Imputing them with 'NA'
#> The data contains blank values. Imputing them with 'NA'
#> The data contains blank values. Imputing them with 'NA'
#> The data contains blank values. Imputing them with 'NA'
#> The data contains blank values. Imputing them with 'NA'
#> The data contains blank values. Imputing them with 'NA'
# impute missing values
xtrain[, Age := replace(Age, is.na(Age), median(Age, na.rm = T))]
xtest[, Age := replace(Age, is.na(Age), median(Age, na.rm = T))]
# drop these features
to_drop <- c('PassengerId','Ticket','Name')
xtrain <- xtrain[,-c(to_drop), with=F]
xtest <- xtest[,-c(to_drop), with=F]
Now, our data is ready to be served for model training. Let’s do it.
KNN Classification
knn <- KNNTrainer$new(k = 2,prob = T,type = 'class')
knn$fit_predict(train = xtrain, test = xtest, y = 'Survived')
probs <- knn$get_predictions(type = 'prob')
labels <- knn$get_predictions(type='raw')
auc(actual = xtest$Survived, predicted=labels)
#> [1] 0.6776491
Logistic Regression
lf <- LMTrainer$new(family="binomial")
lf$fit(X = xtrain, y = "Survived")
summary(lf$model)
#>
#> Call:
#> stats::glm(formula = f, family = self$family, data = X, weights = self$weights)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.0647 -0.5139 -0.3550 0.5659 2.5979
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 1.882835 0.667638 2.820 0.00480 **
#> Pclass -0.991285 0.198153 -5.003 5.66e-07 ***
#> Sex 3.014839 0.250533 12.034 < 2e-16 ***
#> Age -0.050270 0.010402 -4.833 1.35e-06 ***
#> SibSp -0.376242 0.132598 -2.837 0.00455 **
#> Parch -0.137521 0.146524 -0.939 0.34796
#> Fare 0.001671 0.002794 0.598 0.54981
#> Cabin 0.017868 0.005923 3.017 0.00256 **
#> Embarked 0.076637 0.148818 0.515 0.60657
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 823.56 on 623 degrees of freedom
#> Residual deviance: 495.21 on 615 degrees of freedom
#> AIC: 513.21
#>
#> Number of Fisher Scoring iterations: 5
predictions <- lf$predict(df = xtest)
auc(actual = xtest$Survived, predicted = predictions)
#> [1] 0.7930805
Lasso Logistic Regression
lf <- LMTrainer$new(family="binomial", alpha=1)
lf$cv_model(X = xtrain, y = "Survived", nfolds = 5, parallel = FALSE)
#> Computation done.
pred <- lf$cv_predict(df = xtest)
auc(actual = xtest$Survived, predicted = pred)
#> [1] 0.7960625
Ridge Logistic Regression
lf <- LMTrainer$new(family="binomial", alpha=0)
lf$cv_model(X = xtrain, y = "Survived", nfolds = 5, parallel = FALSE)
#> Computation done.
pred <- lf$cv_predict(df = xtest)
auc(actual = xtest$Survived, predicted = pred)
#> [1] 0.7948466
Random Forest
rf <- RFTrainer$new(n_estimators = 500,classification = 1, max_features = 3)
rf$fit(X = xtrain, y = "Survived")
pred <- rf$predict(df = xtest)
rf$get_importance()
#> tmp.order.tmp..decreasing...TRUE..
#> Sex 83.179331
#> Fare 49.530245
#> Age 44.026986
#> Cabin 27.806777
#> Pclass 22.210427
#> SibSp 13.742906
#> Parch 9.837351
#> Embarked 6.777296
auc(actual = xtest$Survived, predicted = pred)
#> [1] 0.7512739
Xgboost
xgb <- XGBTrainer$new(objective = "binary:logistic"
, n_estimators = 500
, eval_metric = "auc"
, maximize = T
, learning_rate = 0.1
,max_depth = 6)
xgb$fit(X = xtrain, y = "Survived", valid = xtest)
#> converting the data into xgboost format..
#> starting with training...
#> [1] train-auc:0.910654 val-auc:0.815229
#> [51] train-auc:0.977882 val-auc:0.803243
#> [101] train-auc:0.990142 val-auc:0.808280
#> [151] train-auc:0.994508 val-auc:0.807354
#> [201] train-auc:0.996520 val-auc:0.809352
#> [251] train-auc:0.997454 val-auc:0.809496
#> [301] train-auc:0.998147 val-auc:0.808309
#> [351] train-auc:0.998554 val-auc:0.808136
#> [401] train-auc:0.998796 val-auc:0.809120
#> [451] train-auc:0.999060 val-auc:0.809699
#> [500] train-auc:0.999104 val-auc:0.809699
pred <- xgb$predict(xtest)
auc(actual = xtest$Survived, predicted = pred)
#> [1] 0.8096989
Grid Search
xgb <- XGBTrainer$new(objective="binary:logistic")
gst <-GridSearchTrainer$new(trainer = xgb,
parameters = list(n_estimators = c(10,50),
max_depth = c(5,2)),
n_folds = 3,
scoring = c('accuracy','auc'))
gst$fit(xtrain, "Survived")
#> [1] "entering grid search"
#> [1] "In total, 4 models will be trained"
#> converting the data into xgboost format..
#> starting with training...
#> [1] train-error:0.129808
#> [10] train-error:0.098558
#> converting the data into xgboost format..
#> starting with training...
#> [1] train-error:0.105769
#> [10] train-error:0.088942
#> converting the data into xgboost format..
#> starting with training...
#> [1] train-error:0.098558
#> [10] train-error:0.069712
#> converting the data into xgboost format..
#> starting with training...
#> [1] train-error:0.129808
#> [50] train-error:0.038462
#> converting the data into xgboost format..
#> starting with training...
#> [1] train-error:0.105769
#> [50] train-error:0.036058
#> converting the data into xgboost format..
#> starting with training...
#> [1] train-error:0.098558
#> [50] train-error:0.036058
#> converting the data into xgboost format..
#> starting with training...
#> [1] train-error:0.177885
#> [10] train-error:0.153846
#> converting the data into xgboost format..
#> starting with training...
#> [1] train-error:0.201923
#> [10] train-error:0.137019
#> converting the data into xgboost format..
#> starting with training...
#> [1] train-error:0.182692
#> [10] train-error:0.115385
#> converting the data into xgboost format..
#> starting with training...
#> [1] train-error:0.177885
#> [50] train-error:0.110577
#> converting the data into xgboost format..
#> starting with training...
#> [1] train-error:0.201923
#> [50] train-error:0.096154
#> converting the data into xgboost format..
#> starting with training...
#> [1] train-error:0.182692
#> [50] train-error:0.081731
gst$best_iteration()
#> $n_estimators
#> [1] 10
#>
#> $max_depth
#> [1] 5
#>
#> $accuracy_avg
#> [1] 0
#>
#> $accuracy_sd
#> [1] 0
#>
#> $auc_avg
#> [1] 0.883034
#>
#> $auc_sd
#> [1] 0.0242347
Random Search
rf <- RFTrainer$new()
rst <-RandomSearchTrainer$new(trainer = rf,
parameters = list(n_estimators = c(10,50),
max_depth = c(5,2)),
n_folds = 3,
scoring = c('accuracy','auc'),
n_iter = 3)
rst$fit(xtrain, "Survived")
#> [1] "In total, 3 models will be trained"
rst$best_iteration()
#> $n_estimators
#> [1] 50
#>
#> $max_depth
#> [1] 5
#>
#> $accuracy_avg
#> [1] 0.849359
#>
#> $accuracy_sd
#> [1] 0.0264787
#>
#> $auc_avg
#> [1] 0.8279856
#>
#> $auc_sd
#> [1] 0.02242134
Let’s create some new feature based on target variable using target encoding and test a model.
# add target encoding features
xtrain[, feat_01 := smoothMean(train_df = xtrain,
test_df = xtest,
colname = "Embarked",
target = "Survived")$train[[2]]]
xtest[, feat_01 := smoothMean(train_df = xtrain,
test_df = xtest,
colname = "Embarked",
target = "Survived")$test[[2]]]
# train a random forest
# Random Forest
rf <- RFTrainer$new(n_estimators = 500,classification = 1, max_features = 4)
rf$fit(X = xtrain, y = "Survived")
pred <- rf$predict(df = xtest)
rf$get_importance()
#> tmp.order.tmp..decreasing...TRUE..
#> Sex 85.213569
#> Fare 51.676287
#> Age 47.071256
#> Cabin 28.804936
#> Pclass 22.431287
#> SibSp 13.735815
#> Parch 9.643044
#> feat_01 4.449812
#> Embarked 4.385365
auc(actual = xtest$Survived, predicted = pred)
#> [1] 0.7512739