The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.
You fit a boosting tree model in R with your favourite package xgboost
, the validation results looks great, so next question comes up as how to deploy this model into production so that others could utilize it to help with the business. Incorporating the model into a shiny app would certainly be a good idea, but sometimes the model needs to be integrated into some other systems that the company is heavily relied on. Plus moving large amount of data between database and R could be time and memory consuming. So we propose R package xgb2sql
enabling in-database scoring of XGBoost models built in R by translating trained model objects into SQL query.
CRAN Task View: Model Deployment with R categorizes the process of deploying models to various environments for scoring or inferencing on new data into two categories. The first category is Deployment through Different Types of Artifacts, which basically means exporting the model as an object, then using supported software/platform to consume this object scoring out the model predictions. The other category is Deployment through Cloud/Server, which includes a). providing an R interface to third-party managed services such as Google Cloud Machine Learning Engine; b). turning R code into web API and opening service on the server. Our approach provides SQL query producing model predictions, which can be taken as a combination of the model itself plus the scoring process. The output SQL query can be treated as an artifact, but we can easily set up service for it on the database server.
The SQL query generated by this tool is basic enough to be compatible with all SQL-based database products and services. Other than this tool, there are two R packages providing modeling and predicting capability inside database:
ibmdbR
offers modeling and predicting capability for naive-Bayes, linear regression, decision tree, association rules, and K-means clustering. But the only supported database product and service is IBM DB2.tidypredict
leverages dplyr
and dbplyr
for SQL translation, supporting linear regression, generalized linear model, and random forest.Here is the outline for the rest of this vignette:
As we know, xgboost
only consumes numeric input for its model fitting function 1. So after transferring raw table in database to R as a data.frame/data.table, same one-hot encoding needs to be performed on both the table and the data.frame/data.table. Here we have function onehot2sql()
to perform one-hot encoding on the training data in R, producing at the same the SQL query performing the exact transformation for the raw table in database. Let’s start with loading the sample dataset from ggplot2
:
library(data.table)
library(xgboost)
library(xgb2sql)
df <- data.frame(ggplot2::diamonds)
head(df)
#> carat cut color clarity depth table price x y z
#> 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
#> 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
#> 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
#> 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
#> 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
#> 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
Funtion onehot2sql()
is built upon base R functions model.frame()
and model.matrix()
. Other than consuming a data.frame as input, this function has been optimized to work with a data.table with greater efficiency. It outputs a matrix ready for model fitting following rules listed below:
The function treats any non-numeric columns, i.e., columns with class not being numeric
and integer
, as categorical and performs one-hot encoding for them.
The one-hot encoding doesn’t remove one feature-level combination for each categorical feature, as model.matrix
does in order to avoid issues caused by multicollinearity. Although the output matrix conveys same amount of information with and without one feature-level combination removed, thus producing similar model performance, system knowledge gained along this modeling practice is very different. If let’s say the “cut” of diamonds being “Ideal” has a huge impact on its price as the target/response, removing binary column “cut.Ideal” in the output matrix would result in the predictive power of “cut.Ideal” being scattered among other “cut” columns 2. The model performance would be comparative, but we would miss the information that “cut” being “Ideal” is the dominate factor of price, by studying the variable importance. So as multicollinearity wouldn’t be a problem for tree-based model 3, we believe performing full one-hot encoding is more appropriate for XGBoost modeling.
The function keeps NAs inside both categorical and numeric features preserved. As pointed by the author of xgboost
, the algorithm will automatically learn what is the best direction to go when a value is missing, which can be viewed as automatically “learn” what is the best imputation value for missing values based on reduction on training loss 4. This is one of the reasons of XGBoost being so powerful, so we are keeping all NAs in the output matrix.
The function outputs meta
information tracking all the levels for each categorical feature. If it is given to the function as an input, the exact feature-level combinations will be populated, even if the new data is missing one level for a particular categorical feature, or having a new level never seen before.
Available arguments of this function are, which will be explained with examples:
onehot2sql(data, meta=NULL, sep="_", ws_replace=TRUE, ws_replace_with="",
unique_id=NULL, output_file_name=NULL, input_table_name=NULL)
Output of this function is a list containing:
meta
data tracking the transformation.model.matrix
being the data after processing which is ready for XGBoost fitting.sql
performing the exact one-hot encoding in the database.So let’ take a look of its basic usage:
out <- onehot2sql(df)
print(out$meta)
#> $num.vec
#> [1] "carat" "depth" "table" "price" "x" "y" "z"
#>
#> $catg.vec
#> [1] "cut" "color" "clarity"
#>
#> $contrasts
#> $contrasts$cut_
#> Fair Good Very Good Premium Ideal
#> Fair 1 0 0 0 0
#> Good 0 1 0 0 0
#> Very Good 0 0 1 0 0
#> Premium 0 0 0 1 0
#> Ideal 0 0 0 0 1
#>
#> $contrasts$color_
#> D E F G H I J
#> D 1 0 0 0 0 0 0
#> E 0 1 0 0 0 0 0
#> F 0 0 1 0 0 0 0
#> G 0 0 0 1 0 0 0
#> H 0 0 0 0 1 0 0
#> I 0 0 0 0 0 1 0
#> J 0 0 0 0 0 0 1
#>
#> $contrasts$clarity_
#> I1 SI2 SI1 VS2 VS1 VVS2 VVS1 IF
#> I1 1 0 0 0 0 0 0 0
#> SI2 0 1 0 0 0 0 0 0
#> SI1 0 0 1 0 0 0 0 0
#> VS2 0 0 0 1 0 0 0 0
#> VS1 0 0 0 0 1 0 0 0
#> VVS2 0 0 0 0 0 1 0 0
#> VVS1 0 0 0 0 0 0 1 0
#> IF 0 0 0 0 0 0 0 1
head(out$model.matrix)
#> (Intercept) carat clarity_I1 clarity_IF clarity_SI1 clarity_SI2
#> 1 1 0.23 0 0 0 1
#> 2 1 0.21 0 0 1 0
#> 3 1 0.23 0 0 0 0
#> 4 1 0.29 0 0 0 0
#> 5 1 0.31 0 0 0 1
#> 6 1 0.24 0 0 0 0
#> clarity_VS1 clarity_VS2 clarity_VVS1 clarity_VVS2 color_D color_E
#> 1 0 0 0 0 0 1
#> 2 0 0 0 0 0 1
#> 3 1 0 0 0 0 1
#> 4 0 1 0 0 0 0
#> 5 0 0 0 0 0 0
#> 6 0 0 0 1 0 0
#> color_F color_G color_H color_I color_J cut_Fair cut_Good cut_Ideal
#> 1 0 0 0 0 0 0 0 1
#> 2 0 0 0 0 0 0 0 0
#> 3 0 0 0 0 0 0 1 0
#> 4 0 0 0 1 0 0 0 0
#> 5 0 0 0 0 1 0 1 0
#> 6 0 0 0 0 1 0 0 0
#> cut_Premium cut_VeryGood depth price table x y z
#> 1 0 0 61.5 326 55 3.95 3.98 2.43
#> 2 1 0 59.8 326 61 3.89 3.84 2.31
#> 3 0 0 56.9 327 65 4.05 4.07 2.31
#> 4 1 0 62.4 334 58 4.20 4.23 2.63
#> 5 0 0 63.3 335 58 4.34 4.35 2.75
#> 6 0 1 62.8 336 57 3.94 3.96 2.48
It should be noted that level “Very Good” for feature “cut” has been replaced with “VeryGood”, with the white-space removed. This behaviour is controlled by function arguments ws_replace=TRUE
and ws_replace_with=""
, where other symbol can be specified to replace the white-space inside levels of categorical features. Such processing is very necessary as SQL database usually doesn’t allow white-space inside its table column names. And symbol separating the feature and its levels is controlled by sep="_"
. The output model.matrix would have all its columns reordered alphabetically.
The SQL query performing one-hot encoding for the raw table is:
cat(out$sql)
#> SELECT ROW_KEY, [carat], [depth], [table], [price], [x], [y], [z],
#> (case when [cut] IS NULL then NULL when [cut] = 'Fair' then 1 else 0 end) AS [cut_Fair],
#> (case when [cut] IS NULL then NULL when [cut] = 'Good' then 1 else 0 end) AS [cut_Good],
#> (case when [cut] IS NULL then NULL when [cut] = 'Very Good' then 1 else 0 end) AS [cut_VeryGood],
#> (case when [cut] IS NULL then NULL when [cut] = 'Premium' then 1 else 0 end) AS [cut_Premium],
#> (case when [cut] IS NULL then NULL when [cut] = 'Ideal' then 1 else 0 end) AS [cut_Ideal],
#> (case when [color] IS NULL then NULL when [color] = 'D' then 1 else 0 end) AS [color_D],
#> (case when [color] IS NULL then NULL when [color] = 'E' then 1 else 0 end) AS [color_E],
#> (case when [color] IS NULL then NULL when [color] = 'F' then 1 else 0 end) AS [color_F],
#> (case when [color] IS NULL then NULL when [color] = 'G' then 1 else 0 end) AS [color_G],
#> (case when [color] IS NULL then NULL when [color] = 'H' then 1 else 0 end) AS [color_H],
#> (case when [color] IS NULL then NULL when [color] = 'I' then 1 else 0 end) AS [color_I],
#> (case when [color] IS NULL then NULL when [color] = 'J' then 1 else 0 end) AS [color_J],
#> (case when [clarity] IS NULL then NULL when [clarity] = 'I1' then 1 else 0 end) AS [clarity_I1],
#> (case when [clarity] IS NULL then NULL when [clarity] = 'SI2' then 1 else 0 end) AS [clarity_SI2],
#> (case when [clarity] IS NULL then NULL when [clarity] = 'SI1' then 1 else 0 end) AS [clarity_SI1],
#> (case when [clarity] IS NULL then NULL when [clarity] = 'VS2' then 1 else 0 end) AS [clarity_VS2],
#> (case when [clarity] IS NULL then NULL when [clarity] = 'VS1' then 1 else 0 end) AS [clarity_VS1],
#> (case when [clarity] IS NULL then NULL when [clarity] = 'VVS2' then 1 else 0 end) AS [clarity_VVS2],
#> (case when [clarity] IS NULL then NULL when [clarity] = 'VVS1' then 1 else 0 end) AS [clarity_VVS1],
#> (case when [clarity] IS NULL then NULL when [clarity] = 'IF' then 1 else 0 end) AS [clarity_IF]
#> FROM INPUT_TABLE
We want to emphasise here that an unique row identifier inside the raw table is crucial for in-database scoring of XGBoost model. Column name of the identifier can be specified by the function argument unique_id
, which will be passed along to the table after one-hot encoding. If it is not given, SQL query will be populated with column name “ROW_KEY” for the identifier. Similarly, “INPUT_TABLE” is used in the query if name of the raw table input_table_name
is NULL
. Given a valid value, the SQL query will be written to the file specified by output_file_name
.
Let’s have another example with NAs and a date column:
d2 <- data.table(ggplot2::diamonds)
# change column class
d2[, cut:=factor(cut, ordered=FALSE)]
d2[, clarity:=as.character(clarity)]
# create IDate column
d2[, tsdt:=as.IDate('2017-01-05')]
d2[1:3, tsdt:=tsdt-1]
# add NAs
d2[1, clarity:=NA]
d2[2, depth:=NA]
head(d2)
#> carat cut color clarity depth table price x y z
#> 1: 0.23 Ideal E <NA> 61.5 55 326 3.95 3.98 2.43
#> 2: 0.21 Premium E SI1 NA 61 326 3.89 3.84 2.31
#> 3: 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
#> 4: 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
#> 5: 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
#> 6: 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
#> tsdt
#> 1: 2017-01-04
#> 2: 2017-01-04
#> 3: 2017-01-04
#> 4: 2017-01-05
#> 5: 2017-01-05
#> 6: 2017-01-05
out2 <- onehot2sql(d2)
head(out2$model.matrix)
#> (Intercept) carat clarity_I1 clarity_IF clarity_SI1 clarity_SI2
#> 1 1 0.23 NA NA NA NA
#> 2 1 0.21 0 0 1 0
#> 3 1 0.23 0 0 0 0
#> 4 1 0.29 0 0 0 0
#> 5 1 0.31 0 0 0 1
#> 6 1 0.24 0 0 0 0
#> clarity_VS1 clarity_VS2 clarity_VVS1 clarity_VVS2 color_D color_E
#> 1 NA NA NA NA 0 1
#> 2 0 0 0 0 0 1
#> 3 1 0 0 0 0 1
#> 4 0 1 0 0 0 0
#> 5 0 0 0 0 0 0
#> 6 0 0 0 1 0 0
#> color_F color_G color_H color_I color_J cut_Fair cut_Good cut_Ideal
#> 1 0 0 0 0 0 0 0 1
#> 2 0 0 0 0 0 0 0 0
#> 3 0 0 0 0 0 0 1 0
#> 4 0 0 0 1 0 0 0 0
#> 5 0 0 0 0 1 0 1 0
#> 6 0 0 0 0 1 0 0 0
#> cut_Premium cut_VeryGood depth price table tsdt_20170104 tsdt_20170105
#> 1 0 0 61.5 326 55 1 0
#> 2 1 0 NA 326 61 1 0
#> 3 0 0 56.9 327 65 1 0
#> 4 1 0 62.4 334 58 0 1
#> 5 0 0 63.3 335 58 0 1
#> 6 0 1 62.8 336 57 0 1
#> x y z
#> 1 3.95 3.98 2.43
#> 2 3.89 3.84 2.31
#> 3 4.05 4.07 2.31
#> 4 4.20 4.23 2.63
#> 5 4.34 4.35 2.75
#> 6 3.94 3.96 2.48
Then let’s look at when meta
is given to data with new elements, whether onehot2sql()
will output model.matrix with identical columns as the training data, in order to apply predict()
to the trained model on the new data:
newdata <- d2[1:5,]
# newdata has columns with new elements
newdata[5, clarity:='NEW']; newdata[1,tsdt:=as.IDate('2018-05-01')]
# newdata has a new column
newdata[, new_col:=1]
# newdata is lacking a column
newdata[, cut:=NULL]
head(newdata)
#> carat color clarity depth table price x y z tsdt new_col
#> 1: 0.23 E <NA> 61.5 55 326 3.95 3.98 2.43 2018-05-01 1
#> 2: 0.21 E SI1 NA 61 326 3.89 3.84 2.31 2017-01-04 1
#> 3: 0.23 E VS1 56.9 65 327 4.05 4.07 2.31 2017-01-04 1
#> 4: 0.29 I VS2 62.4 58 334 4.20 4.23 2.63 2017-01-05 1
#> 5: 0.31 J NEW 63.3 58 335 4.34 4.35 2.75 2017-01-05 1
onehot2sql(newdata, meta=out2$meta)$model.matrix
#> Warning in onehot2sql(newdata, meta = out2$meta): Following columns are populated with NAs:
#> cut
#> (Intercept) carat clarity_I1 clarity_IF clarity_SI1 clarity_SI2
#> 1 1 0.23 NA NA NA NA
#> 2 1 0.21 0 0 1 0
#> 3 1 0.23 0 0 0 0
#> 4 1 0.29 0 0 0 0
#> 5 1 0.31 0 0 0 0
#> clarity_VS1 clarity_VS2 clarity_VVS1 clarity_VVS2 color_D color_E
#> 1 NA NA NA NA 0 1
#> 2 0 0 0 0 0 1
#> 3 1 0 0 0 0 1
#> 4 0 1 0 0 0 0
#> 5 0 0 0 0 0 0
#> color_F color_G color_H color_I color_J cut_Fair cut_Good cut_Ideal
#> 1 0 0 0 0 0 NA NA NA
#> 2 0 0 0 0 0 NA NA NA
#> 3 0 0 0 0 0 NA NA NA
#> 4 0 0 0 1 0 NA NA NA
#> 5 0 0 0 0 1 NA NA NA
#> cut_Premium cut_VeryGood depth price table tsdt_20170104 tsdt_20170105
#> 1 NA NA 61.5 326 55 0 0
#> 2 NA NA NA 326 61 1 0
#> 3 NA NA 56.9 327 65 1 0
#> 4 NA NA 62.4 334 58 0 1
#> 5 NA NA 63.3 335 58 0 1
#> x y z
#> 1 3.95 3.98 2.43
#> 2 3.89 3.84 2.31
#> 3 4.05 4.07 2.31
#> 4 4.20 4.23 2.63
#> 5 4.34 4.35 2.75
We can see from this example that
We recommend any feature engineering and/or missing imputation work to be done before applying function onehot2sql()
to the training data in R. It should be the last step before kicking off the model fitting. And SQL query for feature engineering and/or missing imputation can be placed as a sub-query inside the one-hot query. For example, replacing “INPUT_TABLE” inside out$sql
with following sub-query will do one-hot encoding together with missing imputation for feature “clarity”:
(SELECT ROW_KEY, [cut], [color],
(case when [clarity] IS NULL then 'MISS' else [clarity] end) as [clarity],
[carat], [depth], [table], [price], [x], [y], [z]
FROM INPUT_TABLE) AS IMPUTED_TABLE
Before taking a close look at function booster2sql()
translating XGBoost model into SQL query, we want to illustrate the suggested work-flow for the whole process of model fitting and scoring with package xgb2sql
:
dplyr
and DBI
here.onehot2sql()
to the data.frame/data.table, obtaining the model.matrix and storing the one-hot query.booster2sql()
to the final model, producing the XGBoost query for its in-database scoring.Now let’s move back to function booster2sql()
. Available arguments are:
booster2sql(xgbModel, print_progress=FALSE, unique_id=NULL,
output_file_name=NULL, input_table_name=NULL, input_onehot_query=NULL)
The model input xgbModel
to this function should have a class of xgb.Booster
. And print_progress=FALSE
controls whether the translating progress should be printed to console. Similarly, unique_id
and input_table_name
should be given to generate the SQL query. It should be noted that there must be a valid file path for output_file_name
to write the query, otherwise the function will not run.
Let’s try to predict the “price” of diamonds using the other features. In order to demonstrate the generated XGBoost query, we will train the model with max.depth=2
and nround=2
.
x <- out$model.matrix[,colnames(out$model.matrix)!='price']
y <- out$model.matrix[,colnames(out$model.matrix)=='price']
bst <- xgboost(data = x,
label = y,
max.depth = 2,
eta = .3,
nround = 2,
objective = 'reg:linear')
#> [1] train-rmse:4095.421143
#> [2] train-rmse:3074.222656
booster2sql(bst, output_file_name='xgb.txt')
#> query is written to file with row unique id named as ROW_KEY
#> query is written to file with input table named as MODREADY_TABLE
cat(readChar('xgb.txt', file.info('xgb.txt')$size))
#> SELECT ROW_KEY , 0.5 + SUM(ONETREE) AS XGB_PRED
#> FROM (
#> SELECT ROW_KEY ,
#> (CASE WHEN [carat] < 0.995000005 THEN
#> (CASE WHEN [y] < 5.53499985 THEN 317.401001
#> WHEN [y] >= 5.53499985 THEN 922.349731
#> WHEN [y] IS NULL THEN 317.401001 END)
#> WHEN [carat] >= 0.995000005 THEN
#> (CASE WHEN [y] < 7.19499969 THEN 1841.06018
#> WHEN [y] >= 7.19499969 THEN 3696.24292
#> WHEN [y] IS NULL THEN 1841.06018 END)
#> WHEN [carat] IS NULL THEN
#> (CASE WHEN [y] < 5.53499985 THEN 317.401001
#> WHEN [y] >= 5.53499985 THEN 922.349731
#> WHEN [y] IS NULL THEN 317.401001 END) END) AS ONETREE FROM MODREADY_TABLE
#> UNION ALL
#>
#> SELECT ROW_KEY ,
#> (CASE WHEN [y] < 6.69499969 THEN
#> (CASE WHEN [carat] < 0.824999988 THEN 289.332123
#> WHEN [carat] >= 0.824999988 THEN 1056.4021
#> WHEN [carat] IS NULL THEN 289.332123 END)
#> WHEN [y] >= 6.69499969 THEN
#> (CASE WHEN [y] < 7.65499973 THEN 1814.65881
#> WHEN [y] >= 7.65499973 THEN 3217.57129
#> WHEN [y] IS NULL THEN 1814.65881 END)
#> WHEN [y] IS NULL THEN
#> (CASE WHEN [carat] < 0.824999988 THEN 289.332123
#> WHEN [carat] >= 0.824999988 THEN 1056.4021
#> WHEN [carat] IS NULL THEN 289.332123 END) END) AS ONETREE FROM MODREADY_TABLE
#> ) AS TREES_TABLE GROUP BY ROW_KEY
We can see that each SELECT ... AS ONETREE
section inside the XGBoost query is composed of nested case when statement, providing scores along a tree structure. And each of these sections represents one round/iteration of the XGBoost model. Values for the splits and scores within the query are from the xgb.dump()
of the model without any rounding:
xgb.dump(bst)
#> [1] "booster[0]"
#> [2] "0:[f1<0.995000005] yes=1,no=2,missing=1"
#> [3] "1:[f25<5.53499985] yes=3,no=4,missing=3"
#> [4] "3:leaf=317.401001"
#> [5] "4:leaf=922.349731"
#> [6] "2:[f25<7.19499969] yes=5,no=6,missing=5"
#> [7] "5:leaf=1841.06018"
#> [8] "6:leaf=3696.24292"
#> [9] "booster[1]"
#> [10] "0:[f25<6.69499969] yes=1,no=2,missing=1"
#> [11] "1:[f1<0.824999988] yes=3,no=4,missing=3"
#> [12] "3:leaf=289.332123"
#> [13] "4:leaf=1056.4021"
#> [14] "2:[f25<7.65499973] yes=5,no=6,missing=5"
#> [15] "5:leaf=1814.65881"
#> [16] "6:leaf=3217.57129"
It should be noted that model prediction calculated by adding up the scores provided by xgb.dump()
, is different from that by applying predict()
to the model directly. It is a rounding difference thus extremely insignificant. But since the XGBoost query is generated with scores from xgb.dump
, this difference will still be there between the in-database scoring and the R’s predict()
of the model.
There is one last argument of booster2sql()
we haven’t talked about, i.e., input_onehot_query
. Here we can input the one-hot query generated by onehot2sql()
, which will be used as sub-query replacing “MODREADY_TABLE” within the XGBoost query. In this way, the XGBoost query can be executed on the raw table, producing the model predictions directly.
booster2sql(bst, output_file_name='onehot-xgb.txt', input_onehot_query=out$sql)
#> query is written to file with row unique id named as ROW_KEY
cat(readChar('onehot-xgb.txt', file.info('onehot-xgb.txt')$size))
#> SELECT ROW_KEY , 0.5 + SUM(ONETREE) AS XGB_PRED
#> FROM (
#> SELECT ROW_KEY ,
#> (CASE WHEN [carat] < 0.995000005 THEN
#> (CASE WHEN [y] < 5.53499985 THEN 317.401001
#> WHEN [y] >= 5.53499985 THEN 922.349731
#> WHEN [y] IS NULL THEN 317.401001 END)
#> WHEN [carat] >= 0.995000005 THEN
#> (CASE WHEN [y] < 7.19499969 THEN 1841.06018
#> WHEN [y] >= 7.19499969 THEN 3696.24292
#> WHEN [y] IS NULL THEN 1841.06018 END)
#> WHEN [carat] IS NULL THEN
#> (CASE WHEN [y] < 5.53499985 THEN 317.401001
#> WHEN [y] >= 5.53499985 THEN 922.349731
#> WHEN [y] IS NULL THEN 317.401001 END) END) AS ONETREE FROM (
#> SELECT ROW_KEY, [carat], [depth], [table], [price], [x], [y], [z],
#> (case when [cut] IS NULL then NULL when [cut] = 'Fair' then 1 else 0 end) AS [cut_Fair],
#> (case when [cut] IS NULL then NULL when [cut] = 'Good' then 1 else 0 end) AS [cut_Good],
#> (case when [cut] IS NULL then NULL when [cut] = 'Very Good' then 1 else 0 end) AS [cut_VeryGood],
#> (case when [cut] IS NULL then NULL when [cut] = 'Premium' then 1 else 0 end) AS [cut_Premium],
#> (case when [cut] IS NULL then NULL when [cut] = 'Ideal' then 1 else 0 end) AS [cut_Ideal],
#> (case when [color] IS NULL then NULL when [color] = 'D' then 1 else 0 end) AS [color_D],
#> (case when [color] IS NULL then NULL when [color] = 'E' then 1 else 0 end) AS [color_E],
#> (case when [color] IS NULL then NULL when [color] = 'F' then 1 else 0 end) AS [color_F],
#> (case when [color] IS NULL then NULL when [color] = 'G' then 1 else 0 end) AS [color_G],
#> (case when [color] IS NULL then NULL when [color] = 'H' then 1 else 0 end) AS [color_H],
#> (case when [color] IS NULL then NULL when [color] = 'I' then 1 else 0 end) AS [color_I],
#> (case when [color] IS NULL then NULL when [color] = 'J' then 1 else 0 end) AS [color_J],
#> (case when [clarity] IS NULL then NULL when [clarity] = 'I1' then 1 else 0 end) AS [clarity_I1],
#> (case when [clarity] IS NULL then NULL when [clarity] = 'SI2' then 1 else 0 end) AS [clarity_SI2],
#> (case when [clarity] IS NULL then NULL when [clarity] = 'SI1' then 1 else 0 end) AS [clarity_SI1],
#> (case when [clarity] IS NULL then NULL when [clarity] = 'VS2' then 1 else 0 end) AS [clarity_VS2],
#> (case when [clarity] IS NULL then NULL when [clarity] = 'VS1' then 1 else 0 end) AS [clarity_VS1],
#> (case when [clarity] IS NULL then NULL when [clarity] = 'VVS2' then 1 else 0 end) AS [clarity_VVS2],
#> (case when [clarity] IS NULL then NULL when [clarity] = 'VVS1' then 1 else 0 end) AS [clarity_VVS1],
#> (case when [clarity] IS NULL then NULL when [clarity] = 'IF' then 1 else 0 end) AS [clarity_IF]
#> FROM INPUT_TABLE
#> ) AS MODREADY_TABLE
#> UNION ALL
#>
#> SELECT ROW_KEY ,
#> (CASE WHEN [y] < 6.69499969 THEN
#> (CASE WHEN [carat] < 0.824999988 THEN 289.332123
#> WHEN [carat] >= 0.824999988 THEN 1056.4021
#> WHEN [carat] IS NULL THEN 289.332123 END)
#> WHEN [y] >= 6.69499969 THEN
#> (CASE WHEN [y] < 7.65499973 THEN 1814.65881
#> WHEN [y] >= 7.65499973 THEN 3217.57129
#> WHEN [y] IS NULL THEN 1814.65881 END)
#> WHEN [y] IS NULL THEN
#> (CASE WHEN [carat] < 0.824999988 THEN 289.332123
#> WHEN [carat] >= 0.824999988 THEN 1056.4021
#> WHEN [carat] IS NULL THEN 289.332123 END) END) AS ONETREE FROM (
#> SELECT ROW_KEY, [carat], [depth], [table], [price], [x], [y], [z],
#> (case when [cut] IS NULL then NULL when [cut] = 'Fair' then 1 else 0 end) AS [cut_Fair],
#> (case when [cut] IS NULL then NULL when [cut] = 'Good' then 1 else 0 end) AS [cut_Good],
#> (case when [cut] IS NULL then NULL when [cut] = 'Very Good' then 1 else 0 end) AS [cut_VeryGood],
#> (case when [cut] IS NULL then NULL when [cut] = 'Premium' then 1 else 0 end) AS [cut_Premium],
#> (case when [cut] IS NULL then NULL when [cut] = 'Ideal' then 1 else 0 end) AS [cut_Ideal],
#> (case when [color] IS NULL then NULL when [color] = 'D' then 1 else 0 end) AS [color_D],
#> (case when [color] IS NULL then NULL when [color] = 'E' then 1 else 0 end) AS [color_E],
#> (case when [color] IS NULL then NULL when [color] = 'F' then 1 else 0 end) AS [color_F],
#> (case when [color] IS NULL then NULL when [color] = 'G' then 1 else 0 end) AS [color_G],
#> (case when [color] IS NULL then NULL when [color] = 'H' then 1 else 0 end) AS [color_H],
#> (case when [color] IS NULL then NULL when [color] = 'I' then 1 else 0 end) AS [color_I],
#> (case when [color] IS NULL then NULL when [color] = 'J' then 1 else 0 end) AS [color_J],
#> (case when [clarity] IS NULL then NULL when [clarity] = 'I1' then 1 else 0 end) AS [clarity_I1],
#> (case when [clarity] IS NULL then NULL when [clarity] = 'SI2' then 1 else 0 end) AS [clarity_SI2],
#> (case when [clarity] IS NULL then NULL when [clarity] = 'SI1' then 1 else 0 end) AS [clarity_SI1],
#> (case when [clarity] IS NULL then NULL when [clarity] = 'VS2' then 1 else 0 end) AS [clarity_VS2],
#> (case when [clarity] IS NULL then NULL when [clarity] = 'VS1' then 1 else 0 end) AS [clarity_VS1],
#> (case when [clarity] IS NULL then NULL when [clarity] = 'VVS2' then 1 else 0 end) AS [clarity_VVS2],
#> (case when [clarity] IS NULL then NULL when [clarity] = 'VVS1' then 1 else 0 end) AS [clarity_VVS1],
#> (case when [clarity] IS NULL then NULL when [clarity] = 'IF' then 1 else 0 end) AS [clarity_IF]
#> FROM INPUT_TABLE
#> ) AS MODREADY_TABLE
#> ) AS TREES_TABLE GROUP BY ROW_KEY
As processing time and query size grow exponentially with max.depth
, linearly with nround
, this approach of combining the one-hot query and the XGBoost query together should be used for only simple models.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.