Example showing safe “best practice” use of the ‘vtreat’ variable preparation library. For more on vtreat see here and here.
Build an example data frame with no relation between x and y. We are using a synthetic data set so we know what the “right answer is” (no signal). False fitting on no-signal variables is bad for at least two reasons:
This example shows things we don’t want to happen, and then the additional precautions that help prevent them.
set.seed(22626)
d <- data.frame(x=sample(paste('level',1:1000,sep=''),2000,replace=TRUE)) # independent variables.
d$y <- runif(nrow(d))>0.5 # the quantity to be predicted, notice: independent of variables.
d$rgroup <- round(100*runif(nrow(d))) # the random group used for splitting the data set, not a variable.
Bad practice: use same set of data to prepare variable encoding and train a model. Leads to false belief (derived from training set) that model had a good fit. Largely due to the treated variable appearing to consume only one degree of freedom, when it consumes many more. In many cases a reasonable setting of ‘pruneSig’ (say 0.01) will help against a variable being considered desirable, but selected variables may still be mis-used by downstream modeling.
dTrain <- d[d$rgroup<=80,,drop=FALSE]
dTest <- d[d$rgroup>80,,drop=FALSE]
library('vtreat')
treatments <- vtreat::designTreatmentsC(dTrain,'x','y',TRUE,
rareCount=0 # Note: usually want rareCount>0, setting to zero to illustrate problem
)
## [1] "desigining treatments Tue Nov 10 12:28:24 2015"
## [1] "design var x Tue Nov 10 12:28:24 2015"
## [1] "scoring treatments Tue Nov 10 12:28:25 2015"
## [1] "have treatment plan Tue Nov 10 12:28:26 2015"
dTrainTreated <- vtreat::prepare(treatments,dTrain,
pruneSig=c() # Note: usually want pruneSig to be a small fraction, setting to null to illustrate problem
)
m1 <- glm(y~x_catB,data=dTrainTreated,family=binomial(link='logit'))
print(summary(m1)) # notice low residual deviance
##
## Call:
## glm(formula = y ~ x_catB, family = binomial(link = "logit"),
## data = dTrainTreated)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2753 -1.2753 0.2849 1.0826 2.7703
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.22721 0.05503 4.129 3.64e-05 ***
## x_catB 4.35004 0.72866 5.970 2.37e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2270.5 on 1637 degrees of freedom
## Residual deviance: 1863.4 on 1636 degrees of freedom
## AIC: 1867.4
##
## Number of Fisher Scoring iterations: 10
dTrain$predM1 <- predict(m1,newdata=dTrainTreated,type='response')
# devtools::install_github("WinVector/WVPlots")
# library('WVPlots')
plotRes <- function(d,predName,yName,title) {
print(title)
tab <- table(truth=d[[yName]],pred=d[[predName]]>0.5)
print(tab)
diag <- sum(vapply(seq_len(min(dim(tab))),
function(i) tab[i,i],numeric(1)))
acc <- diag/sum(tab)
# if(requireNamespace("WVPlots",quietly=TRUE)) {
# print(WVPlots::ROCPlot(d,predName,yName,title))
# }
print(paste('accuracy',acc))
}
plotRes(dTrain,'predM1','y','model1 on train')
## [1] "model1 on train"
## pred
## truth FALSE TRUE
## FALSE 211 597
## TRUE 3 827
## [1] "accuracy 0.633699633699634"
dTestTreated <- vtreat::prepare(treatments,dTest,pruneSig=c())
dTest$predM1 <- predict(m1,newdata=dTestTreated,type='response')
plotRes(dTest,'predM1','y','model1 on test')
## [1] "model1 on test"
## pred
## truth FALSE TRUE
## FALSE 13 162
## TRUE 18 169
## [1] "accuracy 0.502762430939227"
The above is bad: we saw a “significant” model fit on training data (even though there is no relation to be found). This means the treated training data can be confusing to machine learning techniques and to the analyst. The issue is the training data is no longer exchangeable with the test data because the training data was used to build the variable encodings. One way to avoid this is to not use the training data for variable encoding construction, but instead use a third set for this task.
First notice vtreat did thing there was any usable signal, and did not want us to use the variables (we got them by setting ‘pruneSig=c()’). Also notice we set rareCount=0, which allows use of very rare levels (which help drive the problem).
print(treatments$scoreFrame)
## varName origName varMoves PRESSRsquared psig sig catPRSquared
## 1 x_catB x TRUE -0.00420741 1 0.1297236 0.004039834
## csig
## 1 0.1297236
‘vtreat’ thinks a signal as strong as the one seen on the derived variable ‘x_catB’ happens at least 70% of the time for a variable with that sort of distribution: even when there is no signal. But also notice the down-stream machine learning (in this case a standard logistic regression) used the variable wrong. It gave it a non-negligible coefficient (around 3) and thought it had a reliable estimate of the coefficient and significant model that almost halved deviance (when in fact it was given nothing). So any variables that do get through may have distributional issues (and misleadingly low apparent degrees of freedom).
The biggest contributors to this distributional issue tend to be rare levels of categorical variables. Since the individual levels are rare we have unreliable estimates for there effects, but if there are very many of them we may see quite a large effect. To help combat this we have a control called ‘rareLevels’. Any level that is observed no more than ‘rareLevels’ times during training is re-mapped to a new special level called “rare” and not allowed to directly contribute (i.e. can not generated unique indicator columns, and doesn’t have a direct effect on ‘catB’ or ‘catN’ encodings). If all the rare levels have a distinct behavior after grouping the “rare” level can capture that.
Another undesirable effect is over-estimating significance of derived variable fit for ‘catB’ and ‘catN’ impact coded variables. To fight this vtreat attempts to estimate out of sample or cross-validated effect significances (when it has enough data). So with enough data setting the ‘pruneSig’ parameter during prepare will help remove noise variables. One can set ‘pruneSig’ to something like 1/number-of-columns to ensure that with high probability only an constant number of truly useless variables make it to later modeling. However, the significance of a given effect size for variables that actually have some signal (i.e. non-noise variables) can still be sensitive to in/out sample scoring and the hiding of degrees of freedom that occurs when a large categorical variable (that represents a large number of degrees of freedom) is re-coded as an impact or effect (which appears to have only a single degree of freedom).
We next show how to avoid these undesirable illusory effects: better practice in partitioning and using training data. We are doing more with our data (essentially chaining models), so we have to take a bit more care with our data.
Correct example: use different sets to prepare variable encoding and train a model. Leads to false belief (derived from training set) that model had a good fit. Largely due to the treated variable appearing to consume only one degree of freedom, when it consumes many more.
Remember, the goal isn’t good performance on training- it is good performance on future data (simulated by test). So doing well on training and bad on test is worse than doing bad on both test and training.
Below is part of our suggested work pattern: coding/train/test split.
dCode <- d[d$rgroup<=20,,drop=FALSE]
dTrain <- d[(d$rgroup>20) & (d$rgroup<=80),,drop=FALSE]
treatments <- vtreat::designTreatmentsC(dCode,'x','y',TRUE,
rareCount=0, # Note set this to something larger, like 5
rareSig=c() # Note set this to something like 0.3
)
## [1] "desigining treatments Tue Nov 10 12:28:26 2015"
## [1] "design var x Tue Nov 10 12:28:26 2015"
## [1] "scoring treatments Tue Nov 10 12:28:26 2015"
## [1] "have treatment plan Tue Nov 10 12:28:26 2015"
dTrainTreated <- vtreat::prepare(treatments,dTrain,
pruneSig=c() # Note: set this to filter, like 0.05 or 1/nvars
)
m2 <- glm(y~x_catB,data=dTrainTreated,family=binomial(link='logit'))
print(summary(m2)) # notice high residual deviance
##
## Call:
## glm(formula = y ~ x_catB, family = binomial(link = "logit"),
## data = dTrainTreated)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.181 -1.181 1.174 1.174 1.174
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.0081150 0.0614833 0.132 0.895
## x_catB -0.0001303 0.0221666 -0.006 0.995
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1681.6 on 1212 degrees of freedom
## Residual deviance: 1681.6 on 1211 degrees of freedom
## AIC: 1685.6
##
## Number of Fisher Scoring iterations: 3
dTrain$predM2 <- predict(m2,newdata=dTrainTreated,type='response')
plotRes(dTrain,'predM2','y','model2 on train')
## [1] "model2 on train"
## pred
## truth TRUE
## FALSE 604
## TRUE 609
## [1] "accuracy 0.497938994229184"
# We do not advise creating dCodeTreated for any purpose other than
# diagnostic plotting. You should not use the treated coding data
# for anything (as that would undo the benefit of having a separate
# coding data subset).
dCodeTreated <- vtreat::prepare(treatments,dCode,pruneSig=c())
dCode$predM2 <- predict(m2,newdata=dCodeTreated,type='response')
plotRes(dCode,'predM2','y','model2 on coding set')
## [1] "model2 on coding set"
## pred
## truth TRUE
## FALSE 204
## TRUE 221
## [1] "accuracy 0.48"
dTestTreated <- vtreat::prepare(treatments,dTest,pruneSig=c())
dTest$predM2 <- predict(m2,newdata=dTestTreated,type='response')
plotRes(dTest,'predM2','y','model2 on test set')
## [1] "model2 on test set"
## pred
## truth TRUE
## FALSE 175
## TRUE 187
## [1] "accuracy 0.483425414364641"
In the above example we saw training and test performance are similar (and where they should be as there is no signal). Notice the coding set can (falsely) show high performance. This is the bad behavior we wanted to isolate out of the training set.
Also be wary, on small data sets vtreat::designTreatments can not always get accurate out-of sample estimates of variable performance (in these cases it falls back to untrustworthy in-sample estimates). This is something we will improve over time, but vtreat is intended mostly for production applications on large data sets.
Bad small example:
treatmentsBad <- vtreat::designTreatmentsC(d[d$rgroup<=0,,drop=FALSE],'x','y',TRUE,
rareCount=0 # Note set this to something larger, like 5
)
## [1] "desigining treatments Tue Nov 10 12:28:26 2015"
## [1] "design var x Tue Nov 10 12:28:26 2015"
## [1] "scoring treatments Tue Nov 10 12:28:26 2015"
## [1] "have treatment plan Tue Nov 10 12:28:26 2015"
print(treatmentsBad$scoreFrame[treatmentsBad$scoreFrame$sig<=0.05,,drop=FALSE])
## [1] varName origName varMoves PRESSRsquared psig
## [6] sig catPRSquared csig
## <0 rows> (or 0-length row.names)
We would have preferred no variables to have received a “good score.”
So even better is to perform the 3-way data split, set pruneSig to something reasonable, and do not set rareCount to zero (leave it at the default or a reasonable count like 5 or 10).
dCode <- d[d$rgroup<=20,,drop=FALSE]
dTrain <- d[(d$rgroup>20) & (d$rgroup<=80),,drop=FALSE]
tryCatch(
{ treatments <- vtreat::designTreatmentsC(dCode,'x','y',TRUE)
dTrainTreated <- vtreat::prepare(treatments,dTrain,pruneSig=0.01)},
error=function(x) { print(paste('caught',x)); return(c()) }
)
## [1] "desigining treatments Tue Nov 10 12:28:26 2015"
## [1] "design var x Tue Nov 10 12:28:26 2015"
## [1] "scoring treatments Tue Nov 10 12:28:26 2015"
## [1] "have treatment plan Tue Nov 10 12:28:26 2015"
## [1] "caught Error in vtreat::prepare(treatments, dTrain, pruneSig = 0.01): no usable vars\n"
## NULL
And in this case we are (properly) told are no variables to work with (and prevented from accidentally continuing a bad analysis).
There are, of course, other methods to avoid the bias introduced in using the same data to generate the variable encodings and then train a model using those variables. vtreat incorporates a number of these (smoothing controlled through ‘smFactor’, pruning of rare levels controlled through ‘rareSig’ and ‘rareCount’, and cross-constructed training frames accessed by setting ‘returnXFrame=TRUE’). But we feel when you have a lot of data the simplicity (and statistical soundness) of the three way split is attractive.