‘vtreat’ is a package that prepares arbitrary data frames into clean data frames that are ready for analysis. A clean data frame:
To achieve this a number of techniques are used. Principally:
For more details see: the ‘vtreat’ article and update.
The main pattern is the use of ‘designTreatmentsC()’ or ‘designTreatmentsN()’ to design a treatment plan and then use the returned structure with ‘prepare()’ to apply the plan to data frames. The main feature of ‘vtreat’ is all data preparation is “y-aware” or uses the relations of effective variables to the dependent or outcome variable to encode the effective variables.
The structure returned from ‘designTreatmentsN()’ or ‘designTreatmentsC()’ includes informational fields. The main fields are mostly vectors with names (all with the same names in the same order):
In addition to these vectors ‘designTreatmentsC()’ and ‘designTreatmentsN()’ return a data frame named ‘scoreFrame’ which contains columns: - ‘varName’: name of new variable - ‘origName’: name of original variable variable was derived from (can repeat) - ‘varMoves’ : logical TRUE if the variable varied during training, only variables that move will be in the treated frame. - ‘PRESSRsquared’ : a PRESS-held out R-squared of a linear fit from each variable to the y-value. Scores of zero and below are very bad, scores near one are very good. - ‘psig’ : significance of observed variable ‘PRESSRsquared’ value under an in-sample permutation test. - ‘catPRSquared’ : for categorical outcomes: deviance based pseudo-Rsquared. - ‘csig’ : for categorical outcomes: significance of observed variable catPRSquared value under an in-sample permutation test. - ‘sig’ : ‘csig’ for categorical outcomes, ‘psig’ otherwise.
In all cases we have two undesirable upward biases on the scores:
‘vtreat’ uses a number of cross-training and jackknife style procedures to try to mitigate these effects. The suggested best practice is (if you have enough data) to split your randomly into at least the following disjoint data sets:
The idea is: taking the extra step to perform the ‘designTreatmentsC()’ or ‘designTreatmentsN()’ on data disjoint from training makes the training data more exchangeable with test and avoids the issue that ‘vtreat’ may be hiding a large number of degrees of freedom in variables it derives from large categoricals.
An trivial execution example (not demonstrating any cal/train/test split) is given below. Variables that do not move during hold-out testing are considered “not to move.”
library(vtreat)
dTrainC <- data.frame(x=c('a','a','a','b','b',NA),
z=c(1,2,3,4,NA,6),y=c(FALSE,FALSE,TRUE,FALSE,TRUE,TRUE))
head(dTrainC)
## x z y
## 1 a 1 FALSE
## 2 a 2 FALSE
## 3 a 3 TRUE
## 4 b 4 FALSE
## 5 b NA TRUE
## 6 <NA> 6 TRUE
dTestC <- data.frame(x=c('a','b','c',NA),z=c(10,20,30,NA))
head(dTestC)
## x z
## 1 a 10
## 2 b 20
## 3 c 30
## 4 <NA> NA
treatmentsC <- designTreatmentsC(dTrainC,colnames(dTrainC),'y',TRUE)
## [1] "desigining treatments Tue Nov 10 12:28:24 2015"
## [1] "design var x Tue Nov 10 12:28:24 2015"
## [1] "design var z Tue Nov 10 12:28:24 2015"
## [1] "scoring treatments Tue Nov 10 12:28:24 2015"
## [1] "WARNING skipped vars: x"
## [1] "have treatment plan Tue Nov 10 12:28:24 2015"
print(treatmentsC)
## $treatments
## $treatments[[1]]
## [1] "vtreat 'Categoric Indicators'('x'->character->'x_lev_rare')"
##
## $treatments[[2]]
## [1] "vtreat 'Bayesian Impact Code'('x'->character->'x_catB')"
##
## $treatments[[3]]
## [1] "vtreat 'Scalable pass through'('z'->numeric->'z_clean')"
##
## $treatments[[4]]
## [1] "vtreat 'is.bad'('z'->numeric->'z_isBAD')"
##
##
## $vars
## [1] "z_clean" "z_isBAD"
##
## $varMoves
## z_clean z_isBAD
## TRUE TRUE
##
## $sig
## z_clean z_isBAD
## 0.2601608 1.0000000
##
## $scoreFrame
## varName origName varMoves PRESSRsquared psig sig catPRSquared
## 1 z_clean z TRUE -0.8237958 1 0.2601608 0.1524329
## 2 z_isBAD z TRUE 0.0000000 1 1.0000000 0.0000000
## csig
## 1 0.2601608
## 2 1.0000000
##
## $nmMap
## $nmMap[[1]]
## $nmMap[[1]]$new
## [1] "x_lev_rare"
##
## $nmMap[[1]]$orig
## [1] "x"
##
##
## $nmMap[[2]]
## $nmMap[[2]]$new
## [1] "x_catB"
##
## $nmMap[[2]]$orig
## [1] "x"
##
##
## $nmMap[[3]]
## $nmMap[[3]]$new
## [1] "z_clean"
##
## $nmMap[[3]]$orig
## [1] "z"
##
##
## $nmMap[[4]]
## $nmMap[[4]]$new
## [1] "z_isBAD"
##
## $nmMap[[4]]$orig
## [1] "z"
##
##
##
## $outcomename
## [1] "y"
##
## $meanY
## [1] 0.5
##
## $ndat
## [1] 6
##
## $skippedVars
## [1] "x"
##
## attr(,"class")
## [1] "treatmentplan"
print(treatmentsC$treatments[[1]])
## [1] "vtreat 'Categoric Indicators'('x'->character->'x_lev_rare')"
dTrainCTreated <- prepare(treatmentsC,dTrainC,pruneSig=c(),scale=TRUE)
head(dTrainCTreated)
## z_clean z_isBAD y
## 1 -3.864865e-01 -0.1 FALSE
## 2 -2.108108e-01 -0.1 FALSE
## 3 -3.513514e-02 -0.1 TRUE
## 4 1.405405e-01 -0.1 FALSE
## 5 -2.220446e-16 0.5 TRUE
## 6 4.918919e-01 -0.1 TRUE
varsC <- setdiff(colnames(dTrainCTreated),'y')
# all input variables should be mean 0
sapply(dTrainCTreated[,varsC,drop=FALSE],mean)
## z_clean z_isBAD
## -1.942890e-16 -2.543922e-17
# all slopes should be 1
sapply(varsC,function(c) { lm(paste('y',c,sep='~'),
data=dTrainCTreated)$coefficients[[2]]})
## z_clean z_isBAD
## 1 1
dTestCTreated <- prepare(treatmentsC,dTestC,pruneSig=c(),scale=TRUE)
head(dTestCTreated)
## z_clean z_isBAD
## 1 4.918919e-01 -0.1
## 2 4.918919e-01 -0.1
## 3 4.918919e-01 -0.1
## 4 -2.220446e-16 0.5
# numeric example
dTrainN <- data.frame(x=c('a','a','a','a','b','b',NA),
z=c(1,2,3,4,5,NA,7),y=c(0,0,0,1,0,1,1))
head(dTrainN)
## x z y
## 1 a 1 0
## 2 a 2 0
## 3 a 3 0
## 4 a 4 1
## 5 b 5 0
## 6 b NA 1
dTestN <- data.frame(x=c('a','b','c',NA),z=c(10,20,30,NA))
head(dTestN)
## x z
## 1 a 10
## 2 b 20
## 3 c 30
## 4 <NA> NA
treatmentsN = designTreatmentsN(dTrainN,colnames(dTrainN),'y')
## [1] "desigining treatments Tue Nov 10 12:28:24 2015"
## [1] "design var x Tue Nov 10 12:28:24 2015"
## [1] "design var z Tue Nov 10 12:28:24 2015"
## [1] "scoring treatments Tue Nov 10 12:28:24 2015"
## [1] "WARNING skipped vars: x"
## [1] "have treatment plan Tue Nov 10 12:28:24 2015"
print(treatmentsN)
## $treatments
## $treatments[[1]]
## [1] "vtreat 'Categoric Indicators'('x'->character->'x_lev_rare')"
##
## $treatments[[2]]
## [1] "vtreat 'Scalable Impact Code'('x'->character->'x_catN')"
##
## $treatments[[3]]
## [1] "vtreat 'Scalable pass through'('z'->numeric->'z_clean')"
##
## $treatments[[4]]
## [1] "vtreat 'is.bad'('z'->numeric->'z_isBAD')"
##
##
## $vars
## [1] "z_clean" "z_isBAD"
##
## $varMoves
## z_clean z_isBAD
## TRUE TRUE
##
## $sig
## z_clean z_isBAD
## 1 1
##
## $scoreFrame
## varName origName varMoves PRESSRsquared psig sig
## 1 z_clean z TRUE -0.4545128 1 1
## 2 z_isBAD z TRUE 0.0000000 1 1
##
## $nmMap
## $nmMap[[1]]
## $nmMap[[1]]$new
## [1] "x_lev_rare"
##
## $nmMap[[1]]$orig
## [1] "x"
##
##
## $nmMap[[2]]
## $nmMap[[2]]$new
## [1] "x_catN"
##
## $nmMap[[2]]$orig
## [1] "x"
##
##
## $nmMap[[3]]
## $nmMap[[3]]$new
## [1] "z_clean"
##
## $nmMap[[3]]$orig
## [1] "z"
##
##
## $nmMap[[4]]
## $nmMap[[4]]$new
## [1] "z_isBAD"
##
## $nmMap[[4]]$orig
## [1] "z"
##
##
##
## $outcomename
## [1] "y"
##
## $meanY
## [1] 0.4285714
##
## $ndat
## [1] 7
##
## $skippedVars
## [1] "x"
##
## attr(,"class")
## [1] "treatmentplan"
dTrainNTreated <- prepare(treatmentsN,dTrainN,
pruneSig=c(),scale=TRUE)
head(dTrainNTreated)
## z_clean z_isBAD y
## 1 -0.41904762 -0.0952381 0
## 2 -0.26190476 -0.0952381 0
## 3 -0.10476190 -0.0952381 0
## 4 0.05238095 -0.0952381 1
## 5 0.20952381 -0.0952381 0
## 6 0.00000000 0.5714286 1
varsN <- setdiff(colnames(dTrainNTreated),'y')
# all input variables should be mean 0
sapply(dTrainNTreated[,varsN,drop=FALSE],mean)
## z_clean z_isBAD
## 4.757324e-17 -7.929874e-17
# all slopes should be 1
sapply(varsN,function(c) { lm(paste('y',c,sep='~'),
data=dTrainNTreated)$coefficients[[2]]})
## z_clean z_isBAD
## 1 1
# prepared frame
dTestNTreated <- prepare(treatmentsN,dTestN,
pruneSig=c())
head(dTestNTreated)
## z_clean z_isBAD
## 1 7.000000 0
## 2 7.000000 0
## 3 7.000000 0
## 4 3.666667 1
# scaled prepared frame
dTestNTreatedS <- prepare(treatmentsN,dTestN,
pruneSig=c(),scale=TRUE)
head(dTestNTreatedS)
## z_clean z_isBAD
## 1 0.5238095 -0.0952381
## 2 0.5238095 -0.0952381
## 3 0.5238095 -0.0952381
## 4 0.0000000 0.5714286