The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.

RFlocalfdr A Simulated Data Example

Rob Dunne

Wednesday, January 29, 2025

Introduction

Variable importance measures are often used to perform variable selection. It is possible to identify two final aims (Genuer, Poggi, and Tuleau-Malot (2010)):

find a small number of variables with a maximized accuracy, or
detect and rank all influential variables to focus on for interpretation and further exploration with domain experts.

It is the second objective that is more often our aim in biomolecular work. After the important variables are selected, they can be joined with other data from the literature or databases such as gene annotations. They can also inform further rounds of experiments. In additions, the reduced data set can be used in other analysis methods such as extraction of structural causal models.

The problem is characterized by:

a relatively small \(n\) and a \(p\) that is typically two to three orders of magnitude larger than \(n\);
a lot of correlation between the variables;

Discussion and Conclusion

We simulate a classification problem with 6350 variables of which 127 are related to the classification and a strong correlation structure. We consider a number of variable ranking and selection methods. We give the results for variable selection, not for the performance of the classifier.

We consider two problems:

ranking the variables by variable importance. We have taken the point on the ROC curve that gives the minimum cost (with equal cost for positives and negatives). So these are the best results that could be obtained for these methods.
where there is a method of thresholding the importances to do variable selection, we have used it.
where there is a method of threshold the variable importance to do variable selection, we consider the accuracy of the selected set of variables.

The results are given in the table below. Considering the second part of the table where we select variables based on a RF impurity measure. RFlocalfdr has done reasonably well on the FDR. Recursive Feature Elimination has performed better but only selected a small number of variables.

I could not get many of the SHAP implementations to converge in a 2 1/2 hour run time limit. Shaff converged and returned one variable. This is not surprising. The Shapley values are an “all subsets” calculation and this is NP hard. All of the methods available use some heuristic to get around the hardness of the problem.

As this is quite interesting, I tried a number of SHAP implementations on the smaller that set that we used for visualizing the problem. This goes beyond the original intent of a vignette for the RFlocalfdr package but seems interesting to pursue.

The Simulation

We simulate a small data set with a covariance structure to explore the actions of RFlocalfdr. This package implements an empirical Bayes method of determining a significance level for the Random Forest MDI (Mean Decrease in Impurity) variable importance measure. See Dunne et al. (2022) for details.

simulate the data set
fit a Random Forest model
use RFlocalfdr to estimate the significant variables
for comparison, we also estimate the significant variables using
- Boruta (Kursa and Rudnicki (2010))
- Recursive Fearure Elimination
- AIR (Nembrini et al. (2018))
- PIMP (Altmann et al. (2010)).

data setup

This dataset consist of bands, with blocks of \(\{1, 2, 4, 8, 16, 32, 64\}\) of identical variables. The variables are \(\in \{0,1,2\}\), a common encoding for genomic data where the numbers represent the number of copies of the minor allele. Only band 1 is used to calculate the \(y\) vector and \(y\) is 1 if any of \(X[, c(1, 2, 4, 8, 16, 32, 64)]\) is non-zero, so \(y\) is unbalanced, containing more 1’s than 0’s. We plot a small data set of \(300\times 508\) to explore the data generation method, see figure @ref(fig:simulation2).

For the problem we generate a data set with 50 bands and 300 observations so \(X\) is \(300 \times 6350\) with 127 non-null variables. We fit a standard RF to this dataset and record the resulting MDI importance score.

library(ranger)
library(RFlocalfdr)
library(caret)
library(pROC)

A small simulated data set. Each band contains blocks of size {1, 2, 4, 8, 16, 32, 64}, and each block consists of correlated (identical variables).

# generate the data
set.seed(13)
num_samples <- 300
num_bands <- 50
band_rank <- 6
num_vars <- num_bands * (2 ** (band_rank+1) -1)
print(num_vars)

X <- matrix(NA, num_samples , num_vars)
set.seed(123)
  
var_index <- 1
for(band in 1:num_bands) {
    for (rank in 0:band_rank) {
#        cat("rank=",rank,"\n")
        var <- sample(0:2, num_samples, replace=TRUE)
        for (i in 1:2**rank) {
            X[,var_index] <- var
            var_index <- var_index +1 
        }
#        print(X[1:2,1:140])
#        system("sleep 3")
    }
}

y <- as.numeric(X[,1] > 1 |  X[,2] > 1  |  X[,4] > 1 |  X[,8] > 1 |  X[,16] > 1 |
                X[,32] > 1 | X[,64] > 1 )

Random forest

parameter setting for a RF

ranger sets mtry to \(\lfloor \sqrt{p} \rfloor\)
randomForest sets mtry \(\lfloor \sqrt{p} \rfloor\) for classification and \(\ \lfloor p/3 \rfloor\) for regression
if there are only a few relevant variables out of many, which is the case in many genetic datasets, mtry should be set high, so that the algorithm can find the relevant variables (Goldstein, Polley, and Briggs (2011)). A large mtry ensures that there is (with high probability) at least one strong variable in the set of mtry candidate variables.
For high dimensional data they observe lower error rates for higher mtry values for both classification and regression, corroborating (Goldstein, Polley, and Briggs (2011))
min.node.size specifies the minimum number of observations in a terminal node. Setting it lower leads to trees with a larger depth which means that more splits are performed until the terminal nodes. Both ranger and randomForest set the default value to 1 for classification and 5 for regression.
num.trees we have set this to a large value.
replace=FALSE

We fit a Random Forest (Breiman (2001)) model and recover the Mean Decrease in Impurity variable importance.

data <- cbind(y,X)
colnames(data) <- c("y",make.names(1:num_vars))

rfModel <- ranger(data=data,dependent.variable.name="y", importance="impurity",
             classification=TRUE,num.trees = 10000, replace=FALSE, mtry=100, seed  = 17 )


imp <-log(ranger::importance(rfModel))
t2 <-count_variables(rfModel)
plot(density(imp))

roccurve <- roc(c(rep(1,127),rep(0,6350-127)),imp)
plot(roccurve)
auc(roccurve) # 0.993

palette("default")
col<-c(1, rep(2,2), rep(3,4), rep(4, 8), rep(5,16 ), rep(6,32 ), rep(7,64) )
plot(1:1016,imp[1:1016],type="p",col=rep(col,8),pch=16,cex=0.8,
     xlab="variable number",ylab="log importances")
plot(imp,type="p",col=rep(col,8),pch=16,cex=0.8,xlab="variable number",
     ylab="log importances")


predictions <- imp
labels <- c(rep(1,127),rep(0,6350-127))
pred <- prediction(predictions, labels)
perf <- performance(pred, "tpr", "fpr")
plot(perf)
cost_perf = performance(pred, "cost")
pred@cutoffs[[1]][which.min(cost_perf@y.values[[1]])]
#     X22 
#-3.485894 

predicted_values <-rep(0,6350);predicted_values[ which(imp> -3.485894) ]<-1
conf_matrix<-table(predicted_values,c(rep(1,127),rep(0,6350-127)))
conf_matrix
#predicted_values    0    1
#               0 6217   72
#               1    6   55
conf_matrix[2,1]/sum(conf_matrix[2,]) #[1]] 0.09836066 FDR
sensitivity(conf_matrix) #] 0.9990358 TP/(TP+FN)
specificity(conf_matrix) # 0.4330709          TN/(FP+TN)

We have set num.trees to a large value and mtry to a larger value than the default. This is because of the large number of variables.

We plot the log importances for the first 8 bands (figure @ref(fig:simulation2zz2)). The plot for all 50 bands it too compressed. The blocks are shown in different colors. The bias described by Strobl et al. (2007) is clearly visible. Blocks 1 to 7 of band 1 should be of equal expected influence on \(y\), but the importance is decreasing as the number of variables in the block is increased.

The log importances for the first 8 bands.

RFlocalfdr

determine cutoff

cutoffs <- c(0,1,4,10)
#png("./supp_figures/simulated_data_determine_cutoff.png")
res.con<- determine_cutoff(imp,t2,cutoff=cutoffs,plot=c(0,1,4,10))
#dev.off()

We plot the kernel density estimate of the histogram of the data \(y\), and the skew normal fit, \(t_1\), using the data up to the quantile \(Q\), shown in red.

For this data set, the selected cutoff value is 0.

plot(cutoffs,res.con[,3])
cutoffs[which.min(res.con[,3])]

By plotting \(\max(|y - t_1|)\) versus the cutoff values, we determine the appropriate cutoff. In this case it is just \(t2>0\)

recover variables

temp<-imp[t2 > 0]
palette("R3")

qq <- plotQ(temp,debug.flag = 1)
ppp<-run.it.importances(qq,temp,debug=0)
 aa<-significant.genes(ppp,temp,cutoff=0.05,do.plot=2)
length(aa$probabilities) # 95

tt1 <-as.numeric(gsub("X([0-9]*)","\\1",names(aa$probabilities)))
tt2 <- table(ifelse((tt1 < 127),1,2))
# 1  2 
# 59 36 
# The false discovery rate is 0.3789474
tt2[2]/(tt2[1]+tt2[2])
#59 36   36/(36+59) = 0.3789474

predicted_values<-rep(0, 6350);predicted_values[tt1]<-1
conf_matrix<-table(predicted_values,c(rep(1,127),rep(0,6350-127)))
conf_matrix
conf_matrix[2,1]/sum(conf_matrix[2,]) # 0.3789474 FDR 
sensitivity(conf_matrix) #0.994215 TP/(TP+FN)
specificity(conf_matrix) #0.4645669 TN/(FP+TN)

roccurve <- roc(c(rep(1,127),rep(0,6350-127)),predicted_values)
plot(roccurve)
auc(roccurve) #0.7294

accuracy<-(conf_matrix[1,1]+conf_matrix[2,2])/(sum(conf_matrix))
accuracy # 0.983622

The FDR is 0.379. We can also calculate the sensitivity, sensitivity and accruacy.

In order to make the comparisons with other methods, the following packages may need to be installed.

install.packages("Boruta")
install.packages("locfdr")
install.packages("vita") 
install.packages("locfdr")
devtools::install_github("silkeszy/Pomona")

Boruta

We try the Boruta algorithm (Kursa and Rudnicki (2010)).

library(Boruta)
set.seed(120)  
system.time(boruta1 <- Boruta(X,as.factor(y), num.threads = 7,getImp=getImpRfGini,
                              classification=TRUE,num.trees = 10000, replace=FALSE, mtry=100, seed  = 17))
print(boruta1)
#Boruta performed 99 iterations in 19.54727 secs.
#4 attributes confirmed important: X4859, X58, X6132, X7;
# 6346 attributes confirmed unimportant: X1, X10, X100, X1000, X1001 and 6341 more;
plotImpHistory(boruta1)
aa <- which(boruta1$finalDecision=="Confirmed") 
bb <- which(boruta1$finalDecision=="Tentative") 
predicted_values <-rep(0,6350);predicted_values[c(aa,bb)]<-1
conf_matrix<-table(predicted_values,c(rep(1,127),rep(0,6350-127)))
conf_matrix

conf_matrix[2,1]/sum(conf_matrix[2,]) # 0.3789474 FDR 
sensitivity(conf_matrix) #0.9996786 TP/(TP+FN)
specificity(conf_matrix) #0.01574803 TN/(FP+TN)

roccurve <- roc(c(rep(1,127),rep(0,6350-127)),predicted_values)
plot(roccurve)
auc(roccurve) #0.5077

accuracy<-(conf_matrix[1,1]+conf_matrix[2,2])/(sum(conf_matrix))
accuracy #0.98

Recursive Feature Elimination

This is provided by the package Pomona (Fouodo (2022))
recalculates importance scores in each step.
does not recognize the ranger option replace=FALSE

library(Pomona)
colnames(X) <- make.names(1:dim(X)[2])
set.seed(111)
res <- var.sel.rfe(X, y, prop.rm = 0.2,  recalculate = TRUE, tol = 10, 
    ntree = 500, mtry.prop = 0.2, nodesize.prop = 0.1, no.threads = 7, 
    method = "ranger", type = "classification", importance = "impurity", 
    case.weights = NULL) 
 res$var
#[1] "X1"    "X106"  "X11"   "X12"   "X127"  "X13"   "X15"   "X16"   "X2"   
#[10] "X23"   "X24"   "X3"    "X4"    "X44"   "X46"   "X4885" "X5"    "X54"  
#[19] "X5474" "X6"    "X7"    "X72"   "X9"    "X91"  
tt <-as.numeric(gsub("X([0-9]*)","\\1", res$var))
table(ifelse((tt < 127),1,2))
# 1  2 
#21  3 0.0833

res<-c(1,106, 11, 12, 127, 13, 15, 16,  2, 23, 24,  3,  4, 44, 46,  4885,  5, 54, 5474,  6,
       7, 72,  9, 91)
predicted_values <-rep(0,6350);predicted_values[c(res)]<-1
conf_matrix<-table(predicted_values,c(rep(1,127),rep(0,6350-127)))
conf_matrix
#predicted_values    0    1
#               0 6221  105
#               1    2   22

conf_matrix[2,1]/sum(conf_matrix[2,]) # 0.083 FDR 
sensitivity(conf_matrix) # 0.9996786 TP/(TP+FN)
specificity(conf_matrix) #0 0.1732283 TN/(FP+TN)

roccurve <- roc(c(rep(1,127),rep(0,6350-127)),predicted_values)
plot(roccurve)
auc(roccurve) #0.5865
accuracy<-(conf_matrix[1,1]+conf_matrix[2,2])/(sum(conf_matrix))
accuracy #  0.9831496

AIR

See Nembrini et al. (2018).

AIR is provided in the package ranger, using the option impurity_corrected.
the suggested approach is to apply the **janitza* method and select the variables with \(p < 0.05\)

rfModel2 <- ranger(data=data,dependent.variable.name="y", importance="impurity_corrected",
       classification=TRUE,  mtry=100,num.trees = 10000, replace=TRUE, seed  = 17 )
ww<- importance_pvalues( rfModel2, method = "janitza")

p <- ww[,"pvalue"]
cc <- which(p< 0.05)  
predicted_values <-rep(0,6350);predicted_values[cc]<-1
conf_matrix<-table(predicted_values,c(rep(1,127),rep(0,6350-127)))
conf_matrix
#predicted_values    0    1
#               0 5950    0
#               1  273  127
#FDR is 273/(127+273) = 0.6825

sensitivity(conf_matrix) #0.9561305 TP/(TP+FN)
specificity(conf_matrix) #1          TN/(FP+TN)
roccurve <- roc(c(rep(1,127),rep(0,6350-127)),predicted_values)
plot(roccurve)
auc(roccurve) # 0.9781
accuracy<-(conf_matrix[1,1]+conf_matrix[2,2])/(sum(conf_matrix))
accuracy #  0.9570079

As the null values, as modelled bt the AIR procedure, are symmetric around 0, the question arises as to whether we can apply the locfdr (Efron, Turnbull, and Narasimhan (2015)) procedure. In this case, it reduces the FDR from 0.682 to 0.601.

plot(density(ww[,"importance"]))
imp <- ww[,"importance"]
#imp <-imp/sqrt(var(imp))
#plot(density(imp))
library(locfdr)

aa<-locfdr(imp,df=13)
which( (aa$fdr< 0.05) & (imp>0))
cc2 <-  which( (aa$fdr< 0.05) & (imp>0))
length(cc2) # [1] 309
length(intersect(cc,cc2)) #[1] 309

(length(cc2)  - length(which(cc2 <= 127)))/length(cc2) #[1] 0.6019417  fdr
predicted_values <-rep(0,6350);predicted_values[cc2]<-1
conf_matrix<-table(predicted_values,c(rep(1,127),rep(0,6350-127)))
conf_matrix
conf_matrix[2,1]/sum(conf_matrix[2,]) #  0.6019417 FDR 
sensitivity(conf_matrix) # 0.9701109 TP/(TP+FN)
specificity(conf_matrix) #0.9685039         TN/(FP+TN)
roccurve <- roc(c(rep(1,127),rep(0,6350-127)),predicted_values)
plot(roccurve)
auc(roccurve) # 0.9693
accuracy<-(conf_matrix[1,1]+conf_matrix[2,2])/(sum(conf_matrix))
accuracy #   0.9700787

PIMP

note that PIMP

permutes the response vector \(y\) \(S\) times to give \(y^{*s}\)
For each permutation, a new RF is grown and the permutation variable importance measure (VarImp\(^{*s}\)) for all predictor variables X is computed.
The vector ‘perVarImp’ of \(S\) VarImp measures for every predictor variable is used to approximate the null importance distributions for each variable (see PimpTest)
we are doing \(p\) tests, hence a multiple testing correction is in order
we base our work on the PIMP implementation provided by Celik (2015) Variable Importance Testing Approaches
pump as described by Altman is only applicable to permutation importance.
however we see no impediment in extending the method to Gini importance
we have done that in our code and also
- fixed a small error
- extended the code to use ranger as well as randomForest implementations of RF

#vita: Variable Importance Testing Approaches
library(vita) 
y<-factor(y)
X<-data.frame(X)
set.seed(117)
cl.ranger <- ranger(y~. , X,mtry = 100,num.trees = 10000, classification=TRUE, replace=FALSE,
                    importance = 'impurity') 
system.time(pimp.varImp.cl<-my_ranger_PIMP(X,y,cl.ranger,S=10, parallel=TRUE, ncores=10)) 
pimp.t.cl <- vita::PimpTest(pimp.varImp.cl,para = FALSE)
aa <- summary(pimp.t.cl,pless = 0.1)

tt<-as.numeric(gsub("X([0-9]*)","\\1",  names(which(aa$cmat2[,"p-value"]< 0.05))))
table(ifelse((tt < 127),1,2))
# 1    2 
# 126 176      176 /( 126+ 176  ) = 0.582

predicted_values <-rep(0,6350);predicted_values[which(aa$cmat2[,"p-value"]< 0.05)]<-1
conf_matrix<-table(predicted_values,c(rep(1,127),rep(0,6350-127)))
conf_matrix
conf_matrix[2,1]/sum(conf_matrix[2,]) #[1] 0.5794 FDR
sensitivity(conf_matrix) #0.971 TP/(TP+FN)
specificity(conf_matrix) #1          TN/(FP+TN)

roccurve <- roc(c(rep(1,127),rep(0,6350-127)),predicted_values)
plot(roccurve)
auc(roccurve) #  0.9859
accuracy<-(conf_matrix[1,1]+conf_matrix[2,2])/(sum(conf_matrix))
accuracy #  0.9724409

Other Methods

permutation importance

rfModel2 <- ranger(data=data,dependent.variable.name="y", importance="permutation",
             classification=TRUE,  mtry=100,num.trees = 10000, replace=FALSE, seed  = 17 )
imp <-ranger::importance(rfModel2)
plot(density(imp))

palette(topo.colors(n = 7))
plot(1:1016,imp[1:1016],type="p",col=rep(col,2),pch=16,cex=0.8,
     xlab="variable number",ylab="log importances")
lines(1:1016,imp[1:1016],col = "gray62",lwd=0.5)
abline(v=127,col="green")

# could apply Briemans and Cutlers argument that the permutation importance is Gaussian --
# or could use empirical Bayes

roccurve <- roc(c(rep(1,127),rep(0,6350-127)),imp)
plot(roccurve)
auc(roccurve) #  0.9924


library(ROCR)
predictions <- imp
labels <- c(rep(1,127),rep(0,6350-127))
pred <- prediction(predictions, labels)
cost_perf = performance(pred, "cost") 
pred@cutoffs[[1]][which.min(cost_perf@y.values[[1]])]

table(imp> 0.0001063063  ,c(rep(1,127),rep(0,6350-127)))
#          0    1
#  FALSE 6220   66
#  TRUE     3   61

predicted_values <-rep(0,6350);predicted_values[which(imp> 0.0001063063 )]<-1
conf_matrix<-table(predicted_values,c(rep(1,127),rep(0,6350-127)))
conf_matrix
#predicted_values    0    1
#               0 6220   66
#               1    3   61
conf_matrix[2,1]/sum(conf_matrix[2,]) #[1]  0.046875 FDR
sensitivity(conf_matrix) # 0.9995179 TP/(TP+FN)
specificity(conf_matrix) # 0.480315         TN/(FP+TN)

roccurve <- roc(c(rep(1,127),rep(0,6350-127)),predicted_values)
plot(roccurve)
auc(roccurve) #   0.7399
accuracy<-(conf_matrix[1,1]+conf_matrix[2,2])/(sum(conf_matrix))
accuracy #  0.9724409

impurity corrected

rfModel3 <- ranger(data=data,dependent.variable.name="y", importance="impurity_corrected",
             classification=TRUE,  mtry=100,num.trees = 10000, replace=FALSE, seed  = 17 )
imp <-ranger::importance(rfModel3)
plot(density(imp))

palette(topo.colors(n = 7))
plot(1:1016,imp[1:1016],type="p",col=col,pch=16,cex=0.8,
     xlab="variable number",ylab="log importances")
lines(1:1016,imp[1:1016],col = "gray62",lwd=0.5)
abline(v=127,col="green")

roccurve <- roc(c(rep(1,127),rep(0,6350-127)),imp)
plot(roccurve)
auc(roccurve) # 0.9951

library(ROCR)
predictions <- imp
labels <- c(rep(1,127),rep(0,6350-127))
pred <- prediction(predictions, labels)
cost_perf = performance(pred, "cost") 
pred@cutoffs[[1]][which.min(cost_perf@y.values[[1]])]

table(imp> 0.0106588   ,c(rep(1,127),rep(0,6350-127)))
#          0    1
#  FALSE 6186   40
#  TRUE    37   87
#best FDR is
37/(37 +  87) #[1] ]  0.2983871
#
predicted_values <-rep(0,6350);predicted_values[ which(imp> 0.0106588)]<-1
conf_matrix<-table(predicted_values,c(rep(1,127),rep(0,6350-127)))
conf_matrix
conf_matrix[2,1]/sum(conf_matrix[2,]) #[1]] 0.2983871 FDR
sensitivity(conf_matrix) #0.9940543 TP/(TP+FN)
specificity(conf_matrix) #0.6850394          TN/(FP+TN)

cforest

we consider the cforest forest fitting procedure (Hothorn, Hornik, and Zeileis (2006)) and the Conditional Permutation Importance (CPI) (Strobl et al. (2008)). It appears that the implementation of CPI provided by party::varimp is excessively slow. We use the implementation provided by the permimp package.

library(party)
library(permimp) #
data <- data.frame(y,X)

system.time(mod1.cf <- party::cforest(y ~ ., data = data, 
                           control = party::cforest_unbiased(ntree = 10,  mtry = 100)))
# 12.848   0.000  12.885 
system.time(aa<-party::varimp(mod1.cf, conditional = TRUE))
#    user   system  elapsed 
#3089.301    9.484 3100.554
system.time(aa3<-permimp(mod1.cf, conditional = TRUE))
#   user  system elapsed 
#  4.685   0.000   4.707 

system.time(mod1.cf <- party::cforest(y ~ ., data = data, 
                            control = party::cforest_unbiased(ntree = 100,  mtry = 100)))
#  27.233   0.464  27.703 
system.time(aa<-party::varimp(mod1.cf, conditional = TRUE))
#    user   system  elapsed 
#29380.980    81.511 29475.138 
save(aa,file="aa.Rdata")
system.time(aa3<-permimp(mod1.cf, conditional = FALSE))
#  user  system elapsed 
# 6.084   0.000   6.089 


system.time(mod1.cf <- party::cforest(y ~ ., data = data, 
                       ontrol = party::cforest_unbiased(ntree = 1000,  mtry = 100)))
#35.438   0.636  36.095 
system.time(aa4<-permimp(mod1.cf, conditional = FALSE))
#  user  system elapsed 
# 47.135   0.429  47.578 

palette("default")
col<-c(1, rep(2,2), rep(3,4), rep(4, 8), rep(5,16 ), rep(6,32 ), rep(7,64) )
plot(1:1016,aa4$values[1:1016],type="p",col=col,pch=16,cex=0.8,
      xlab="variable number",ylab="log importances")



system.time(mod1.cf <- party::cforest(y ~ ., data = data, 
                           control = party::cforest_unbiased(ntree = 10000, replace=FALSE, mtry = 100)))
#35.438   0.636  36.095 
system.time(aa4<-permimp(mod1.cf, conditional = FALSE))
#  user  system elapsed 
# 47.135   0.429  47.578


roccurve <- roc(c(rep(1,127),rep(0,6350-127)),aa4$values)
plot(roccurve)
auc(roccurve) #0.9486

system.time(mod1.cf <- party::cforest(y ~ ., data = data, 
                                    control = party::cforest_unbiased(ntree = 10000,  mtry = 100)))
#Process R killed at Mon Feb 26 07:18:23 2024 on robubuntu
#   user  system elapsed 
# 98.673   2.832 101.977  on petra, seems to have used 26GB of RAM

system.time(aa4<-permimp(mod1.cf, conditional = TRUE))
#  user  system elapsed 
#352.681   2.262 356.232 
head(aa4$values)
#          X1           X2           X3           X4           X5           X6 
#2.600188e-05 9.454988e-05 1.217855e-04 1.682538e-04 1.572708e-04 1.706311e-04 

roccurve <- roc(c(rep(1,127),rep(0,6350-127)),aa4$values)
plot(roccurve)
auc(roccurve) # 0.9978


library(ROCR)
predictions <- aa4$values
labels <- c(rep(1,127),rep(0,6350-127))
pred <- prediction(predictions, labels)
perf <- performance(pred, "tpr", "fpr")
plot(perf)

cost_perf <- performance(pred, "cost") 
pred@cutoffs[[1]][which.min(cost_perf@y.values[[1]])]

table(predictions> 3.264138e-05  ,c(rep(1,127),rep(0,6350-127)))
#           0    1
#  FALSE 6220   66
#  TRUE     3   61
predicted_values <-rep(0,6350);predicted_values[ which(predictions>  3.264138e-05   )]<-1
conf_matrix<-table(predicted_values,c(rep(1,127),rep(0,6350-127)))
conf_matrix

conf_matrix[2,1]/sum(conf_matrix[2,]) #[1]]  0.1229508 FDR
sensitivity(conf_matrix) # 0.9975896 TP/(TP+FN)
specificity(conf_matrix) #  0.8425197         TN/(FP+TN)

#https://stats.stackexchange.com/questions/61521/cut-off-point-in-a-roc-curve-is-there-a-simple-function
library(pROC)
rocobj <- roc(labels,predictions)
plot(rocobj)
coords(rocobj, "best")
coords(rocobj, x="best", input="threshold", best.method="youden") # Sa
auc(rocobj) # 0.9978

predicted_values <-rep(0,6350);predicted_values[ which(predictions>  8.976665e-06   )]<-1
conf_matrix<-table(predicted_values,c(rep(1,127),rep(0,6350-127)))
conf_matrix

#see also
#https://stats.stackexchange.com/questions/132547/roc-curves-for-unbalanced-datasets
#https://cran.r-project.org/web/packages/qeML/vignettes/Unbalanced_Classes.html
#https://juandelacalle.medium.com/how-and-why-i-switched-from-the-roc-curve-to-the-precision-recall-curve-to-analyze-my-imbalanced-6171da91c6b8
#https://machinelearningmastery.com/roc-curves-and-precision-recall-curves-for-imbalanced-classification/

sobolMDA

Bénard, da Veiga, and Scornet (2022)

Shap values

SHapley Additive exPlanations, (SHAP values), are used as a measure of variable importance. It is based on Shapley values, which use game theory to assign credit for a model’s prediction to each feature or feature value.

The Shapley value is defined for any value function and SHAP is a special case of the Shapley value by taking the function to be the conditional expectation function of our model. A SHAP value is the average marginal contribution of a feature value across all possible sets of features.

the very fast TreeSHAP is an algorithm to compute SHAP values for tree ensemble models such as decision trees, random forests, and gradient boosted trees in a polynomial-time proposed by Lundberg, Erion, and Lee (2019)
the R library treeshap only accepts regression models (not classification)
other options
- fastshap uses Monte-Carlo sampling to approximate SHAP values,
- shapr and kernelshap provide implementations of KernelSHAP.
- shapper A port to Python’s “shap” package is provided in shapper

Global explanations Aside from explaining individual prediction (i.e., local explanation), it can be useful to aggregate the results of several (i.e., all of the training predictions) into an overall global summary about the model (i.e., global explanations).

The interpretation of the Shapley value for feature value j is: The value of the j-th feature contributed \(\phi_j\) to the prediction of this particular instance compared to the average prediction for the dataset.

SHAFF: Fast and consistent SHApley eFfect estimates via random Forests

Bénard et al. (2022)

improve the Monte-Carlo approach by using importance sampling to focus on the most relevant subsets of variables identified by the forest.
uses a projected random forest algorithm to compute fast and accurate estimates of the conditional expectations for any variable subset (same algorithm as SobolMDA?)
initial forest sets mtry to \(p/3\), much larger than we are using Finds one variable.

iml: Interpretable Machine Learning

See Molnar (2021)
did not finish in 3 hours (20 processors but seems to be only using 1, can we make it parallel?)

mlr3

fastshap

https://bgreenwell.github.io/fastshap/articles/fastshap.html
did not finish in 2 1/2 hours with 20 processors
tried the samll data set

Altmann, André, Laura Toloşi, Oliver Sander, and Thomas Lengauer. 2010. “Permutation Importance: A Corrected Feature Importance Measure.” Bioinformatics 26 (10): 1340. https://doi.org/10.1093/bioinformatics/btq134.

Bénard, Clément, Gérard Biau, Sébastien da Veiga, and Erwan Scornet. 2022. “SHAFF: Fast and Consistent SHApley eFfect Estimates via Random Forests.” February 2, 2022. https://doi.org/10.48550/arXiv.2105.11724.

Bénard, Clément, Sébastien da Veiga, and Erwan Scornet. 2022. “MDA for Random Forests: Inconsistency, and a Practical Solution via the Sobol-MDA.” March 1, 2022. https://doi.org/10.48550/arXiv.2102.13347.

Breiman, Leo. 2001. “Random Forests.” Machine Learning 45 (1): 5–32.

Celik, Ender. 2015. Vita: Variable Importance Testing Approaches. https://CRAN.R-project.org/package=vita.

Dunne, Robert, Roc Reguant, Priya Ramarao-Milne, Piotr Szul, Letitia Sng, Mischa Lundberg, Natalie A. Twine, and Denis C. Bauer. 2022. “Thresholding Gini Variable Importance with a Single Trained Random Forest: An Empirical Bayes Approach.” bioRxiv. https://doi.org/10.1101/2022.04.06.487300.

Efron, Bradley, Brit Turnbull, and Balasubramanian Narasimhan. 2015. Locfdr: Computes Local False Discovery Rates. https://CRAN.R-project.org/package=locfdr.

Fouodo, Cesaire. 2022. Pomona: Identification of Relevant Variables in Omics Data Sets Using Random Forests.

Genuer, Robin, Jean-Michel Poggi, and Christine Tuleau-Malot. 2010. “Variable Selection Using Random Forests.” Pattern Recognition Letters 31 (14): 2225–36. https://doi.org/10.1016/j.patrec.2010.03.014.

Goldstein, Benjamin A., Eric C. Polley, and Farren B. S. Briggs. 2011. “Random Forests for Genetic Association Studies.” Statistical Applications in Genetics and Molecular Biology 10 (1): 32. https://doi.org/10.2202/1544-6115.1691.

Hothorn, Torsten, Kurt Hornik, and Achim Zeileis. 2006. “Unbiased Recursive Partitioning: A Conditional Inference Framework.” Journal of Computational and Graphical Statistics 15 (3): 651–74. https://doi.org/10.1198/106186006X133933.

Kursa, Miron B., and Witold R. Rudnicki. 2010. “Feature Selection with the Boruta Package.” Journal of Statistical Software 36 (11): 1–13. https://www.jstatsoft.org/v36/i11/.

Lundberg, Scott M., Gabriel G. Erion, and Su-In Lee. 2019. “Consistent Individualized Feature Attribution for Tree Ensembles.” arXiv:1802.03888 [Cs, Stat], March. https://arxiv.org/abs/1802.03888.

Molnar, Christoph. 2021. Interpretable Machine Learning: A Guide for Making Black Box Models Explainable. Leanpub. https://leanpub.com/interpretable-machine-learning.

Nembrini, Stefano, Inke R. König, Marvin N. Wright, and Alfonso Valencia. 2018. “The Revival of the Gini Importance?” Bioinformatics. https://doi.org/10.1093/bioinformatics/bty373.

Strobl, Carolin, Anne-Laure Boulesteix, Thomas Kneib, Thomas Augustin, and Achim Zeileis. 2008. “Conditional Variable Importance for Random Forests.” BMC Bioinformatics 9 (1): 307. https://doi.org/10.1186/1471-2105-9-307.

Strobl, Carolin, Anne-Laure Boulesteix, Achim Zeileis, and Torsten Hothorn. 2007. “Bias in Random Forest Variable Importance Measures: Illustrations, Sources and a Solution.” BMC Bioinformatics 8: 25. https://doi.org/10.1186/1471-2105-8-25.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.