SMLE Tutorial

The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.

SMLE Tutorial

Qianxiang Zang

6/30/2024

Introduction

This vignette describes how to use the SMLE package to perform Ultra-high dimensional feature screening.

Suppose the data \(\{(y_{i}, \boldsymbol{x}_{i}), i=1,\ldots,n \}\) are collected independently from \((Y, \boldsymbol{x})\), where \(Y\) is a response variable and \(\boldsymbol{x}=(x_{1}, \ldots, x_{p})\) is a \(p\)-dimensional covariate (feature) vector.

We assume a GLM with a canonical link for the relationship between \(Y\) and \(\boldsymbol{x}\): \[ f(y; \theta)=\exp(\theta y - b(\theta) + c(y) ), \] where the distribution of \(Y\) given \(\boldsymbol{x}\) is assumed to belong to an exponential family, \(\theta = \boldsymbol{x}\boldsymbol{\beta}\), and \(\boldsymbol{\beta}=(\beta_{1}, \ldots, \beta_{p})^{T}\) is a \(p\)-dimensional regression coefficient. Linear, logistic, and Poisson regression are all examples of GLMs.

SMLE uses an iterative approach to find the sparsity-restricted estimator: \[ \hat{\boldsymbol{\beta}}_{k}=\max\limits_{\beta} \sum_{i=1}^{n} [y_{i} \cdot \boldsymbol{x}_{i} \boldsymbol{\beta} - b( \boldsymbol{x}_{i} \boldsymbol{\beta}) ]\quad \text{subject to}\quad ||\beta||_0 \leq k, \]

The theory and algorithms behind SMLE are described in Xu and Chen (2014). The implementation and R package is described in Zang, Xu, and Burkett (2022).

Example 1 - Linear regression with correlated, numerical features

First we show how to use SMLE to conduct feature screening and post-screening selection via a simulated example. We generate a dataset with \(n=400\) observations and \(p=1000\) features. We generate the feature matrix \(X\) from a multivariate normal distribution with an auto-regressive structure, where the adjacent features have a high correlation of \(\rho=0.7\). The response variable \(Y\) is generated based on the following linear model with linear predictor: \[ \mu = 1/2x_1 + x_3 - x_5 + 2x_7 - 2x_9. \] In this setup, the feature matrix contains only five features that are causally-related to the response, as indicated in the model.

library(SMLE)

set.seed(1)

Data_eg <- Gen_Data(n = 400, p = 1000, family = "gaussian",
correlation = "AR", rho=0.5,pos_truecoef = c(1,3,5,7,9),
effect_truecoef = c(0.5,1,-1,2,-2))

Data_eg

## Call:
##  Gen_Data(n = 400, p = 1000, pos_truecoef = c(1, 3, 5, 7, 9), 
##     effect_truecoef = c(0.5, 1, -1, 2, -2), correlation = "AR", 
##     rho = 0.5, family = "gaussian")
##  
## An object of class sdata
##  
## Simulated Dataset Properties:
##  Length of response: 400
##  Dim of features: 400 x 1000
##  Correlation: auto regressive
##  Rho: 0.5
##  Index of causal features: 1, 3, 5, 7, 9
##  Model type: gaussian

The following code shows the simplest function call to SMLE(), where we aim to retain only \(k=10\) important features out of \(p=1000\).

fit1 <- SMLE(Y = Data_eg$Y,X = Data_eg$X, k = 10, family = "gaussian")
summary(fit1)

## Call:
##   SMLE(X = Data_eg$X, Y = Data_eg$Y, k = 10, family = "gaussian")
##  
## An object of class summary.smle
##  
## Summary:
## 
##   Length of response: 400
##   Dim of features: 400 x 1000
##   Model type: gaussian
##   Model size: 10
##   Feature name: 1, 3, 5, 7, 9, 30, 163, 164, 191, 414
##   Feature index: 1, 3, 5, 7, 9, 30, 163, 164, 191, 414
##   Intercept: -0.12411
##   Coefficients estimated by IHT: 0.5356, 1.1176, -0.9214, 1.9024, -2.0977, 0.1764, -0.1009, 0.2027, -0.0619, 0.1735
##   Number of IHT iteration steps: 2

The function returns a 'smle' object and the call to the summary() function confirms that a refined set of 10 features is selected. All five causal features are retained in the refined features set after screening.

The coefficients of the retained features can be viewed with the coef function.

coef(fit1)

##  (Intercept)          `1`          `3`          `5`          `7`          `9` 
##  0.001537283  0.566806689  1.019212595 -1.035717991  2.046733916 -2.072110440 
##         `30`        `163`        `164`        `191`        `414` 
##  0.072081447 -0.118538590  0.166783654 -0.093756159  0.094989655

The plot function can be used to view the solution path of coefficients as well as model convergence diagnostics.

plot(fit1)

The resulting plots are shown below. This example is a fairly trivial one, so SMLE converges quickly with only small changes to the coefficients after the first iteration.

Figure 1 - Plot of solution path for coefficients

Figure 2 - Plot of diagnostics to assess convergence

Note that even after screening, the set of retained features will contain some that are not relevant to the response. This is to be expected; k should be chosen to be larger than the actual number of casual features, as the goal of feature screening is merely to remove the most irrelevant features before conducting an in-depth analysis on a smaller dataset.

After screening out features, one may wish to conduct feature selection on the refined set to further identify relevant features. This is done with the smle_select() function. The fitted 'SMLE' object is passed to smle_select() and a criterion for selection is chosen. The default criterion is EBIC. smle_select() returns a 'selection' object, named "fit1_s" in the following example. Only the five causal features from the true model are selected.

fit1_s <- smle_select(fit1, criterion = "ebic")
summary(fit1_s)

## Call:
##   smle_select(object = fit1, criterion = "ebic")
##  
## An object of class summary.selection
##  
## Summary:
##  
##   Length of response: 400
##   Dim of features: 400 x 1000
##   Model type: gaussian
##   Selected model size: 5
##   Selected feature index: 1, 3, 5, 7, 9
##   Selection criterion: ebic
##   Gamma for ebic: 0.5

The plot function for an object of class 'selection' returns a plot showing the selection criterion scores for each of the candidate sub-models with varying number of features. In the example below, we see that EBIC is lowest at five features.

plot(fit1_s)

Figure 3 - Plot after SMLE selection

Example 2 - Logistic regression with categorical features

In this example, we simulate a dataset with five categorical features. The first three categorical features have three levels each, the fourth has four levels, and the fifth has 5 levels. The first three categorical features (columns 1,3,5 of the data) are causally related to the response. The remaining causal features are numeric.

For this example, we separate the data into training and testing groups in order to assess the performance using classification.

set.seed(1)
Data_sim2 <- Gen_Data(n = 700, p = 2000, family = "binomial", num_ctgidx = 5, 
                      pos_ctgidx = c(1,3,5,7,9), effect_truecoef= c(2,2,2,-1,0.5),
                      pos_truecoef = c(1,3,5,8,12), level_ctgidx = c(3,3,3,4,5))
training=sample(c(rep(TRUE,500),rep(FALSE,200)))

Y=Data_sim2$Y
dat=data.frame(Y,Data_sim2$X)

traindat=subset(dat,training)
testdat=subset(dat,!training)

Users may specify whether to treat those dummy covariates as a single group feature or as individual features, and which type of dummy coding is used by arguments: group and codingtype. We recommend using the default values for these, which is that categorical variables are screened out as a group and dummy variable coding is used.

In the following call to SMLE, we make use of the formula option for specifying the linear model. As there are 3000 features, we can’t write out the full model. Instead, we use the ..

fit_1 <- SMLE(Y ~ . , family = "binomial", k = 10, data=traindat)
fit_1

## Call:
##   SMLE(formula = Y ~ ., data = traindat, k = 10, family = "binomial")
##  
## An object of class smle
##  
## Subset:
##   Model size: 10
##   Feature name: C1, C3, C5, X8, X12, X744, X1261, X1351, X1671, X1746
##   Feature index: 2, 4, 6, 9, 13, 745, 1262, 1352, 1672, 1747

We now examine the quality of the classification using the testing dataset.

threshold=0.5
values=1*(predict(fit_1, newdata = testdat,type="response")>threshold)
(confusionmat=table(Actual=testdat$Y,Predicted=values))

##       Predicted
## Actual  0  1
##      0 66 29
##      1 21 84

(accuracy=sum(diag(confusionmat))/sum(confusionmat))

## [1] 0.75

(ppv=confusionmat[2,2]/colSums(confusionmat)[2])

##         1 
## 0.7433628

(npv=confusionmat[1,1]/colSums(confusionmat)[1])

##         0 
## 0.7586207

Xu, Chen, and Jiahua Chen. 2014. “The Sparse MLE for Ultrahigh-Dimensional Feature Screening.” Journal of the American Statistical Association 109 (507): 1257–69.

Zang, Qianxiang, Chen Xu, and Kelly Burkett. 2022. “SMLE: An r Package for Joint Feature Screening in Ultrahigh-Dimensional GLMs.” https://arxiv.org/abs/2201.03512.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.