The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.
SMLE Tutorial
Qianxiang Zang
6/30/2024
Introduction
This vignette describes how to use the SMLE package
to perform Ultra-high dimensional feature screening.
Suppose the data \(\{(y_{i},
\boldsymbol{x}_{i}), i=1,\ldots,n \}\) are collected
independently from \((Y,
\boldsymbol{x})\), where \(Y\)
is a response variable and \(\boldsymbol{x}=(x_{1}, \ldots, x_{p})\) is
a \(p\)-dimensional covariate (feature)
vector.
We assume a GLM with a canonical link for the relationship between
\(Y\) and \(\boldsymbol{x}\): \[
f(y; \theta)=\exp(\theta y - b(\theta) + c(y) ),
\] where the distribution of \(Y\) given \(\boldsymbol{x}\) is assumed to belong to an
exponential family, \(\theta =
\boldsymbol{x}\boldsymbol{\beta}\), and \(\boldsymbol{\beta}=(\beta_{1}, \ldots,
\beta_{p})^{T}\) is a \(p\)-dimensional regression coefficient.
Linear, logistic, and Poisson regression are all examples of GLMs.
SMLE uses an iterative approach to find the sparsity-restricted
estimator: \[
\hat{\boldsymbol{\beta}}_{k}=\max\limits_{\beta} \sum_{i=1}^{n} [y_{i}
\cdot \boldsymbol{x}_{i} \boldsymbol{\beta} - b( \boldsymbol{x}_{i}
\boldsymbol{\beta}) ]\quad \text{subject to}\quad ||\beta||_0 \leq k,
\]
The theory and algorithms behind SMLE are described in Xu and Chen (2014). The implementation and R
package is described in Zang, Xu, and Burkett (2022).
Example 1 - Linear regression with correlated, numerical
features
First we show how to use SMLE
to conduct feature
screening and post-screening selection via a simulated example. We
generate a dataset with \(n=400\)
observations and \(p=1000\) features.
We generate the feature matrix \(X\)
from a multivariate normal distribution with an auto-regressive
structure, where the adjacent features have a high correlation of \(\rho=0.7\). The response variable \(Y\) is generated based on the following
linear model with linear predictor: \[ \mu =
1/2x_1 + x_3 - x_5 + 2x_7 - 2x_9. \] In this setup, the feature
matrix contains only five features that are causally-related to the
response, as indicated in the model.
library(SMLE)
set.seed(1)
Data_eg <- Gen_Data(n = 400, p = 1000, family = "gaussian",
correlation = "AR", rho=0.5,pos_truecoef = c(1,3,5,7,9),
effect_truecoef = c(0.5,1,-1,2,-2))
Data_eg
## Call:
## Gen_Data(n = 400, p = 1000, pos_truecoef = c(1, 3, 5, 7, 9),
## effect_truecoef = c(0.5, 1, -1, 2, -2), correlation = "AR",
## rho = 0.5, family = "gaussian")
##
## An object of class sdata
##
## Simulated Dataset Properties:
## Length of response: 400
## Dim of features: 400 x 1000
## Correlation: auto regressive
## Rho: 0.5
## Index of causal features: 1, 3, 5, 7, 9
## Model type: gaussian
The following code shows the simplest function call to
SMLE()
, where we aim to retain only \(k=10\) important features out of \(p=1000\).
fit1 <- SMLE(Y = Data_eg$Y,X = Data_eg$X, k = 10, family = "gaussian")
summary(fit1)
## Call:
## SMLE(X = Data_eg$X, Y = Data_eg$Y, k = 10, family = "gaussian")
##
## An object of class summary.smle
##
## Summary:
##
## Length of response: 400
## Dim of features: 400 x 1000
## Model type: gaussian
## Model size: 10
## Feature name: 1, 3, 5, 7, 9, 30, 163, 164, 191, 414
## Feature index: 1, 3, 5, 7, 9, 30, 163, 164, 191, 414
## Intercept: -0.12411
## Coefficients estimated by IHT: 0.5356, 1.1176, -0.9214, 1.9024, -2.0977, 0.1764, -0.1009, 0.2027, -0.0619, 0.1735
## Number of IHT iteration steps: 2
The function returns a 'smle'
object and the call to the
summary()
function confirms that a refined set of 10
features is selected. All five causal features are retained in the
refined features set after screening.
The coefficients of the retained features can be viewed with the
coef
function.
## (Intercept) `1` `3` `5` `7` `9`
## 0.001537283 0.566806689 1.019212595 -1.035717991 2.046733916 -2.072110440
## `30` `163` `164` `191` `414`
## 0.072081447 -0.118538590 0.166783654 -0.093756159 0.094989655
The plot function can be used to view the solution path of
coefficients as well as model convergence diagnostics.
The resulting plots are shown below. This example is a fairly trivial
one, so SMLE converges quickly with only small changes to the
coefficients after the first iteration.
Figure 1 - Plot of solution path for
coefficients
Figure 2 - Plot of diagnostics to assess
convergence
Note that even after screening, the set of retained features will
contain some that are not relevant to the response. This is to be
expected; k
should be chosen to be larger than the actual
number of casual features, as the goal of feature screening is merely to
remove the most irrelevant features before conducting an in-depth
analysis on a smaller dataset.
After screening out features, one may wish to conduct feature
selection on the refined set to further identify relevant features. This
is done with the smle_select()
function. The fitted
'SMLE'
object is passed to smle_select()
and a
criterion for selection is chosen. The default criterion is EBIC.
smle_select()
returns a 'selection'
object,
named "fit1_s"
in the following example. Only the five
causal features from the true model are selected.
fit1_s <- smle_select(fit1, criterion = "ebic")
summary(fit1_s)
## Call:
## smle_select(object = fit1, criterion = "ebic")
##
## An object of class summary.selection
##
## Summary:
##
## Length of response: 400
## Dim of features: 400 x 1000
## Model type: gaussian
## Selected model size: 5
## Selected feature index: 1, 3, 5, 7, 9
## Selection criterion: ebic
## Gamma for ebic: 0.5
The plot function for an object of class 'selection'
returns a plot showing the selection criterion scores for each of the
candidate sub-models with varying number of features. In the example
below, we see that EBIC is lowest at five features.
Figure 3 - Plot after SMLE selection
Example 2 - Logistic regression with categorical features
In this example, we simulate a dataset with five categorical
features. The first three categorical features have three levels each,
the fourth has four levels, and the fifth has 5 levels. The first three
categorical features (columns 1,3,5 of the data) are causally related to
the response. The remaining causal features are numeric.
For this example, we separate the data into training and testing
groups in order to assess the performance using classification.
set.seed(1)
Data_sim2 <- Gen_Data(n = 700, p = 2000, family = "binomial", num_ctgidx = 5,
pos_ctgidx = c(1,3,5,7,9), effect_truecoef= c(2,2,2,-1,0.5),
pos_truecoef = c(1,3,5,8,12), level_ctgidx = c(3,3,3,4,5))
training=sample(c(rep(TRUE,500),rep(FALSE,200)))
Y=Data_sim2$Y
dat=data.frame(Y,Data_sim2$X)
traindat=subset(dat,training)
testdat=subset(dat,!training)
Users may specify whether to treat those dummy covariates as a single
group feature or as individual features, and which type of dummy coding
is used by arguments: group
and codingtype
. We
recommend using the default values for these, which is that categorical
variables are screened out as a group and dummy variable coding is
used.
In the following call to SMLE, we make use of the formula option for
specifying the linear model. As there are 3000 features, we can’t write
out the full model. Instead, we use the .
.
fit_1 <- SMLE(Y ~ . , family = "binomial", k = 10, data=traindat)
fit_1
## Call:
## SMLE(formula = Y ~ ., data = traindat, k = 10, family = "binomial")
##
## An object of class smle
##
## Subset:
## Model size: 10
## Feature name: C1, C3, C5, X8, X12, X744, X1261, X1351, X1671, X1746
## Feature index: 2, 4, 6, 9, 13, 745, 1262, 1352, 1672, 1747
We now examine the quality of the classification using the testing
dataset.
threshold=0.5
values=1*(predict(fit_1, newdata = testdat,type="response")>threshold)
(confusionmat=table(Actual=testdat$Y,Predicted=values))
## Predicted
## Actual 0 1
## 0 66 29
## 1 21 84
(accuracy=sum(diag(confusionmat))/sum(confusionmat))
## [1] 0.75
(ppv=confusionmat[2,2]/colSums(confusionmat)[2])
## 1
## 0.7433628
(npv=confusionmat[1,1]/colSums(confusionmat)[1])
## 0
## 0.7586207
Xu, Chen, and Jiahua Chen. 2014. “The Sparse MLE for
Ultrahigh-Dimensional Feature Screening.” Journal of the
American Statistical Association 109 (507): 1257–69.
Zang, Qianxiang, Chen Xu, and Kelly Burkett. 2022.
“SMLE: An r
Package for Joint Feature Screening in Ultrahigh-Dimensional
GLMs.” https://arxiv.org/abs/2201.03512.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.