Title: | T-Rex Selector: High-Dimensional Variable Selection & FDR Control |
Version: | 1.0.0 |
Date: | 2024-02-23 |
Description: | Performs fast variable selection in high-dimensional settings while controlling the false discovery rate (FDR) at a user-defined target level. The package is based on the paper Machkour, Muma, and Palomar (2022) <doi:10.48550/arXiv.2110.06048>. |
Maintainer: | Jasin Machkour <jasin.machkour@tu-darmstadt.de> |
URL: | https://github.com/jasinmachkour/TRexSelector, https://arxiv.org/abs/2110.06048 |
BugReports: | https://github.com/jasinmachkour/TRexSelector/issues |
License: | GPL (≥ 3) |
Encoding: | UTF-8 |
LazyData: | true |
RoxygenNote: | 7.3.1 |
Suggests: | knitr, rmarkdown, ggplot2, patchwork, WGCNA, fastcluster, testthat (≥ 3.0.0) |
Config/testthat/edition: | 3 |
Imports: | MASS, stats, tlars, parallel, doParallel, foreach, doRNG, methods, glmnet, boot |
Depends: | R (≥ 2.10) |
VignetteBuilder: | knitr |
NeedsCompilation: | no |
Packaged: | 2024-02-23 22:36:14 UTC; jasinmachkour |
Author: | Jasin Machkour [aut, cre], Simon Tien [aut], Daniel P. Palomar [aut], Michael Muma [aut] |
Repository: | CRAN |
Date/Publication: | 2024-02-23 23:20:02 UTC |
False discovery proportion (FDP)
Description
Computes the FDP based on the estimated and the true regression coefficient vectors.
Usage
FDP(beta_hat, beta, eps = .Machine$double.eps)
Arguments
beta_hat |
Estimated regression coefficient vector. |
beta |
True regression coefficient vector. |
eps |
Numerical zero. |
Value
False discovery proportion (FDP).
Examples
data("Gauss_data")
X <- Gauss_data$X
y <- c(Gauss_data$y)
beta <- Gauss_data$beta
set.seed(1234)
res <- trex(X, y)
beta_hat <- res$selected_var
FDP(beta_hat = beta_hat, beta = beta)
Toy data generated from a Gaussian linear model
Description
A data set containing a predictor matrix X with n = 50 observations and p = 100 variables (predictors), and a sparse parameter vector beta with associated support vector.
Usage
Gauss_data
Format
A list containing a matrix X and vectors y, beta, and support:
- X
Predictor matrix, n = 50, p = 100.
- y
Response vector.
- beta
Parameter vector.
- support
Support vector.
Examples
# Generated as follows:
set.seed(789)
n <- 50
p <- 100
X <- matrix(stats::rnorm(n * p), nrow = n, ncol = p)
beta <- c(rep(5, times = 3), rep(0, times = 97))
support <- beta > 0
y <- X %*% beta + stats::rnorm(n)
Gauss_data <- list(
X = X,
y = y,
beta = beta,
support = support
)
Computes the Deflated Relative Occurrences
Description
Computes the vector of deflated relative occurrences for all variables (i.e., j = 1,..., p) and T = T_stop.
Usage
Phi_prime_fun(
p,
T_stop,
num_dummies,
phi_T_mat,
Phi,
eps = .Machine$double.eps
)
Arguments
p |
Number of candidate variables. |
T_stop |
Number of included dummies after which the random experiments (i.e., forward selection processes) are stopped. |
num_dummies |
Number of dummies |
phi_T_mat |
Matrix of relative occurrences for all variables (i.e., j = 1,..., p) and for T = 1, ..., T_stop. |
Phi |
Vector of relative occurrences for all variables (i.e., j = 1,..., p) at T = T_stop. |
eps |
Numerical zero. |
Value
Vector of deflated relative occurrences for all variables (i.e., j = 1,..., p) and T = T_stop.
True positive proportion (TPP)
Description
Computes the TPP based on the estimated and the true regression coefficient vectors.
Usage
TPP(beta_hat, beta, eps = .Machine$double.eps)
Arguments
beta_hat |
Estimated regression coefficient vector. |
beta |
True regression coefficient vector. |
eps |
Numerical zero. |
Value
True positive proportion (TPP).
Examples
data("Gauss_data")
X <- Gauss_data$X
y <- c(Gauss_data$y)
beta <- Gauss_data$beta
set.seed(1234)
res <- trex(X, y)
beta_hat <- res$selected_var
TPP(beta_hat = beta_hat, beta = beta)
Add dummy predictors to the original predictor matrix
Description
Sample num_dummies dummy predictors from the univariate standard normal distribution and append them to the predictor matrix X.
Usage
add_dummies(X, num_dummies)
Arguments
X |
Real valued predictor matrix. |
num_dummies |
Number of dummies that are appended to the predictor matrix. |
Value
Enlarged predictor matrix.
Examples
set.seed(123)
n <- 50
p <- 100
X <- matrix(stats::rnorm(n * p), nrow = n, ncol = p)
add_dummies(X = X, num_dummies = p)
Add dummy predictors to the original predictor matrix, as required by the T-Rex+GVS selector (doi:10.23919/EUSIPCO55093.2022.9909883)
Description
Generate num_dummies dummy predictors as required for the T-Rex+GVS selector (doi:10.23919/EUSIPCO55093.2022.9909883) and append them to the predictor matrix X.
Usage
add_dummies_GVS(X, num_dummies, corr_max = 0.5)
Arguments
X |
Real valued predictor matrix. |
num_dummies |
Number of dummies that are appended to the predictor matrix. Has to be a multiple of the number of original variables. |
corr_max |
Maximum allowed correlation between any two predictors from different clusters. |
Value
Enlarged predictor matrix for the T-Rex+GVS selector.
Examples
set.seed(123)
n <- 50
p <- 100
X <- matrix(stats::rnorm(n * p), nrow = n, ncol = p)
add_dummies_GVS(X = X, num_dummies = p)
Computes the conservative FDP estimate of the T-Rex selector (doi:10.48550/arXiv.2110.06048)
Description
Computes the conservative FDP estimate of the T-Rex selector (doi:10.48550/arXiv.2110.06048)
Usage
fdp_hat(V, Phi, Phi_prime, eps = .Machine$double.eps)
Arguments
V |
Voting level grid. |
Phi |
Vector of relative occurrences. |
Phi_prime |
Vector of deflated relative occurrences. |
eps |
Numerical zero. |
Value
Vector of conservative FDP estimates for each value of the voting level grid.
Perform one random experiment
Description
Run one random experiment of the T-Rex selector (doi:10.48550/arXiv.2110.06048), i.e., generates dummies, appends them to the predictor matrix, and runs the forward selection algorithm until it is terminated after T_stop dummies have been selected.
Usage
lm_dummy(
X,
y,
model_tlars,
T_stop = 1,
num_dummies = ncol(X),
method = "trex",
GVS_type = "IEN",
type = "lar",
corr_max = 0.5,
lambda_2_lars = NULL,
early_stop = TRUE,
verbose = TRUE,
intercept = FALSE,
standardize = TRUE
)
Arguments
X |
Real valued predictor matrix. |
y |
Response vector. |
model_tlars |
Object of the class tlars_cpp. It contains all state variables of the previous T-LARS step (necessary for warm-starts, i.e., restarting the forward selection process exactly where it was previously terminated). |
T_stop |
Number of included dummies after which the random experiments (i.e., forward selection processes) are stopped. |
num_dummies |
Number of dummies that are appended to the predictor matrix. |
method |
'trex' for the T-Rex selector (doi:10.48550/arXiv.2110.06048), 'trex+GVS' for the T-Rex+GVS selector (doi:10.23919/EUSIPCO55093.2022.9909883), 'trex+DA+AR1' for the T-Rex+DA+AR1 selector, 'trex+DA+equi' for the T-Rex+DA+equi selector, 'trex+DA+BT' for the T-Rex+DA+BT selector (doi:10.48550/arXiv.2401.15796), 'trex+DA+NN' for the T-Rex+DA+NN selector (doi:10.48550/arXiv.2401.15139). |
GVS_type |
'IEN' for the Informed Elastic Net (doi:10.1109/CAMSAP58249.2023.10403489), 'EN' for the ordinary Elastic Net (doi:10.1111/j.1467-9868.2005.00503.x). |
type |
'lar' for 'LARS' and 'lasso' for Lasso. |
corr_max |
Maximum allowed correlation between any two predictors from different clusters. |
lambda_2_lars |
lambda_2-value for LARS-based Elastic Net. |
early_stop |
Logical. If TRUE, then the forward selection process is stopped after T_stop dummies have been included. Otherwise the entire solution path is computed. |
verbose |
Logical. If TRUE progress in computations is shown when performing T-LARS steps on the created model. |
intercept |
Logical. If TRUE an intercept is included. |
standardize |
Logical. If TRUE the predictors are standardized and the response is centered. |
Value
Object of the class tlars_cpp.
Examples
set.seed(123)
eps <- .Machine$double.eps
n <- 75
p <- 100
X <- matrix(stats::rnorm(n * p), nrow = n, ncol = p)
beta <- c(rep(3, times = 3), rep(0, times = 97))
y <- X %*% beta + rnorm(n)
res <- lm_dummy(X = X, y = y, T_stop = 1, num_dummies = 5 * p)
beta_hat <- res$get_beta()[seq(p)]
support <- abs(beta_hat) > eps
support
Run K random experiments
Description
Run K early terminated T-Rex (doi:10.48550/arXiv.2110.06048) random experiments and compute the matrix of relative occurrences for all variables and all numbers of included variables before stopping.
Usage
random_experiments(
X,
y,
K = 20,
T_stop = 1,
num_dummies = ncol(X),
method = "trex",
GVS_type = "EN",
type = "lar",
corr_max = 0.5,
lambda_2_lars = NULL,
early_stop = TRUE,
lars_state_list,
verbose = TRUE,
intercept = FALSE,
standardize = TRUE,
dummy_coef = FALSE,
parallel_process = FALSE,
parallel_max_cores = min(K, max(1, parallel::detectCores(logical = FALSE))),
seed = NULL,
eps = .Machine$double.eps
)
Arguments
X |
Real valued predictor matrix. |
y |
Response vector. |
K |
Number of random experiments. |
T_stop |
Number of included dummies after which the random experiments (i.e., forward selection processes) are stopped. |
num_dummies |
Number of dummies that are appended to the predictor matrix. |
method |
'trex' for the T-Rex selector (doi:10.48550/arXiv.2110.06048), 'trex+GVS' for the T-Rex+GVS selector (doi:10.23919/EUSIPCO55093.2022.9909883), 'trex+DA+AR1' for the T-Rex+DA+AR1 selector, 'trex+DA+equi' for the T-Rex+DA+equi selector, 'trex+DA+BT' for the T-Rex+DA+BT selector (doi:10.48550/arXiv.2401.15796), 'trex+DA+NN' for the T-Rex+DA+NN selector (doi:10.48550/arXiv.2401.15139). |
GVS_type |
'IEN' for the Informed Elastic Net (doi:10.1109/CAMSAP58249.2023.10403489), 'EN' for the ordinary Elastic Net (doi:10.1111/j.1467-9868.2005.00503.x). |
type |
'lar' for 'LARS' and 'lasso' for Lasso. |
corr_max |
Maximum allowed correlation between any two predictors from different clusters (for method = 'trex+GVS'). |
lambda_2_lars |
lambda_2-value for LARS-based Elastic Net. |
early_stop |
Logical. If TRUE, then the forward selection process is stopped after T_stop dummies have been included. Otherwise the entire solution path is computed. |
lars_state_list |
If parallel_process = TRUE: List of state variables of the previous T-LARS steps of the K random experiments (necessary for warm-starts, i.e., restarting the forward selection process exactly where it was previously terminated). If parallel_process = FALSE: List of objects of the class tlars_cpp associated with the K random experiments (necessary for warm-starts, i.e., restarting the forward selection process exactly where it was previously terminated). |
verbose |
Logical. If TRUE progress in computations is shown. |
intercept |
Logical. If TRUE an intercept is included. |
standardize |
Logical. If TRUE the predictors are standardized and the response is centered. |
dummy_coef |
Logical. If TRUE a matrix containing the terminal dummy coefficient vectors of all K random experiments as rows is returned. |
parallel_process |
Logical. If TRUE random experiments are executed in parallel. |
parallel_max_cores |
Maximum number of cores to be used for parallel processing. |
seed |
Seed for random number generator (ignored if parallel_process = FALSE). |
eps |
Numerical zero. |
Value
List containing the results of the K random experiments.
Examples
set.seed(123)
data("Gauss_data")
X <- Gauss_data$X
y <- c(Gauss_data$y)
res <- random_experiments(X = X, y = y)
relative_occurrences_matrix <- res$phi_T_mat
relative_occurrences_matrix
Run the Screen-T-Rex selector (doi:10.1109/SSP53291.2023.10207957)
Description
The Screen-T-Rex selector (doi:10.1109/SSP53291.2023.10207957) performs very fast variable selection in high-dimensional settings while informing the user about the automatically selected false discovery rate (FDR).
Usage
screen_trex(
X,
y,
K = 20,
R = 1000,
method = "trex",
bootstrap = FALSE,
conf_level_grid = seq(0, 1, by = 0.001),
cor_coef = NA,
type = "lar",
corr_max = 0.5,
lambda_2_lars = NULL,
rho_thr_DA = 0.02,
parallel_process = FALSE,
parallel_max_cores = min(K, max(1, parallel::detectCores(logical = FALSE))),
seed = NULL,
eps = .Machine$double.eps,
verbose = TRUE
)
Arguments
X |
Real valued predictor matrix. |
y |
Response vector. |
K |
Number of random experiments. |
R |
Number of bootstrap resamples. |
method |
'trex' for the T-Rex selector (doi:10.48550/arXiv.2110.06048), 'trex+GVS' for the T-Rex+GVS selector (doi:10.23919/EUSIPCO55093.2022.9909883), 'trex+DA+AR1' for the T-Rex+DA+AR1 selector, 'trex+DA+equi' for the T-Rex+DA+equi selector. |
bootstrap |
Logical. If TRUE Screen-T-Rex is carried out with bootstrapping. |
conf_level_grid |
Confidence level grid for the bootstrap confidence intervals. |
cor_coef |
AR(1) autocorrelation coefficient for the T-Rex+DA+AR1 selector or equicorrelation coefficient for the T-Rex+DA+equi selector. |
type |
'lar' for 'LARS' and 'lasso' for Lasso. |
corr_max |
Maximum allowed correlation between any two predictors from different clusters. |
lambda_2_lars |
lambda_2-value for LARS-based Elastic Net. |
rho_thr_DA |
Correlation threshold for the T-Rex+DA+AR1 selector and the T-Rex+DA+equi selector (i.e., method = 'trex+DA+AR1' or 'trex+DA+equi'). |
parallel_process |
Logical. If TRUE random experiments are executed in parallel. |
parallel_max_cores |
Maximum number of cores to be used for parallel processing. |
seed |
Seed for random number generator (ignored if parallel_process = FALSE). |
eps |
Numerical zero. |
verbose |
Logical. If TRUE progress in computations is shown. |
Value
A list containing the estimated support vector, the automatically selected false discovery rate (FDR) and additional information.
Examples
data("Gauss_data")
X <- Gauss_data$X
y <- c(Gauss_data$y)
set.seed(123)
res <- screen_trex(X = X, y = y)
selected_var <- res$selected_var
selected_var
Compute set of selected variables
Description
Computes the set of selected variables and returns the estimated support vector for the T-Rex selector (doi:10.48550/arXiv.2110.06048).
Usage
select_var_fun(p, tFDR, T_stop, FDP_hat_mat, Phi_mat, V)
Arguments
p |
Number of candidate variables. |
tFDR |
Target FDR level (between 0 and 1, i.e., 0% and 100%). |
T_stop |
Number of included dummies after which the random experiments (i.e., forward selection processes) are stopped. |
FDP_hat_mat |
Matrix whose rows are the vectors of conservative FDP estimates for each value of the voting level grid. |
Phi_mat |
Matrix of relative occurrences as determined by the T-Rex calibration algorithm. |
V |
Voting level grid. |
Value
Estimated support vector.
Compute set of selected variables for the T-Rex+DA+BT selector T-Rex+DA+BT selector (doi:10.48550/arXiv.2401.15796)
Description
Computes the set of selected variables and returns the estimated support vector for the T-Rex+DA+BT selector (doi:10.48550/arXiv.2401.15796).
Usage
select_var_fun_DA_BT(
p,
tFDR,
T_stop,
FDP_hat_array_BT,
Phi_array_BT,
V,
rho_grid
)
Arguments
p |
Number of candidate variables. |
tFDR |
Target FDR level (between 0 and 1, i.e., 0% and 100%). |
T_stop |
Number of included dummies after which the random experiments (i.e., forward selection processes) are stopped. |
FDP_hat_array_BT |
Array containing the conservative FDP estimates for all variables (dimension 1), values of the voting level grid (dimension 2), and values of the dendrogram grid (dimension 3). |
Phi_array_BT |
Array of relative occurrences as determined by the T-Rex calibration algorithm. |
V |
Voting level grid. |
rho_grid |
Dendrogram grid. |
Value
List containing the estimated support vector, etc.
Run the T-Rex selector (doi:10.48550/arXiv.2110.06048)
Description
The T-Rex selector (doi:10.48550/arXiv.2110.06048) performs fast variable selection in high-dimensional settings while controlling the false discovery rate (FDR) at a user-defined target level.
Usage
trex(
X,
y,
tFDR = 0.2,
K = 20,
max_num_dummies = 10,
max_T_stop = TRUE,
method = "trex",
GVS_type = "IEN",
cor_coef = NA,
type = "lar",
corr_max = 0.5,
lambda_2_lars = NULL,
rho_thr_DA = 0.02,
hc_dist = "single",
hc_grid_length = min(20, ncol(X)),
parallel_process = FALSE,
parallel_max_cores = min(K, max(1, parallel::detectCores(logical = FALSE))),
seed = NULL,
eps = .Machine$double.eps,
verbose = TRUE
)
Arguments
X |
Real valued predictor matrix. |
y |
Response vector. |
tFDR |
Target FDR level (between 0 and 1, i.e., 0% and 100%). |
K |
Number of random experiments. |
max_num_dummies |
Integer factor determining the maximum number of dummies as a multiple of the number of original variables p (i.e., num_dummies = max_num_dummies * p). |
max_T_stop |
If TRUE the maximum number of dummies that can be included before stopping is set to ceiling(n / 2), where n is the number of data points/observations. |
method |
'trex' for the T-Rex selector (doi:10.48550/arXiv.2110.06048), 'trex+GVS' for the T-Rex+GVS selector (doi:10.23919/EUSIPCO55093.2022.9909883), 'trex+DA+AR1' for the T-Rex+DA+AR1 selector, 'trex+DA+equi' for the T-Rex+DA+equi selector, 'trex+DA+BT' for the T-Rex+DA+BT selector (doi:10.48550/arXiv.2401.15796), 'trex+DA+NN' for the T-Rex+DA+NN selector (doi:10.48550/arXiv.2401.15139). |
GVS_type |
'IEN' for the Informed Elastic Net (doi:10.1109/CAMSAP58249.2023.10403489), 'EN' for the ordinary Elastic Net (doi:10.1111/j.1467-9868.2005.00503.x). |
cor_coef |
AR(1) autocorrelation coefficient for the T-Rex+DA+AR1 selector or equicorrelation coefficient for the T-Rex+DA+equi selector. |
type |
'lar' for 'LARS' and 'lasso' for Lasso. |
corr_max |
Maximum allowed correlation between any two predictors from different clusters (for method = 'trex+GVS'). |
lambda_2_lars |
lambda_2-value for LARS-based Elastic Net. |
rho_thr_DA |
Correlation threshold for the T-Rex+DA+AR1 selector and the T-Rex+DA+equi selector (i.e., method = 'trex+DA+AR1' or 'trex+DA+equi'). |
hc_dist |
Distance measure of the hierarchical clustering/dendrogram (only for trex+DA+BT): 'single' for single-linkage, "complete" for complete linkage, "average" for average linkage (see hclust for more options). |
hc_grid_length |
Length of the height-cutoff-grid for the dendrogram (integer between 1 and the number of original variables p). |
parallel_process |
Logical. If TRUE random experiments are executed in parallel. |
parallel_max_cores |
Maximum number of cores to be used for parallel processing. |
seed |
Seed for random number generator (ignored if parallel_process = FALSE). |
eps |
Numerical zero. |
verbose |
Logical. If TRUE progress in computations is shown. |
Value
A list containing the estimated support vector and additional information, including the number of used dummies and the number of included dummies before stopping.
Examples
data("Gauss_data")
X <- Gauss_data$X
y <- c(Gauss_data$y)
set.seed(1234)
res <- trex(X = X, y = y)
selected_var <- res$selected_var
selected_var