SelectBoost for Beta regression

The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.

Frédéric Bertrand

2025-11-04

Overview

The new sb_beta() helper glues the beta-regression selectors provided by this package to a SelectBoost-style correlated-resampling loop implemented directly in SelectBoost.beta. It takes care of squeezing the response inside the open unit interval (unless squeeze = FALSE) and tagging the output with the selector that was used.

This vignette walks through two complementary perspectives:

Reconstructing the SelectBoost workflow step by step with betareg_step_aic() to highlight where correlated resampling happens.
Calling sb_beta() to obtain the same result with a single function call.

Throughout the examples we rely on the built-in simulator to generate correlated design matrices with a handful of truly associated predictors.

sim <- simulation_DATA.beta(
  n = 150, p = 6, s = 3, beta_size = c(1, -0.8, 0.6),
  corr = "ar1", rho = 0.25,
  mechanism = "jitter"
)
str(sim$X)
#>  num [1:150, 1:6] -0.123 1.635 1.428 -0.508 -0.243 ...
#>  - attr(*, "dimnames")=List of 2
#>   ..$ : NULL
#>   ..$ : chr [1:6] "x1" "x2" "x3" "x4" ...
summary(sim$Y)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#> 0.04934 0.25725 0.49003 0.48969 0.70233 0.99998

Manual SelectBoost workflow with beta selectors

The classic SelectBoost algorithm first normalises the design matrix, computes pairwise correlations, groups variables above a chosen threshold and finally resamples the predictors before applying the selector. All of those stages are available directly in SelectBoost.beta.

# Normalise the predictors (centre + L2 scale)
X_norm <- sb_normalize(sim$X)

# Compute correlations
corr_mat <- sb_compute_corr(X_norm)

# Group variables whose absolute correlation exceeds 0.6
raw_groups <- sb_group_variables(corr_mat, c0 = 0.6)

# Draw eight correlated replicas for the grouped variables
X_draws <- sb_resample_groups(X_norm, raw_groups, B = 8, seed = 11)
#> Warning: All groups are singletons; correlated resampling degenerates to
#> repeated `X_norm`.

dim(X_draws[[1]])
#> [1] 150   6

Each element of X_draws stores a correlated copy of the normalised design. Feeding these matrices to sb_apply_selector_manual() together with a beta-regression selector yields coefficient estimates for every resampled data set.

coef_path <- sb_apply_selector_manual(
  X_norm, X_draws, sim$Y, selector = betareg_step_aic
)

dim(coef_path)
#> [1] 8 9
coef_path[, 1:3]
#>                        sim0        sim1        sim2
#> (Intercept)     -0.03588528 -0.03588528 -0.03588528
#> x1              11.34931343 11.34931343 11.34931343
#> x2              -8.95724666 -8.95724666 -8.95724666
#> x3               7.17554325  7.17554325  7.17554325
#> x4               0.00000000  0.00000000  0.00000000
#> x5               0.87055660  0.87055660  0.87055660
#> x6               0.00000000  0.00000000  0.00000000
#> phi|(Intercept)  2.95165950  2.95165950  2.95165950

The leading column sim0 records the coefficients fitted on the original normalised design, providing a convenient baseline against which the resampled paths can be compared.

Finally, the sb_selection_frequency() helper counts how often each variable appears with a non-zero coefficient across the replicates. Because betareg_step_aic() returns a glmnet-style coefficient vector (intercept plus predictors), we set version = "glmnet" when computing the selection frequencies.

sel_freq <- sb_selection_frequency(coef_path, version = "glmnet")
sel_freq
#>              x1              x2              x3              x4              x5 
#>               1               1               1               0               1 
#>              x6 phi|(Intercept) 
#>               0               1

This manual exercise confirms that the correlated resampling loop from the original SelectBoost package plugs seamlessly into the beta selectors shipped in SelectBoost.beta.

Running the entire loop with `sb_beta()`

The sb_beta() wrapper performs the same steps internally while exposing the arguments most relevant to beta regression. By default it uses betareg_step_aic() as the base selector, but any of the exported functions ("betareg_step_bic", betareg_glmnet, etc.) can be passed either by name or as a function.

sb <- sb_beta(
  sim$X, sim$Y,
  B = 60,
  step.num = 0.5,
  steps.seq = c(0.9, 0.7, 0.5)
)

class(sb)
#> [1] "sb_beta" "matrix"  "array"
attr(sb, "selector")
#> [1] "betareg_step_aic"
rownames(sb)
#> [1] "c0 = 1.000" "c0 = 0.900" "c0 = 0.700" "c0 = 0.500" "c0 = 0.000"
round(sb, 3)
#> SelectBoost beta selection frequencies
#> Selector: betareg_step_aic
#> Resamples per threshold: 60
#> Interval mode: none
#> c0 grid: 1.0, 0.9, 0.7, 0.5, 0.0
#> Inner thresholds: 0.9, 0.7, 0.5
#>             x1    x2  x3   x4  x5  x6 phi|(Intercept)
#> c0 = 1.000 1.0 1.000 1.0 0.00 1.0 0.0               1
#> c0 = 0.900 1.0 1.000 1.0 0.00 1.0 0.0               1
#> c0 = 0.700 1.0 1.000 1.0 0.00 1.0 0.0               1
#> c0 = 0.500 1.0 1.000 1.0 0.00 1.0 0.0               1
#> c0 = 0.000 0.2 0.167 0.2 0.25 0.2 0.2               1
#> attr(,"c0.seq")
#> [1] 1.0 0.9 0.7 0.5 0.0
#> attr(,"steps.seq")
#> [1] 0.9 0.7 0.5
#> attr(,"B")
#> [1] 60
#> attr(,"selector")
#> [1] "betareg_step_aic"
#> attr(,"resample_diagnostics")
#> attr(,"resample_diagnostics")$`c0 = 1.000`
#> [1] group                   size                    regenerated            
#> [4] cached                  mean_abs_corr_orig      mean_abs_corr_surrogate
#> [7] mean_abs_corr_cross    
#> <0 rows> (or 0-length row.names)
#> 
#> attr(,"resample_diagnostics")$`c0 = 0.900`
#> [1] group                   size                    regenerated            
#> [4] cached                  mean_abs_corr_orig      mean_abs_corr_surrogate
#> [7] mean_abs_corr_cross    
#> <0 rows> (or 0-length row.names)
#> 
#> attr(,"resample_diagnostics")$`c0 = 0.700`
#> [1] group                   size                    regenerated            
#> [4] cached                  mean_abs_corr_orig      mean_abs_corr_surrogate
#> [7] mean_abs_corr_cross    
#> <0 rows> (or 0-length row.names)
#> 
#> attr(,"resample_diagnostics")$`c0 = 0.500`
#> [1] group                   size                    regenerated            
#> [4] cached                  mean_abs_corr_orig      mean_abs_corr_surrogate
#> [7] mean_abs_corr_cross    
#> <0 rows> (or 0-length row.names)
#> 
#> attr(,"resample_diagnostics")$`c0 = 0.000`
#>               group size regenerated cached mean_abs_corr_orig
#> 1 x1,x2,x3,x4,x5,x6    6          60  FALSE          0.1189262
#>   mean_abs_corr_surrogate mean_abs_corr_cross
#> 1               0.1375874          0.06768812
#> 
#> attr(,"interval")
#> [1] "none"

The resulting matrix comes with several attributes that document how the frequencies were generated. attr(sb, "c0.seq") returns the correlation threshold grid, attr(sb, "B") stores the number of correlated resamples per threshold, attr(sb, "interval") highlights whether interval sampling was activated, and attr(sb, "resample_diagnostics") keeps summary statistics on the cached surrogate draws. These metadata mirror the legacy SelectBoost beta implementation and are now documented in ?sb_beta().

Changing the selector is simply a matter of passing a different routine. The call below uses the GAMLSS-based elastic-net variant and asks sb_beta() to pass choose = "bic" to the underlying betareg_glmnet() implementation.

sb_enet <- sb_beta(
  sim$X, sim$Y,
  selector = betareg_glmnet,
  B = 60,
  step.num = 0.5,
  version = "glmnet",
  choose = "bic",
  prestandardize = TRUE
)

attr(sb_enet, "selector")
#> [1] "betareg_glmnet"
colMeans(sb_enet)
#>         x1         x2         x3         x4         x5         x6 
#> 0.35000000 0.33888889 0.33333333 0.34444444 0.34444444 0.01666667

Because the wrapper always builds on the same correlated resamples, results are directly comparable across selectors as long as they adopt the glmnet-style coefficient convention. This makes it straightforward to run stability analyses for interval responses by pairing sb_beta() with the convenience wrapper sb_beta_interval() (or the lower-level fastboost_interval()) or to compare several beta selectors under the exact same resampled design matrices.

Conference communications

The SelectBoost4Beta workflow and its correlated resampling foundations were presented by Frédéric Bertrand and Myriam Maumy in 2023 at two conferences:

Joint Statistical Meetings 2023 (Toronto, Canada) — “Improving variable selection in Beta regression models using correlated resampling”.
BioC2023 (Boston, USA) — “SelectBoost4Beta: Improving variable selection in Beta regression models”.

Both communications emphasised how leveraging correlation-aware resampling improves the recall and precision of variable selection in high-dimensional Beta regression settings.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.

SelectBoost for Beta regression

Frédéric Bertrand

2025-11-04

Overview

Manual SelectBoost workflow with beta selectors

Running the entire loop with sb_beta()

Conference communications

Running the entire loop with `sb_beta()`