A common problem in building statistical models is determining which features to include in a model. Mathematical publications provide some suggestions, but there is no consensus. Some examples are the lasso or simply trying all possible combinations of predictors. For large data, both of these could require extensive computation time.
With a multithreaded BLAS, stepwise search provides a computationally light weight feature selection method. No resampling is needed because AIC is used and the feature space is searched in an efficient way. In this vignette, this method will be tested in a variety of situations.
The more parameters a model has, the better it will fit the data. If the model is too complex, the worse it will perform on unseen data. AIC strikes a balance between fitting the training data well and keeping the model simple.
Using AIC, a search starts with no features. \[g(Y) = \beta_0\] Then each feature is considered. If there are 10 features, there are 10 models under consideration. For each model, AIC is calculated and the model with the lowest AIC is selected. In this case, X1 was selected. \[g(Y) = \beta_1X_1 + \beta_0\]
After the first feature is selected, all remaining 9 features are considered. Of the 9 features, the one with the lowest AIC is selected, creating a 2 feature model. In this round, X3 was selected. \[g(Y) = \beta_3X_3 + \beta_1X_1 + \beta_0\]
When adding more features does not improve AIC, the procedure stops.
library(GlmSimulatoR)
library(ggplot2)
library(MASS)
#Creating data to work with
set.seed(1)
simdata <- simulate_inverse_gaussian(N = 100000, link = "1/mu^2",
weights = c(1, 2, 3), unrelated = 3)
#Y looks like an inverse gaussian distribution.
ggplot(simdata, aes(x=Y)) +
geom_histogram(bins = 30)
#Setting the simplest model and the most complex model.
scopeArg <- list(
lower = Y ~ 1,
upper = Y ~ X1 + X2 + X3 + Unrelated1 + Unrelated2 + Unrelated3
)
#Run search
startingModel <- glm(Y ~ 1, data = simdata, family = inverse.gaussian(link = "1/mu^2"))
glmSearch <- stepAIC(startingModel, scopeArg)
#> Start: AIC=-209832
#> Y ~ 1
#>
#> Df Deviance AIC
#> + X3 1 33541 -211190
#> + X2 1 33792 -210458
#> + X1 1 33956 -209982
#> <none> 34008 -209832
#> + Unrelated3 1 34008 -209830
#> + Unrelated1 1 34008 -209830
#> + Unrelated2 1 34008 -209830
#>
#> Step: AIC=-211211.7
#> Y ~ X3
#>
#> Df Deviance AIC
#> + X2 1 33327 -211844
#> + X1 1 33489 -211366
#> <none> 33541 -211212
#> + Unrelated3 1 33541 -211210
#> + Unrelated1 1 33541 -211210
#> + Unrelated2 1 33541 -211210
#> - X3 1 34008 -209830
#>
#> Step: AIC=-211849.4
#> Y ~ X3 + X2
#>
#> Df Deviance AIC
#> + X1 1 33273 -212009
#> <none> 33327 -211849
#> + Unrelated3 1 33327 -211848
#> + Unrelated1 1 33327 -211847
#> + Unrelated2 1 33327 -211847
#> - X2 1 33541 -211212
#> - X3 1 33792 -210460
#>
#> Step: AIC=-212009.4
#> Y ~ X3 + X2 + X1
#>
#> Df Deviance AIC
#> <none> 33273 -212009
#> + Unrelated3 1 33273 -212008
#> + Unrelated1 1 33273 -212008
#> + Unrelated2 1 33273 -212007
#> - X1 1 33327 -211850
#> - X2 1 33489 -211367
#> - X3 1 33739 -210616
summary(glmSearch)
#>
#> Call:
#> glm(formula = Y ~ X3 + X2 + X1, family = inverse.gaussian(link = "1/mu^2"),
#> data = simdata)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.6856 -0.4742 -0.0887 0.2979 2.3616
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 2.81192 0.20843 13.49 <2e-16 ***
#> X3 3.03191 0.08116 37.36 <2e-16 ***
#> X2 2.05731 0.08101 25.40 <2e-16 ***
#> X1 1.02594 0.08067 12.72 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for inverse.gaussian family taken to be 0.3335926)
#>
#> Null deviance: 34008 on 99999 degrees of freedom
#> Residual deviance: 33273 on 99996 degrees of freedom
#> AIC: -212009
#>
#> Number of Fisher Scoring iterations: 5
rm(simdata, scopeArg, glmSearch, startingModel)
Looking at the summary, the correct model was found. Stepwise search worked perfectly!
#Creating data to work with
set.seed(2)
simdata <- simulate_inverse_gaussian(N = 100000, link = "1/mu^2",
weights = c(1, 2, 3), unrelated = 20)
#Y looks like an inverse gaussian distribution.
ggplot(simdata, aes(x=Y)) +
geom_histogram(bins = 30)
#Setting the simplest model and the most complex model.
scopeArg <- list(
lower = Y ~ 1,
upper = Y ~ X1 + X2 + X3 + Unrelated1 + Unrelated2 + Unrelated3 + Unrelated3 +
Unrelated4 + Unrelated5 + Unrelated6 + Unrelated7 + Unrelated8 + Unrelated9 +
Unrelated10 + Unrelated11 + Unrelated12 + Unrelated13 + Unrelated14 + Unrelated15 +
Unrelated16 + Unrelated17 + Unrelated18 + Unrelated19 + Unrelated20
)
#Run search
startingModel <- glm(Y ~ 1, data = simdata, family = inverse.gaussian(link = "1/mu^2"))
glmSearch <- stepAIC(startingModel, scopeArg)
#> Start: AIC=-210348.5
#> Y ~ 1
#>
#> Df Deviance AIC
#> + X3 1 33551 -211691
#> + X2 1 33817 -210909
#> + X1 1 33955 -210505
#> + Unrelated4 1 34007 -210353
#> + Unrelated9 1 34008 -210350
#> + Unrelated14 1 34008 -210349
#> + Unrelated19 1 34009 -210349
#> <none> 34009 -210349
#> + Unrelated18 1 34009 -210348
#> + Unrelated5 1 34009 -210348
#> + Unrelated20 1 34009 -210348
#> + Unrelated17 1 34009 -210347
#> + Unrelated1 1 34009 -210347
#> + Unrelated3 1 34009 -210347
#> + Unrelated6 1 34009 -210347
#> + Unrelated2 1 34009 -210347
#> + Unrelated11 1 34009 -210347
#> + Unrelated13 1 34009 -210347
#> + Unrelated16 1 34009 -210347
#> + Unrelated7 1 34009 -210347
#> + Unrelated15 1 34009 -210347
#> + Unrelated10 1 34009 -210347
#> + Unrelated8 1 34009 -210347
#> + Unrelated12 1 34009 -210347
#>
#> Step: AIC=-211704.9
#> Y ~ X3
#>
#> Df Deviance AIC
#> + X2 1 33357 -212281
#> + X1 1 33496 -211865
#> + Unrelated4 1 33548 -211710
#> + Unrelated9 1 33550 -211706
#> + Unrelated14 1 33550 -211706
#> + Unrelated19 1 33550 -211706
#> + Unrelated18 1 33550 -211705
#> <none> 33551 -211705
#> + Unrelated17 1 33550 -211704
#> + Unrelated5 1 33550 -211704
#> + Unrelated20 1 33550 -211704
#> + Unrelated1 1 33550 -211704
#> + Unrelated3 1 33550 -211703
#> + Unrelated6 1 33550 -211703
#> + Unrelated2 1 33551 -211703
#> + Unrelated13 1 33551 -211703
#> + Unrelated8 1 33551 -211703
#> + Unrelated15 1 33551 -211703
#> + Unrelated11 1 33551 -211703
#> + Unrelated7 1 33551 -211703
#> + Unrelated12 1 33551 -211703
#> + Unrelated10 1 33551 -211703
#> + Unrelated16 1 33551 -211703
#> - X3 1 34009 -210338
#>
#> Step: AIC=-212282
#> Y ~ X3 + X2
#>
#> Df Deviance AIC
#> + X1 1 33303 -212442
#> + Unrelated4 1 33355 -212287
#> + Unrelated14 1 33356 -212283
#> + Unrelated9 1 33356 -212283
#> + Unrelated18 1 33356 -212282
#> + Unrelated19 1 33356 -212282
#> <none> 33357 -212282
#> + Unrelated17 1 33356 -212281
#> + Unrelated20 1 33356 -212281
#> + Unrelated5 1 33357 -212281
#> + Unrelated1 1 33357 -212281
#> + Unrelated6 1 33357 -212281
#> + Unrelated3 1 33357 -212280
#> + Unrelated13 1 33357 -212280
#> + Unrelated2 1 33357 -212280
#> + Unrelated8 1 33357 -212280
#> + Unrelated15 1 33357 -212280
#> + Unrelated16 1 33357 -212280
#> + Unrelated10 1 33357 -212280
#> + Unrelated12 1 33357 -212280
#> + Unrelated11 1 33357 -212280
#> + Unrelated7 1 33357 -212280
#> - X2 1 33551 -211702
#> - X3 1 33817 -210899
#>
#> Step: AIC=-212441.3
#> Y ~ X3 + X2 + X1
#>
#> Df Deviance AIC
#> + Unrelated4 1 33301 -212446
#> + Unrelated14 1 33302 -212442
#> + Unrelated18 1 33302 -212442
#> + Unrelated9 1 33302 -212442
#> + Unrelated19 1 33302 -212442
#> <none> 33303 -212441
#> + Unrelated20 1 33303 -212440
#> + Unrelated17 1 33303 -212440
#> + Unrelated5 1 33303 -212440
#> + Unrelated1 1 33303 -212440
#> + Unrelated6 1 33303 -212440
#> + Unrelated3 1 33303 -212440
#> + Unrelated13 1 33303 -212439
#> + Unrelated2 1 33303 -212439
#> + Unrelated15 1 33303 -212439
#> + Unrelated8 1 33303 -212439
#> + Unrelated16 1 33303 -212439
#> + Unrelated10 1 33303 -212439
#> + Unrelated12 1 33303 -212439
#> + Unrelated7 1 33303 -212439
#> + Unrelated11 1 33303 -212439
#> - X1 1 33357 -212281
#> - X2 1 33496 -211861
#> - X3 1 33764 -211055
#>
#> Step: AIC=-212446.4
#> Y ~ X3 + X2 + X1 + Unrelated4
#>
#> Df Deviance AIC
#> + Unrelated18 1 33300 -212447
#> + Unrelated14 1 33300 -212447
#> + Unrelated9 1 33300 -212447
#> + Unrelated19 1 33300 -212447
#> <none> 33301 -212446
#> + Unrelated20 1 33300 -212446
#> + Unrelated17 1 33300 -212445
#> + Unrelated5 1 33300 -212445
#> + Unrelated1 1 33300 -212445
#> + Unrelated6 1 33301 -212445
#> + Unrelated3 1 33301 -212445
#> + Unrelated13 1 33301 -212445
#> + Unrelated2 1 33301 -212445
#> + Unrelated15 1 33301 -212444
#> + Unrelated8 1 33301 -212444
#> + Unrelated16 1 33301 -212444
#> + Unrelated10 1 33301 -212444
#> + Unrelated12 1 33301 -212444
#> + Unrelated7 1 33301 -212444
#> + Unrelated11 1 33301 -212444
#> - Unrelated4 1 33303 -212441
#> - X1 1 33355 -212286
#> - X2 1 33494 -211867
#> - X3 1 33761 -211060
#>
#> Step: AIC=-212447.3
#> Y ~ X3 + X2 + X1 + Unrelated4 + Unrelated18
#>
#> Df Deviance AIC
#> + Unrelated14 1 33299 -212448
#> + Unrelated9 1 33299 -212448
#> + Unrelated19 1 33299 -212448
#> <none> 33300 -212447
#> + Unrelated20 1 33299 -212446
#> - Unrelated18 1 33301 -212446
#> + Unrelated17 1 33299 -212446
#> + Unrelated5 1 33299 -212446
#> + Unrelated1 1 33299 -212446
#> + Unrelated6 1 33300 -212446
#> + Unrelated3 1 33300 -212446
#> + Unrelated13 1 33300 -212445
#> + Unrelated2 1 33300 -212445
#> + Unrelated15 1 33300 -212445
#> + Unrelated8 1 33300 -212445
#> + Unrelated16 1 33300 -212445
#> + Unrelated10 1 33300 -212445
#> + Unrelated12 1 33300 -212445
#> + Unrelated7 1 33300 -212445
#> + Unrelated11 1 33300 -212445
#> - Unrelated4 1 33302 -212442
#> - X1 1 33354 -212286
#> - X2 1 33493 -211867
#> - X3 1 33761 -211060
#>
#> Step: AIC=-212448.1
#> Y ~ X3 + X2 + X1 + Unrelated4 + Unrelated18 + Unrelated14
#>
#> Df Deviance AIC
#> + Unrelated9 1 33298 -212449
#> + Unrelated19 1 33298 -212449
#> <none> 33299 -212448
#> - Unrelated14 1 33300 -212447
#> - Unrelated18 1 33300 -212447
#> + Unrelated20 1 33298 -212447
#> + Unrelated17 1 33298 -212447
#> + Unrelated5 1 33298 -212447
#> + Unrelated1 1 33299 -212447
#> + Unrelated6 1 33299 -212447
#> + Unrelated3 1 33299 -212447
#> + Unrelated13 1 33299 -212446
#> + Unrelated2 1 33299 -212446
#> + Unrelated15 1 33299 -212446
#> + Unrelated8 1 33299 -212446
#> + Unrelated16 1 33299 -212446
#> + Unrelated10 1 33299 -212446
#> + Unrelated12 1 33299 -212446
#> + Unrelated7 1 33299 -212446
#> + Unrelated11 1 33299 -212446
#> - Unrelated4 1 33301 -212443
#> - X1 1 33353 -212287
#> - X2 1 33492 -211868
#> - X3 1 33760 -211061
#>
#> Step: AIC=-212448.9
#> Y ~ X3 + X2 + X1 + Unrelated4 + Unrelated18 + Unrelated14 + Unrelated9
#>
#> Df Deviance AIC
#> + Unrelated19 1 33297 -212450
#> <none> 33298 -212449
#> - Unrelated9 1 33299 -212448
#> - Unrelated18 1 33299 -212448
#> - Unrelated14 1 33299 -212448
#> + Unrelated20 1 33298 -212448
#> + Unrelated17 1 33298 -212448
#> + Unrelated5 1 33298 -212448
#> + Unrelated1 1 33298 -212448
#> + Unrelated6 1 33298 -212447
#> + Unrelated3 1 33298 -212447
#> + Unrelated13 1 33298 -212447
#> + Unrelated2 1 33298 -212447
#> + Unrelated15 1 33298 -212447
#> + Unrelated8 1 33298 -212447
#> + Unrelated16 1 33298 -212447
#> + Unrelated10 1 33298 -212447
#> + Unrelated12 1 33298 -212447
#> + Unrelated7 1 33298 -212447
#> + Unrelated11 1 33298 -212447
#> - Unrelated4 1 33300 -212444
#> - X1 1 33352 -212288
#> - X2 1 33491 -211869
#> - X3 1 33759 -211062
#>
#> Step: AIC=-212449.5
#> Y ~ X3 + X2 + X1 + Unrelated4 + Unrelated18 + Unrelated14 + Unrelated9 +
#> Unrelated19
#>
#> Df Deviance AIC
#> <none> 33297 -212450
#> - Unrelated19 1 33298 -212449
#> - Unrelated9 1 33298 -212449
#> - Unrelated18 1 33298 -212449
#> - Unrelated14 1 33298 -212449
#> + Unrelated20 1 33297 -212449
#> + Unrelated17 1 33297 -212449
#> + Unrelated5 1 33297 -212449
#> + Unrelated1 1 33297 -212448
#> + Unrelated6 1 33297 -212448
#> + Unrelated3 1 33297 -212448
#> + Unrelated13 1 33297 -212448
#> + Unrelated2 1 33297 -212448
#> + Unrelated15 1 33297 -212448
#> + Unrelated8 1 33297 -212448
#> + Unrelated16 1 33297 -212448
#> + Unrelated10 1 33297 -212448
#> + Unrelated12 1 33297 -212448
#> + Unrelated7 1 33297 -212448
#> + Unrelated11 1 33297 -212448
#> - Unrelated4 1 33299 -212444
#> - X1 1 33351 -212288
#> - X2 1 33490 -211870
#> - X3 1 33758 -211062
summary(glmSearch)
#>
#> Call:
#> glm(formula = Y ~ X3 + X2 + X1 + Unrelated4 + Unrelated18 + Unrelated14 +
#> Unrelated9 + Unrelated19, family = inverse.gaussian(link = "1/mu^2"),
#> data = simdata)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.58085 -0.46828 -0.08548 0.29831 2.38619
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 3.09437 0.34081 9.080 < 2e-16 ***
#> X3 3.02170 0.08107 37.274 < 2e-16 ***
#> X2 1.95173 0.08092 24.120 < 2e-16 ***
#> X1 1.03102 0.08070 12.776 < 2e-16 ***
#> Unrelated4 0.21603 0.08082 2.673 0.00752 **
#> Unrelated18 -0.13565 0.08087 -1.677 0.09347 .
#> Unrelated14 0.13661 0.08081 1.691 0.09092 .
#> Unrelated9 -0.13463 0.08080 -1.666 0.09567 .
#> Unrelated19 -0.13307 0.08086 -1.646 0.09982 .
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for inverse.gaussian family taken to be 0.3318398)
#>
#> Null deviance: 34009 on 99999 degrees of freedom
#> Residual deviance: 33297 on 99991 degrees of freedom
#> AIC: -212450
#>
#> Number of Fisher Scoring iterations: 5
rm(simdata, scopeArg, glmSearch, startingModel)
Some unrelated variables made it into the final model. At least all related features are in the model.
#Creating data to work with
set.seed(3)
simdata <- simulate_inverse_gaussian(N = 1000, link = "1/mu^2",
weights = c(1, 2, 3), unrelated = 3)
#Y looks like an inverse gaussian distribution.
ggplot(simdata, aes(x=Y)) +
geom_histogram(bins = 30)
#Setting the simplest model and the most complex model.
scopeArg <- list(
lower = Y ~ 1,
upper = Y ~ X1 + X2 + X3 + Unrelated1 + Unrelated2 + Unrelated3
)
#Run search
startingModel <- glm(Y ~ 1, data = simdata, family = inverse.gaussian(link = "1/mu^2"))
glmSearch <- stepAIC(startingModel, scopeArg)
#> Start: AIC=-2091.87
#> Y ~ 1
#>
#> Df Deviance AIC
#> + X3 1 344.37 -2100.2
#> + X2 1 346.42 -2094.4
#> + X1 1 347.08 -2092.6
#> <none> 348.05 -2091.9
#> + Unrelated1 1 347.48 -2091.5
#> + Unrelated3 1 347.86 -2090.4
#> + Unrelated2 1 348.05 -2089.9
#>
#> Step: AIC=-2100.52
#> Y ~ X3
#>
#> Df Deviance AIC
#> + X2 1 342.77 -2103.1
#> + X1 1 343.29 -2101.6
#> <none> 344.37 -2100.5
#> + Unrelated1 1 343.80 -2100.2
#> + Unrelated3 1 344.24 -2098.9
#> + Unrelated2 1 344.35 -2098.6
#> - X3 1 348.05 -2092.0
#>
#> Step: AIC=-2103.17
#> Y ~ X3 + X2
#>
#> Df Deviance AIC
#> + X1 1 341.61 -2104.5
#> <none> 342.77 -2103.2
#> + Unrelated1 1 342.23 -2102.7
#> + Unrelated3 1 342.68 -2101.4
#> + Unrelated2 1 342.74 -2101.2
#> - X2 1 344.37 -2100.6
#> - X3 1 346.42 -2094.7
#>
#> Step: AIC=-2104.55
#> Y ~ X3 + X2 + X1
#>
#> Df Deviance AIC
#> <none> 341.61 -2104.6
#> + Unrelated1 1 341.07 -2104.1
#> - X1 1 342.77 -2103.2
#> + Unrelated3 1 341.48 -2102.9
#> + Unrelated2 1 341.58 -2102.6
#> - X2 1 343.29 -2101.7
#> - X3 1 345.36 -2095.8
summary(glmSearch)
#>
#> Call:
#> glm(formula = Y ~ X3 + X2 + X1, family = inverse.gaussian(link = "1/mu^2"),
#> data = simdata)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -1.75911 -0.49424 -0.08638 0.30464 1.85673
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 2.8866 2.1941 1.316 0.18860
#> X3 2.7355 0.8310 3.292 0.00103 **
#> X2 1.8694 0.8491 2.202 0.02792 *
#> X1 1.5190 0.8304 1.829 0.06767 .
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for inverse.gaussian family taken to be 0.3466483)
#>
#> Null deviance: 348.05 on 999 degrees of freedom
#> Residual deviance: 341.61 on 996 degrees of freedom
#> AIC: -2104.6
#>
#> Number of Fisher Scoring iterations: 5
rm(simdata, scopeArg, glmSearch, startingModel)
The correct model was found. Again, stepwise search worked perfectly!
#Creating data to work with
set.seed(4)
simdata <- simulate_inverse_gaussian(N = 1000, link = "1/mu^2",
weights = c(1, 2, 3), unrelated = 20)
#Y looks like an inverse gaussian distribution.
ggplot(simdata, aes(x=Y)) +
geom_histogram(bins = 30)
#Setting the simplest model and the most complex model.
scopeArg <- list(
lower = Y ~ 1,
upper = Y ~ X1 + X2 + X3 + Unrelated1 + Unrelated2 + Unrelated3 + Unrelated3 +
Unrelated4 + Unrelated5 + Unrelated6 + Unrelated7 + Unrelated8 + Unrelated9 +
Unrelated10 + Unrelated11 + Unrelated12 + Unrelated13 + Unrelated14 + Unrelated15 +
Unrelated16 + Unrelated17 + Unrelated18 + Unrelated19 + Unrelated20
)
#Run search
startingModel <- glm(Y ~ 1, data = simdata, family = inverse.gaussian(link = "1/mu^2"))
glmSearch <- stepAIC(startingModel, scopeArg)
#> Start: AIC=-2125.88
#> Y ~ 1
#>
#> Df Deviance AIC
#> + X3 1 340.71 -2136.2
#> + X2 1 343.81 -2127.4
#> + X1 1 343.86 -2127.3
#> + Unrelated8 1 343.88 -2127.2
#> + Unrelated20 1 344.22 -2126.3
#> + Unrelated15 1 344.32 -2126.0
#> + Unrelated19 1 344.32 -2126.0
#> <none> 345.07 -2125.9
#> + Unrelated7 1 344.59 -2125.2
#> + Unrelated11 1 344.74 -2124.8
#> + Unrelated6 1 344.81 -2124.6
#> + Unrelated4 1 344.82 -2124.6
#> + Unrelated1 1 344.84 -2124.5
#> + Unrelated2 1 344.89 -2124.4
#> + Unrelated10 1 344.95 -2124.2
#> + Unrelated12 1 344.97 -2124.2
#> + Unrelated14 1 345.01 -2124.0
#> + Unrelated13 1 345.02 -2124.0
#> + Unrelated3 1 345.06 -2123.9
#> + Unrelated18 1 345.06 -2123.9
#> + Unrelated5 1 345.06 -2123.9
#> + Unrelated9 1 345.06 -2123.9
#> + Unrelated17 1 345.06 -2123.9
#> + Unrelated16 1 345.07 -2123.9
#>
#> Step: AIC=-2136.6
#> Y ~ X3
#>
#> Df Deviance AIC
#> + X2 1 339.40 -2138.4
#> + Unrelated8 1 339.50 -2138.1
#> + X1 1 339.54 -2138.0
#> + Unrelated15 1 339.88 -2137.0
#> + Unrelated20 1 339.91 -2136.9
#> + Unrelated19 1 339.94 -2136.8
#> <none> 340.71 -2136.6
#> + Unrelated7 1 340.18 -2136.1
#> + Unrelated11 1 340.30 -2135.8
#> + Unrelated2 1 340.48 -2135.2
#> + Unrelated1 1 340.53 -2135.1
#> + Unrelated4 1 340.54 -2135.1
#> + Unrelated6 1 340.56 -2135.0
#> + Unrelated12 1 340.58 -2135.0
#> + Unrelated10 1 340.62 -2134.8
#> + Unrelated16 1 340.66 -2134.7
#> + Unrelated13 1 340.66 -2134.7
#> + Unrelated17 1 340.67 -2134.7
#> + Unrelated5 1 340.68 -2134.7
#> + Unrelated9 1 340.69 -2134.6
#> + Unrelated18 1 340.70 -2134.6
#> + Unrelated14 1 340.70 -2134.6
#> + Unrelated3 1 340.70 -2134.6
#> - X3 1 345.07 -2126.0
#>
#> Step: AIC=-2138.46
#> Y ~ X3 + X2
#>
#> Df Deviance AIC
#> + Unrelated8 1 338.21 -2139.9
#> + X1 1 338.28 -2139.7
#> + Unrelated19 1 338.43 -2139.2
#> + Unrelated15 1 338.58 -2138.8
#> + Unrelated20 1 338.58 -2138.8
#> <none> 339.40 -2138.5
#> + Unrelated7 1 338.90 -2137.9
#> + Unrelated11 1 339.01 -2137.6
#> + Unrelated2 1 339.16 -2137.1
#> + Unrelated4 1 339.20 -2137.0
#> + Unrelated1 1 339.22 -2136.9
#> + Unrelated12 1 339.25 -2136.9
#> + Unrelated6 1 339.25 -2136.9
#> + Unrelated10 1 339.29 -2136.8
#> - X2 1 340.71 -2136.7
#> + Unrelated16 1 339.34 -2136.6
#> + Unrelated17 1 339.34 -2136.6
#> + Unrelated13 1 339.35 -2136.6
#> + Unrelated9 1 339.38 -2136.5
#> + Unrelated5 1 339.38 -2136.5
#> + Unrelated18 1 339.39 -2136.5
#> + Unrelated14 1 339.39 -2136.5
#> + Unrelated3 1 339.39 -2136.5
#> - X3 1 343.81 -2127.7
#>
#> Step: AIC=-2139.96
#> Y ~ X3 + X2 + Unrelated8
#>
#> Df Deviance AIC
#> + X1 1 337.03 -2141.4
#> + Unrelated19 1 337.29 -2140.6
#> + Unrelated20 1 337.45 -2140.2
#> + Unrelated15 1 337.46 -2140.1
#> <none> 338.21 -2140.0
#> + Unrelated7 1 337.73 -2139.3
#> + Unrelated11 1 337.78 -2139.2
#> + Unrelated4 1 337.96 -2138.7
#> + Unrelated2 1 338.00 -2138.6
#> - Unrelated8 1 339.40 -2138.5
#> + Unrelated1 1 338.03 -2138.5
#> + Unrelated12 1 338.08 -2138.3
#> + Unrelated6 1 338.10 -2138.3
#> + Unrelated10 1 338.10 -2138.3
#> - X2 1 339.50 -2138.2
#> + Unrelated17 1 338.15 -2138.1
#> + Unrelated13 1 338.15 -2138.1
#> + Unrelated16 1 338.16 -2138.1
#> + Unrelated5 1 338.18 -2138.0
#> + Unrelated14 1 338.19 -2138.0
#> + Unrelated9 1 338.20 -2138.0
#> + Unrelated3 1 338.20 -2138.0
#> + Unrelated18 1 338.20 -2138.0
#> - X3 1 342.65 -2129.1
#>
#> Step: AIC=-2141.47
#> Y ~ X3 + X2 + Unrelated8 + X1
#>
#> Df Deviance AIC
#> + Unrelated19 1 336.14 -2142.1
#> + Unrelated20 1 336.23 -2141.8
#> + Unrelated15 1 336.29 -2141.6
#> <none> 337.03 -2141.5
#> + Unrelated11 1 336.53 -2140.9
#> + Unrelated7 1 336.57 -2140.8
#> + Unrelated2 1 336.78 -2140.2
#> + Unrelated4 1 336.82 -2140.1
#> - X1 1 338.21 -2140.0
#> + Unrelated1 1 336.85 -2140.0
#> + Unrelated12 1 336.87 -2139.9
#> - X2 1 338.26 -2139.9
#> - Unrelated8 1 338.28 -2139.8
#> + Unrelated6 1 336.92 -2139.8
#> + Unrelated10 1 336.94 -2139.7
#> + Unrelated16 1 336.95 -2139.7
#> + Unrelated13 1 336.96 -2139.7
#> + Unrelated17 1 336.96 -2139.7
#> + Unrelated9 1 337.00 -2139.6
#> + Unrelated5 1 337.01 -2139.5
#> + Unrelated14 1 337.01 -2139.5
#> + Unrelated3 1 337.01 -2139.5
#> + Unrelated18 1 337.02 -2139.5
#> - X3 1 341.43 -2130.6
#>
#> Step: AIC=-2142.1
#> Y ~ X3 + X2 + Unrelated8 + X1 + Unrelated19
#>
#> Df Deviance AIC
#> + Unrelated15 1 335.37 -2142.3
#> + Unrelated20 1 335.38 -2142.3
#> <none> 336.14 -2142.1
#> + Unrelated7 1 335.62 -2141.6
#> + Unrelated11 1 335.63 -2141.6
#> - Unrelated19 1 337.03 -2141.5
#> + Unrelated2 1 335.87 -2140.9
#> + Unrelated4 1 335.91 -2140.8
#> - X1 1 337.29 -2140.7
#> + Unrelated1 1 335.96 -2140.6
#> - Unrelated8 1 337.35 -2140.6
#> + Unrelated12 1 336.00 -2140.5
#> + Unrelated6 1 336.03 -2140.4
#> + Unrelated17 1 336.06 -2140.3
#> + Unrelated16 1 336.07 -2140.3
#> + Unrelated13 1 336.07 -2140.3
#> + Unrelated10 1 336.09 -2140.2
#> + Unrelated5 1 336.11 -2140.2
#> + Unrelated9 1 336.11 -2140.2
#> + Unrelated14 1 336.13 -2140.1
#> + Unrelated3 1 336.13 -2140.1
#> + Unrelated18 1 336.14 -2140.1
#> - X2 1 337.55 -2140.0
#> - X3 1 340.60 -2131.1
#>
#> Step: AIC=-2142.38
#> Y ~ X3 + X2 + Unrelated8 + X1 + Unrelated19 + Unrelated15
#>
#> Df Deviance AIC
#> + Unrelated20 1 334.59 -2142.7
#> <none> 335.37 -2142.4
#> - Unrelated15 1 336.14 -2142.1
#> + Unrelated7 1 334.78 -2142.1
#> + Unrelated11 1 334.89 -2141.8
#> - Unrelated19 1 336.29 -2141.7
#> + Unrelated2 1 335.12 -2141.1
#> - X1 1 336.50 -2141.1
#> - Unrelated8 1 336.51 -2141.0
#> + Unrelated1 1 335.18 -2140.9
#> + Unrelated4 1 335.19 -2140.9
#> + Unrelated12 1 335.21 -2140.9
#> + Unrelated6 1 335.27 -2140.7
#> + Unrelated17 1 335.30 -2140.6
#> + Unrelated16 1 335.30 -2140.6
#> + Unrelated13 1 335.31 -2140.6
#> + Unrelated10 1 335.33 -2140.5
#> + Unrelated9 1 335.35 -2140.4
#> + Unrelated5 1 335.35 -2140.4
#> + Unrelated14 1 335.36 -2140.4
#> + Unrelated3 1 335.36 -2140.4
#> + Unrelated18 1 335.37 -2140.4
#> - X2 1 336.77 -2140.2
#> - X3 1 339.91 -2131.0
#>
#> Step: AIC=-2142.72
#> Y ~ X3 + X2 + Unrelated8 + X1 + Unrelated19 + Unrelated15 + Unrelated20
#>
#> Df Deviance AIC
#> <none> 334.59 -2142.7
#> + Unrelated7 1 333.95 -2142.6
#> - Unrelated20 1 335.37 -2142.4
#> - Unrelated15 1 335.38 -2142.4
#> + Unrelated11 1 334.09 -2142.2
#> - Unrelated19 1 335.48 -2142.1
#> - Unrelated8 1 335.66 -2141.6
#> + Unrelated2 1 334.31 -2141.5
#> + Unrelated4 1 334.38 -2141.3
#> - X1 1 335.75 -2141.3
#> + Unrelated1 1 334.40 -2141.3
#> + Unrelated6 1 334.44 -2141.2
#> + Unrelated12 1 334.44 -2141.2
#> + Unrelated17 1 334.49 -2141.0
#> + Unrelated13 1 334.53 -2140.9
#> + Unrelated16 1 334.53 -2140.9
#> + Unrelated10 1 334.54 -2140.9
#> + Unrelated14 1 334.56 -2140.8
#> + Unrelated9 1 334.56 -2140.8
#> + Unrelated5 1 334.58 -2140.8
#> + Unrelated3 1 334.58 -2140.7
#> + Unrelated18 1 334.58 -2140.7
#> - X2 1 335.99 -2140.6
#> - X3 1 339.08 -2131.5
summary(glmSearch)
#>
#> Call:
#> glm(formula = Y ~ X3 + X2 + Unrelated8 + X1 + Unrelated19 + Unrelated15 +
#> Unrelated20, family = inverse.gaussian(link = "1/mu^2"),
#> data = simdata)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.36161 -0.48520 -0.09361 0.29986 1.65164
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 3.2105 3.3207 0.967 0.333882
#> X3 3.0476 0.8385 3.635 0.000293 ***
#> X2 1.7017 0.8362 2.035 0.042112 *
#> Unrelated8 -1.4773 0.8320 -1.776 0.076118 .
#> X1 1.5461 0.8362 1.849 0.064758 .
#> Unrelated19 1.3528 0.8358 1.618 0.105880
#> Unrelated15 -1.3128 0.8585 -1.529 0.126531
#> Unrelated20 1.2960 0.8523 1.520 0.128707
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for inverse.gaussian family taken to be 0.3392529)
#>
#> Null deviance: 345.07 on 999 degrees of freedom
#> Residual deviance: 334.59 on 992 degrees of freedom
#> AIC: -2142.7
#>
#> Number of Fisher Scoring iterations: 5
rm(simdata, scopeArg, glmSearch, startingModel)
A few unrelated features made it into the model, but at least all true predictors were selected.
Stepwise search provides a computationally fast way to select features. When half the features were unrelated, the search found the correct model for both small and large n. When the majority of features were unrelated, stepwise found all related features and erroneously selected a few unrelated variables.