Cross-validation of regression models

John Fox and Georges Monette

2024-01-23

Cross-validation

Cross-validation (CV) is an essentially simple and intuitively reasonable approach to estimating the predictive accuracy of regression models. CV is developed in many standard sources on regression modeling and “machine learning”—we particularly recommend James, Witten, Hastie, & Tibshirani (2021, secs. 5.1, 5.3)—and so we will describe the method only briefly here before taking up computational issues and some examples. See Arlot & Celisse (2010) for a wide-ranging, if technical, survey of cross-validation and related methods that emphasizes the statistical properties of CV.

Validating research by replication on independently collected data is a common scientific norm. Emulating this process in a single study by data-division is less common: The data are randomly divided into two, possibly equal-size, parts; the first part is used to develop and fit a statistical model; and then the second part is used to assess the adequacy of the model fit to the first part of the data. Data-division, however, suffers from two problems: (1) Dividing the data decreases the sample size and thus increases sampling error; and (2), even more disconcertingly, particularly in smaller samples, the results can vary substantially based on the random division of the data: See Harrell (2015, sec. 5.3) for this and other remarks about data-division and cross-validation.

Cross-validation speaks to both of these issues. In CV, the data are randomly divided as equally as possible into several, say \(k\), parts, called “folds.” The statistical model is fit \(k\) times, leaving each fold out in turn. Each fitted model is then used to predict the response variable for the cases in the omitted fold. A CV criterion or “cost” measure, such as the mean-squared error (“MSE”) of prediction, is then computed using these predicted values. In the extreme \(k = n\), the number of cases in the data, thus omitting individual cases and refitting the model \(n\) times—a procedure termed “leave-one-out (LOO) cross-validation.”

Because the \(n\) models are each fit to \(n - 1\) cases, LOO CV produces a nearly unbiased estimate of prediction error. The \(n\) regression models are highly statistical dependent, however, based as they are on nearly the same data, and so the resulting estimate of prediction error has relatively large variance. In contrast, estimated prediction error for \(k\)-fold CV with \(k = 5\) or \(10\) (commonly employed choices) are somewhat biased but have smaller variance. It is also possible to correct \(k\)-fold CV for bias (see below).

Examples

Polynomial regression for the Auto data

The data for this example are drawn from the ISLR2 package for R, associated with James et al. (2021). The presentation here is close (though not identical) to that in the original source (James et al., 2021, secs. 5.1, 5.3), and it demonstrates the use of the cv() function in the cv package.1

The Auto dataset contains information about 392 cars:

data("Auto", package="ISLR2")
head(Auto)
#>   mpg cylinders displacement horsepower weight acceleration year origin
#> 1  18         8          307        130   3504         12.0   70      1
#> 2  15         8          350        165   3693         11.5   70      1
#> 3  18         8          318        150   3436         11.0   70      1
#> 4  16         8          304        150   3433         12.0   70      1
#> 5  17         8          302        140   3449         10.5   70      1
#> 6  15         8          429        198   4341         10.0   70      1
#>                        name
#> 1 chevrolet chevelle malibu
#> 2         buick skylark 320
#> 3        plymouth satellite
#> 4             amc rebel sst
#> 5               ford torino
#> 6          ford galaxie 500
dim(Auto)
#> [1] 392   9

With the exception of origin (which we don’t use here), these variables are largely self-explanatory, except possibly for units of measurement: for details see help("Auto", package="ISLR2").

We’ll focus here on the relationship of mpg (miles per gallon) to horsepower, as displayed in the following scatterplot:

plot(mpg ~ horsepower, data=Auto)
`mpg` vs `horsepower` for the `Auto` data

mpg vs horsepower for the Auto data

The relationship between the two variables is monotone, decreasing, and nonlinear. Following James et al. (2021), we’ll consider approximating the relationship by a polynomial regression, with the degree of the polynomial \(p\) ranging from 1 (a linear regression) to 10.2 Polynomial fits for \(p = 1\) to \(5\) are shown in the following figure:

plot(mpg ~ horsepower, data=Auto)
horsepower <- with(Auto, 
                   seq(min(horsepower), max(horsepower), 
                       length=1000))
for (p in 1:5){
  m <- lm(mpg ~ poly(horsepower,p), data=Auto)
  mpg <- predict(m, newdata=data.frame(horsepower=horsepower))
  lines(horsepower, mpg, col=p + 1, lty=p, lwd=2)
}
legend("topright", legend=1:5, col=2:6, lty=1:5, lwd=2,
       title="Degree", inset=0.02)
`mpg` vs `horsepower` for the `Auto` data

mpg vs horsepower for the Auto data

The linear fit is clearly inappropriate; the fits for \(p = 2\) (quadratic) through \(4\) are very similar; and the fit for \(p = 5\) may over-fit the data by chasing one or two relatively high mpg values at the right (but see the CV results reported below).

The following graph shows two measures of estimated (squared) error as a function of polynomial-regression degree: The mean-squared error (“MSE”), defined as \(\mathsf{MSE} = \frac{1}{n}\sum_{i=1}^n (y_i - \widehat{y}_i)^2\), and the usual residual variance, defined as \(\widehat{\sigma}^2 = \frac{1}{n - p - 1} \sum_{i=1}^n (y_i - \widehat{y}_i)^2\). The former necessarily declines with \(p\) (or, more strictly, can’t increase with \(p\)), while the latter gets slightly larger for the largest values of \(p\), with the “best” value, by a small margin, for \(p = 7\).

library("cv") # for mse() and other functions

var <- mse <- numeric(10)
for (p in 1:10){
  m <- lm(mpg ~ poly(horsepower, p), data=Auto)
  mse[p] <- mse(Auto$mpg, fitted(m))
  var[p] <- summary(m)$sigma^2
}

plot(c(1, 10), range(mse, var), type="n",
     xlab="Degree of polynomial, p",
     ylab="Estimated Squared Error")
lines(1:10, mse, lwd=2, lty=1, col=2, pch=16, type="b")
lines(1:10, var, lwd=2, lty=2, col=3, pch=17, type="b")
legend("topright", inset=0.02,
       legend=c(expression(hat(sigma)^2), "MSE"),
       lwd=2, lty=2:1, col=3:2, pch=17:16)
Estimated squared error as a function of polynomial degree, $p$

Estimated squared error as a function of polynomial degree, \(p\)

The code for this graph uses the mse() function from the cv package to compute the MSE for each fit.

Using cv()

The generic cv() function has an "lm" method, which by default performs \(k = 10\)-fold CV:

m.auto <- lm(mpg ~ poly(horsepower, 2), data=Auto)
summary(m.auto)
#> 
#> Call:
#> lm(formula = mpg ~ poly(horsepower, 2), data = Auto)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -14.714  -2.594  -0.086   2.287  15.896 
#> 
#> Coefficients:
#>                      Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)            23.446      0.221   106.1   <2e-16 ***
#> poly(horsepower, 2)1 -120.138      4.374   -27.5   <2e-16 ***
#> poly(horsepower, 2)2   44.090      4.374    10.1   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 4.37 on 389 degrees of freedom
#> Multiple R-squared:  0.688,  Adjusted R-squared:  0.686 
#> F-statistic:  428 on 2 and 389 DF,  p-value: <2e-16

cv(m.auto)
#> R RNG seed set to 398475
#> 10-Fold Cross Validation
#> method: Woodbury
#> criterion: mse
#> cross-validation criterion = 19.238
#> bias-adjusted cross-validation criterion = 19.224
#> full-sample criterion = 18.985

The "lm" method by default uses mse() as the CV criterion and the Woodbury matrix identity to update the regression with each fold deleted without having literally to refit the model. Computational details are discussed in the final section of this vignette. The function reports the CV estimate of MSE, a biased-adjusted estimate of the MSE (the bias adjustment is explained in the final section), and the MSE is also computed for the original, full-sample regression. Because the division of the data into 10 folds is random, cv() explicitly (randomly) generates and saves a seed for R’s pseudo-random number generator, to make the results replicable. The user can also specify the seed directly via the seed argument to cv().

To perform LOO CV, we can set the k argument to cv() to the number of cases in the data, here k=392, or, more conveniently, to k="loo" or k="n":

cv(m.auto, k="loo")
#> n-Fold Cross Validation
#> method: hatvalues
#> criterion: mse
#> cross-validation criterion = 19.248

For LOO CV of a linear model, cv() by default uses the hatvalues from the model fit to the full data for the LOO updates, and reports only the CV estimate of MSE. Alternative methods are to use the Woodbury matrix identity or the “naive” approach of literally refitting the model with each case omitted. All three methods produce exact results for a linear model (within the precision of floating-point computations):

cv(m.auto, k="loo", method="naive")
#> n-Fold Cross Validation
#> criterion: mse
#> cross-validation criterion = 19.248
#> bias-adjusted cross-validation criterion = 19.248
#> full-sample criterion = 18.985

cv(m.auto, k="loo", method="Woodbury")
#> n-Fold Cross Validation
#> method: Woodbury
#> criterion: mse
#> cross-validation criterion = 19.248
#> bias-adjusted cross-validation criterion = 19.248
#> full-sample criterion = 18.985

The "naive" and "Woodbury" methods also return the bias-adjusted estimate of MSE and the full-sample MSE, but bias isn’t an issue for LOO CV.

This is a small regression problem and all three computational approaches are essentially instantaneous, but it is still of interest to investigate their relative speed. In this comparison, we include the cv.glm() function from the boot package, which takes the naive approach, and for which we have to fit the linear model as an equivalent Gaussian GLM. We use the microbenchmark() function from the package of the same name for the timings (Mersmann, 2023):

m.auto.glm <- glm(mpg ~ poly(horsepower, 2), data=Auto)
boot::cv.glm(Auto, m.auto.glm)$delta
#> [1] 19.248 19.248

microbenchmark::microbenchmark(
  hatvalues = cv(m.auto, k="loo"),
  Woodbury = cv(m.auto, k="loo", method="Woodbury"),
  naive = cv(m.auto, k="loo", method="naive"),
  cv.glm = boot::cv.glm(Auto, m.auto.glm),
  times=10
)
#> Warning in microbenchmark::microbenchmark(hatvalues = cv(m.auto, k = "loo"), :
#> less accurate nanosecond times to avoid potential integer overflows
#> Unit: microseconds
#>       expr       min       lq     mean   median       uq      max neval cld
#>  hatvalues    986.13   1010.6   1131.7   1176.3   1197.7   1205.1    10 a  
#>   Woodbury   9970.58  10173.4  10292.8  10351.1  10419.4  10462.7    10 a  
#>      naive 209703.40 210921.2 233033.5 219749.1 270959.1 277474.9    10  b 
#>     cv.glm 375319.86 382713.4 395619.9 384162.0 387250.0 448388.6    10   c

On our computer, using the hatvalues is about an order of magnitude faster than employing Woodbury matrix updates, and more than two orders of magnitude faster than refitting the model.3

Comparing competing models

The cv() function also has a method that can be applied to a list of regression models for the same data, composed using the models() function. For \(k\)-fold CV, the same folds are used for the competing models, which reduces random error in their comparison. This result can also be obtained by specifying a common seed for R’s random-number generator while applying cv() separately to each model, but employing a list of models is more convenient for both \(k\)-fold and LOO CV (where there is no random component to the composition of the \(n\) folds).

We illustrate with the polynomial regression models of varying degree for the Auto data (discussed previously), beginning by fitting and saving the 10 models:

for (p in 1:10){
  assign(paste0("m.", p),
         lm(mpg ~ poly(horsepower, p), data=Auto))
}
objects(pattern="m\\.[0-9]")
#>  [1] "m.1"  "m.10" "m.2"  "m.3"  "m.4"  "m.5"  "m.6"  "m.7"  "m.8"  "m.9"
summary(m.2) # for example, the quadratic fit
#> 
#> Call:
#> lm(formula = mpg ~ poly(horsepower, p), data = Auto)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -14.714  -2.594  -0.086   2.287  15.896 
#> 
#> Coefficients:
#>                      Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)            23.446      0.221   106.1   <2e-16 ***
#> poly(horsepower, p)1 -120.138      4.374   -27.5   <2e-16 ***
#> poly(horsepower, p)2   44.090      4.374    10.1   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 4.37 on 389 degrees of freedom
#> Multiple R-squared:  0.688,  Adjusted R-squared:  0.686 
#> F-statistic:  428 on 2 and 389 DF,  p-value: <2e-16

We then apply cv() to the list of 10 models (the data argument is required):

# 10-fold CV
cv.auto.10 <- cv(models(m.1, m.2, m.3, m.4, m.5,
                     m.6, m.7, m.8, m.9, m.10),
              data=Auto, seed=2120)
cv.auto.10[1:2] # for the linear and quadratic models
#> 
#> Model model.1:
#> 10-Fold Cross Validation
#> method: Woodbury
#> cross-validation criterion = 24.246
#> bias-adjusted cross-validation criterion = 24.23
#> full-sample criterion = 23.944 
#> 
#> Model model.2:
#> 10-Fold Cross Validation
#> method: Woodbury
#> cross-validation criterion = 19.346
#> bias-adjusted cross-validation criterion = 19.327
#> full-sample criterion = 18.985

# LOO CV
cv.auto.loo <- cv(models(m.1, m.2, m.3, m.4, m.5,
                        m.6, m.7, m.8, m.9, m.10),
                 data=Auto, k="loo")
cv.auto.loo[1:2] # linear and quadratic models
#> 
#> Model model.1:
#> n-Fold Cross Validation
#> method: hatvalues
#> cross-validation criterion = 24.232
#> Model model.2:
#> n-Fold Cross Validation
#> method: hatvalues
#> cross-validation criterion = 19.248

Because we didn’t supply names for the models in the calls to the models() function, the names model.1, model.2, etc., are generated by the function.

Finally, we extract and graph the adjusted MSEs for \(10\)-fold CV and the MSEs for LOO CV:

cv.mse.10 <- sapply(cv.auto.10, function(x) x[["adj CV crit"]])
cv.mse.loo <- sapply(cv.auto.loo, function(x) x[["CV crit"]])
plot(c(1, 10), range(cv.mse.10, cv.mse.loo), type="n",
     xlab="Degree of polynomial, p",
     ylab="Cross-Validated MSE")
lines(1:10, cv.mse.10, lwd=2, lty=1, col=2, pch=16, type="b")
lines(1:10, cv.mse.loo, lwd=2, lty=2, col=3, pch=17, type="b")
legend("topright", inset=0.02,
       legend=c("10-Fold CV", "LOO CV"),
       lwd=2, lty=2:1, col=3:2, pch=17:16)
Cross-validated 10-fold and LOO MSE as a function of polynomial degree, $p$

Cross-validated 10-fold and LOO MSE as a function of polynomial degree, \(p\)

Alternatively, we can use the plot() method for "cvModList" objects to compare the models, though with separate graphs for 10-fold and LOO CV:

plot(cv.auto.10, main="Polynomial Regressions, 10-Fold CV",
     axis.args=list(labels=1:10), xlab="Degree of Polynomial, p")
plot(cv.auto.loo, main="Polynomial Regressions, LOO CV",
     axis.args=list(labels=1:10), xlab="Degree of Polynomial, p")
Cross-validated 10-fold and LOO MSE as a function of polynomial degree, $p$Cross-validated 10-fold and LOO MSE as a function of polynomial degree, $p$

Cross-validated 10-fold and LOO MSE as a function of polynomial degree, \(p\)

In this example, 10-fold and LOO CV produce generally similar results, and also results that are similar to those produced by the estimated error variance \(\widehat{\sigma}^2\) for each model, reported above (except for the highest-degree polynomials, where the CV results more clearly suggest over-fitting).

Logistic regression for the Mroz data

The Mroz data set from the carData package (associated with Fox & Weisberg, 2019) has been used by several authors to illustrate binary logistic regression; see, in particular Fox & Weisberg (2019). The data were originally drawn from the U.S. Panel Study of Income Dynamics and pertain to married women. Here are a few cases in the data set:

data("Mroz", package="carData")
head(Mroz, 3)
#>   lfp k5 k618 age wc hc    lwg   inc
#> 1 yes  1    0  32 no no 1.2102 10.91
#> 2 yes  0    2  30 no no 0.3285 19.50
#> 3 yes  1    3  35 no no 1.5141 12.04
tail(Mroz, 3)
#>     lfp k5 k618 age wc hc     lwg    inc
#> 751  no  0    0  43 no no 0.88814  9.952
#> 752  no  0    0  60 no no 1.22497 24.984
#> 753  no  0    3  39 no no 0.85321 28.363

The response variable in the logistic regression is lfp, labor-force participation, a factor coded "yes" or "no". The remaining variables are predictors:

We use the glm() function to fit a binary logistic regression to the Mroz data:

m.mroz <- glm(lfp ~ ., data=Mroz, family=binomial)
summary(m.mroz)
#> 
#> Call:
#> glm(formula = lfp ~ ., family = binomial, data = Mroz)
#> 
#> Coefficients:
#>             Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)  3.18214    0.64438    4.94  7.9e-07 ***
#> k5          -1.46291    0.19700   -7.43  1.1e-13 ***
#> k618        -0.06457    0.06800   -0.95  0.34234    
#> age         -0.06287    0.01278   -4.92  8.7e-07 ***
#> wcyes        0.80727    0.22998    3.51  0.00045 ***
#> hcyes        0.11173    0.20604    0.54  0.58762    
#> lwg          0.60469    0.15082    4.01  6.1e-05 ***
#> inc         -0.03445    0.00821   -4.20  2.7e-05 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 1029.75  on 752  degrees of freedom
#> Residual deviance:  905.27  on 745  degrees of freedom
#> AIC: 921.3
#> 
#> Number of Fisher Scoring iterations: 4

BayesRule(ifelse(Mroz$lfp == "yes", 1, 0), 
          fitted(m.mroz, type="response"))
#> [1] 0.30677
#> attr(,"casewise loss")
#> [1] "y != round(yhat)"

In addition to the usually summary output for a GLM, we show the result of applying the BayesRule() function from the cv package to predictions derived from the fitted model. Bayes rule, which predicts a “success” in a binary regression model when the fitted probability of success [i.e., \(\phi = \Pr(y = 1)\)] is \(\widehat{\phi} \ge .5\) and a “failure” if \(\widehat{\phi} \lt .5\).4 The first argument to BayesRule() is the binary {0, 1} response, and the second argument is the predicted probability of success. BayesRule() returns the proportion of predictions that are in error, as appropriate for a “cost” function.

The value returned by BayesRule() is associated with an “attribute” named "casewise loss" and set to "y != round(yhat)", signifying that the Bayes rule CV criterion is computed as the mean of casewise values, here 0 if the prediction for a case matches the observed value and 1 if it does not (signifying a prediction error). The mse() function for numeric responses is also calculated as a casewise average. Some other criteria, such as the median absolute error, computed by the medAbsErr() function in the cv package, aren’t averages of casewise components. The distinction is important because, to our knowledge, the statistical theory of cross-validation, for example, in Davison & Hinkley (1997), S. Bates, Hastie, & Tibshirani (2023), and Arlot & Celisse (2010), is developed for CV criteria like MSE that are means of casewise components. As a consequence, we limit computation of bias adjustment and confidence intervals (see below) to criteria that are casewise averages.

In this example, the fitted logistic regression incorrectly predicts 31% of the responses; we expect this estimate to be optimistic given that the model is used to “predict” the data to which it is fit.

The "glm" method for cv() is largely similar to the "lm" method, although the default algorithm, selected explicitly by method="exact", refits the model with each fold removed (and is thus equivalent to method="naive" for "lm" models). For generalized linear models, method="Woodbury" or (for LOO CV) method="hatvalues" provide approximate results (see the last section of the vignette for details):

cv(m.mroz, criterion=BayesRule, seed=248)
#> R RNG seed set to 248
#> 10-Fold Cross Validation
#> criterion: BayesRule
#> cross-validation criterion = 0.32404
#> bias-adjusted cross-validation criterion = 0.31952
#> 95% CI for bias-adjusted CV criterion = (0.28607, 0.35297)
#> full-sample criterion = 0.30677

cv(m.mroz, criterion=BayesRule, seed=248, method="Woodbury")
#> R RNG seed set to 248
#> 10-Fold Cross Validation
#> method: Woodbury
#> criterion: BayesRule
#> cross-validation criterion = 0.32404
#> bias-adjusted cross-validation criterion = 0.31926
#> 95% CI for bias-adjusted CV criterion = (0.28581, 0.35271)
#> full-sample criterion = 0.30677

To ensure that the two methods use the same 10 folds, we specify the seed for R’s random-number generator explicitly; here, and as is common in our experience, the "exact" and "Woodbury" algorithms produce nearly identical results. The CV estimates of prediction error are slightly higher than the estimate based on all of the cases.

The printed output includes a 95% confidence interval for the bias-adjusted Bayes rule CV criterion. S. Bates et al. (2023) show that these confidence intervals are unreliable for models fit to small samples, and by default cv() computes them only when the sample size is 400 or larger and when the CV criterion employed is an average of casewise components, as is the case for Bayes rule. See the final section of the vignette for details of the computation of confidence intervals for bias-adjusted CV criteria.

Here are results of applying LOO CV to the Mroz model, using both the exact and the approximate methods:

cv(m.mroz, k="loo", criterion=BayesRule)
#> n-Fold Cross Validation
#> criterion: BayesRule
#> cross-validation criterion = 0.32005
#> bias-adjusted cross-validation criterion = 0.3183
#> 95% CI for bias-adjusted CV criterion = (0.28496, 0.35164)
#> full-sample criterion = 0.30677

cv(m.mroz, k="loo", criterion=BayesRule, method="Woodbury")
#> n-Fold Cross Validation
#> method: Woodbury
#> criterion: BayesRule
#> cross-validation criterion = 0.32005
#> bias-adjusted cross-validation criterion = 0.3183
#> 95% CI for bias-adjusted CV criterion = (0.28496, 0.35164)
#> full-sample criterion = 0.30677

cv(m.mroz, k="loo", criterion=BayesRule, method="hatvalues")
#> n-Fold Cross Validation
#> method: hatvalues
#> criterion: BayesRule
#> cross-validation criterion = 0.32005

To the number of decimal digits shown, the three methods produce identical results for this example.

As for linear models, we report some timings for the various cv() methods of computation in LOO CV as well as for the cv.glm() function from the boot package (which, recall, refits the model with each case removed, and thus is comparable to cv() with method="exact"):

microbenchmark::microbenchmark(
  hatvalues=cv(m.mroz, k="loo", criterion=BayesRule, method="hatvalues"),
  Woodbury=cv(m.mroz, k="loo", criterion=BayesRule, method="Woodbury"),
  exact=cv(m.mroz, k="loo", criterion=BayesRule),
  cv.glm=boot::cv.glm(Mroz, m.mroz,
               cost=BayesRule),
  times=10)
#> Unit: milliseconds
#>       expr       min        lq      mean    median        uq       max neval
#>  hatvalues    1.3145    1.3296    1.6209    1.3431    1.3785    4.0053    10
#>   Woodbury   37.0752   39.6014   40.5334   40.0595   40.5473   47.5972    10
#>      exact 1714.0892 1747.4177 1782.3877 1785.8192 1813.3066 1848.2108    10
#>     cv.glm 2003.4197 2032.1928 2066.9809 2072.7632 2104.7543 2115.3950    10
#>   cld
#>  a   
#>   b  
#>    c 
#>     d

There is a substantial time penalty associated with exact computations.

Cross-validating mixed-effects models

The fundamental analogy for cross-validation is to the collection of new data. That is, predicting the response in each fold from the model fit to data in the other folds is like using the model fit to all of the data to predict the response for new cases from the values of the predictors for those new cases. As we explained, the application of this idea to independently sampled cases is straightforward—simply partition the data into random folds of equal size and leave each fold out in turn, or, in the case of LOO CV, simply omit each case in turn.

In contrast, mixed-effects models are fit to dependent data, in which cases as clustered, such as hierarchical data, where the clusters comprise higher-level units (e.g., students clustered in schools), or longitudinal data, where the clusters are individuals and the cases repeated observations on the individuals over time.5

We can think of two approaches to applying cross-validation to clustered data:6

  1. Treat CV as analogous to predicting the response for one or more cases in a newly observed cluster. In this instance, the folds comprise one or more whole clusters; we refit the model with all of the cases in clusters in the current fold removed; and then we predict the response for the cases in clusters in the current fold. These predictions are based only on fixed effects because the random effects for the omitted clusters are presumably unknown, as they would be for data on cases in newly observed clusters.

  2. Treat CV as analogous to predicting the response for a newly observed case in an existing cluster. In this instance, the folds comprise one or more individual cases, and the predictions can use both the fixed and random effects.

Example: The High-School and Beyond data

Following their use by Raudenbush & Bryk (2002), data from the 1982 High School and Beyond (HSB) survey have become a staple of the literature on mixed-effects models. The HSB data are used by Fox & Weisberg (2019, sec. 7.2.2) to illustrate the application of linear mixed models to hierarchical data, and we’ll closely follow their example here.

The HSB data are included in the MathAchieve and MathAchSchool data sets in the nlme package (Pinheiro & Bates, 2000). MathAchieve includes individual-level data on 7185 students in 160 high schools, and MathAchSchool includes school-level data:

data("MathAchieve", package="nlme")
dim(MathAchieve)
#> [1] 7185    6
head(MathAchieve, 3)
#> Grouped Data: MathAch ~ SES | School
#>   School Minority    Sex    SES MathAch MEANSES
#> 1   1224       No Female -1.528   5.876  -0.428
#> 2   1224       No Female -0.588  19.708  -0.428
#> 3   1224       No   Male -0.528  20.349  -0.428
tail(MathAchieve, 3)
#> Grouped Data: MathAch ~ SES | School
#>      School Minority    Sex    SES MathAch MEANSES
#> 7183   9586       No Female  1.332  19.641   0.627
#> 7184   9586       No Female -0.008  16.241   0.627
#> 7185   9586       No Female  0.792  22.733   0.627

data("MathAchSchool", package="nlme")
dim(MathAchSchool)
#> [1] 160   7
head(MathAchSchool, 2)
#>      School Size Sector PRACAD DISCLIM HIMINTY MEANSES
#> 1224   1224  842 Public   0.35   1.597       0  -0.428
#> 1288   1288 1855 Public   0.27   0.174       0   0.128
tail(MathAchSchool, 2)
#>      School Size   Sector PRACAD DISCLIM HIMINTY MEANSES
#> 9550   9550 1532   Public   0.45   0.791       0   0.059
#> 9586   9586  262 Catholic   1.00  -2.416       0   0.627

The first few students are in school number 1224 and the last few in school 9586.

We’ll use only the School, SES (students’ socioeconomic status), and MathAch (their score on a standardized math-achievement test) variables in the MathAchieve data set, and Sector ("Catholic" or "Public") in the MathAchSchool data set.

Some data-management is required before fitting a mixed-effects model to the HSB data, for which we use the dplyr package (Wickham, François, Henry, Müller, & Vaughan, 2023):

library("dplyr")
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
MathAchieve %>% group_by(School) %>%
  summarize(mean.ses = mean(SES)) -> Temp
Temp <- merge(MathAchSchool, Temp, by="School")
HSB <- merge(Temp[, c("School", "Sector", "mean.ses")],
             MathAchieve[, c("School", "SES", "MathAch")], by="School")
names(HSB) <- tolower(names(HSB))

HSB$cses <- with(HSB, ses - mean.ses)

In the process, we created two new school-level variables: meanses, which is the average SES for students in each school; and cses, which is school-average SES centered at its mean. For details, see Fox & Weisberg (2019, sec. 7.2.2).

Still following Fox and Weisberg, we proceed to use the lmer() function in the lme4 package (D. Bates, Mächler, Bolker, & Walker, 2015) to fit a mixed model for math achievement to the HSB data:

library("lme4")
#> Loading required package: Matrix
hsb.lmer <- lmer(mathach ~ mean.ses*cses + sector*cses
                   + (cses | school), data=HSB)
summary(hsb.lmer, correlation=FALSE)
#> Linear mixed model fit by REML ['lmerMod']
#> Formula: mathach ~ mean.ses * cses + sector * cses + (cses | school)
#>    Data: HSB
#> 
#> REML criterion at convergence: 46504
#> 
#> Scaled residuals: 
#>    Min     1Q Median     3Q    Max 
#> -3.159 -0.723  0.017  0.754  2.958 
#> 
#> Random effects:
#>  Groups   Name        Variance Std.Dev. Corr
#>  school   (Intercept)  2.380   1.543        
#>           cses         0.101   0.318    0.39
#>  Residual             36.721   6.060        
#> Number of obs: 7185, groups:  school, 160
#> 
#> Fixed effects:
#>                     Estimate Std. Error t value
#> (Intercept)           12.128      0.199   60.86
#> mean.ses               5.333      0.369   14.45
#> cses                   2.945      0.156   18.93
#> sectorCatholic         1.227      0.306    4.00
#> mean.ses:cses          1.039      0.299    3.48
#> cses:sectorCatholic   -1.643      0.240   -6.85

We can then cross-validate at the cluster (i.e., school) level,

cv(hsb.lmer, k=10, clusterVariables="school", seed=5240)
#> R RNG seed set to 5240
#> 10-Fold Cross Validation based on 160 {school} clusters
#> cross-validation criterion = 39.157
#> bias-adjusted cross-validation criterion = 39.148
#> 95% CI for bias-adjusted CV criterion = (38.066, 40.231)
#> full-sample criterion = 39.006

or at the case (i.e., student) level,

cv(hsb.lmer, seed=1575)
#> R RNG seed set to 1575
#> Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
#> Model failed to converge with max|grad| = 0.00587228 (tol = 0.002, component 1)
#> boundary (singular) fit: see help('isSingular')
#> 10-Fold Cross Validation
#> cross-validation criterion = 37.445
#> bias-adjusted cross-validation criterion = 37.338
#> 95% CI for bias-adjusted CV criterion = (36.288, 38.388)
#> full-sample criterion = 36.068

For cluster-level CV, the clusterVariables argument tells cv() how the clusters are defined. Were there more than one clustering variable, say classes within schools, these would be provided as a character vector of variable names: clusterVariables = c("school", "class"). For cluster-level CV, the default is k = "loo", that is, leave one cluster out at a time; we instead specify k = 10 folds of clusters, each fold therefore comprising \(160/10 = 16\) schools.

If the clusterVariables argument is omitted, then case-level CV is employed, with k = 10 folds as the default, here each with \(7185/10 \approx 719\) students. Notice that one of the 10 models refit with a fold removed failed to converge. Convergence problems are common in mixed-effects modeling. The apparent issue here is that an estimated variance component is close to or equal to 0, which is at a boundary of the parameter space. That shouldn’t disqualify the fitted model for the kind of prediction required for cross-validation.

There is also a cv() method for linear mixed models fit by the lme() function in the nlme package, and the arguments for cv() in this case are the same as for a model fit by lmer() or glmer(). We illustrate with the mixed model fit to the HSB data:

library("nlme")
#> 
#> Attaching package: 'nlme'
#> The following object is masked from 'package:lme4':
#> 
#>     lmList
#> The following object is masked from 'package:dplyr':
#> 
#>     collapse
hsb.lme <- lme(mathach ~ mean.ses*cses + sector*cses,
                 random = ~ cses | school, data=HSB,
               control=list(opt="optim"))
summary(hsb.lme)
#> Linear mixed-effects model fit by REML
#>   Data: HSB 
#>     AIC   BIC logLik
#>   46525 46594 -23252
#> 
#> Random effects:
#>  Formula: ~cses | school
#>  Structure: General positive-definite, Log-Cholesky parametrization
#>             StdDev   Corr  
#> (Intercept) 1.541177 (Intr)
#> cses        0.018174 0.006 
#> Residual    6.063492       
#> 
#> Fixed effects:  mathach ~ mean.ses * cses + sector * cses 
#>                       Value Std.Error   DF t-value p-value
#> (Intercept)         12.1282   0.19920 7022  60.886   0e+00
#> mean.ses             5.3367   0.36898  157  14.463   0e+00
#> cses                 2.9421   0.15122 7022  19.456   0e+00
#> sectorCatholic       1.2245   0.30611  157   4.000   1e-04
#> mean.ses:cses        1.0444   0.29107 7022   3.588   3e-04
#> cses:sectorCatholic -1.6421   0.23312 7022  -7.044   0e+00
#>  Correlation: 
#>                     (Intr) men.ss cses   sctrCt mn.ss:
#> mean.ses             0.256                            
#> cses                 0.000  0.000                     
#> sectorCatholic      -0.699 -0.356  0.000              
#> mean.ses:cses        0.000  0.000  0.295  0.000       
#> cses:sectorCatholic  0.000  0.000 -0.696  0.000 -0.351
#> 
#> Standardized Within-Group Residuals:
#>       Min        Q1       Med        Q3       Max 
#> -3.170106 -0.724877  0.014892  0.754263  2.965498 
#> 
#> Number of Observations: 7185
#> Number of Groups: 160

cv(hsb.lme, k=10, clusterVariables="school", seed=5240)
#> R RNG seed set to 5240
#> 10-Fold Cross Validation based on 160 {school} clusters
#> cross-validation criterion = 39.157
#> bias-adjusted cross-validation criterion = 39.149
#> 95% CI for bias-adjusted CV criterion = (38.066, 40.232)
#> full-sample criterion = 39.006

cv(hsb.lme, seed=1575)
#> R RNG seed set to 1575
#> 10-Fold Cross Validation
#> cross-validation criterion = 37.442
#> bias-adjusted cross-validation criterion = 37.402
#> 95% CI for bias-adjusted CV criterion = (36.351, 38.453)
#> full-sample criterion = 36.147

We used the same random-number generator seeds as in the previous example cross-validating the model fit by lmer(), and so the same folds are employed in both cases.7 The estimated covariance components and fixed effects in the summary output differ slightly between the lmer() and lme() solutions, although both functions seek to maximize the REML criterion. This is, of course, to be expected when different algorithms are used for numerical optimization. To the precision reported, the cluster-level CV results for the lmer() and lme() models are identical, while the case-level CV results are very similar but not identical.

Example: Contrived hierarchical data

We introduce an artificial data set that exemplifies aspects of cross-validation particular to hierarchical models. Using this data set, we show that model comparisons employing cluster-based and those employing case-based cross-validation may not agree on a “best” model. Furthermore, commonly used measures of fit, such as mean-squared error, do not necessarily become smaller as models become larger, even when the models are nested, and even when the measure of fit is computed for the whole data set.

Consider a researcher studying improvement in a skill, yodeling, for example, among students enrolled in a four-year yodeling program. The plan is to measure each student’s skill level at the beginning of the program and every year thereafter until the end of the program, resulting in five annual measurements for each student. It turns out that yodeling appeals to students of all ages, and students enrolling in the program range in age from 20 to 70. Moreover, participants’ untrained yodeling skill is similar at all ages, as is their rate of progress with training. All students complete the four-year program.

The researcher, who has more expertise in yodeling than in modeling, decides to model the response, \(y\), yodeling skill, as a function of age, \(x\), reasoning that students get older during their stay in the program, and (incorrectly) that age can serve as a proxy for elapsed time. The researcher knows that a mixed model should be used to account for clustering due to the expected similarity of measurements taken from each student.

We start by generating the data, using parameters consistent with the description above and meant to highlight the issues that arise in cross-validating mixed-effects models:8

# Parameters:
set.seed(9693) 
Nb <- 100     # number of groups
Nw <- 5       # number of individuals within groups
Bb <- 0       # between-group regression coefficient on group mean
SDre <- 2.0   # between-group SD of random level relative to group mean of x
SDwithin <- 0.5  # within group SD
Bw <- 1          # within group effect of x
Ay <- 10         # intercept for response
Ax <- 20         # starting level of x
Nx <- Nw*10      # number of distinct x values

Data <- data.frame(
  group = factor(rep(1:Nb, each=Nw)),
  x = Ax + rep(1:Nx, length.out = Nw*Nb)
) |>
  within(
    {
      xm  <- ave(x, group, FUN = mean) # within-group mean
      y <- Ay +
        Bb * xm +                    # contextual effect
        Bw * (x - xm) +              # within-group effect
        rnorm(Nb, sd=SDre)[group] +  # random level by group
        rnorm(Nb*Nw, sd=SDwithin)    # random error within groups
    }
  )

Here is a scatterplot of the data for a representative group of 10 (without loss of generality, the first 10) of 100 students, showing the 95% concentration ellipse for each cluster:9

library("lattice")
library("latticeExtra")
plot <- xyplot(y ~ x, data=Data[1:Nx, ], group=group,
               ylim=c(4, 16),
               par.settings=list(superpose.symbol=list(pch=1, cex=0.7))) +
    layer(panel.ellipse(..., center.cex=0))
plot # display graph
Hierarchical data set, showing the first 10 of 100 students.

Hierarchical data set, showing the first 10 of 100 students.

The between-student effect of age is 0 but the within-student effect is 1. Due to the large variation in ages between students, the least-squares regression of yodeling skill on age (for the 500 observations among all 100 students) produces an estimated slope close to 0 (though with a small \(p\)-value), because the slope is heavily weighted toward the between-student effect:

summary(lm(y ~ x, data=Data))
#> 
#> Call:
#> lm(formula = y ~ x, data = Data)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -5.771 -1.658 -0.089  1.552  7.624 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  9.05043    0.34719   26.07   <2e-16 ***
#> x            0.02091    0.00727    2.87   0.0042 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 2.35 on 498 degrees of freedom
#> Multiple R-squared:  0.0163, Adjusted R-squared:  0.0143 
#> F-statistic: 8.26 on 1 and 498 DF,  p-value: 0.00422

The initial mixed-effects model that we fit to the data is a simple random-intercepts model:

# random intercept only:
mod.0 <- lmer(y ~ 1 + (1 | group), Data)
summary(mod.0)
#> Linear mixed model fit by REML ['lmerMod']
#> Formula: y ~ 1 + (1 | group)
#>    Data: Data
#> 
#> REML criterion at convergence: 2103.1
#> 
#> Scaled residuals: 
#>     Min      1Q  Median      3Q     Max 
#> -2.0351 -0.7264 -0.0117  0.7848  2.0438 
#> 
#> Random effects:
#>  Groups   Name        Variance Std.Dev.
#>  group    (Intercept) 2.90     1.70    
#>  Residual             2.71     1.65    
#> Number of obs: 500, groups:  group, 100
#> 
#> Fixed effects:
#>             Estimate Std. Error t value
#> (Intercept)   10.002      0.186    53.9

We will shortly consider three other, more complex, mixed models; because of data-management considerations, it is convenient to fit them now, but we defer discussion of these models:

# effect of x and random intercept:
mod.1 <- lmer(y ~ x + (1 | group), Data)

# effect of x, contextual (student) mean of x, and random intercept:
mod.2 <- lmer(y ~ x + xm + (1 | group), Data)
        # equivalent to y ~ I(x - xm) + xm + (1 | group)

# model generating the data (where Bb = 0)
mod.3 <- lmer(y ~ I(x - xm) + (1 | group), Data)

We proceed to obtain predictions from the random-intercept model (mod.0) and the other models (mod.1, mod.2, and mod.3) based on fixed effects alone, as would be used for cross-validation based on clusters (i.e., students), and for fixed and random effects—so-called best linear unbiased predictions or BLUPs—as would be used for cross-validation based on cases (i.e., occasions within students):

Data <- within(Data, {
  fit_mod0.fe <- predict(mod.0, re.form = ~ 0) # fixed effects only
  fit_mod0.re <- predict(mod.0) # fixed and random effects (BLUPs)
  fit_mod1.fe <- predict(mod.1, re.form = ~ 0)
  fit_mod1.re <- predict(mod.1)
  fit_mod2.fe <- predict(mod.2, re.form = ~ 0)
  fit_mod2.re <- predict(mod.2)
  fit_mod3.fe <- predict(mod.3, re.form = ~ 0)
  fit_mod3.re <- predict(mod.3)
})

We then prepare the data for plotting:

Data_long <- reshape(Data[1:Nx, ], direction = "long", sep = ".", 
              timevar = "effect", varying = grep("\\.", names(Data[1:Nx, ])))
Data_long$id <- 1:nrow(Data_long)
Data_long <- reshape(Data_long, direction = "long", sep = "_", 
              timevar = "modelcode",  varying = grep("_", names(Data_long)))
Data_long$model <- factor(
  c("~ 1", "~ 1 + x", "~ 1 + x + xm", "~ 1 + I(x - xm)")
  [match(Data_long$modelcode, c("mod0", "mod1", "mod2", "mod3"))]
)

Predictions based on the random-intercept model mod.0 for the first 10 students are shown in the following graph:

(plot +
  xyplot(fit ~ x, subset(Data_long, modelcode == "mod0" & effect == "fe"),
         groups=group, type="l", lwd=2) +
  xyplot(fit ~ x, subset(Data_long, modelcode == "mod0" &  effect == "re"),
         groups=group, type="l", lwd=2, lty=3)
) |> update(
  main="Model: y ~ 1 + (1 | group)",
  key=list(
    corner=c(0.05, 0.05),
    text=list(c("fixed effects only","fixed and random")),
    lines=list(lty=c(1, 3))))
Predictions from the random intercept model.

Predictions from the random intercept model.

The fixed-effect predictions for the various individuals are identical—the estimated fixed-effects intercept or estimated general mean of \(y\)—while the BLUPs are the sums of the fixed-effects intercept and the random intercepts, and are only slightly shrunken towards the general mean. Because in our artificial data there is no population relationship between age and skill, the fixed-effect-only predictions and the BLUPs are not very different.

Our next model, mod.1, includes a fixed intercept and fixed effect of x along with a random intercept:

summary(mod.1)
#> Linear mixed model fit by REML ['lmerMod']
#> Formula: y ~ x + (1 | group)
#>    Data: Data
#> 
#> REML criterion at convergence: 1564.5
#> 
#> Scaled residuals: 
#>     Min      1Q  Median      3Q     Max 
#> -2.9016 -0.6350  0.0188  0.5541  2.8293 
#> 
#> Random effects:
#>  Groups   Name        Variance Std.Dev.
#>  group    (Intercept) 192.941  13.890  
#>  Residual               0.257   0.507  
#> Number of obs: 500, groups:  group, 100
#> 
#> Fixed effects:
#>             Estimate Std. Error t value
#> (Intercept) -33.9189     1.5645   -21.7
#> x             0.9653     0.0158    61.0
#> 
#> Correlation of Fixed Effects:
#>   (Intr)
#> x -0.460

Predictions from this model appear in the following graph:

(plot +
  xyplot(fit ~ x, subset(Data_long, modelcode == "mod1" & effect == "fe"),
         groups=group, type="l", lwd=2) +
  xyplot(fit ~ x, subset(Data_long, modelcode == "mod1" & effect == "re"),
         groups=group, type="l", lwd=2, lty=3)
) |> update(
  main="Model: y ~ 1 + x + (1 | group)",
  ylim=c(-15, 35),
  key=list(
    corner=c(0.95, 0.05),
    text=list(c("fixed effects only","fixed and random")),
    lines=list(lty=c(1, 3))))
Predictions from the model with random intercepts and $x$ as a fixed-effect predictor.

Predictions from the model with random intercepts and \(x\) as a fixed-effect predictor.

The BLUPs fit the observed data very closely, but predictions based on the fixed effects alone, with a common intercept and slope for all clusters, are very poor—indeed, much worse than the fixed-effects-only predictions based on the simpler random-intercept model, mod.0. We therefore anticipate (and show later in this section) that case-based cross-validation will prefer mod1 to mod0, but that cluster-based cross-validation will prefer mod0 to mod1.

Our third model, mod.2, includes the contextual effect of \(x\)—that is, the cluster mean xm—along with \(x\) and the intercept in the fixed-effect part of the model, and a random intercept:

summary(mod.2)
#> Linear mixed model fit by REML ['lmerMod']
#> Formula: y ~ x + xm + (1 | group)
#>    Data: Data
#> 
#> REML criterion at convergence: 1169.2
#> 
#> Scaled residuals: 
#>     Min      1Q  Median      3Q     Max 
#> -2.9847 -0.6375  0.0019  0.5568  2.7325 
#> 
#> Random effects:
#>  Groups   Name        Variance Std.Dev.
#>  group    (Intercept) 3.399    1.844   
#>  Residual             0.255    0.505   
#> Number of obs: 500, groups:  group, 100
#> 
#> Fixed effects:
#>             Estimate Std. Error t value
#> (Intercept)   9.4787     0.6171    15.4
#> x             0.9915     0.0160    62.1
#> xm           -0.9800     0.0206   -47.7
#> 
#> Correlation of Fixed Effects:
#>    (Intr) x     
#> x   0.000       
#> xm -0.600 -0.777

This model is equivalent to fitting y ~ I(x - xm) + xm + (1 | group), which is the model that generated the data once the coefficient of the contextual predictor xm is set to 0 (as it is in mod.3, discussed below).

Predictions from model mod.2 appear in the following graph:

(plot +
  xyplot(fit ~ x, subset(Data_long, modelcode == "mod2" & effect == "fe"),
         groups=group, type="l", lwd=2) +
  xyplot(fit ~ x, subset(Data_long, modelcode == "mod2" & effect == "re"),
         groups=group, type="l", lwd=2, lty=3)
) |> update(
  main="Model: y ~ 1 + x + xm + (1 | group)",
  ylim=c(4, 16),
  key=list(
    corner=c(0.05, 0.05),
    text=list(c("fixed effects only","fixed and random")),
    lines=list(lty=c(1, 3))))
Predictors from the model with random intercepts, $x$, and the group (student) mean of $x$ as predictors.

Predictors from the model with random intercepts, \(x\), and the group (student) mean of \(x\) as predictors.

Depending on the estimated variance parameters of the model, a mixed model like mod.2 will apply varying degrees of shrinkage to the random-intercept BLUPs that correspond to variation in the heights of the parallel fitted lines for the individual students. In our contrived data, the mod.2 applies little shrinkage, allowing substantial variability in the heights of the fitted lines, which closely approach the observed values for each student. The fit of the mixed model mod.2 is consequently similar to that of a fixed-effects model with age and a categorical predictor for individual students (i.e., treating students as a factor, and not shown here).

The mixed model mod.2 therefore fits individual observations well, and we anticipate a favorable assessment using individual-based cross-validation. In contrast, the large variability in the BLUPs results in larger residuals for predictions based on fixed effects alone, and so we expect that cluster-based cross-validation won’t show an advantage for model mod.2 compared to the smaller model mod.0, which includes only fixed and random intercepts.

Had the mixed model applied considerable shrinkage, then neither cluster-based nor case-based cross-validation would show much improvement over the random-intercept-only model. In our experience, the degree of shrinkage does not vary smoothly as parameters are changed but tends to be “all or nothing,” and near the tipping point, the behavior of estimates can be affected considerably by the choice of algorithm used to fit the model.

Finally, mod.3 directly estimates the model used to generate the data. As mentioned, it is a constrained version of mod.2, with the coefficient of xm set to 0, and with x expressed as a deviation from the cluster mean xm:

summary(mod.3)
#> Linear mixed model fit by REML ['lmerMod']
#> Formula: y ~ I(x - xm) + (1 | group)
#>    Data: Data
#> 
#> REML criterion at convergence: 1163.2
#> 
#> Scaled residuals: 
#>     Min      1Q  Median      3Q     Max 
#> -2.9770 -0.6320  0.0063  0.5603  2.7249 
#> 
#> Random effects:
#>  Groups   Name        Variance Std.Dev.
#>  group    (Intercept) 3.391    1.842   
#>  Residual             0.255    0.505   
#> Number of obs: 500, groups:  group, 100
#> 
#> Fixed effects:
#>             Estimate Std. Error t value
#> (Intercept)   10.002      0.185    53.9
#> I(x - xm)      0.992      0.016    62.1
#> 
#> Correlation of Fixed Effects:
#>           (Intr)
#> I(x - xm) 0.000

The predictions from mod.3 are therefore similar to those from mod.2:

(plot +
  xyplot(fit ~ x, subset(Data_long, modelcode == "mod3" & effect == "fe"),
         groups=group, type="l", lwd=2) +
  xyplot(fit ~ x, subset(Data_long, modelcode == "mod3" & effect == "re"),
         groups=group, type="l", lwd=2, lty=3)
) |> update(
  main="Model: y ~ 1 + I(x - xm) + (1 | group)",
  ylim=c(4, 16),
  key=list(
    corner=c(0.05, 0.05),
    text=list(c("fixed effects only","fixed and random")),
    lines=list(lty=c(1, 3))))
Predictions from the estimated model generating the data.

Predictions from the estimated model generating the data.

We next carry out case-based cross-validation, which, as we have explained, is based on both fixed and predicted random effects (i.e., BLUPs), and cluster-based cross-validation, which is based on fixed effects only. In order to reduce between-model random variability in comparisons of models, we apply cv() to the list of models created by the models() function (introduced previously), performing cross-validation with the same folds for each model:

modlist <- models("~ 1"=mod.0, "~ 1 + x"=mod.1, 
                  "~ 1 + x + xm"=mod.2, "~ 1 + I(x - xm)"=mod.3)
cvs_clusters <- cv(modlist, data=Data, cluster="group", k=10, seed=6449)
plot(cvs_clusters, main="Model Comparison, Cluster-Based CV")
10-fold cluster-based cross-validation comparing random intercept models with varying fixed effects. The error bars show the 95% confidence interval around the CV estimate of the MSE for each model.

10-fold cluster-based cross-validation comparing random intercept models with varying fixed effects. The error bars show the 95% confidence interval around the CV estimate of the MSE for each model.

cvs_cases <- cv(modlist, data=Data, seed=9693)
plot(cvs_cases, main="Model Comparison, Case-Based CV")
10-fold case-based cross-validation comparing random intercept models with varying fixed effects.

10-fold case-based cross-validation comparing random intercept models with varying fixed effects.

In summary, model mod.1, with \(x\) alone and without the contextual mean of \(x\), is assessed as fitting very poorly by cluster-based CV, but relatively much better by case-based CV. Model mod.2, which includes both \(x\) and its contextual mean, produces better results using both cluster-based and case-based CV. The data-generating model, mod.3, which includes the fixed effect of x - xm in place of separate terms in x and xm, isn’t distinguishable from model mod.2, which includes x and xm separately, even though mod.2 has an unnecessary parameter (recall that the population coefficient of xm is 0 when x is expressed as deviations from the contextual mean). These conclusions are consistent with our observations based on graphing predictions from the various models, and they illustrate the desirability of assessing mixed-effect models at different hierarchical levels.

Example: Crossed random effects

Crossed random effects arise when the structure of the data aren’t strictly hierarchical. Nevertheless, crossed and nested random effects can be handled in much the same manner, by refitting the mixed-effects model to the data with a fold of clusters or cases removed and using the refitted model to predict the response in the removed fold.

We’ll illustrate with data on pig growth, introduced by Diggle, Liang, & Zeger (1994, Table 3.1). The data are in the Pigs data frame in the cv package:

head(Pigs, 9)
#>   id week weight
#> 1  1    1   24.0
#> 2  1    2   32.0
#> 3  1    3   39.0
#> 4  1    4   42.5
#> 5  1    5   48.0
#> 6  1    6   54.5
#> 7  1    7   61.0
#> 8  1    8   65.0
#> 9  1    9   72.0
head(xtabs(~ id + week, data=Pigs), 3)
#>    week
#> id  1 2 3 4 5 6 7 8 9
#>   1 1 1 1 1 1 1 1 1 1
#>   2 1 1 1 1 1 1 1 1 1
#>   3 1 1 1 1 1 1 1 1 1
tail(xtabs(~ id + week, data=Pigs), 3)
#>     week
#> id   1 2 3 4 5 6 7 8 9
#>   46 1 1 1 1 1 1 1 1 1
#>   47 1 1 1 1 1 1 1 1 1
#>   48 1 1 1 1 1 1 1 1 1

Each of 48 pigs is observed weekly over a period of 9 weeks, with the weight of the pig recorded in kg. The data are in “long” format, as is appropriate for use with the lmer() function in the lme4 package. The data are very regular, with no missing cases.

The following graph, showing the growth trajectories of the pigs, is similar to Figure 3.1 in Diggle et al. (1994); we add an overall least-squares line and a loess smooth, which are nearly indistinguishable:

plot(weight ~ week, data=Pigs, type="n")
for (i in unique(Pigs$id)){
  with(Pigs, lines(x=1:9, y=Pigs[id == i, "weight"],
                   col="gray"))
}
abline(lm(weight ~ week, data=Pigs), col="blue", lwd=2)
lines(with(Pigs, loess.smooth(week, weight, span=0.5)),
      col="magenta", lty=2, lwd=2)
Growth trajectories for 48 pigs, with overall least-squares line (sold blue) and loess line (broken magenta).

Growth trajectories for 48 pigs, with overall least-squares line (sold blue) and loess line (broken magenta).

The individual “growth curves” and the overall trend are generally linear, with some tendency for variability of pig weight to increase over weeks (a feature of the data that we ignore in the mixed model that we fit to the data below).

The Stata mixed-effects models manual proposes a model with crossed random effects for the Pigs data (StataCorp LLC, 2023, p. 37):

[S]uppose that we wish to fit \[ \mathrm{weight}_{ij} = \beta_0 + \beta_1 \mathrm{week}_{ij} + u_i + v_j + \varepsilon_{ij} \] for the \(i = 1, \ldots, 9\) weeks and \(j = 1, \dots, 48\) pigs and \[ u_i \sim N(0, \sigma^2_u); v_j \sim N(0, \sigma^2_v ); \varepsilon_{ij} \sim N(0, \sigma^2_\varepsilon) \] all independently. That is, we assume an overall population-average growth curve \(\beta_0 + \beta_1 \mathrm{week}\) and a random pig-specific shift. In other words, the effect due to week, \(u_i\), is systematic to that week and common to all pigs. The rationale behind [this model] could be that, assuming that the pigs were measured contemporaneously, we might be concerned that week-specific random factors such as weather and feeding patterns had significant systematic effects on all pigs.

Although we might prefer an alternative model,10 we think that this is a reasonable specification.

The Stata manual fits the mixed model by maximum likelihood (rather than REML), and we duplicate the results reported there using lmer():

m.p <- lmer(weight ~ week + (1 | id) + (1 | week),
            data=Pigs, REML=FALSE, # i.e., ML
            control=lmerControl(optimizer="bobyqa"))
summary(m.p)
#> Linear mixed model fit by maximum likelihood  ['lmerMod']
#> Formula: weight ~ week + (1 | id) + (1 | week)
#>    Data: Pigs
#> Control: lmerControl(optimizer = "bobyqa")
#> 
#>      AIC      BIC   logLik deviance df.resid 
#>   2037.6   2058.0  -1013.8   2027.6      427 
#> 
#> Scaled residuals: 
#>    Min     1Q Median     3Q    Max 
#> -3.775 -0.542  0.005  0.476  3.982 
#> 
#> Random effects:
#>  Groups   Name        Variance Std.Dev.
#>  id       (Intercept) 14.836   3.852   
#>  week     (Intercept)  0.085   0.292   
#>  Residual              4.297   2.073   
#> Number of obs: 432, groups:  id, 48; week, 9
#> 
#> Fixed effects:
#>             Estimate Std. Error t value
#> (Intercept)  19.3556     0.6334    30.6
#> week          6.2099     0.0539   115.1
#> 
#> Correlation of Fixed Effects:
#>      (Intr)
#> week -0.426

We opt for the non-default "bobyqa" optimizer because it provides more numerically stable results for subsequent cross-validation in this example.

We can then cross-validate the model by omitting folds composed of pigs, folds composed of weeks, or folds composed of pig-weeks (which in the Pigs data set correspond to individual cases, using only the fixed effects):

cv(m.p, clusterVariables="id")
#> n-Fold Cross Validation based on 48 {id} clusters
#> cross-validation criterion = 19.973
#> bias-adjusted cross-validation criterion = 19.965
#> 95% CI for bias-adjusted CV criterion = (17.125, 22.805)
#> full-sample criterion = 19.201

cv(m.p, clusterVariables="week")
#> boundary (singular) fit: see help('isSingular')
#> n-Fold Cross Validation based on 9 {week} clusters
#> cross-validation criterion = 19.312
#> bias-adjusted cross-validation criterion = 19.305
#> 95% CI for bias-adjusted CV criterion = (16.566, 22.044)
#> full-sample criterion = 19.201

cv(m.p, clusterVariables=c("id", "week"), k=10, seed=8469)
#> R RNG seed set to 8469
#> 10-Fold Cross Validation based on 432 {id, week} clusters
#> cross-validation criterion = 19.235
#> bias-adjusted cross-validation criterion = 19.233
#> 95% CI for bias-adjusted CV criterion = (16.493, 21.973)
#> full-sample criterion = 19.201

We can also cross-validate the individual cases taking account of the random effects (employing the same 10 folds):

cv(m.p, k=10, seed=8469)
#> R RNG seed set to 8469
#> 10-Fold Cross Validation
#> cross-validation criterion = 5.1583
#> bias-adjusted cross-validation criterion = 5.0729
#> 95% CI for bias-adjusted CV criterion = (4.123, 6.0229)
#> full-sample criterion = 3.796

Because these predictions are based on BLUPs, they are more accurate than the predictions based only on fixed effects.11 As well, the difference between the MSE computed for the model fit to the full data and the CV estimates of the MSE is greater here than for cluster-based predictions.

Replicating cross-validation

Assuming that the number of cases \(n\) is a multiple of the number of folds \(k\)—a slightly simplifying assumption—the number of possible partitions of cases into folds is \(\frac{n!}{[(n/k)!]^k}\), a number that grows very large very quickly. For example, for \(n = 10\) and \(k = 5\), so that the folds are each of size \(n/k = 2\), there are \(113,400\) possible partitions; for \(n=100\) and \(k=5\), where \(n/k = 20\), still a small problem, the number of possible partitions is truly astronomical, \(1.09\times 10^{66}\).

Because the partition into folds that’s employed is selected randomly, the resulting CV criterion estimates are subject to sampling error. (An exception is LOO cross-validation, which is not at all random.) To get a sense of the magnitude of the sampling error, we can repeat the CV procedure with different randomly selected partitions into folds. All of the CV functions in the cv package are capable of repeated cross-validation, with the number of repetitions controlled by the reps argument, which defaults to 1.

Here, for example, is 10-fold CV for the Mroz logistic regression, repeated 5 times:

cv(m.mroz, criterion=BayesRule, seed=248, reps=5, 
   method="Woodbury")
#> R RNG seed set to 248
#> R RNG seed set to 68134
#> R RNG seed set to 767359
#> R RNG seed set to 556270
#> R RNG seed set to 882966
#> 
#> Replicate 1:
#> 10-Fold Cross Validation
#> method: Woodbury
#> criterion: BayesRule
#> cross-validation criterion = 0.32005
#> bias-adjusted cross-validation criterion = 0.31301
#> 95% CI for bias-adjusted CV criterion = (0.27967, 0.34635)
#> full-sample criterion = 0.30677 
#> 
#> Replicate 2:
#> 10-Fold Cross Validation
#> method: Woodbury
#> criterion: BayesRule
#> cross-validation criterion = 0.31607
#> bias-adjusted cross-validation criterion = 0.3117
#> 95% CI for bias-adjusted CV criterion = (0.27847, 0.34493)
#> full-sample criterion = 0.30677 
#> 
#> Replicate 3:
#> 10-Fold Cross Validation
#> method: Woodbury
#> criterion: BayesRule
#> cross-validation criterion = 0.31474
#> bias-adjusted cross-validation criterion = 0.30862
#> 95% CI for bias-adjusted CV criterion = (0.27543, 0.34181)
#> full-sample criterion = 0.30677 
#> 
#> Replicate 4:
#> 10-Fold Cross Validation
#> method: Woodbury
#> criterion: BayesRule
#> cross-validation criterion = 0.32404
#> bias-adjusted cross-validation criterion = 0.31807
#> 95% CI for bias-adjusted CV criterion = (0.28462, 0.35152)
#> full-sample criterion = 0.30677 
#> 
#> Replicate 5:
#> 10-Fold Cross Validation
#> method: Woodbury
#> criterion: BayesRule
#> cross-validation criterion = 0.32404
#> bias-adjusted cross-validation criterion = 0.31926
#> 95% CI for bias-adjusted CV criterion = (0.28581, 0.35271)
#> full-sample criterion = 0.30677 
#> 
#> Average:
#> 10-Fold Cross Validation
#> method: Woodbury
#> criterion: BayesRule
#> cross-validation criterion = 0.31983 (0.003887)
#> bias-adjusted cross-validation criterion = 0.31394 (0.0040093)
#> full-sample criterion = 0.30677

When reps > 1, the result returned by cv() is an object of class "cvList"—literally a list of "cv" objects. The results are reported for each repetition and then averaged across repetitions, with the standard deviations of the CV criterion and the biased-adjusted CV criterion given in parentheses. In this example, there is therefore little variation across repetitions, increasing our confidence in the reliability of the results.

Notice that the seed that’s set in the cv() command pertains to the first repetition and the seeds for the remaining repetitions are then selected pseudo-randomly.12 Setting the first seed, however, makes the entire process easily replicable, and the seed for each repetition is stored in the corresponding element of the "cvList" object (which isn’t, however, saved in the example).

It’s also possible to replicate CV when comparing competing models via the cv() method for "modList" objects. Recall our comparison of polynomial regressions of varying degree fit to the Auto data; we performed 10-fold CV for each of 10 models. Here, we replicate that process 5 times for each model and graph the results:

cv.auto.reps <- cv(models(m.1, m.2, m.3, m.4, m.5,
                        m.6, m.7, m.8, m.9, m.10),
                 data=Auto, seed=8004, reps=5)
plot(cv.auto.reps)
 Replicated cross-validated 10-fold CV as a function of polynomial degree, $p$

Replicated cross-validated 10-fold CV as a function of polynomial degree, \(p\)

The graph shows both the average CV criterion and its range for each of the competing models.

Cross-validating model selection

A preliminary example

As Hastie, Tibshirani, & Friedman (2009, sec. 7.10.2: “The Wrong and Right Way to Do Cross-validation”) explain, if the whole data are used to select or fine-tune a statistical model, subsequent cross-validation of the model is intrinsically misleading, because the model is selected to fit the whole data, including the part of the data that remains when each fold is removed.

The following example is similar in spirit to one employed by Hastie et al. (2009). Suppose that we randomly generate \(n = 1000\) independent observations for a response variable variable \(y \sim N(\mu = 10, \sigma^2 = 0)\), and independently sample \(1000\) observations for \(p = 100\) “predictors,” \(x_1, \ldots, x_{100}\), each from \(x_j \sim N(0, 1)\). The response has nothing to do with the predictors and so the population linear-regression model \(y_i = \alpha + \beta_1 x_{i1} + \cdots + \beta_{100} x_{i,100} + \varepsilon_i\) has \(\alpha = 10\) and all \(\beta_j = 0\).

set.seed(24361) # for reproducibility
D <- data.frame(
  y = rnorm(1000, mean=10),
  X = matrix(rnorm(1000*100), 1000, 100)
)
head(D[, 1:6])
#>         y      X.1      X.2      X.3       X.4       X.5
#> 1 10.0316 -1.23886 -0.26487 -0.03539 -2.576973  0.811048
#> 2  9.6650  0.12287 -0.17744  0.37290 -0.935138  0.628673
#> 3 10.0232 -0.95052 -0.73487 -1.05978  0.882944  0.023918
#> 4  8.9910  1.13571  0.32411  0.11037  1.376303 -0.422114
#> 5  9.0712  1.49474  1.87538  0.10575  0.292140 -0.184568
#> 6 11.3493 -0.18453 -0.78037 -1.23804 -0.010949  0.691034

Least-squares provides accurate estimates of the regression constant \(\alpha = 10\) and the error variance \(\sigma^2 = 1\) for the “null model” including only the regression constant; moreover, the omnibus \(F\)-test of the correct null hypothesis that all of the \(\beta\)s are 0 for the “full model” with all 100 \(x\)s is associated with a large \(p\)-value:

m.full <- lm(y ~ ., data=D)
m.null <- lm(y ~ 1, data=D)
anova(m.null, m.full)
#> Analysis of Variance Table
#> 
#> Model 1: y ~ 1
#> Model 2: y ~ X.1 + X.2 + X.3 + X.4 + X.5 + X.6 + X.7 + X.8 + X.9 + X.10 + 
#>     X.11 + X.12 + X.13 + X.14 + X.15 + X.16 + X.17 + X.18 + X.19 + 
#>     X.20 + X.21 + X.22 + X.23 + X.24 + X.25 + X.26 + X.27 + X.28 + 
#>     X.29 + X.30 + X.31 + X.32 + X.33 + X.34 + X.35 + X.36 + X.37 + 
#>     X.38 + X.39 + X.40 + X.41 + X.42 + X.43 + X.44 + X.45 + X.46 + 
#>     X.47 + X.48 + X.49 + X.50 + X.51 + X.52 + X.53 + X.54 + X.55 + 
#>     X.56 + X.57 + X.58 + X.59 + X.60 + X.61 + X.62 + X.63 + X.64 + 
#>     X.65 + X.66 + X.67 + X.68 + X.69 + X.70 + X.71 + X.72 + X.73 + 
#>     X.74 + X.75 + X.76 + X.77 + X.78 + X.79 + X.80 + X.81 + X.82 + 
#>     X.83 + X.84 + X.85 + X.86 + X.87 + X.88 + X.89 + X.90 + X.91 + 
#>     X.92 + X.93 + X.94 + X.95 + X.96 + X.97 + X.98 + X.99 + X.100
#>   Res.Df RSS  Df Sum of Sq    F Pr(>F)
#> 1    999 974                          
#> 2    899 888 100      85.2 0.86   0.82

summary(m.null)
#> 
#> Call:
#> lm(formula = y ~ 1, data = D)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -3.458 -0.681  0.019  0.636  2.935 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   9.9370     0.0312     318   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.987 on 999 degrees of freedom

Next, using the stepAIC() function in the MASS package (Venables & Ripley, 2002), let us perform a forward stepwise regression to select a “best” model, starting with the null model, and using AIC as the model-selection criterion (see the help page for stepAIC() for details):13

library("MASS")  # for stepAIC()
m.select <- stepAIC(m.null,
                    direction="forward", trace=FALSE,
                    scope=list(lower=~1, upper=formula(m.full)))
summary(m.select)
#> 
#> Call:
#> lm(formula = y ~ X.99 + X.90 + X.87 + X.40 + X.65 + X.91 + X.53 + 
#>     X.45 + X.31 + X.56 + X.61 + X.60 + X.46 + X.35 + X.92, data = D)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -3.262 -0.645  0.024  0.641  3.118 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   9.9372     0.0310  320.80   <2e-16 ***
#> X.99         -0.0910     0.0308   -2.95   0.0032 ** 
#> X.90         -0.0820     0.0314   -2.62   0.0090 ** 
#> X.87         -0.0694     0.0311   -2.24   0.0256 *  
#> X.40         -0.0476     0.0308   -1.55   0.1221    
#> X.65         -0.0552     0.0315   -1.76   0.0795 .  
#> X.91          0.0524     0.0308    1.70   0.0894 .  
#> X.53         -0.0492     0.0305   -1.61   0.1067    
#> X.45          0.0554     0.0318    1.74   0.0818 .  
#> X.31          0.0452     0.0311    1.46   0.1457    
#> X.56          0.0543     0.0327    1.66   0.0972 .  
#> X.61         -0.0508     0.0317   -1.60   0.1091    
#> X.60         -0.0513     0.0319   -1.61   0.1083    
#> X.46          0.0516     0.0327    1.58   0.1153    
#> X.35          0.0470     0.0315    1.49   0.1358    
#> X.92          0.0443     0.0310    1.43   0.1533    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.973 on 984 degrees of freedom
#> Multiple R-squared:  0.0442, Adjusted R-squared:  0.0296 
#> F-statistic: 3.03 on 15 and 984 DF,  p-value: 8.34e-05
mse(D$y, fitted(m.select))
#> [1] 0.93063
#> attr(,"casewise loss")
#> [1] "(y - yhat)^2"

The resulting model has 15 predictors, a very modest \(R^2 = .044\), but a small \(p\)-value for its omnibus \(F\)-test (which, of course, is entirely spurious because the same data were used to select and test the model). The MSE for the selected model is smaller than the true error variance \(\sigma^2 = 1\), as is the estimated error variance for the selected model, \(\widehat{\sigma}^2 = 0.973^2 = 0.947\).

If we cross-validate the selected model, we also obtain an optimistic estimate of its predictive power (although the confidence interval for the bias-adjusted MSE includes 1):

cv(m.select, seed=2529)
#> R RNG seed set to 2529
#> 10-Fold Cross Validation
#> method: Woodbury
#> criterion: mse
#> cross-validation criterion = 0.95937
#> bias-adjusted cross-validation criterion = 0.95785
#> 95% CI for bias-adjusted CV criterion = (0.87661, 1.0391)
#> full-sample criterion = 0.93063

The cvSelect() function in the cv package allows us to cross-validate the whole model-selection procedure. The first argument to cvSelect() is a model-selection function capable of refitting the model with a fold omitted and returning a CV criterion. The selectStepAIC() function, also in cv and based on stepAIC(), is suitable for use with cvSelect():

cv.select <- cvSelect(selectStepAIC, data=D, seed=3791,
                      model=m.null, direction="forward",
                      scope=list(lower=~1, 
                                 upper=formula(m.full)))
#> R RNG seed set to 3791
cv.select
#> 10-Fold Cross Validation
#> cross-validation criterion = 1.0687
#> bias-adjusted cross-validation criterion = 1.0612
#> 95% CI for bias-adjusted CV criterion = (0.97172, 1.1506)
#> full-sample criterion = 0.93063

The other arguments to cvSelect() are:

By default, cvSelect() performs 10-fold CV, and produces an estimate of MSE for the model-selection procedure even larger than the true error variance, \(\sigma^2 = 1\).

Also by default, when the number of folds is 10 or fewer, cvSelect() saves the coefficients of the selected models. In this example, the compareFolds() function reveals that the variables retained by the model-selection process in the several folds are quite different:

compareFolds(cv.select)
#>         (Intercept)    X.87    X.90    X.99    X.91    X.54    X.53    X.56
#> Fold 1       9.9187 -0.0615 -0.0994 -0.0942  0.0512  0.0516                
#> Fold 2       9.9451 -0.0745 -0.0899 -0.0614          0.0587          0.0673
#> Fold 3       9.9423 -0.0783 -0.0718 -0.0987  0.0601                  0.0512
#> Fold 4       9.9410 -0.0860 -0.0831 -0.0867  0.0570         -0.0508        
#> Fold 5       9.9421 -0.0659 -0.0849 -0.1004  0.0701  0.0511 -0.0487  0.0537
#> Fold 6       9.9633 -0.0733 -0.0874 -0.0960  0.0555  0.0629 -0.0478        
#> Fold 7       9.9279 -0.0618 -0.0960 -0.0838  0.0533         -0.0464        
#> Fold 8       9.9453 -0.0610 -0.0811 -0.0818          0.0497 -0.0612  0.0560
#> Fold 9       9.9173 -0.0663 -0.0894 -0.1100  0.0504  0.0524          0.0747
#> Fold 10      9.9449 -0.0745 -0.0906 -0.0891  0.0535  0.0482 -0.0583  0.0642
#>            X.40    X.45    X.65    X.68    X.92    X.15    X.26    X.46    X.60
#> Fold 1                  -0.0590                 -0.0456  0.0658  0.0608        
#> Fold 2                                   0.0607          0.0487                
#> Fold 3  -0.0496         -0.0664          0.0494                                
#> Fold 4  -0.0597  0.0579 -0.0531          0.0519 -0.0566                 -0.0519
#> Fold 5                           0.0587                          0.0527 -0.0603
#> Fold 6  -0.0596  0.0552          0.0474                                        
#> Fold 7           0.0572          0.0595                                        
#> Fold 8           0.0547 -0.0617  0.0453  0.0493 -0.0613  0.0591  0.0703 -0.0588
#> Fold 9  -0.0552  0.0573 -0.0635  0.0492         -0.0513  0.0484         -0.0507
#> Fold 10 -0.0558                          0.0529                  0.0710        
#>            X.61     X.8    X.28    X.29    X.31    X.35    X.70    X.89    X.17
#> Fold 1  -0.0490          0.0616 -0.0537                  0.0638                
#> Fold 2           0.0671                  0.0568                  0.0523        
#> Fold 3  -0.0631          0.0616                                                
#> Fold 4           0.0659         -0.0549          0.0527                  0.0527
#> Fold 5           0.0425                  0.0672  0.0613          0.0493        
#> Fold 6           0.0559         -0.0629  0.0498          0.0487                
#> Fold 7                                                           0.0611  0.0472
#> Fold 8  -0.0719                                          0.0586                
#> Fold 9                   0.0525                                                
#> Fold 10 -0.0580                                  0.0603                        
#>            X.25     X.4    X.64    X.81    X.97    X.11     X.2    X.33    X.47
#> Fold 1                                   0.0604          0.0575                
#> Fold 2   0.0478          0.0532  0.0518                                        
#> Fold 3                           0.0574                          0.0473        
#> Fold 4                   0.0628                                                
#> Fold 5   0.0518                                                                
#> Fold 6                                           0.0521                        
#> Fold 7           0.0550                                                        
#> Fold 8                                                                         
#> Fold 9                                   0.0556                          0.0447
#> Fold 10          0.0516                                                        
#>             X.6    X.72    X.73    X.77    X.79 X.88
#> Fold 1   0.0476                                     
#> Fold 2                   0.0514                     
#> Fold 3                                              
#> Fold 4                                  -0.0473     
#> Fold 5           0.0586                         0.07
#> Fold 6                          -0.0489             
#> Fold 7                                              
#> Fold 8                                              
#> Fold 9                                              
#> Fold 10

Mroz’s logistic regression revisited

For a contrasting example we apply model selection to Mroz’s logistic regression for married women’s labor-force participation. First, recall the logistic regression model that we fit to the Mroz data:

summary(m.mroz)
#> 
#> Call:
#> glm(formula = lfp ~ ., family = binomial, data = Mroz)
#> 
#> Coefficients:
#>             Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)  3.18214    0.64438    4.94  7.9e-07 ***
#> k5          -1.46291    0.19700   -7.43  1.1e-13 ***
#> k618        -0.06457    0.06800   -0.95  0.34234    
#> age         -0.06287    0.01278   -4.92  8.7e-07 ***
#> wcyes        0.80727    0.22998    3.51  0.00045 ***
#> hcyes        0.11173    0.20604    0.54  0.58762    
#> lwg          0.60469    0.15082    4.01  6.1e-05 ***
#> inc         -0.03445    0.00821   -4.20  2.7e-05 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 1029.75  on 752  degrees of freedom
#> Residual deviance:  905.27  on 745  degrees of freedom
#> AIC: 921.3
#> 
#> Number of Fisher Scoring iterations: 4

Applying stepwise model selection Mroz’s logistic regression, using BIC as the model-selection criterion (via the argument k=log(nrow(Mroz)) to stepAIC()) selects 5 of the 7 original predictors:

m.mroz.sel <- stepAIC(m.mroz, k=log(nrow(Mroz)),
                      trace=FALSE)
summary(m.mroz.sel)
#> 
#> Call:
#> glm(formula = lfp ~ k5 + age + wc + lwg + inc, family = binomial, 
#>     data = Mroz)
#> 
#> Coefficients:
#>             Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)   2.9019     0.5429    5.35  9.0e-08 ***
#> k5           -1.4318     0.1932   -7.41  1.3e-13 ***
#> age          -0.0585     0.0114   -5.13  2.9e-07 ***
#> wcyes         0.8724     0.2064    4.23  2.4e-05 ***
#> lwg           0.6157     0.1501    4.10  4.1e-05 ***
#> inc          -0.0337     0.0078   -4.32  1.6e-05 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 1029.75  on 752  degrees of freedom
#> Residual deviance:  906.46  on 747  degrees of freedom
#> AIC: 918.5
#> 
#> Number of Fisher Scoring iterations: 3
BayesRule(Mroz$lfp == "yes",
          predict(m.mroz.sel, type="response"))
#> [1] 0.31873
#> attr(,"casewise loss")
#> [1] "y != round(yhat)"

Bayes rule applied to the selected model misclassifies 32% of the cases in the Mroz data.

Cross-validating the selected model produces a similar, slightly larger, estimate of misclassification, about 33%:

cv(m.mroz.sel, criterion=BayesRule, seed=345266)
#> R RNG seed set to 345266
#> 10-Fold Cross Validation
#> criterion: BayesRule
#> cross-validation criterion = 0.33068
#> bias-adjusted cross-validation criterion = 0.33332
#> 95% CI for bias-adjusted CV criterion = (0.2997, 0.36695)
#> full-sample criterion = 0.31873

Is this estimate of predictive performance optimistic?

We proceed to apply the model-selection procedure by cross-validation, producing more or less the same result:

m.mroz.sel.cv <- cvSelect(selectStepAIC, Mroz, 
                          seed=6681,
                          criterion=BayesRule,
                          model=m.mroz,
                          AIC=FALSE)
#> R RNG seed set to 6681
m.mroz.sel.cv
#> 10-Fold Cross Validation
#> cross-validation criterion = 0.33068
#> bias-adjusted cross-validation criterion = 0.33452
#> 95% CI for bias-adjusted CV criterion = (0.3009, 0.36815)
#> full-sample criterion = 0.31873

Setting AIC=FALSE in the call to cvSelect() uses the BIC rather than the AIC as the model-selection criterion. As it turns out, exactly the same predictors are selected when each of the 10 folds are omitted, and the several coefficient estimates are very similar, as we show using compareFolds():

compareFolds(m.mroz.sel.cv)
#>         (Intercept)     age     inc      k5     lwg wcyes
#> Fold 1       2.5014 -0.0454 -0.0388 -1.3613  0.5653  0.85
#> Fold 2       3.0789 -0.0659 -0.0306 -1.5335  0.6923  0.79
#> Fold 3       3.0141 -0.0595 -0.0305 -1.3994  0.5428  0.86
#> Fold 4       2.7251 -0.0543 -0.0354 -1.4474  0.6298  1.09
#> Fold 5       2.7617 -0.0566 -0.0320 -1.4752  0.6324  0.74
#> Fold 6       3.0234 -0.0621 -0.0348 -1.4537  0.6618  0.94
#> Fold 7       2.9615 -0.0600 -0.0351 -1.4127  0.5835  0.97
#> Fold 8       2.9598 -0.0603 -0.0329 -1.3865  0.6210  0.69
#> Fold 9       3.2481 -0.0650 -0.0381 -1.4138  0.6093  0.94
#> Fold 10      2.7724 -0.0569 -0.0295 -1.4503  0.6347  0.85

In this example, therefore, we appear to obtain a realistic estimate of model performance directly from the selected model, because there is little added uncertainty induced by model selection.

Cross-validating choice of transformations in regression

The cv package also provides a cvSelect() procedure, selectTrans(), for choosing transformations of the predictors and the response in regression.

Some background: As Weisberg (2014, sec. 8.2) explains, there are technical advantages to having (numeric) predictors in linear regression analysis that are themselves linearly related. If the predictors aren’t linearly related, then the relationships between them can often be straightened by power transformations. Transformations can be selected after graphical examination of the data, or by analytic methods. Once the relationships between the predictors are linearized, it can be advantageous similarly to transform the response variable towards normality.

Selecting transformations analytically raises the possibility of automating the process, as would be required for cross-validation. One could, in principle, apply graphical methods to select transformations for each fold, but because a data analyst couldn’t forget the choices made for previous folds, the process wouldn’t really be applied independently to the folds.

To illustrate, we adapt an example appearing in several places in Fox & Weisberg (2019) (for example in Chapter 3 on transforming data), using data on the prestige and other characteristics of 102 Canadian occupations circa 1970. The data are in the Prestige data frame in the carData package:

data("Prestige", package="carData")
head(Prestige)
#>                     education income women prestige census type
#> gov.administrators      13.11  12351 11.16     68.8   1113 prof
#> general.managers        12.26  25879  4.02     69.1   1130 prof
#> accountants             12.77   9271 15.70     63.4   1171 prof
#> purchasing.officers     11.42   8865  9.11     56.8   1175 prof
#> chemists                14.62   8403 11.68     73.5   2111 prof
#> physicists              15.64  11030  5.13     77.6   2113 prof
summary(Prestige)
#>    education         income          women          prestige        census    
#>  Min.   : 6.38   Min.   :  611   Min.   : 0.00   Min.   :14.8   Min.   :1113  
#>  1st Qu.: 8.45   1st Qu.: 4106   1st Qu.: 3.59   1st Qu.:35.2   1st Qu.:3120  
#>  Median :10.54   Median : 5930   Median :13.60   Median :43.6   Median :5135  
#>  Mean   :10.74   Mean   : 6798   Mean   :28.98   Mean   :46.8   Mean   :5402  
#>  3rd Qu.:12.65   3rd Qu.: 8187   3rd Qu.:52.20   3rd Qu.:59.3   3rd Qu.:8312  
#>  Max.   :15.97   Max.   :25879   Max.   :97.51   Max.   :87.2   Max.   :9517  
#>    type   
#>  bc  :44  
#>  prof:31  
#>  wc  :23  
#>  NA's: 4  
#>           
#> 

The variables in the Prestige data set are:

The object of a regression analysis for the Prestige data (and their original purpose) is to predict occupational prestige from the other variables in the data set.

A scatterplot matrix (using the scatterplotMatrix() function in the car package) of the numeric variables in the data reveals that the distributions of income and women are positively skewed, and that some of the relationships among the three predictors, and between the predictors and the response (i.e., prestige), are nonlinear:

library("car")
#> Loading required package: carData
scatterplotMatrix(~ prestige + income + education + women,
                  data=Prestige, smooth=list(spread=FALSE))
Scatterplot matrix for the `Prestige` data.

Scatterplot matrix for the Prestige data.

The powerTransform() function in the car package transforms variables towards multivariate normality by a generalization of Box and Cox’s maximum-likelihood-like approach (Box & Cox, 1964). Several “families” of power transformations can be used, including the original Box-Cox family, simple powers (and roots), and two adaptations of the Box-Cox family to data that may include negative values and zeros: the Box-Cox-with-negatives family and the Yeo-Johnson family; see Weisberg (2014, Chapter 8), and Fox & Weisberg (2019, Chapter 3) for details. Because women has some zero values, we use the Yeo-Johnson family:

trans <- powerTransform( cbind(income, education, women) ~ 1,
                         data=Prestige, family="yjPower")
summary(trans)
#> yjPower Transformations to Multinormality 
#>           Est Power Rounded Pwr Wald Lwr Bnd Wald Upr Bnd
#> income       0.2678        0.33       0.1051       0.4304
#> education    0.5162        1.00      -0.2822       1.3145
#> women        0.1630        0.16       0.0112       0.3149
#> 
#>  Likelihood ratio test that all transformation parameters are equal to 0
#>                              LRT df    pval
#> LR test, lambda = (0 0 0) 15.739  3 0.00128

We thus have evidence of the desirability of transforming income (by the \(1/3\) power) and women (by the \(0.16\) power—which is close to the “0” power, i.e., the log transformation), but not education. Applying the “rounded” power transformations makes the predictors better-behaved:

P <- Prestige[, c("prestige", "income", "education", "women")]
(lambdas <- trans$roundlam)
#>    income education     women 
#>   0.33000   1.00000   0.16302
names(lambdas) <- c("income", "education", "women")
for (var in c("income", "education", "women")){
  P[, var] <- yjPower(P[, var], lambda=lambdas[var])
}
summary(P)
#>     prestige        income       education         women     
#>  Min.   :14.8   Min.   :22.2   Min.   : 6.38   Min.   :0.00  
#>  1st Qu.:35.2   1st Qu.:44.2   1st Qu.: 8.45   1st Qu.:1.73  
#>  Median :43.6   Median :50.3   Median :10.54   Median :3.36  
#>  Mean   :46.8   Mean   :50.8   Mean   :10.74   Mean   :3.50  
#>  3rd Qu.:59.3   3rd Qu.:56.2   3rd Qu.:12.65   3rd Qu.:5.59  
#>  Max.   :87.2   Max.   :83.6   Max.   :15.97   Max.   :6.83

scatterplotMatrix(~ prestige + income + education + women,
                  data=P, smooth=list(spread=FALSE))
Scatterplot matrix for the `Prestige` data with the predictors transformed.

Scatterplot matrix for the Prestige data with the predictors transformed.

Comparing the MSE for the regressions with the original and transformed predictors shows a advantage to the latter:

m.pres <- lm(prestige ~ income + education + women, data=Prestige)
m.pres.trans <- lm(prestige ~ income + education + women, data=P)
mse(Prestige$prestige, fitted(m.pres))
#> [1] 59.153
#> attr(,"casewise loss")
#> [1] "(y - yhat)^2"
mse(P$prestige, fitted(m.pres.trans))
#> [1] 50.6
#> attr(,"casewise loss")
#> [1] "(y - yhat)^2"

Similarly, component+residual plots for the two regressions, produced by the crPlots() function in the car package, suggest that the partial relationship of prestige to income is more nearly linear in the transformed data, but the transformation of women fails to capture what appears to be a slight quadratic partial relationship; the partial relationship of prestige to education is close to linear in both regressions:

crPlots(m.pres)
Component+residual plots for the `Prestige` regression with the original predictors.

Component+residual plots for the Prestige regression with the original predictors.

crPlots(m.pres.trans)
Component+residual plots for the `Prestige` regression with transformed predictors.

Component+residual plots for the Prestige regression with transformed predictors.

Having transformed the predictors towards multinormality, we now consider whether there’s evidence for transforming the response (using powerTransform() for Box and Cox’s original method), and we discover that there’s not:

summary(powerTransform(m.pres.trans))
#> bcPower Transformation to Normality 
#>    Est Power Rounded Pwr Wald Lwr Bnd Wald Upr Bnd
#> Y1    1.0194           1       0.6773       1.3615
#> 
#> Likelihood ratio test that transformation parameter is equal to 0
#>  (log transformation)
#>                          LRT df     pval
#> LR test, lambda = (0) 32.217  1 1.38e-08
#> 
#> Likelihood ratio test that no transformation is needed
#>                            LRT df  pval
#> LR test, lambda = (1) 0.012384  1 0.911

The selectTrans() function in the cv package automates the process of selecting predictor and response transformations. The function takes a data set and “working” model as arguments, along with the candidate predictors and response for transformation, and the transformation family to employ. If the predictors argument is missing then only the response is transformed, and if the response argument is missing, only the supplied predictors are transformed. The default family for transforming the predictors is "bcPower"—the original Box-Cox family—as is the default family.y for transforming the response; here we specify family="yjPower because of the zeros in women. selectTrans() returns the result of applying a lack-of-fit criterion to the model after the selected transformation is applied, with the default criterion=mse:

selectTrans(data=Prestige, model=m.pres,
            predictors=c("income", "education", "women"),
            response="prestige", family="yjPower")
#> [1] 50.6
#> attr(,"casewise loss")
#> [1] "(y - yhat)^2"

selectTrans() also takes an optional indices argument, making it suitable for doing computations on a subset of the data (i.e., a CV fold), and hence for use with cvSelect() (see ?selectTrans for details):

cvs <- cvSelect(selectTrans, data=Prestige, model=m.pres, seed=1463,
                predictors=c("income", "education", "women"),
                response="prestige",
                family="yjPower")
#> R RNG seed set to 1463
cvs
#> 10-Fold Cross Validation
#> cross-validation criterion = 54.487
#> bias-adjusted cross-validation criterion = 54.308
#> full-sample criterion = 50.6

cv(m.pres, seed=1463) # untransformed model with same folds
#> R RNG seed set to 1463
#> 10-Fold Cross Validation
#> method: Woodbury
#> criterion: mse
#> cross-validation criterion = 63.293
#> bias-adjusted cross-validation criterion = 63.073
#> full-sample criterion = 59.153

compareFolds(cvs)
#>         lam.education lam.income lam.women lambda
#> Fold 1          1.000      0.330     0.330      1
#> Fold 2          1.000      0.330     0.169      1
#> Fold 3          1.000      0.330     0.330      1
#> Fold 4          1.000      0.330     0.330      1
#> Fold 5          1.000      0.330     0.000      1
#> Fold 6          1.000      0.330     0.330      1
#> Fold 7          1.000      0.330     0.330      1
#> Fold 8          1.000      0.330     0.000      1
#> Fold 9          1.000      0.330     0.000      1
#> Fold 10         1.000      0.330     0.000      1

The results suggest that the predictive power of the transformed regression is reliably greater than that of the untransformed regression (though in both case, the cross-validated MSE is considerably higher than the MSE computed for the whole data). Examining the selected transformations for each fold reveals that the predictor education and the response prestige are never transformed; that the \(1/3\) power is selected for income in all of the folds; and that the transformation selected for women varies narrowly across the folds between the \(0\)th power (i.e., log) and the \(1/3\) power.

Selecting both transformations and predictors14

As we mentioned, Hastie et al. (2009, sec. 7.10.2: “The Wrong and Right Way to Do Cross-validation”) explain that honest cross-validation has to take account of model specification and selection. Statistical modeling is at least partly a craft, and one could imagine applying that craft to successive partial data sets, each with a fold removed. The resulting procedure would be tedious, though possibly worth the effort, but it would also be difficult to realize in practice: After all, we can hardly erase our memory of statistical modeling choices between analyzing partial data sets.

Alternatively, if we’re able to automate the process of model selection, then we can more realistically apply CV mechanically. That’s what we did in the preceding two sections, first for predictor selection and then for selection of transformations in regression. In this section, we consider the case where we both select variable transformations and then proceed to select predictors. It’s insufficient to apply these steps sequentially, first, for example, using cvSelect() with selectTrans() and then with selectStepAIC(); rather we should apply the whole model-selection procedure with each fold omitted. The selectTransAndStepAIC() function, also supplied by the cv package, does exactly that.

To illustrate this process, we return to the Auto data set:

summary(Auto)
#>       mpg         cylinders     displacement   horsepower        weight    
#>  Min.   : 9.0   Min.   :3.00   Min.   : 68   Min.   : 46.0   Min.   :1613  
#>  1st Qu.:17.0   1st Qu.:4.00   1st Qu.:105   1st Qu.: 75.0   1st Qu.:2225  
#>  Median :22.8   Median :4.00   Median :151   Median : 93.5   Median :2804  
#>  Mean   :23.4   Mean   :5.47   Mean   :194   Mean   :104.5   Mean   :2978  
#>  3rd Qu.:29.0   3rd Qu.:8.00   3rd Qu.:276   3rd Qu.:126.0   3rd Qu.:3615  
#>  Max.   :46.6   Max.   :8.00   Max.   :455   Max.   :230.0   Max.   :5140  
#>                                                                            
#>   acceleration       year        origin                     name    
#>  Min.   : 8.0   Min.   :70   Min.   :1.00   amc matador       :  5  
#>  1st Qu.:13.8   1st Qu.:73   1st Qu.:1.00   ford pinto        :  5  
#>  Median :15.5   Median :76   Median :1.00   toyota corolla    :  5  
#>  Mean   :15.5   Mean   :76   Mean   :1.58   amc gremlin       :  4  
#>  3rd Qu.:17.0   3rd Qu.:79   3rd Qu.:2.00   amc hornet        :  4  
#>  Max.   :24.8   Max.   :82   Max.   :3.00   chevrolet chevette:  4  
#>                                             (Other)           :365
xtabs(~ year, data=Auto)
#> year
#> 70 71 72 73 74 75 76 77 78 79 80 81 82 
#> 29 27 28 40 26 30 34 28 36 29 27 28 30
xtabs(~ origin, data=Auto)
#> origin
#>   1   2   3 
#> 245  68  79
xtabs(~ cylinders, data=Auto)
#> cylinders
#>   3   4   5   6   8 
#>   4 199   3  83 103

We previously used the Auto here in a preliminary example where we employed CV to inform the selection of the order of a polynomial regression of mpg on horsepower. Here, we consider more generally the problem of predicting mpg from the other variables in the Auto data. We begin with a bit of data management, and then examine the pairwise relationships among the numeric variables in the data set:

Auto$cylinders <- factor(Auto$cylinders,
                         labels=c("3.4", "3.4", "5.6", "5.6", "8"))
Auto$year <- as.factor(Auto$year)
Auto$origin <- factor(Auto$origin,
                      labels=c("America", "Europe", "Japan"))
rownames(Auto) <- make.names(Auto$name, unique=TRUE)
Auto$name <- NULL

scatterplotMatrix(~ mpg + displacement + horsepower + weight + acceleration, 
                  smooth=list(spread=FALSE), data=Auto)
Scatterplot matrix for the numeric variables in the `Auto` data

Scatterplot matrix for the numeric variables in the Auto data

A comment before we proceed: origin is clearly categorical and so converting it to a factor is natural, but we could imagine treating cylinders and year as numeric predictors. There are, however, only 5 distinct values of cylinders (ranging from 3 to 8), but cars with 3 or 5 cylinders are rare. and none of the cars has 7 cylinders. There are similarly only 13 distinct years between 1970 and 1982 in the data, and the relationship between mpg and year is difficult to characterize.15 It’s apparent that most these variables are positively skewed and that many of the pairwise relationships among them are nonlinear.

We begin with a “working model” that specifies linear partial relationships of the response to the numeric predictors:

m.auto <- lm(mpg ~ ., data = Auto)
summary(m.auto)
#> 
#> Call:
#> lm(formula = mpg ~ ., data = Auto)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -9.006 -1.745 -0.092  1.525 10.950 
#> 
#> Coefficients:
#>               Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  37.034132   1.969393   18.80  < 2e-16 ***
#> cylinders5.6 -2.602941   0.655200   -3.97  8.5e-05 ***
#> cylinders8   -0.582458   1.171452   -0.50  0.61934    
#> displacement  0.017425   0.006734    2.59  0.01004 *  
#> horsepower   -0.041353   0.013379   -3.09  0.00215 ** 
#> weight       -0.005548   0.000632   -8.77  < 2e-16 ***
#> acceleration  0.061527   0.088313    0.70  0.48643    
#> year71        0.968058   0.837390    1.16  0.24841    
#> year72       -0.601435   0.825115   -0.73  0.46652    
#> year73       -0.687689   0.740272   -0.93  0.35351    
#> year74        1.375576   0.876500    1.57  0.11741    
#> year75        0.929929   0.859072    1.08  0.27974    
#> year76        1.559893   0.822505    1.90  0.05867 .  
#> year77        2.909416   0.841729    3.46  0.00061 ***
#> year78        3.175198   0.798940    3.97  8.5e-05 ***
#> year79        5.019299   0.845759    5.93  6.8e-09 ***
#> year80        9.099763   0.897293   10.14  < 2e-16 ***
#> year81        6.688660   0.885218    7.56  3.3e-13 ***
#> year82        8.071125   0.870668    9.27  < 2e-16 ***
#> originEurope  2.046664   0.517124    3.96  9.1e-05 ***
#> originJapan   2.144887   0.507717    4.22  3.0e-05 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 2.92 on 371 degrees of freedom
#> Multiple R-squared:  0.867,  Adjusted R-squared:  0.86 
#> F-statistic:  121 on 20 and 371 DF,  p-value: <2e-16

Anova(m.auto)
#> Anova Table (Type II tests)
#> 
#> Response: mpg
#>              Sum Sq  Df F value  Pr(>F)    
#> cylinders       292   2   17.09 7.9e-08 ***
#> displacement     57   1    6.70  0.0100 *  
#> horsepower       82   1    9.55  0.0021 ** 
#> weight          658   1   76.98 < 2e-16 ***
#> acceleration      4   1    0.49  0.4864    
#> year           3017  12   29.40 < 2e-16 ***
#> origin          190   2   11.13 2.0e-05 ***
#> Residuals      3173 371                    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

crPlots(m.auto)
Component+residual plots for the working model fit to the `Auto` data

Component+residual plots for the working model fit to the Auto data

The component+residual plots, created with the crPlots() function in the previously loaded car package, clearly reveal the inadequacy of the model.

We proceed to transform the numeric predictors towards multi-normality:

num.predictors <- c("displacement", "horsepower", "weight", "acceleration")
tr.x <- powerTransform(Auto[, num.predictors])
summary(tr.x)
#> bcPower Transformations to Multinormality 
#>              Est Power Rounded Pwr Wald Lwr Bnd Wald Upr Bnd
#> displacement   -0.0509           0      -0.2082       0.1065
#> horsepower     -0.1249           0      -0.2693       0.0194
#> weight         -0.0870           0      -0.2948       0.1208
#> acceleration    0.3061           0      -0.0255       0.6376
#> 
#> Likelihood ratio test that transformation parameters are equal to 0
#>  (all log transformations)
#>                                LRT df  pval
#> LR test, lambda = (0 0 0 0) 4.8729  4 0.301
#> 
#> Likelihood ratio test that no transformations are needed
#>                                LRT df   pval
#> LR test, lambda = (1 1 1 1) 390.08  4 <2e-16

We then apply the (rounded) transformations—all, as it turns out, logs—to the data and re-estimate the model:

A <- Auto
powers <- tr.x$roundlam
for (pred in num.predictors){
  A[, pred] <- bcPower(A[, pred], lambda=powers[pred])
}
head(A)
#>                           mpg cylinders displacement horsepower weight
#> chevrolet.chevelle.malibu  18         8       5.7268     4.8675 8.1617
#> buick.skylark.320          15         8       5.8579     5.1059 8.2142
#> plymouth.satellite         18         8       5.7621     5.0106 8.1421
#> amc.rebel.sst              16         8       5.7170     5.0106 8.1412
#> ford.torino                17         8       5.7104     4.9416 8.1458
#> ford.galaxie.500           15         8       6.0615     5.2883 8.3759
#>                           acceleration year  origin
#> chevrolet.chevelle.malibu       2.4849   70 America
#> buick.skylark.320               2.4423   70 America
#> plymouth.satellite              2.3979   70 America
#> amc.rebel.sst                   2.4849   70 America
#> ford.torino                     2.3514   70 America
#> ford.galaxie.500                2.3026   70 America

m <- update(m.auto, data=A)

Finally, we perform Box-Cox regression to transform the response (also obtaining a log transformation):

summary(powerTransform(m))
#> bcPower Transformation to Normality 
#>    Est Power Rounded Pwr Wald Lwr Bnd Wald Upr Bnd
#> Y1    0.0024           0      -0.1607       0.1654
#> 
#> Likelihood ratio test that transformation parameter is equal to 0
#>  (log transformation)
#>                              LRT df  pval
#> LR test, lambda = (0) 0.00080154  1 0.977
#> 
#> Likelihood ratio test that no transformation is needed
#>                          LRT df   pval
#> LR test, lambda = (1) 124.13  1 <2e-16

m <- update(m, log(mpg) ~ .)
summary(m)
#> 
#> Call:
#> lm(formula = log(mpg) ~ cylinders + displacement + horsepower + 
#>     weight + acceleration + year + origin, data = A)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -0.3341 -0.0577  0.0041  0.0607  0.3808 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)    8.8965     0.3582   24.84  < 2e-16 ***
#> cylinders5.6  -0.0636     0.0257   -2.47    0.014 *  
#> cylinders8    -0.0769     0.0390   -1.97    0.049 *  
#> displacement   0.0280     0.0515    0.54    0.587    
#> horsepower    -0.2901     0.0563   -5.15  4.2e-07 ***
#> weight        -0.5427     0.0819   -6.62  1.2e-10 ***
#> acceleration  -0.1421     0.0563   -2.52    0.012 *  
#> year71         0.0250     0.0289    0.87    0.387    
#> year72        -0.0168     0.0289   -0.58    0.562    
#> year73        -0.0426     0.0260   -1.64    0.103    
#> year74         0.0493     0.0304    1.62    0.106    
#> year75         0.0472     0.0296    1.59    0.112    
#> year76         0.0709     0.0284    2.49    0.013 *  
#> year77         0.1324     0.0293    4.52  8.2e-06 ***
#> year78         0.1447     0.0278    5.21  3.1e-07 ***
#> year79         0.2335     0.0292    7.99  1.7e-14 ***
#> year80         0.3238     0.0317   10.22  < 2e-16 ***
#> year81         0.2565     0.0309    8.29  2.1e-15 ***
#> year82         0.3076     0.0304   10.13  < 2e-16 ***
#> originEurope   0.0492     0.0195    2.52    0.012 *  
#> originJapan    0.0441     0.0195    2.26    0.024 *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.104 on 371 degrees of freedom
#> Multiple R-squared:  0.911,  Adjusted R-squared:  0.906 
#> F-statistic:  189 on 20 and 371 DF,  p-value: <2e-16

Anova(m)
#> Anova Table (Type II tests)
#> 
#> Response: log(mpg)
#>              Sum Sq  Df F value  Pr(>F)    
#> cylinders      0.07   2    3.05   0.048 *  
#> displacement   0.00   1    0.30   0.587    
#> horsepower     0.29   1   26.54 4.2e-07 ***
#> weight         0.48   1   43.88 1.2e-10 ***
#> acceleration   0.07   1    6.37   0.012 *  
#> year           4.45  12   34.13 < 2e-16 ***
#> origin         0.08   2    3.71   0.025 *  
#> Residuals      4.03 371                    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The transformed numeric variables are much better-behaved:

scatterplotMatrix(~ log(mpg) + displacement + horsepower + weight 
                  + acceleration, 
                  smooth=list(spread=FALSE), data=A)
Scatterplot matrix for the transformed numeric variables in the `Auto` data

Scatterplot matrix for the transformed numeric variables in the Auto data

And the partial relationships in the model fit to the transformed data are much more nearly linear:

crPlots(m)
Component+residual plots for the model fit to the transformed `Auto` data

Component+residual plots for the model fit to the transformed Auto data

Having transformed both the numeric predictors and the response, we proceed to use the stepAIC() function in the MASS package to perform predictor selection, employing the BIC model-selection criterion (by setting the k argument of stepAIC() to \(\log(n)\)):

m.step <- stepAIC(m, k=log(nrow(A)), trace=FALSE)
summary(m.step)
#> 
#> Call:
#> lm(formula = log(mpg) ~ horsepower + weight + acceleration + 
#>     year + origin, data = A)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -0.3523 -0.0568  0.0068  0.0674  0.3586 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   9.43459    0.26153   36.07  < 2e-16 ***
#> horsepower   -0.27625    0.05614   -4.92  1.3e-06 ***
#> weight       -0.60907    0.05600  -10.88  < 2e-16 ***
#> acceleration -0.13138    0.05319   -2.47  0.01397 *  
#> year71        0.02798    0.02894    0.97  0.33412    
#> year72       -0.00711    0.02845   -0.25  0.80274    
#> year73       -0.03953    0.02601   -1.52  0.12947    
#> year74        0.05275    0.02999    1.76  0.07936 .  
#> year75        0.05320    0.02928    1.82  0.07004 .  
#> year76        0.07432    0.02821    2.63  0.00878 ** 
#> year77        0.13793    0.02888    4.78  2.6e-06 ***
#> year78        0.14588    0.02753    5.30  2.0e-07 ***
#> year79        0.23604    0.02908    8.12  7.0e-15 ***
#> year80        0.33527    0.03115   10.76  < 2e-16 ***
#> year81        0.26287    0.03056    8.60  < 2e-16 ***
#> year82        0.32339    0.02961   10.92  < 2e-16 ***
#> originEurope  0.05582    0.01678    3.33  0.00097 ***
#> originJapan   0.04355    0.01748    2.49  0.01314 *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.105 on 374 degrees of freedom
#> Multiple R-squared:  0.909,  Adjusted R-squared:  0.905 
#> F-statistic:  220 on 17 and 374 DF,  p-value: <2e-16

Anova(m.step)
#> Anova Table (Type II tests)
#> 
#> Response: log(mpg)
#>              Sum Sq  Df F value  Pr(>F)    
#> horsepower     0.27   1   24.21 1.3e-06 ***
#> weight         1.30   1  118.28 < 2e-16 ***
#> acceleration   0.07   1    6.10  0.0140 *  
#> year           4.76  12   36.05 < 2e-16 ***
#> origin         0.14   2    6.21  0.0022 ** 
#> Residuals      4.11 374                    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The selected model includes three of the numeric predictors, horsepower, weight, and acceleration, along with the factors year and origin. We can calculate the MSE for this model, but we expect that the result will be optimistic because we used the whole data to help specify the model

mse(Auto$mpg, exp(fitted(m.step)))
#> [1] 6.5121
#> attr(,"casewise loss")
#> [1] "(y - yhat)^2"

This is considerably smaller than the MSE for the original working model:

mse(Auto$mpg, fitted(m.auto))
#> [1] 8.0932
#> attr(,"casewise loss")
#> [1] "(y - yhat)^2"

A perhaps subtle point is that we compute the MSE for the selected model on the original mpg response scale rather than the log scale, so as to make the selected model comparable to the working model. That’s slightly uncomfortable given the skewed distribution of mpg. An alternative is to use the median absolute error instead of the mean-squared error, employing the medAbsErr() function from the cv package:

medAbsErr(Auto$mpg, exp(fitted(m.step)))
#> [1] 1.3396
medAbsErr(Auto$mpg, fitted(m.auto))
#> [1] 1.6661

Now let’s use cvSelect() with selectTransAndStepAIC() to automate and cross-validate the whole model-specification process:

num.predictors
#> [1] "displacement" "horsepower"   "weight"       "acceleration"
cvs <- cvSelect(selectTransStepAIC, data=Auto, seed=76692, model=m.auto,
                predictors=num.predictors,
                response="mpg", AIC=FALSE, criterion=medAbsErr)
#> R RNG seed set to 76692
cvs
#> 10-Fold Cross Validation
#> cross-validation criterion = 1.4951
#> full-sample criterion = 1.3396

compareFolds(cvs)
#>         (Intercept) horsepower lam.acceleration lam.displacement lam.horsepower
#> Fold 1      9.71384   -0.17408          0.50000          0.00000        0.00000
#> Fold 2      9.21713   -0.31480          0.00000          0.00000        0.00000
#> Fold 3      9.61824   -0.19248          0.00000          0.00000        0.00000
#> Fold 4      8.69910   -0.25523          0.50000          0.00000        0.00000
#> Fold 5      9.14403   -0.14934          0.00000          0.00000        0.00000
#> Fold 6      9.63481   -0.16739          0.50000          0.00000        0.00000
#> Fold 7      9.98933   -0.36847          0.00000          0.00000       -0.15447
#> Fold 8      9.06301   -0.29721          0.00000          0.00000        0.00000
#> Fold 9      8.88315   -0.22684          0.00000          0.00000        0.00000
#> Fold 10     9.61727   -0.17086          0.00000          0.00000        0.00000
#>         lam.weight   lambda   weight   year71   year72   year73   year74
#> Fold 1     0.00000  0.00000 -0.74636  0.03764 -0.00327 -0.02477  0.05606
#> Fold 2     0.00000  0.00000 -0.47728  0.02173 -0.01488 -0.03770  0.04312
#> Fold 3     0.00000  0.00000 -0.72085  0.01128 -0.02569 -0.03872  0.05187
#> Fold 4     0.00000  0.00000 -0.53846  0.02153 -0.02922 -0.05181  0.04136
#> Fold 5     0.00000  0.00000 -0.69081  0.02531 -0.01062 -0.04625  0.05039
#> Fold 6     0.00000  0.00000 -0.74049  0.02456  0.00759 -0.03412  0.06266
#> Fold 7     0.00000  0.00000 -0.72843  0.02532 -0.01271 -0.04144  0.04568
#> Fold 8     0.00000  0.00000 -0.46392  0.02702 -0.02041 -0.05605  0.04437
#> Fold 9     0.00000  0.00000 -0.47136  0.00860 -0.03620 -0.04835  0.01906
#> Fold 10    0.00000  0.00000 -0.73550  0.02937 -0.00899 -0.03814  0.05408
#>           year75   year76   year77   year78   year79   year80   year81   year82
#> Fold 1   0.07080  0.07250  0.14420  0.14281  0.23266  0.35127  0.25635  0.30546
#> Fold 2   0.04031  0.06718  0.13094  0.14917  0.21871  0.33192  0.26196  0.30943
#> Fold 3   0.03837  0.06399  0.11593  0.12601  0.20499  0.32821  0.24478  0.29204
#> Fold 4   0.04072  0.05537  0.12292  0.14083  0.22878  0.32947  0.25140  0.27248
#> Fold 5   0.05596  0.07044  0.13356  0.14724  0.24675  0.33331  0.26938  0.32594
#> Fold 6   0.06940  0.07769  0.14211  0.14647  0.23532  0.34761  0.26737  0.33062
#> Fold 7   0.03614  0.07385  0.12976  0.14040  0.23976  0.33998  0.27652  0.30659
#> Fold 8   0.06573  0.08135  0.13158  0.13987  0.23011  0.32880  0.25886  0.30538
#> Fold 9   0.03018  0.05846  0.10536  0.11722  0.20665  0.31533  0.23352  0.29375
#> Fold 10  0.04881  0.07862  0.14101  0.14313  0.23258  0.35649  0.26214  0.32421
#>         acceleration displacement cylinders5.6 cylinders8 originEurope
#> Fold 1                                                                
#> Fold 2      -0.18909     -0.09197                                     
#> Fold 3                                                                
#> Fold 4      -0.03484                  -0.09080   -0.10909             
#> Fold 5                                                         0.06261
#> Fold 6                                                                
#> Fold 7                                                                
#> Fold 8      -0.17676     -0.10542                                     
#> Fold 9      -0.14514     -0.13452                                     
#> Fold 10                                                               
#>         originJapan
#> Fold 1             
#> Fold 2             
#> Fold 3             
#> Fold 4             
#> Fold 5         0.04
#> Fold 6             
#> Fold 7             
#> Fold 8             
#> Fold 9             
#> Fold 10

Here, as for selectTrans(), the predictors and response arguments specify candidate variables for transformation, and AIC=FALSE uses the BIC for model selection. The starting model, m.auto, is the working model fit to the Auto data. The CV criterion isn’t bias-adjusted because median absolute error isn’t a mean of casewise error components.

Some noteworthy points:

Parallel computations

The CV functions in the cv package are all capable of performing parallel computations by setting the ncores argument (specifying the number of computer cores to be used) to a number > 1 (which is the default). Parallel computation can be advantageous for large problems, reducing the execution time of the program.

To illustrate, let’s time model selection in Mroz’s logistic regression, repeating the computation as performed previously and then doing it in parallel using 2 cores:

system.time(m.mroz.sel.cv <- cvSelect(selectStepAIC, Mroz,
                          seed=6681,
                          criterion=BayesRule,
                          model=m.mroz,
                          AIC=FALSE))
#> R RNG seed set to 6681
#>    user  system elapsed 
#>   0.313   0.000   0.315

system.time(m.mroz.sel.cv.p <- cvSelect(selectStepAIC, Mroz,
                          seed=6681,
                          criterion=BayesRule,
                          model=m.mroz,
                          AIC=FALSE,
                          ncores=2))
#> R RNG seed set to 6681
#>    user  system elapsed 
#>   0.031   0.010   0.910
all.equal(m.mroz.sel.cv, m.mroz.sel.cv.p)
#> [1] TRUE

In this small problem, the parallel computation is actually slower, because there is an overhead cost to parallelization, but we can see that it produces the same result as before.

Computational notes

Efficient computations for linear and generalized linear models

The most straightforward way to implement cross-validation in R for statistical modeling functions that are written in the canonical manner is to use update() to refit the model with each fold removed. This is the approach taken in the default method for cv(), and it is appropriate if the cases are independently sampled. Refitting the model in this manner for each fold is generally feasible when the number of folds in modest, but can be prohibitively costly for leave-one-out cross-validation when the number of cases is large.

The "lm" and "glm" methods for cv() take advantage of computational efficiencies by avoiding refitting the model with each fold removed. Consider, in particular, the weighted linear model \(\mathbf{y}_{n \times 1} = \mathbf{X}_{n \times p}\boldsymbol{\beta}_{p \times 1} + \boldsymbol{\varepsilon}_{n \times 1}\), where \(\boldsymbol{\varepsilon} \sim \mathbf{N}_n \left(\mathbf{0}, \sigma^2 \mathbf{W}^{-1}_{n \times n}\right)\). Here, \(\mathbf{y}\) is the response vector, \(\mathbf{X}\) the model matrix, and \(\boldsymbol{\varepsilon}\) the error vector, each for \(n\) cases, and \(\boldsymbol{\beta}\) is the vector of \(p\) population regression coefficients. The errors are assumed to be multivariately normally distributed with 0 means and covariance matrix \(\sigma^2 \mathbf{W}^{-1}\), where \(\mathbf{W} = \mathrm{diag}(w_i)\) is a diagonal matrix of inverse-variance weights. For the linear model with constant error variance, the weight matrix is taken to be \(\mathbf{W} = \mathbf{I}_n\), the order-\(n\) identity matrix.

The weighted-least-squares (WLS) estimator of \(\boldsymbol{\beta}\) is (see, e.g., Fox, 2016, sec. 12.2.2) 16 \[ \mathbf{b}_{\mathrm{WLS}} = \left( \mathbf{X}^T \mathbf{W} \mathbf{X} \right)^{-1} \mathbf{X}^T \mathbf{W} \mathbf{y} \]

Fitted values are then \(\widehat{\mathbf{y}} = \mathbf{X}\mathbf{b}_{\mathrm{WLS}}\).

The LOO fitted value for the \(i\)th case can be efficiently computed by \(\widehat{y}_{-i} = y_i - e_i/(1 - h_i)\) where \(h_i = \mathbf{x}^T_i \left( \mathbf{X}^T \mathbf{W} \mathbf{X} \right)^{-1} \mathbf{x}_i\) (the so-called “hatvalue”). Here, \(\mathbf{x}^T_i\) is the \(i\)th row of \(\mathbf{X}\), and \(\mathbf{x}_i\) is the \(i\)th row written as a column vector. This approach can break down when one or more hatvalues are equal to 1, in which case the formula for \(\widehat{y}_{-i}\) requires division by 0.

To compute cross-validated fitted values when the folds contain more than one case, we make use of the Woodbury matrix identify ("Woodbury matrix identity", 2023), \[ \left(\mathbf{A}_{m \times m} + \mathbf{U}_{m \times k} \mathbf{C}_{k \times k} \mathbf{V}_{k \times m} \right)^{-1} = \mathbf{A}^{-1} - \mathbf{A}^{-1}\mathbf{U} \left(\mathbf{C}^{-1} + \mathbf{VA}^{-1}\mathbf{U} \right)^{-1} \mathbf{VA}^{-1} \] where \(\mathbf{A}\) is a nonsingular order-\(n\) matrix. We apply this result by letting \[\begin{align*} \mathbf{A} &= \mathbf{X}^T \mathbf{W} \mathbf{X} \\ \mathbf{U} &= \mathbf{X}_\mathbf{j}^T \\ \mathbf{V} &= - \mathbf{X}_\mathbf{j} \\ \mathbf{C} &= \mathbf{W}_\mathbf{j} \\ \end{align*}\] where the subscript \(\mathbf{j} = (i_{j1}, \ldots, i_{jm})^T\) represents the vector of indices for the cases in the \(j\)th fold, \(j = 1, \ldots, k\). The negative sign in \(\mathbf{V} = - \mathbf{X}_\mathbf{j}\) reflects the removal, rather than addition, of the cases in \(\mathbf{j}\).

Applying the Woodbury identity isn’t quite as fast as using the hatvalues, but it is generally much faster than refitting the model. A disadvantage of the Woodbury identity, however, is that it entails explicit matrix inversion and thus may be numerically unstable. The inverse of \(\mathbf{A} = \mathbf{X}^T \mathbf{W} \mathbf{X}\) is available directly in the "lm" object, but the second term on the right-hand side of the Woodbury identity requires a matrix inversion with each fold deleted. (In contrast, the inverse of each \(\mathbf{C} = \mathbf{W}_\mathbf{j}\) is straightforward because \(\mathbf{W}\) is diagonal.)

The Woodbury identity also requires that the model matrix be of full rank. We impose that restriction in our code by removing redundant regressors from the model matrix for all of the cases, but that doesn’t preclude rank deficiency from surfacing when a fold is removed. Rank deficiency of \(\mathbf{X}\) doesn’t disqualify cross-validation because all we need are fitted values under the estimated model.

glm() computes the maximum-likelihood estimates for a generalized linear model by iterated weighted least squares (see, e.g., Fox & Weisberg, 2019, sec. 6.12). The last iteration is therefore just a WLS fit of the “working response” on the model matrix using “working weights.” Both the working weights and the working response at convergence are available from the information in the object returned by glm().

We then treat re-estimation of the model with a case or cases deleted as a WLS problem, using the hatvalues or the Woodbury matrix identity. The resulting fitted values for the deleted fold aren’t exact—that is, except for the Gaussian family, the result isn’t identical to what we would obtain by literally refitting the model—but in our (limited) experience, the approximation is very good, especially for LOO CV, which is when we would be most tempted to use it. Nevertheless, because these results are approximate, the default for the "glm" cv() method is to perform the exact computation, which entails refitting the model with each fold omitted.

Computation of the bias-corrected CV criterion and confidence intervals

Let \(\mathrm{CV}(\mathbf{y}, \widehat{\mathbf{y}})\) represent a cross-validation cost criterion, such as mean-squared error, computed for all of the \(n\) values of the response \(\mathbf{y}\) based on fitted values \(\widehat{\mathbf{y}}\) from the model fit to all of the data. We require that \(\mathrm{CV}(\mathbf{y}, \widehat{\mathbf{y}})\) is the mean of casewise components, that is, \(\mathrm{CV}(\mathbf{y}, \widehat{\mathbf{y}}) = \frac{1}{n}\sum_{i=1}^n\mathrm{cv}(y_i, \widehat{y}_i)\).17 For example, \(\mathrm{MSE}(\mathbf{y}, \widehat{\mathbf{y}}) = \frac{1}{n}\sum_{i=1}^n (y_i - \widehat{y}_i)^2\).

We divide the \(n\) cases into \(k\) folds of approximately \(n_j \approx n/k\) cases each, where \(n = \sum n_j\). As above, let \(\mathbf{j}\) denote the indices of the cases in the \(j\)th fold.

Now define \(\mathrm{CV}_j = \mathrm{CV}(\mathbf{y}, \widehat{\mathbf{y}}^{(j)})\). The superscript \((j)\) on \(\widehat{\mathbf{y}}^{(j)}\) represents fitted values computed for all of the cases from the model with fold \(j\) omitted. Let \(\widehat{\mathbf{y}}^{(-i)}\) represent the vector of fitted values for all \(n\) cases where the fitted value for the \(i\)th case is computed from the model fit with the fold including the \(i\)th case omitted (i.e., fold \(j\) for which \(i \in \mathbf{j}\)).

Then the cross-validation criterion is just \(\mathrm{CV} = \mathrm{CV}(\mathbf{y}, \widehat{\mathbf{y}}^{(-i)})\). Following Davison & Hinkley (1997, pp. 293–295), the bias-adjusted cross-validation criterion is \[ \mathrm{CV}_{\mathrm{adj}} = \mathrm{CV} + \mathrm{CV}(\mathbf{y}, \widehat{\mathbf{y}}) - \frac{1}{n} \sum_{j=1}^{k} n_j \mathrm{CV}_j \]

We compute the standard error of CV as \[ \mathrm{SE}(\mathrm{CV}) = \frac{1}{\sqrt n} \sqrt{ \frac{\sum_{i=1}^n \left[ \mathrm{cv}(y_i, \widehat{y}_i^{(-i)} ) - \mathrm{CV} \right]^2 }{n - 1} } \] that is, as the standard deviation of the casewise components of CV divided by the square-root of the number of cases.

We then use \(\mathrm{SE}(\mathrm{CV})\) to construct a \(100 \times (1 - \alpha)\)% confidence interval around the adjusted CV estimate of error: \[ \left[ \mathrm{CV}_{\mathrm{adj}} - z_{1 - \alpha/2}\mathrm{SE}(\mathrm{CV}), \mathrm{CV}_{\mathrm{adj}} + z_{1 - \alpha/2}\mathrm{SE}(\mathrm{CV}) \right] \] where \(z_{1 - \alpha/2}\) is the \(1 - \alpha/2\) quantile of the standard-normal distribution (e.g, \(z \approx 1.96\) for a 95% confidence interval, for which \(1 - \alpha/2 = .975\)).

S. Bates et al. (2023) show that the coverage of this confidence interval is poor for small samples, and they suggest a much more computationally intensive procedure, called nested cross-validation, to compute better estimates of error and confidence intervals with better coverage for small samples. We may implement Bates et al.’s approach in a later release of the cv package. At present we use the confidence interval above for sufficiently large \(n\), which, based on Bates et al.’s results, we take by default to be \(n \ge 400\).

References

Arlot, S., & Celisse, A. (2010). A survey of cross-validation procedures for model selection. Statistics Surveys, 4, 40–79. Retrieved from https://doi.org/10.1214/09-SS054
Bates, D., Mächler, M., Bolker, B., & Walker, S. (2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1), 1–48.
Bates, S., Hastie, T., & Tibshirani, R. (2023). Cross-validation: What does it estimate and how well does it do it? Journal of the American Statistical Association, in press. Retrieved from https://doi.org/10.1080/01621459.2023.2197686
Box, G. E. P., & Cox, D. R. (1964). An analysis of transformations. Journal of the Royal Statistical Society, Series B, 26, 211–252.
Canty, A., & Ripley, B. D. (2022). Boot: Bootstrap R (S-plus) functions.
Davison, A. C., & Hinkley, D. V. (1997). Bootstrap methods and their applications. Cambridge: Cambridge University Press.
Diggle, P. J., Liang, K.-Y., & Zeger, S. L. (1994). Analysis of longitudinal data. Oxford: Oxford University Press.
Fox, J. (2016). Applied regression analysis and generalized linear models (Second edition). Thousand Oaks CA: Sage.
Fox, J., & Weisberg, S. (2019). An R companion to applied regression (Third edition). Thousand Oaks CA: Sage.
Harrell, F., Jr. (2015). Regression modeling strategies (Second edition). New York: Springer.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: Data mining, inference, and prediction (Second edition). New York: Springer. Retrieved from https://hastie.su.domains/ElemStatLearn/printings/ESLII_print12_toc.pdf
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An introduction to statistical learning with applications in R (Second edition). New York: Springer.
Mersmann, O. (2023). Microbenchmark: Accurate timing functions. Retrieved from https://CRAN.R-project.org/package=microbenchmark
Pinheiro, J. C., & Bates, D. M. (2000). Mixed-effects models in S and S-PLUS. New York: Springer.
Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models: Applications and data analysis methods (Second edition). Thousand Oaks CA: Sage.
Sarkar, D. (2008). Lattice: Multivariate data visualization with R. New York: Springer. Retrieved from http://lmdvr.r-forge.r-project.org
Sarkar, D., & Andrews, F. (2022). latticeExtra: Extra graphical utilities based on lattice. Retrieved from https://CRAN.R-project.org/package=latticeExtra
StataCorp LLC. (2023). Stata multilevel mixed-effects reference manual, release 18. College Station TX: Stata Press. Retrieved from https://www.stata.com/manuals/me.pdf
Vehtari, A. (2023). Cross-validation FAQ. Retrieved October 15, 2023, from https://users.aalto.fi/~ave/CV-FAQ.html
Venables, W. N., & Ripley, B. D. (2002). Modern applied statistics with S (Fourth edition). New York: Springer.
Weisberg, S. (2014). Applied linear regression (Second edition). Hoboken NJ: Wiley.
Wickham, H., François, R., Henry, L., Müller, K., & Vaughan, D. (2023). Dplyr: A grammar of data manipulation. Retrieved from https://CRAN.R-project.org/package=dplyr
"Woodbury matrix identity". (2023). Woodbury matrix identity—Wikipedia, the free encyclopedia. Retrieved from https://en.wikipedia.org/wiki/Woodbury_matrix_identity

  1. James et al. (2021) use the cv.glm() function in the boot package (Canty & Ripley, 2022; Davison & Hinkley, 1997). Despite its name, cv.glm() is an independent function and not a method of a cv() generic function.↩︎

  2. Although it serves to illustrate the use of CV, a polynomial is probably not the best choice here. Consider, for example the scatterplot for log-transformed mpg and horsepower, produced by plot(mpg ~ horsepower, data=Auto, log="xy") (execution of which is left to the reader).↩︎

  3. Out of impatience, we asked microbenchmark() to execute each command only 10 times rather than the default 100. With the exception of the last columns, the output is self-explanatory. The last column shows which methods have average timings that are statistically distinguishable. Because of the small number of repetitions (i.e., 10), the "hatvalues" and "Woodbury" methods aren’t distinguishable, but the difference between these methods persists when we perform more repetitions—we invite the reader to redo this computation with the default times=100 repetitions.↩︎

  4. BayesRule() does some error checking; BayesRule2() is similar, but omits the error checking, and so can be faster for large problems.↩︎

  5. There are, however, more complex situations that give rise to so-called crossed (rather than nested) random effects. For example, consider students within classes within schools. In primary schools, students typically are in a single class, and so classes are nested within schools. In secondary schools, however, students typically take several classes and students who are together in a particular class may not be together in other classes; consequently, random effects based on classes within schools are crossed. The lmer() function in the lme4 package is capable of modeling both nested and crossed random effects, and the cv() methods for mixed models in the cv package pertain to both nested and crossed random effects. We present an example of the latter later in the vignette.↩︎

  6. We subsequently discovered that Vehtari (2023, sec. 8) makes similar points.↩︎

  7. The observant reader will notice that we set the argument control=list(opt="optim") in the call to lme(), changing the optimizer employed from the default "nlminb". We did this because with the default optimizer, lme() encountered the same convergence issue as lmer(), but rather than issuing a warning, lme() failed, reporting an error. As it turns out, setting the optimizer to "optim" avoids this problem.↩︎

  8. We invite the interested reader to experiment with varying the parameters of our example.↩︎

  9. We find it convenient to use the lattice (Sarkar, 2008) and latticeExtra (Sarkar & Andrews, 2022) packages for this and other graphs in this section.↩︎

  10. These are repeated-measures data, which would be more conventionally modeled with autocorrelated errors within pigs. The lme() function in the nlme package, for example, is capable of fitting a mixed-model of this form.↩︎

  11. Even though there is only one observation per combination of pigs and weeks, we can use the BLUP for the omitted case because of the crossed structure of the random effects; that is each pig-week has a pig random effect and a week random effect. Although it probably isn’t sensible, we can imagine a mixed model for the pig data that employs nested random effects, which would be specified by lmer(weight ~ week + (1 | id/week), data=Pigs)—that is, a random intercept that varies by combinations of id (pig) and week. This model can’t be fit, however: With only one case per combination of id and week, the nested random-effect variance is indistinguishable from the case-level variance.↩︎

  12. Because of the manner in which the computation is performed, the order of the replicates in the "cvList" object returned by cv() isn’t the same as the order in which the replicates are computed. Each element of the result, however, is a "cv" object with the correct random-number seed saved, and so this technical detail can be safely ignored. The individual "cv" objects are printed in the order in which they are stored rather than the order in which they are computed.↩︎

  13. It’s generally advantageous to start with the largest model, here the one with 100 predictors, and proceed by backward elimination. In this demonstration, however, where all of the \(\beta\)s are really 0, the selected model will be small, and so we proceed by forward selection from the null model to save computing time.↩︎

  14. The presentation in the section benefits from an email conversation with Bill Venables, who of course isn’t responsible for the use to which we’ve put his insightful remarks.↩︎

  15. Of course, making the decision to treat year as a factor on this basis could be construed as cheating in the current context, which illustrates the difficulty of automating the whole model-selection process. It’s rarely desirable, in our opinion, to forgo exploration of the data to ensure the purity of model validation. We believe, however, that it’s still useful to automate as much of the process as we can to obtain a more realistic, if still biased, estimate of the predictive power of a model.↩︎

  16. This is a definitional formula, which assumes that the model matrix \(\mathbf{X}\) is of full column rank, and which can be subject to numerical instability when \(\mathbf{X}\) is ill-conditioned. lm() uses the singular-value decomposition of the model matrix to obtain computationally more stable results.↩︎

  17. Arlot & Celisse (2010) term the casewise loss, \(\mathrm{cv}(y_i, \widehat{y}_i)\), the “contrast function.”↩︎