We built estimatr
to provide accurate standard errors quickly.
This document benchmarks the speed of or linear regression estimator
against other estimators. Our performance is slightly better than base R
when using classical standard errors, but most of our improvements come
when estimating robust standard errors.
Furthermore, we provide an option in our lm_robust()
and lm_lin()
estimators, try_cholesky
, which users should set to TRUE
if they are
concerned about speed and are certain their analysis does not suffer
from perfect multicollinearity (linear dependencies).
I test our speed in estimating coefficients, standard errors, and doing
inference on four different datasets (500 and 5000 observations; 5 and
50 covariates) and across several different specifications. Below I
preview the results comparing lm_robust()
to base R for fitting
coefficients and a commonly used package for robust standard errors,
such as the sandwich
package. In the two largest datasets, our method
is almost always faster and at worst is the same as base R, and only
with classical standard errors. When it comes to the biggest gains,
using lm_robust()
to get HC2 or Stata-like cluster-robust standard
errors will roughly halve your waiting time. If you want CR2 standard
errors, lm_robust()
can reduce your run time by a factor of 10!
N. Obs | N. Coefs | Estimator | Classical SEs | HC2 SEs | Stata clustered SEs | CR2 SEs |
---|---|---|---|---|---|---|
500 | 5 | estimatr::lm_robust() |
1.9 | 2.3 | 2 | 6 |
base + sandwich/clubSandwich | 1.7 | 5.2 | 4.4 | 66 | ||
5000 | 5 | estimatr::lm_robust() |
4.6 | 7.9 | 7.8 | 172 |
base + sandwich/clubSandwich | 4.6 | 22.4 | 21.7 | 2268 | ||
500 | 50 | estimatr::lm_robust() |
5.8 | 8.2 | 8.2 | 62 |
base + sandwich/clubSandwich | 6.7 | 20.2 | 29.2 | 160 | ||
5000 | 50 | estimatr::lm_robust() |
26.3 | 41.9 | 55 | 2504 |
base + sandwich/clubSandwich | 32.2 | 114.8 | 253.8 | 10166 |
The times are milliseconds and are a median over 200 runs for all but
the CR2 case, which was taken on a sample of 50 runs, using the
microbenchmark
package. This benchmarking was done on a 2017 MacBook
Air, with a 1.8 GHz Intel Core i5 CPU and 8 GB of memory.
To see the exact comparisons, see below.
library(estimatr)
library(microbenchmark)
# Create some data sets of different sizes for testing below
set.seed(42)
data_size <- expand.grid(list(ns = c(500, 5000), ps = c(5, 50)))
data_list <- lapply(
1:nrow(data_size),
function(i) {
n <- data_size$ns[i]
p <- data_size$ps[i]
y <- rnorm(n)
X <- matrix(rnorm(n*p), n, p)
return(data.frame(y, X))
}
)
First I compare to a couple other methods of the classical standard
errors. First, let's compare against base R, RcppEigen's fastLm()
function (from which we borrow much of our algorithm), and
RcppArmadillo's fastLm()
function.
library(RcppEigen)
library(RcppArmadillo)
test_base <- lapply(data_list, function(dat) {
mbo <- summary(microbenchmark(
'lm_robust' = lm_robust(y ~ ., data = dat, se_type = "classical"),
'base' = summary(lm(y ~ ., data = dat)),
'RcppEigen' = RcppEigen:::summary.fastLm(
RcppEigen::fastLm(y ~ ., data = dat)
),
"RcppArmadillo" = RcppArmadillo:::summary.fastLm(
RcppArmadillo::fastLm(y ~ ., data = dat)
),
times = 200L
),
unit = "ms")
return(mbo[, c("expr", "median")])
})
The following table has the median time in milliseconds across 50 runs of each estimator for each of the different data sets.
Estimator | N=500, P=5 | N=500, P=50 | N=5000, P=5 | N=500, P=50 |
---|---|---|---|---|
lm_robust | 2 | 5 | 6 | 26 |
base | 2 | 5 | 7 | 32 |
RcppEigen | 1 | 5 | 6 | 32 |
RcppArmadillo | 2 | 6 | 10 | 54 |
However, the real speed gains come with robust standard errors. Let's
compare lm_robust
to getting “HC2” standard errors and doing inference
using them from the coeftest
and sandwich
packages.
library(sandwich)
library(lmtest)
test_rob <- lapply(data_list, function(dat) {
mbo <- summary(microbenchmark(
'lm_robust' = lm_robust(y ~ ., data = dat, se_type = "HC2"),
'lm + coeftest + sandwich' = {
lmo <- lm(y ~ ., data = dat)
coeftest(lmo, vcov = vcovHC(lmo, type = "HC2"))
},
times = 200L
),
unit = "ms")
return(mbo[, c("expr", "median")])
})
Estimator | N=500, P=5 | N=500, P=50 | N=5000, P=5 | N=500, P=50 |
---|---|---|---|---|
lm_robust | 2 | 8 | 8 | 42 |
lm + coeftest + sandwich | 5 | 22 | 20 | 115 |
What about with Stata's clustered standard errors using tapply
and
sandwich
?
# Commonly used function attributed mostly to M. Arai replicating Stata
# clustered SEs in R using sandwich and lmtest packages
cluster_robust_se <- function(model, cluster){
M <- length(unique(cluster))
N <- length(cluster)
K <- model$rank
dfc <- (M/(M - 1)) * ((N - 1)/(N - K))
uj <- apply(estfun(model), 2, function(x) tapply(x, cluster, sum));
rcse.cov <- dfc * sandwich(model, meat = crossprod(uj)/N)
rcse.se <- coeftest(model, rcse.cov)
return(list(rcse.cov, rcse.se))
}
test_cl <- lapply(data_list, function(dat) {
cluster <- sample(nrow(dat)/5, size = nrow(dat), replace = TRUE)
mbo <- summary(microbenchmark(
'lm_robust' = lm_robust(
y ~ .,
data = dat,
clusters = cluster,
se_type = "stata"
),
'lm + coeftest + sandwich' = {
lmo <- lm(y ~ ., data = dat)
cluster_robust_se(lmo, cluster)
},
times = 200L
),
unit = "ms")
return(mbo[, c("expr", "median")])
})
Estimator | N=500, P=5 | N=500, P=50 | N=5000, P=5 | N=500, P=50 |
---|---|---|---|---|
lm_robust | 2 | 8 | 8 | 55 |
lm + coeftest + sandwich | 4 | 22 | 29 | 254 |
The original authors who came up with a generalized version of the CR2
errors and accompanying Satterthwaite-like corrected degrees of freedom
have their own package,
clubSandwich
, that provides
estimators for many methods. We show here how much faster our
implementation is for simple linear regression.
library(clubSandwich)
test_cr2 <- lapply(data_list, function(dat) {
cluster <- sample(nrow(dat)/5, size = nrow(dat), replace = TRUE)
mbo <- summary(microbenchmark(
'lm_robust' = lm_robust(
y ~ .,
data = dat,
clusters = cluster,
se_type = "CR2"
),
'lm + clubSandwich' = {
lmo <- lm(y ~ ., data = dat)
coef_test(lmo, vcov = vcovCR(lmo, cluster = cluster, type = "CR2"))
},
times = 50L
),
unit = "ms")
return(mbo[, c("expr", "median")])
})
knitr::kable(create_tab(test_cr2), col.names = col_names)
Estimator | N=500, P=5 | N=500, P=50 | N=5000, P=5 | N=500, P=50 |
---|---|---|---|---|
lm_robust | 6 | 173 | 62 | 2504 |
lm + clubSandwich | 66 | 2268 | 160 | 10166 |
print(test_cr2[[4]]$median)
#> [1] 2504 10166
sessionInfo()
#> R version 3.4.3 (2017-11-30)
#> Platform: x86_64-apple-darwin14.5.0 (64-bit)
#> Running under: OS X Yosemite 10.10.5
#>
#> Matrix products: default
#> BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
#> LAPACK: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libLAPACK.dylib
#>
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> loaded via a namespace (and not attached):
#> [1] compiler_3.4.3 backports_1.1.2 magrittr_1.5 rprojroot_1.3-2
#> [5] tools_3.4.3 htmltools_0.3.6 yaml_2.1.15 Rcpp_0.12.15
#> [9] stringi_1.1.6 rmarkdown_1.8 highr_0.6 knitr_1.17
#> [13] stringr_1.2.0 digest_0.6.14 evaluate_0.10.1