\newcommand{\X}{\mathbf{X}} \newcommand{\Pb}{\mathbf{P}} \newcommand{\Gb}{\mathbf{G}} \newcommand{\XtXinv}{(\X{\top}\X){-1}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\E}{\mathbb{E}} \newcommand{\e}{\mathbf{e}} \newcommand{\V}{\mathbb{V}}

estimatr is a package in R dedicated to providing fast estimators that take into consideration designs often used by social scientists. Estimators are statistical methods for estimating quantities of interest like treatment effects or regression parameters. Many of the estimators included with the R programming language or popular R packages are slow and have default settings that lead to statistically inappropriate estimates. Certain estimators that reflect cutting-edge advances in statistics are not yet implemented in R packages for convenient use. estimatr is designed to solve these problems and provide estimators tuned for design-based inference.

The most up-to-date version of this vignette can be found on the DeclareDesign website here.

Estimators

The current estimators we provide are:

I first create some sample data to demonstrate how to use each of these estimators.

library(estimatr)

# Example dataset to be used throughout built using fabricatr and randomizr
library(fabricatr)
library(randomizr)
dat <- fabricate(
  N = 100,                        # sample size
  x = runif(N, 0, 1),             # pre-treatment covariate
  y0 = rnorm(N, mean = x),        # control potential outcome
  y1 = y0 + 0.35,                 # treatment potential outcome
  z = complete_ra(N),             # complete random assignment to treatment
  y = ifelse(z, y1, y0),          # observed outcome

  # We will also consider clustered data
  clust = sample(rep(letters[1:20], each = 5)),
  z_clust = cluster_ra(clust),
  y_clust = ifelse(z_clust, y1, y0)
)

head(dat)
ID x y0 y1 z y clust z_clust y_clust
001 0.91 1.24 1.6 0 1.24 k 1 1.59
002 0.94 0.15 0.5 0 0.15 m 0 0.15
003 0.29 1.86 2.2 0 1.86 i 0 1.86
004 0.83 1.47 1.8 1 1.82 r 1 1.82
005 0.64 0.73 1.1 1 1.08 c 0 0.73
006 0.52 0.80 1.1 0 0.80 s 0 0.80

lm_robust

The estimatr package provides lm_robust() to quickly fit linear models with the most common variance estimators and degrees of freedom corrections used in social science. You can easily estimate heteroskedastic standard errors, clustered standard errors, and classical standard errors.

Usage largely mimics lm(), although it defaults to using Eicker-Huber-White robust standard errors, specifically “HC2” standard errors. More about the exact specifications used can be found in the technical notes and more about the estimator can be found on its reference page: lm_robust().

res <- lm_robust(y ~ z + x, data = dat)
summary(res)
#> 
#> Call:
#> lm_robust(formula = y ~ z + x, data = dat)
#> 
#> Standard error type =  HC2 
#> 
#> Coefficients:
#>             Estimate Std. Error Pr(>|t|) CI Lower CI Upper DF
#> (Intercept)   -0.187      0.207 3.69e-01   -0.598    0.224 97
#> z              0.235      0.187 2.11e-01   -0.135    0.605 97
#> x              1.418      0.286 2.97e-06    0.851    1.985 97
#> 
#> Multiple R-squared:  0.182 , Adjusted R-squared:  0.165 
#> F-statistic: 10.8 on 2 and 97 DF,  p-value: 5.83e-05

Users can also easily get the output as a data.frame by using tidy().

tidy(res)
coefficient_name coefficients se p ci_lower ci_upper df outcome
(Intercept) -0.19 0.21 0.37 -0.60 0.22 97 y
z 0.23 0.19 0.21 -0.14 0.60 97 y
x 1.42 0.29 0.00 0.85 1.99 97 y

It is straightforward to do cluster-robust inference, by passing the name of your cluster variable to the clusters = argument. The default variance estimator with clusters is dubbed 'CR2' because it is analogous to 'HC2' for the clustered case, and utilizes recent advances proposed by @pustejovskytipton2016 to correct hypotheses tests for small samples and work with commonly specified fixed effects and weights. Note that lm_robust() is quicker if your cluster variable is a factor!

res_cl <- lm_robust(
  y_clust ~ z_clust + x,
  data = dat,
  clusters = clust
)
tidy(res_cl)
coefficient_name coefficients se p ci_lower ci_upper df outcome
(Intercept) -0.48 0.22 0.05 -0.97 0.0 12 y_clust
z_clust 0.81 0.17 0.00 0.45 1.2 18 y_clust
x 1.43 0.34 0.00 0.72 2.1 17 y_clust

Researchers can also replicate Stata's standard errors by using the se_type = argument both with and without clusters:

res_stata <- lm_robust(
  y_clust ~ z_clust + x,
  data = dat,
  clusters = clust,
  se_type = "stata"
)
tidy(res_stata)
coefficient_name coefficients se p ci_lower ci_upper df outcome
(Intercept) -0.48 0.22 0.04 -0.95 -0.02 19 y_clust
z_clust 0.81 0.17 0.00 0.46 1.16 19 y_clust
x 1.43 0.33 0.00 0.73 2.13 19 y_clust

Furthermore, users can take advantage of the margins package to get marginal effects, average marginal effects and their standard errors, and more.

library(margins)

res_int <- lm_robust(y ~ x * z, data = dat)
mar_int <- margins(res_int, vce = "delta")
summary(mar_int)
#>  factor    AME     SE      z      p   lower  upper
#>       x 1.4319 0.2894 4.9468 0.0000  0.8645 1.9992
#>       z 0.2355 0.1864 1.2633 0.2065 -0.1298 0.6008

Users who want their regression output in LaTeX or HTML can use the texreg package, which we extend here to work with our linear regression estimators.

library(texreg)

tex_int <- extract(res_int)
texreg(tex_int, file = "ex.tex")

lm_lin

Adjusting for pre-treatment covariates when using regression to estimate treatment effects is common practice across scientific disciplines. However, @freedman2008 demonstrated that pre-treatment covariate adjustment biases estimates of average treatment effects. In response, @lin2013 proposed an alternative estimator that would reduce this bias and improve precision. @lin2013 proposes centering all pre-treatment covariates, interacting them with the treatment variable, and regressing the outcome on the treatment, the centered pre-treatment covariates, and all of the interaction terms. This can require a non-trivial amount of data pre-processing.

To facilitate this, we provide a wrapper that processes the data and estimates the model. We dub this estimator the Lin estimator and it can be accessed using lm_lin(). This function is a wrapper for lm_robust(), and all arguments that work for lm_robust() work here. The only difference is in the second argument covariates, where one specifies a right-sided formula with all of your pre-treatment covariates. Below is an example, and more can be seen on the function reference page lm_lin and some formal notation can be seen in the technical notes.

res_lin <- lm_lin(
  y ~ z,
  covariates = ~ x,
  data = dat
)
tidy(res_lin)
coefficient_name coefficients se p ci_lower ci_upper df outcome
(Intercept) 0.55 0.15 0.00 0.25 0.84 96 y
z 0.24 0.19 0.21 -0.13 0.61 96 y
x_bar 1.72 0.47 0.00 0.79 2.65 96 y
z:x_bar -0.58 0.58 0.32 -1.73 0.57 96 y

The output of a lm_lin() call can be used with the same methods as lm_robust(), including the margins package.

difference_in_means

While estimating differences in means may seem straightforward, we provide a function that appropriately adjusts estimates for experimental design. We provide support for unit-randomized, cluster-randomized, block-randomized, matched-pair randomized, and matched-pair clustered designs. Usage is similar to usage in regression functions. More examples can be seen on the function reference page, difference_in_means(), and the actual estimators used can be found in the technical notes.

# Simple version
res_dim <- difference_in_means(
  y ~ z,
  data = dat
)
tidy(res_dim)
coefficient_name coefficients se p ci_lower ci_upper df outcome
z 0.16 0.2 0.44 -0.25 0.56 90 y
# Clustered version
res_dim_cl <- difference_in_means(
  y_clust ~ z_clust,
  data = dat,
  clusters = clust
)
coefficient_name coefficients se p ci_lower ci_upper df outcome
z_clust 0.82 0.17 0 0.45 1.2 18 y_clust

You can check which design was learned and which kind of estimator used by examining the design in the output.

data(sleep)
res_mps <- difference_in_means(extra ~ group, data = sleep, blocks = ID)
res_mps$design
#> [1] "Matched-pair"

horvitz_thompson

Horvitz-Thompson estimators can be used to estimate unbiased treatment effects when the randomization is known. This is particularly useful when there are clusters of different sizes being randomized into treatment or when the treatment assignment is complex and there are dependencies across units in the probability of being treated. Horvitz-Thompson estimators require information about the probability each unit is in treatment and control, as well as the joint probability each unit is in the treatment, in the control, and in opposite treatment conditions.

The estimator we implement here, horvitz_thompson() estimates treatment effects for two-armed trials. The easiest way to specify your design and recover the full set of joint and marginal probabilities is to declare your randomization scheme by using declare_ra() from the randomizr package. I show some examples of how to do that below. Again, the technical details for this estimator can be found here and in references in those notes.

# Complete random assignment declaration
crs_decl <- declare_ra(
  N = nrow(dat),
  prob = 0.5,
  simple = FALSE
)

ht_comp <- horvitz_thompson(
  y ~ z,
  data = dat,
  declaration = crs_decl
)
tidy(ht_comp)
coefficient_name coefficients se p ci_lower ci_upper df outcome
z 0.16 0.2 0.44 -0.24 0.56 NA y

We can also easily estimate treatment effects from a cluster randomized experiment. Letting horvitz_thompson know that the design is clustered means it uses a collapsed estimator for the variance, described in @aronowmiddleton2013.

# Clustered random assignment declaration
crs_clust_decl <- declare_ra(
  N = nrow(dat),
  clusters = dat$clust,
  prob = 0.5,
  simple = FALSE
)

ht_clust <- horvitz_thompson(
  y_clust ~ z_clust,
  data = dat,
  declaration = crs_clust_decl
)
tidy(ht_clust)
coefficient_name coefficients se p ci_lower ci_upper df outcome
z_clust 0.82 0.25 0 0.33 1.3 NA y_clust

You can also build the condition probability matrix (condition_prob_mat =) that horvitz_thompson() needs from a declaration from the randomizr package—using declaration_to_conditional_pr_mat()—or from a matrix of permutations of the treatment vector—using permutations_to_conditional_pr_mat(). This is largely intended for use by experienced users. Note, that if one passes a condition_prob_mat that indicates clustering, but does not specify the clusters argument, then the collapsed estimator will not be used.

# arbitrary permutation matrix
possible_treats <- cbind(
  c(1, 1, 0, 1, 0, 0, 0, 1, 1, 0),
  c(0, 1, 1, 0, 1, 1, 0, 1, 0, 1),
  c(1, 0, 1, 1, 1, 1, 1, 0, 0, 0)
)
arb_pr_mat <- permutations_to_condition_pr_mat(possible_treats)

# Simulating a column to be realized treatment
dat <- data.frame(
  z = possible_treats[, sample(ncol(possible_treats), size = 1)],
  y = rnorm(nrow(possible_treats))
)

ht_arb <- horvitz_thompson(
  y ~ z,
  data = dat,
  condition_pr_mat = arb_pr_mat
)
tidy(ht_arb)
coefficient_name coefficients se p ci_lower ci_upper df outcome
z -0.82 0.84 0.33 -2.5 0.83 NA y

References