Recursive Two-Stage Models to Address Endogeneity

The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.

Jing Peng

2025-10-12

1. Introduction

Endogeneity is a key challenge in causal inference. In the absence of plausible instrumental variables, empirical researchers often have little choice but to rely on model-based identification, which makes parametric assumption about the endogeneity structure.

Model-based identification is usually operationalized in the form of recursive two-stage models, where the dependent variable of the first stage is also the endogenous variable of interest in the second stage. Depending on the types of variables involved in the first and second stages (e.g., continuous, binary, and count), the recursive two-stage models can take many different forms.

The endogeneity package supports the estimation of the following recursive two-stage models discussed in Peng (2023). The models implemented in this package can be used to address the endogeneity of treatment variables in observational studies or the endogeneity of mediators in randomized experiments.

**Table 1. Recursive Two-Stage Models Supported by the Endogeneity Package**
Model	First Stage	Second Stage	Endogenous Variable	Outcome Variable
biprobit	probit	probit	binary	binary
biprobit_latent	probit	probit	binary (unobserved)	binary
biprobit_partial	probit	probit	binary (partially observed)	binary
probit_linear	probit	linear	binary	continuous
probit_linear_latent	probit	linear	binary (unobserved)	continuous
probit_linear_partial	probit	linear	binary (partially observed)	continuous
probit_linearRE	probit	linearRE	binary	continuous
pln_linear	pln	linear	count	continuous
pln_probit	pln	probit	count	binary

2. Models

Let M and Y denote the endogenous variable and the outcome variable, respectively. The models listed in Table 1 are specified as follows.

2.1. biprobit

This model can be used when the endogenous variable and the outcome variable are both binary. The first and second stages of the model are given by:

First stage (Probit): \[m_i=1(\alpha'w_i+u_i>0)\]

Second stage (Probit):

\[y_i=1(\beta'x_i+\gamma m_i+v_i>0)\]

Endogeneity structure:

\[\begin{pmatrix} u_i \\ v_i \end{pmatrix}\sim N\left(\begin{pmatrix} 0 \\ 0 \end{pmatrix},\begin{pmatrix} 1 & \rho \\ \rho & 1 \end{pmatrix}\right). \]

where \(w_i\) represents the set of covariates influencing the endogenous variable \(m_i\), and \(x_i\) denotes the set of covariates influencing the outcome variable \(y_i\). \(u_i\) and \(v_i\) are assumed to follow a standard bivariate normal distribution. As is customary in a Probit model, the variance of the error term is assumed to be one in both stages to ensure that the parameter estimates are unique.

2.2. biprobit_latent and biprobit_partial

These two models can be used when the endogenous variable and the outcome variable are both binary, but the endogenous variable is unobserved or partially observed. Such endogenous variables of interest to researchers could be an unobserved or partially observed mediator.

The first and second stages of the biprobit_latent model are given by:

First stage (Latent Probit): \[m_i^*=1(\alpha'w_i+u_i>0)\]

Second stage (Probit):

\[y_i=1(\beta'x_i+\gamma m_i^*+v_i>0)\]

Endogeneity structure:

\[\begin{pmatrix} u_i \\ v_i \end{pmatrix}\sim N\left(\begin{pmatrix} 0 \\ 0 \end{pmatrix},\begin{pmatrix} 1 & \rho \\ \rho & 1 \end{pmatrix}\right). \]

where \(w_i\) represents the set of covariates influencing the unobserved endogenous variable \(m_i^*\), and \(x_i\) denotes the set of covariates influencing the outcome variable \(y_i\). \(u_i\) and \(v_i\) are assumed to follow a standard bivariate normal distribution. To ensure that the estimates of the above model are unique, \(\gamma\) is restricted to be positive. Even with this constraint, the identification of this model can still be weak.

The only difference between biprobit_latent and biprobit_partial is that the latter allows the endogenous variable M to be partially observed. Compared to the case when M is fully unobserved, measuring M for 10%~20% of units can significantly improve the identification of the model.

2.3. probit_linear

This model can be used when the endogenous variable is binary and the outcome variable is continuous. The first and second stages of the model are given by:

First stage (Probit): \[m_i=1(\alpha'w_i+u_i>0)\]

Second stage (Linear):

\[y_i=\beta'x_i+\gamma m_i+\sigma v_i\]

Endogeneity structure:

\[\begin{pmatrix} u_i \\ v_i \end{pmatrix}\sim N\left(\begin{pmatrix} 0 \\ 0 \end{pmatrix},\begin{pmatrix} 1 & \rho \\ \rho & 1 \end{pmatrix}\right). \]

where \(w_i\) represents the set of covariates influencing the endogenous variable \(m_i\), and \(x_i\) denotes the set of covariates influencing the outcome variable \(y_i\). \(u_i\) and \(v_i\) are assumed to follow a standard bivariate normal distribution. \(\sigma^2\) represents the variance of the error term in the outcome equation.

2.4. probit_linear_latent and probit_linear_partial

These two models can be used when the outcome variable is continuous and the endogenous variable is an unobserved or partially observed binary variable. Such endogenous variables of interest to researchers could be an unobserved or partially observed mediator.

The first and second stages of the probit_linear_latent model are given by:

First stage (Latent Probit): \[m_i^*=1(\alpha'w_i+u_i>0)\]

Second stage (Linear):

\[y_i=\beta'x_i+\gamma m_i^*+\sigma v_i\]

Endogeneity structure:

\[\begin{pmatrix} u_i \\ v_i \end{pmatrix}\sim N\left(\begin{pmatrix} 0 \\ 0 \end{pmatrix},\begin{pmatrix} 1 & \rho \\ \rho & 1 \end{pmatrix}\right). \]

The only difference between probit_linear_latent and probit_linear_partial is that the latter allows the endogenous variable M to be partially observed. Compared to the case when M is fully unobserved, measuring M for 10%~20% of units can significantly improve the identification of the model.

2.5. probit_linearRE

This model is an extension of the probit_linear model to panel data. The outcome variable is a time-variant continuous variable, and the endogenous variable is a time-invariant binary variable. The first and second stages of the model are given by:

First stage (Probit): \[m_i=1(\alpha'w_i+u_i>0)\]

Second stage (Panel linear model with individual-level random effects):

\[y_{it}=\beta'x_{it}+\gamma m_i+\lambda v_i+\sigma \varepsilon_{it}\]

Endogeneity structure:

\[\begin{pmatrix} u_i \\ v_i \end{pmatrix}\sim N\left(\begin{pmatrix} 0 \\ 0 \end{pmatrix},\begin{pmatrix} 1 & \rho \\ \rho & 1 \end{pmatrix}\right). \]

where \(w_i\) represents the set of covariates influencing the endogenous variable \(m_i\), and \(x_i\) denotes the set of covariates influencing the outcome variable \(y_i\). \(v_i\) represents the individual-level random effect and is assumed to follow a standard bivariate normal distribution with \(u_i\). \(\sigma^2\) represents the variance of the error term in the outcome equation.

2.6. pln_linear

This model can be used when the endogenous variable is a count measure and the outcome variable is continuous. The first and second stages of the model are given by:

First stage (Poisson lognormal): \[E[m_i|w_i,u_i]=exp(\alpha'w_i+\lambda u_i)\]

Second stage (linear):

\[y_i=\beta'x_i+\gamma m_i+\sigma v_i\]

Endogeneity structure:

\[\begin{pmatrix} u_i \\ v_i \end{pmatrix}\sim N\left(\begin{pmatrix} 0 \\ 0 \end{pmatrix},\begin{pmatrix} 1 & \rho \\ \rho & 1 \end{pmatrix}\right). \]

where \(w_i\) represents the set of covariates influencing the endogenous variable \(m_i\), and \(x_i\) denotes the set of covariates influencing the outcome variable \(y_i\). \(u_i\) and \(v_i\) are assumed to follow a standard bivariate normal distribution. \(\lambda^2\) and \(\sigma^2\) represent the variance of the error terms in the first and second stages, respectively.

2.7. pln_probit

This model can be used when the endogenous variable is a count measure and the outcome variable is binary. The first and second stages of the model are given by:

First stage (Poisson lognormal): \[E[m_i|w_i,u_i]=exp(\alpha'w_i+\lambda u_i)\]

Second stage (Probit):

\[y_i=1(\beta'x_i+\gamma m_i+v_i>0)\]

Endogeneity structure:

\[\begin{pmatrix} u_i \\ v_i \end{pmatrix}\sim N\left(\begin{pmatrix} 0 \\ 0 \end{pmatrix},\begin{pmatrix} 1 & \rho \\ \rho & 1 \end{pmatrix}\right). \]

where \(w_i\) represents the set of covariates influencing the endogenous variable \(m_i\), and \(x_i\) denotes the set of covariates influencing the outcome variable \(y_i\). \(u_i\) and \(v_i\) are assumed to follow a standard bivariate normal distribution. \(\lambda^2\) represents the variance of the error term in the first stage. The variance of the error term in the second stage Probit model is normalized to 1.

3. Examples

After loading the endogeneity package, type “example(model_name)” to see sample code for each model. For example, the code below runs the probit_linear model on a simulated dataset with the following data generating process (DGP):

\[m_i=1(1+x_i+z_i+u_i>0)\]

\[y_i=1+x_i+z_i+m_i+v_i>0\]

\[\begin{pmatrix} u_i \\ v_i \end{pmatrix}\sim N\left(\begin{pmatrix} 0 \\ 0 \end{pmatrix},\begin{pmatrix} 1 & -0.5 \\ -0.5 & 1 \end{pmatrix}\right). \]

library(endogeneity)
example(probit_linear, prompt.prefix=NULL)
#> 
#> > library(MASS)
#> 
#> > N = 2000
#> 
#> > rho = -0.5
#> 
#> > set.seed(1)
#> 
#> > x = rbinom(N, 1, 0.5)
#> 
#> > z = rnorm(N)
#> 
#> > e = mvrnorm(N, mu=c(0,0), Sigma=matrix(c(1,rho,rho,1), nrow=2))
#> 
#> > e1 = e[,1]
#> 
#> > e2 = e[,2]
#> 
#> > m = as.numeric(1 + x + z + e1 > 0)
#> 
#> > y = 1 + x + z + m + e2
#> 
#> > est = probit_linear(m~x+z, y~x+z+m)
#> ==== Converged after 65 iterations, LL=-3424.12, gtHg=0.000000 ****
#> LR test of rho=0, chi2(1)=20.632, p-value=0.0000
#> Time difference of 0.0328331 secs
#> 
#> > print(est$estimates, digits=3)
#>                    estimate     se     z        p    lci    uci
#> linear.(Intercept)    0.970 0.1232  7.88 3.33e-15  0.729  1.212
#> linear.x              0.996 0.0526 18.91 0.00e+00  0.893  1.099
#> linear.z              0.971 0.0338 28.68 0.00e+00  0.904  1.037
#> linear.m              1.046 0.1566  6.68 2.42e-11  0.739  1.353
#> probit.(Intercept)    1.019 0.0549 18.55 0.00e+00  0.911  1.126
#> probit.x              0.948 0.0853 11.11 0.00e+00  0.780  1.115
#> probit.z              0.983 0.0497 19.77 0.00e+00  0.886  1.081
#> sigma                 1.034 0.0206 50.08 0.00e+00  0.994  1.075
#> rho                  -0.488 0.0772 -6.31 2.71e-10 -0.624 -0.322

It can be seen that the parameter estimates are very close to the true values.

4. Notes

When the first stage is nonlinear, the identification of a recursive two-stage model does not require an instrumental variable that appears in the first stage but not the second stage. The identification strength generally increases with the explanatory power of the first stage covariates. Therefore, one can improve the identification by including more control variables. Comprehensive simulation studies and sensitivity analyses for the recursive two-stage models are available in Peng (2023).

Empirical researchers are encouraged to try both instrument-based and model-based identification whenever possible. If the two identification strategies relying on different assumptions lead to consistent results, we can be more certain about the validity of our findings.

Citations

Peng, Jing. (2023) Identification of Causal Mechanisms from Randomized Experiments: A Framework for Endogenous Mediation Analysis. Information Systems Research, 34(1):67-84. Available at https://doi.org/10.1287/isre.2022.1113

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.