The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.

Multibias Validation

library(multibias)

This example demonstrates how multibias is validated using simulation data. Here we are interested in quantifying the effect of exposure X on outcome Y. The causal system can be represented in the following directed acyclic graph (DAG):

The variables are defined:

The DAG indicates that there are three sources of bias: 1. There is uncontrolled confounding from (unobserved) variable U. 2. The true exposure, X, is unobserved, and the misclassified exposure X* is dependent on both the exposure and outcome. 3. Lastly, there is collider stratification at variable S since exposure and outcome both affect selection. The study naturally only assesses those who were selected into the study (i.e. those with S=1), which represents a fraction of all people in the source population from which we are trying to draw inference.

A simulated dataframe corresponding to this DAG, df_uc_emc_sel can be loaded from the multibias package.

head(df_uc_em_sel)
#>   Xstar Y C1 C2 C3
#> 1     1 0  0  0  1
#> 2     0 0  0  0  1
#> 3     0 0  0  0  0
#> 4     0 0  1  0  1
#> 5     1 0  0  0  1
#> 6     1 0  0  1  1

In this data, the true, unbiased exposure-outcome odds ratio (ORYX) equals ~2. However, when we run a logistic regression of the outcome on the exposure and confounders, we do not observe an odds ratio of 2 due to the multiple bias sources.

biased_model <- glm(Y ~ Xstar + C1 + C2 + C3, ,
                    family = binomial(link = "logit"),
                    data = df_uc_em_sel)
biased_or <- round(exp(coef(biased_model)[2]), 2)
print(paste0("Biased Odds Ratio: ", biased_or))
#> [1] "Biased Odds Ratio: 1.5"

The function adjust_uc_emc_sel() can be used here to “reconstruct” the unbiased data and return the exposure-outcome odds ratio that would be observed in the unbiased setting.

Models for the missing variables (U, X, S) are used to facilitate this data reconstruction. For the above DAG, the corresponding bias models are:

where j indicates the number of measured confounders.

To perform the bias adjustment, it is necessary to obtain values of these bias parameters. Potential sources of these bias parameters include internal validation data, estimates in the literature, and expert opinion. For purposes of demonstrating the methodology, we will obtain the exact values of these bias parameters. This is possible because for purposes of validation we have access to the data of missing values that would otherwise be absent in real-world practice. This source data is available in multibias as df_uc_emc_sel_source.

u_model <- glm(U ~ X + Y,
               family = binomial(link = "logit"),
               data = df_uc_em_sel_source)
x_model <- glm(X ~ Xstar + Y + C1 + C2 + C3,
               family = binomial(link = "logit"),
               data = df_uc_em_sel_source)
s_model <- glm(S ~ Xstar + Y + C1 + C2 + C3,
               family = binomial(link = "logit"),
               data = df_uc_em_sel_source)

In this example we’ll perform probabilistic bias analysis, representing each bias parameter as a single draw from a probability distribution. For this reason, we will run the analysis over 1,000 iterations with bootstrap samples to obtain a valid confidence interval. To speed up the computation we will run the for loop in parallel using the foreach() function in the doParallel package. We create a cluster, make a seed for consistent results, and specify the desired number of bootstrap repitions.

library(doParallel)

no_cores <- detectCores() - 1
registerDoParallel(cores = no_cores)
cl <- makeCluster(no_cores)

set.seed(1234)
nreps <- 1000
est <- vector(length = nreps)

Next we run the parallel for loop in which we apply the adjust_uc_em_sel() function to bootstrap samples of the df_uc_em_sel data. Within the function, we include the following arguments: 1. The data_observed object, which includes the dataframe and key column names. 2. The bias parameters: U model coefficients, X model coefficients, and S model coefficients.

Using the results from the fitted bias models above, we’ll use Normal distribution draws for each bias parameter where the mean correponds to the estimated coefficient from the bias model and the standard deviation comes from the estimated standard deviation (i.e., standard error) of the coefficient in the bias model. Each loop iteration will thus have slightly different values for the bias parameters.

or <- foreach(i = 1:nreps, .combine = c,
              .packages = c("dplyr", "multibias")) %dopar% {

  df_sample <- df_uc_em_sel[sample(seq_len(nrow(df_uc_em_sel)),
                                   nrow(df_uc_em_sel),
                                   replace = TRUE), ]

  est[i] <- adjust_uc_em_sel(
    data_observed = data_observed(
      data = df_sample,
      exposure = "Xstar",
      outcome = "Y",
      confounders = c("C1", "C2", "C3")
    ),
    u_model_coefs = c(
      rnorm(1, mean = u_model$coef[1], sd = summary(u_model)$coef[1, 2]),
      rnorm(1, mean = u_model$coef[2], sd = summary(u_model)$coef[2, 2]),
      rnorm(1, mean = u_model$coef[3], sd = summary(u_model)$coef[3, 2])
    ),
    x_model_coefs = c(
      rnorm(1, mean = x_model$coef[1], sd = summary(x_model)$coef[1, 2]),
      rnorm(1, mean = x_model$coef[2], sd = summary(x_model)$coef[2, 2]),
      rnorm(1, mean = x_model$coef[3], sd = summary(x_model)$coef[3, 2]),
      rnorm(1, mean = x_model$coef[4], sd = summary(x_model)$coef[4, 2]),
      rnorm(1, mean = x_model$coef[5], sd = summary(x_model)$coef[5, 2]),
      rnorm(1, mean = x_model$coef[6], sd = summary(x_model)$coef[6, 2])
    ),
    s_model_coefs = c(
      rnorm(1, mean = s_model$coef[1], sd = summary(s_model)$coef[1, 2]),
      rnorm(1, mean = s_model$coef[2], sd = summary(s_model)$coef[2, 2]),
      rnorm(1, mean = s_model$coef[3], sd = summary(s_model)$coef[3, 2]),
      rnorm(1, mean = s_model$coef[4], sd = summary(s_model)$coef[4, 2]),
      rnorm(1, mean = s_model$coef[5], sd = summary(s_model)$coef[5, 2]),
      rnorm(1, mean = s_model$coef[6], sd = summary(s_model)$coef[6, 2])
    )
  )$estimate
}

Finally, we obtain the ORYX estimate and 95% confidence interval from the distribution of 1,000 odds ratio estimates. As expected, ORYX ~ 2, indicating that we were successfully able to obtain a bias-adjusted estimate in our biased data that approximates the known, unbiased estimate.

# odds ratio estimate
round(median(or), 2)
#> 2.02

# confidence interval
round(quantile(or, c(.025, .975)), 2)
#> 1.93 2.11

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.