Causal Conditional Distance Correlation

The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.

Eric W. Bridgeford

2025-01-07

require(causalBatch)
require(ggplot2)
require(tidyr)
n = 200

To start, we will begin with a simulation example, similar to the ones we were working in for the simulations, which you can access from:

vignette("cb.simulations", package="causalBatch")

# a function for plotting a scatter plot of the data
plot.sim <- function(Ys, Ts, Xs, title="", 
                     xlabel="Covariate",
                     ylabel="Outcome (1st dimension)") {
  data = data.frame(Y1=Ys[,1], Y2=Ys[,2], 
                    Group=factor(Ts, levels=c(0, 1), ordered=TRUE), 
                    Covariates=Xs)
  
  data %>%
    ggplot(aes(x=Covariates, y=Y1, color=Group)) +
    geom_point() +
    labs(title=title, x=xlabel, y=ylabel) +
    scale_x_continuous(limits = c(-1, 1)) +
    scale_color_manual(values=c(`0`="#bb0000", `1`="#0000bb"), 
                       name="Group/Batch") +
    theme_bw()
}

sim = cb.sims.sim_sigmoid(n=n, eff_sz=1, unbalancedness=1.5)

plot.sim(sim$Ys, sim$Ts, sim$Xs, title="Sigmoidal Simulation")

Despite the fact that the covariate distributions for each group/batch do not overlap perfectly (note that unbalancedness is not \(1\)), it looks like the two batches still appear to be slightly different. We can test this using the causal conditional distance correlation, like so:

result <- cb.detect.caus_cdcorr(sim$Ys, sim$Ts, sim$Xs, R=100)

Here, we set the number of null replicates R to \(100\) to make the simulation run faster, but in practice you should typically use at least \(1000\) null replicates. To make this faster, we would suggest setting num.threads to be close to the maximum number of cores available on your machine. You can identify the number of cores available on your machine using parallel::detectCores().

print(sprintf("p-value: %.4f", result$Test$p.value))
#> [1] "p-value: 0.0099"

Since the \(p\)-value is \(< \alpha\), we reject the null hypothesis in favor of the alternative; that is, that the group/batch causes a difference in the outcome variable.

We could optionally have pre-computed a distance matrix for the outcomes, like so:

# compute distance matrix for outcomes
DY = dist(sim$Ys)

In your use-cases, you could substitute this distance function for any distance function of your choosing, and pass a distance matrix directly to the detection algorithm, by specifying that distance=TRUE:

result <- cb.detect.caus_cdcorr(DY, sim$Ts, sim$Xs, distance=TRUE, R=100)

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.