The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.

Introduction to VizTest: Optimal Confidence Intervals for Visual Testing

William Poirier

David A. Armstrong

2025-03-10

Introduction

The package provides methods for visualizing pairwise comparisons. We propose a method for building inferential confidence intervals that allow readers to visually test whether two estimates are statistically different from each other by evaluating the (non-)overlaps in the inferential confidence intervals. The idea of inferential confidence intervals is relatively simple. First, we imagine that you want to do pairwise tests of estimates at the $\alpha$ level. The normal workflow would be to present $(1-\alpha)\times 100\%$ confidence intervals around each of the estimates. It has been shown by various different scholars Radean (2023) that using (non-)overlaps in the $(1-\alpha)\times 100\%$ confidence intervals to evaluate whether differences are statistically significant or not is problematic and can lead to erroneous inferences. We know that when intervals do not overlap that the estimates are statistically different from each other, but the converse is not necessarily true. We propose that there may be a different confidence interval $(1-\gamma)\times 100\%$ such that whether the intervals overlap or not corresponds perfectly (or as closely as possible) to whether the pairwise tests are statistically significant or not. We call these $(1-\gamma)\times 100\%$ confidence intervals - Inferential Confidence Intervals. Our paper (Armstrong II and Poirier 2025) discusses this method in more detail and our paper in the JOURNAL provides more detail about the implementation of the methods in R (Poirier and Armstrong II 2025).

Below, we walk through a few different examples that should get you started using the package, though the paper in the JOURNAL provides more details.

Descriptive Statistics: Chick Weights

We start with a simple example using the chickwts dataset that is included in base R. This dataset contains weights of chicks (weight) fed with different types of feed (feed). First we re-order the feed variable so that the values will be increasing in average weight. We then calculate average weight, standard deviation of weights, sample size and then calculate the standard error - the square root of the sampling variance.

data(chickwts)
chickwts$feed <- reorder(chickwts$feed, 
                         chickwts$weight, 
                         FUN=mean)
chick_preds <- chickwts %>% 
  group_by(feed) %>% 
  summarise(ave_weight = mean(weight),
            n = n(),
            sd = sd(weight)) %>%
  mutate(se = sd/sqrt(n))
chick_est <- chick_preds$ave_weight
names(chick_est) <- chick_preds$feed

We can make the results amenable to using the viztest() function by using make_vt_data(), which, in this case, requires estimates and either a vector of variances or a variance-covariance matrix in variances. We use the default, type="est_var", though there is a simulation option available.

chick_vt_data <- make_vt_data(estimates = chick_est, variances = chick_preds$se^2, type="est_var")

We then use the viztest() function from the package to calculate the optimal inferential confidence intervals.

chick_vt <- viztest(chick_vt_data, include_zero = FALSE)
chick_vt
#> 
#> Correspondents of PW Tests with CI Tests
#>   level psame     pdiff        easy  method
#> 1  0.78     1 0.7333333 -10.5743601  Lowest
#> 2  0.82     1 0.7333333  -2.3688874  Middle
#> 3  0.87     1 0.7333333 -10.0834571 Highest
#> 4  0.83     1 0.7333333  -0.1097695 Easiest
#> 
#> All  15  tests properly represented for by CI overlaps.

Here, we see that any confidence level between $78\%$ and $87\%$ will yield inferential confidence intervals whose (non-)overlaps correspond perfectly with the pairwise tests. Our papers Poirier and Armstrong II (2025) discuss how these levels are calculated in more detail and why we choose the one labeled “Easiest”, which we argue makes the comparisons easiest to distinguish visually. We could use the plot method on our viztest object:

plot(chick_vt)

Chick Weights with Inferential COnfidence Intervals (wide gray bars) and standard 95% confidence interval (thin black line). If the gray bars overlap, it indicates the pair of estimates whose intervals overlap are not statistically different from each other. If the gray bars do not overlap, the pair of estimates are statistically different from each other.

Coefficient Plot: Wooldridge GPA data.

Using the gpa1 data from the package, we can predict university GPA (colGPA) with variables skipped (average classes skipped per week), alcohol (average number of days per week drinking alcohol), PC (has a personal computer at school), male (male binary variable), car (has a car binary variable), and job20 (works more than 20 hours per week binary variable). We fit the model below:

data(gpa1, package="wooldridge")
model1 <- lm(colGPA~skipped+alcohol+PC+male+car+job20,data=gpa1)
summary(model1)
#> 
#> Call:
#> lm(formula = colGPA ~ skipped + alcohol + PC + male + car + job20, 
#>     data = gpa1)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -0.95171 -0.23883 -0.02712  0.21319  0.82124 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  3.15592    0.08386  37.634  < 2e-16 ***
#> skipped     -0.09116    0.02996  -3.043  0.00282 ** 
#> alcohol      0.04004    0.02451   1.634  0.10470    
#> PC           0.13380    0.06318   2.118  0.03602 *  
#> male        -0.04708    0.06394  -0.736  0.46282    
#> car         -0.13098    0.07223  -1.813  0.07203 .  
#> job20       -0.02573    0.08146  -0.316  0.75261    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.3536 on 134 degrees of freedom
#> Multiple R-squared:  0.1364, Adjusted R-squared:  0.09777 
#> F-statistic: 3.529 on 6 and 134 DF,  p-value: 0.002832

The viztest() function will take any object that has a coef() and vcov() method, so we can input the model directly if we want to make a coefficient plot. There is an include_zero option as well as an include_intercept option that can be set to TRUE or FALSE. By default, the function returns standard confidence intervals (e.g., $95\%$) along with the appropriate visual testing intervals. The standard intervals will capture tests relative to zero, so unless you are only presenting the visual testing intervals, you may want to set include_zero=FALSE.

gpa_vt <- viztest(model1, test_level = 0.05, 
                  range_levels = c(0.25,0.99), 
                  level_increment = 0.01, 
                  include_zero=FALSE)

gpa_vt
#> 
#> Correspondents of PW Tests with CI Tests
#>   level psame     pdiff         easy  method
#> 1  0.73     1 0.3333333 -0.039414304  Lowest
#> 2  0.78     1 0.3333333 -0.009945377  Middle
#> 3  0.84     1 0.3333333 -0.028853018 Highest
#> 4  0.80     1 0.3333333 -0.002000690 Easiest
#> 
#> All  15  tests properly represented for by CI overlaps.

Any level between $73\%$ and $84\%$ would work, but $80\%$ will be the easiest to interpret visually. If we want a bit more control over plotting the results, we can have the plot method return the data and then make a custom plot ourselves. This would allow us to order the estimates by size if they are not that way naturally.

plot(gpa_vt, make_plot = FALSE) %>%
  mutate(vbl = factor(vbl), 
         vbl = reorder(vbl, est)) %>%
  ggplot(aes(x = est, y = vbl)) +
  geom_vline(xintercept = 0, linetype="dashed", color="black") +
  geom_linerange(aes(xmin = lwr, xmax = upr, color="Optimal Visual Testing Intervals (80%)", linewidth="Optimal Visual Testing Intervals (80%)")) +
  geom_linerange(aes(xmin = lwr_add, xmax = upr_add, color="Standard Confidence Interval (95%)", linewidth="Standard Confidence Interval (95%)")) +
  geom_point(size=3) +
  scale_color_manual(values=c("gray75", "black")) + 
  scale_linewidth_manual(values=c(3.5, .5)) + 
  labs(x = "Coefficient Estimate", y = "",
       color = "", linewidth="") +
  theme_classic() + 
  theme(legend.position="top", 
        legend.box="vertical")

GPA Model Coefficients with Inferential Confidence Intervals (80%, wide gray bars) and standard 95% confidence interval (thin black line). If the gray bars overlap, it indicates the pair of estimates whose intervals overlap are not statistically different from each other. If the gray bars do not overlap, the pair of estimates are statistically different from each other.

If you were trying to make a forest plot, where the symbol shapes and sizes varied by some other variable, you could merge the relevant data into the data output from plot(..., make_plot=FALSE) and map the size and shape aesthetics in geom_point() to the appropriate variables.

Predicted Probabilities: Titanic Data

Any quantity whose uncertainty can be appropriately captured could be subjected to the viztest() function. Consider the TitanicSurvival data from the package. Here, we model survival with a binomial GLM controlling for passenger class and the interaction of sex and age category. We generate predictions for the combinations of gender and age groups using the avg_predictions() function from the package.

data(TitanicSurvival, package="carData")
NewTitanicSurvival <- TitanicSurvival %>% 
  mutate(ageCat = case_when(age <= 10 ~ "0-10",
                            age > 10 & age <=18 ~ "11-18",
                            age > 18 & age <=30 ~ "19-30",
                            age > 30 & age <=40 ~ "31-40",
                            age > 40 & age <=50 ~ "41-50",
                            age >50 ~ "51+"))
# The model
model3 <- glm(survived~sex*ageCat+passengerClass,
              data=NewTitanicSurvival,family = binomial(link="logit"))

Using the default settings in avg_predictions() we get results on the response scale with standard errors using the delta method. We could then do normal theory tests on the predicted probabilities using the viztest() function. Further, we have two options here - use all estimates as input to the viztest() function in which case all pairwise tests will be conducted - both across and within gender groups. The effects could also be calculated independently and then viztest() executed individually for each group. In that case, only within-group comparisons would be valid. We show both ways below.

## all pairwise tests
mes <- avg_predictions(model3, 
                variables = list(ageCat = levels(NewTitanicSurvival$ageCat),
                                 sex=levels(NewTitanicSurvival$sex)))

all_vt <- viztest(mes, include_zero=FALSE)
plot_data <- plot(all_vt, make_plot=FALSE) %>% 
  dplyr::select(est, lwr, upr, lwr_add, upr_add) 
mes <- mes %>% 
  as.data.frame() %>% 
  left_join(plot_data, by = join_by(estimate == est))

ggplot(mes, aes(x=estimate, y = ageCat)) + 
  geom_linerange(aes(xmin = lwr, xmax = upr, color="Optimal Visual Testing Intervals (91%)", linewidth="Optimal Visual Testing Intervals (91%)")) +
  geom_linerange(aes(xmin = lwr_add, xmax = upr_add, color="Standard Confidence Intervals (95%)", linewidth="Standard Confidence Intervals (95%)")) +
  geom_point(size=3) + 
  scale_color_manual(values=c("gray75", "black")) +
  scale_linewidth_manual(values=c(3.5, .5)) + 
  facet_grid(sex~.) +
  theme_classic() + 
  theme(legend.position="top") + 
  labs(x="Predicted Pr(Survival)", y="Age Category",
       color="", linewidth="")

Predicted Probabilities of Survival on the Titanic with Inferential Confidence Intervals (91%, wide gray bars) and standard 95% confidence interval (thin black line). If the gray bars overlap, it indicates the pair of estimates whose intervals overlap are not statistically different from each other. If the gray bars do not overlap, the pair of estimates are statistically different from each other. Tests were done using normal-theory on the response scale with standard errors computed by the delta method. Comparisons both within and across panels are valid.

Likely, the response scale is the wrong scale on which to do normal-theory tests. Instead, we should probably do them on the link scale. Changing the type argument in avg_predictions() to "link" will do this for us. Then, we can run viztest() again and the tests will be on the appropriate scale. The results do change a bit here. All tests are accounted for by the (non-)overlaps, but on the response scale, the range of valid testing intervals is from $84\%$ to $95\%$. On the link scale, the range is from $82\%$ to $93\%$.

## all pairwise tests
mes_link <- avg_predictions(model3, 
                variables = list(ageCat = levels(NewTitanicSurvival$ageCat),
                                 sex=levels(NewTitanicSurvival$sex)), 
                type="link")

all_vt <- viztest(mes_link, include_zero=FALSE)

We could subject the estimates as well as the lower and upper confidence interval bounds to the inverse logit transform to get from the link scale to the response scale. This could be accomplished with either plogis() because we know this was a binary logistic regression, or more generally for the GLM object model3, we could use model3$family$linkinv(). Here, we use plogis().

plot_data <- plot(all_vt, make_plot=FALSE) %>% 
  dplyr::select(est, lwr, upr, lwr_add, upr_add) 
mes_resp <- mes_link %>% 
  as.data.frame() %>% 
  left_join(plot_data, by = join_by(estimate == est)) %>%
  mutate(across(c(estimate, lwr, upr, lwr_add, upr_add), plogis))

Then, we could make the plot. Notice, the non-linear transformation into the response space makes the confidence intervals on that scale asymmetric - especially those closer to the bounds of 0 and 1. In this case, we do the tests on the appropriate scale and then transform the estimates and the bounds into the response space.

ggplot(mes_resp, aes(x=estimate, y = ageCat)) + 
  geom_linerange(aes(xmin = lwr, xmax = upr, color="Optimal Visual Testing Intervals (88%)", linewidth="Optimal Visual Testing Intervals (88%)")) +
  geom_linerange(aes(xmin = lwr_add, xmax = upr_add, color="Standard Confidence Intervals (95%)", linewidth="Standard Confidence Intervals (95%)")) +
  geom_point(size=3) + 
  scale_color_manual(values=c("gray75", "black")) +
  scale_linewidth_manual(values=c(3.5, .5)) + 
  facet_grid(sex~.) +
  theme_classic() + 
  theme(legend.position="top") + 
  labs(x="Predicted Pr(Survival)", y="Age Category",
       color="", linewidth="")

Predicted Probabilities of Survival on the Titanic with Inferential Confidence Intervals (88%, wide gray bars) and standard 95% confidence interval (thin black line). If the gray bars overlap, it indicates the pair of estimates whose intervals overlap are not statistically different from each other. If the gray bars do not overlap, the pair of estimates are statistically different from each other. Tests were done using normal-theory on the link scale. Comparisons both within and across panels are valid.

Another alternative here would be to use the simulation method. We could get simulation output using the inferences() function from the package.

mes_sim <- avg_predictions(model3, 
                variables = list(ageCat = levels(NewTitanicSurvival$ageCat),
                                 sex=levels(NewTitanicSurvival$sex))) %>% 
  inferences(method="simulation", R=1000)

We can extract the posterior draws from the inferences object and then use make_vt_data() with type="sim" to prepare the data for viztest(). In the output below, you can see that using the simulation method, the range between $84\%$ and $93\%$ gives inferential credible intervals whose (non-)overlaps correspond perfectly with the Bayesian pairwise tests. In this framework, two estimates are credibly different if the posterior probability that the estimate with the larger posterior mean is larger than the estimate with the smaller posterior mean is greater than $1-\alpha$.

post <- posterior_draws(mes_sim) %>% dplyr::select(drawid, draw, ageCat, sex) %>% 
  pivot_wider(names_from = "drawid", values_from = "draw", names_prefix="draw")
sim <- post %>% dplyr::select(starts_with("draw")) %>% as.matrix() %>% t()
colnames(sim) <- paste0("b", 1:ncol(sim))
sim_vt_data <- make_vt_data(sim, type="sim")
all_vt_sim <- viztest(sim_vt_data, include_zero=FALSE)
all_vt_sim
#> 
#> Correspondents of PW Tests with CI Tests
#>   level psame     pdiff         easy  method
#> 1  0.83     1 0.5909091 -0.001443650  Lowest
#> 2  0.83     1 0.5909091 -0.001443650  Middle
#> 3  0.84     1 0.5909091 -0.003722143 Highest
#> 4  0.83     1 0.5909091 -0.001443650 Easiest
#> 
#> All  66  tests properly represented for by CI overlaps.

We make the plot data below. We select the relevant variables and ensure that the data frame is sorted in the original order. This works here because the vbl variable is b1-b12 and we can sort by the numeric part of that string. We then bind the plot data to the original mes_sim data frame for plotting.

plot_data <- plot(all_vt_sim, make_plot=FALSE) %>% 
  dplyr::select(vbl, est, lwr, upr, lwr_add, upr_add) %>% 
  arrange(as.numeric(gsub("b", "", vbl)))
mes_sim <- mes_sim %>% 
  as.data.frame() %>% 
  bind_cols(plot_data)

Finally, we can make the plot.

ggplot(mes_sim, aes(x=estimate, y = ageCat)) + 
  geom_linerange(aes(xmin = lwr, xmax = upr, color="Optimal Visual Testing Intervals (89%)", linewidth="Optimal Visual Testing Intervals (89%)")) +
  geom_linerange(aes(xmin = lwr_add, xmax = upr_add, color="Standard Credible Intervals (95%)", linewidth="Standard Credible Intervals (95%)")) +
  geom_point(size=3) + 
  scale_color_manual(values=c("gray75", "black")) +
  scale_linewidth_manual(values=c(3.5, .5)) + 
  facet_grid(sex~.) +
  theme_classic() + 
  theme(legend.position="top") + 
  labs(x="Predicted Pr(Survival)", y="Age Category",
       color="", linewidth="")

Predicted Probabilities of Survival on the Titanic with Inferential Credible Intervals (89%, wide gray bars) and standard 95% credible interval (thin black line). If the gray bars overlap, it indicates the pair of estimates whose intervals overlap are not credibly different from each other. If the gray bars do not overlap, the pair of estimates are credibly different from each other. Tests were done using quasi-Bayesian simulation with quantile credible intervals on the response scale. Comparisons both within and across panels are valid.

In all three Titanic examples above, comparisons both within and across panels are valid because we used all 12 estimates as input to viztest(). This means that the (non-)overlaps of the 66 possible pairs of intervals correspond perfectly with the pairwise tests. In some cases, this may be either unnecessary or because of some idiosyncrasies of the data, no interval perfectly captures the tests across groups, but the within-group tests can be perfectly captured (and the within-group tests are of greatest interest). You can generate these by making separate predictions for each group and subjecting each group’s predictions to viztest(). Below is a workflow that should work generally.

First, make a vector of values of the group variable.

sex_vals <- c("female", "male")

Next, use a loop (either a for loop, lapply() or map()) to generate the predictions for each group. We use lapply() below.

preds <- lapply(sex_vals, \(s){
  avg_predictions(model3, newdata=datagrid(sex = s, grid_type="counterfactual"), 
                  variables="ageCat")
})

Next, we can use another loop to subject each element of the preds list to viztest(). To make our lives a bit easier, we will save the predictions and name them so that plot() called on the object will use those same names. We use lapply() again.

names(preds) <- sex_vals
group_vt <- lapply(preds, \(p){
  est <- p$estimate
  names(est) <- p$ageCat
  d <- make_vt_data(est, vcov(p), type="est_var")
  viztest(d, include_zero=FALSE)
})

We can see the results of the individual viztest() runs by printing group_vt:

group_vt
#> $female
#> 
#> Correspondents of PW Tests with CI Tests
#>   level psame pdiff          easy  method
#> 1  0.45     1     0 -0.0007714616  Lowest
#> 2  0.72     1     0 -0.0444245993  Middle
#> 3  0.99     1     0 -0.1797110060 Highest
#> 4  0.45     1     0 -0.0007714616 Easiest
#> 
#> All  15  tests properly represented for by CI overlaps.
#> 
#> $male
#> 
#> Correspondents of PW Tests with CI Tests
#>   level psame     pdiff          easy  method
#> 1  0.84     1 0.5333333 -0.0345304940  Lowest
#> 2  0.89     1 0.5333333 -0.0112390027  Middle
#> 3  0.95     1 0.5333333 -0.0323924883 Highest
#> 4  0.91     1 0.5333333 -0.0004843863 Easiest
#> 
#> All  15  tests properly represented for by CI overlaps.

We could then get the plot data for both objects, bind them together and merge with the original estimates data.

plot_data <- lapply(names(group_vt), \(g){
  d <- plot(group_vt[[g]], make_plot=FALSE, level = ifelse(g == "female", "median", "ce")) %>% 
    dplyr::select(vbl, est, lwr, upr, lwr_add, upr_add) 
  d
})
names(plot_data) <- sex_vals
plot_data <- bind_rows(plot_data, .id="sex") %>% 
  rename(ageCat = vbl)

preds <- preds %>% 
  bind_rows(.id="sex") %>% 
  as.data.frame() %>% 
  left_join(plot_data)

Finally, we could make the plot:

ggplot(preds, aes(x=estimate, y = ageCat)) + 
  geom_linerange(aes(xmin = lwr, xmax = upr, color="Optimal Visual Testing Intervals (91%)", linewidth="Optimal Visual Testing Intervals (91%)")) +
  geom_linerange(aes(xmin = lwr_add, xmax = upr_add, color="Standard Confidence Intervals (95%)", linewidth="Standard Confidence Intervals (95%)")) +
  geom_point(size=3) + 
  scale_color_manual(values=c("gray75", "black")) +
  scale_linewidth_manual(values=c(3.5, .5)) + 
  facet_grid(sex~.) +
  theme_classic() + 
  theme(legend.position="top") + 
  labs(x="Predicted Pr(Survival)", y="Age Category",
       color="", linewidth="")

Predicted Probabilities of Survival on the Titanic with Inferential Confidence Intervals (wide gray bars, 91% for males and 73% for females) and standard 95% confidence interval (thin black line). If the gray bars overlap, it indicates the pair of estimates whose intervals overlap are not statistically different from each other. If the gray bars do not overlap, the pair of estimates are statistically different from each other. Tests were done using normal-theory on the response scale with standard errors computed by the delta method. Intervals were built so comparisons within, but not across panels are valid.

More examples are available in the accompanying papers Armstrong II and Poirier (2025) as well as in the package documentation.

References

Afshartous, David, and Richard A. Preston. 2010. “Confidence Intervals for Dependent Data: Equating Non-Overlap with Statistical Significance.” Computational Statistics and Data Analysis 54: 2296–2305.

Armstrong II, David A., and William Poirier. 2025. “Decoupling Visualization and Testing When Presenting Confidence Intervals.” Political Analysis 33: 252–57. https://doi.org/doi:10.1017/pan.2024.24.

Browne, Richard H. 1979. “On Visual Assessment of the Significance of a Mean Difference.” Biometrics 35 (3): 657–65.

Goldstein, Harvey, and Michael J. R. Healey. 1995. “The Graphical Presentation of a Collection of Means.” Journal of the Royal Statistical Society, Series A 158 (1): 175–77.

Payton, Mark E., Matthew H. Greenstone, and Nathaniel Schenker. 2003. “Overlapping confidence intervals or standard error intervals: What do they mean in terms of statistical significance?” Journal of Insect Science 3 (1): 34.

Poirier, William, and David A. Armstrong II. 2025. “VizTest: Optimal Confidence Intervals for Visual Testing.” Working Paper.

Radean, Marius. 2023. “The Significance of Differences Interval: Assessing the Statistical and Substantive Difference Between Two Quantities of Interest.” Journal of Politics 85 (3): 969–83.

Schenker, Nathanial, and Jane F. Gentleman. 2001. “On Judging the Significance of Differences by Examining the Overlap Between Confidence Intervals.” The American Statistician 55 (3).

Tukey, John. 1991. “The Philosophy of Multiple Comparisons.” Statistical Science 6 (1): 100–116.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.