Choosing a transformation: Box-Cox in practice

library(shewhartr)
library(ggplot2)

Box & Cox (1964) introduced a one-parameter family of power transformations,

\[ y(\lambda) = \begin{cases} (x^\lambda - 1)/\lambda & \lambda \neq 0 \\ \log(x) & \lambda = 0, \end{cases} \]

and a procedure for choosing \(\lambda\) by maximum likelihood. The goal is to find a scale on which the residuals are approximately normal and homoscedastic — the assumptions that classical inferential tools, including Shewhart charts, presuppose.

shewhart_box_cox() returns the profile log-likelihood, the maximiser \(\hat \lambda\), and a 95% confidence interval based on the chi-square approximation to twice the log-likelihood drop.

A textbook example

set.seed(2025)
y <- rlnorm(200, meanlog = 0, sdlog = 0.5)   # log-normal -> lambda = 0
bc <- shewhart_box_cox(y)
bc
#> 
#> ── Box-Cox profile likelihood ──────────────────────────────────────────────────
#> • n = 200
#> • lambda_hat = 0
#> • 95% CI: [-0.25, 0.2]

The optimal lambda is near zero (log transformation), and the 95% CI should cover zero. Let’s plot the profile:

autoplot(bc)

If the CI for \(\lambda\) contains 1, no transformation is needed (the data are approximately normal as is). If it contains 0, take logs. If it contains 0.5, take square roots — and so on.

Interaction with shewhart_regression(model = "auto")

The "auto" model in shewhart_regression() calls shewhart_box_cox() internally on the response (with a +1 shift to keep zeros valid) and selects among linear, log, loglog according to the value of \(\hat \lambda\):

This is a guidance step, not a guarantee. Always inspect the residual diagnostics afterwards via shewhart_diagnostics().

When not to transform

If your data are counts, proportions, times-to-event, or other quantities with a known parametric family, model that family explicitly. Box was clear about this: if you can model the right distribution, do so. Transforms exist for the case where the right distribution isn’t tractable and a normal approximation on a suitably-chosen scale is the best available compromise.

The c, u, p, and np charts in this package implement that advice: they support limits = "poisson" (or "binomial") for exact distribution-aware limits, instead of relying on a transformation to coerce counts into approximate normality.

References