Validation: is your power number trustworthy?

The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.

A power estimate is only as trustworthy as the test it is built on. If a design/analysis combination does not control its Type I error rate, a high “power” number is meaningless — the test is just rejecting too often, under the null as well as the alternative. mixpower makes this check a first-class step.

Calibrate: does the test hold its alpha?

mp_calibrate() runs the scenario under the null (the focal effect set to zero) and estimates the empirical Type I error rate, with an exact interval and a verdict. For a correctly specified model it should sit near alpha.

d <- mp_design(clusters = list(subject = 24), trials_per_cell = 6)
a <- mp_assumptions(
  fixed_effects = list(`(Intercept)` = 0, condition = 0.4),
  random_effects = list(subject = list(intercept_sd = 0.5)),
  residual_sd = 1
)
scn <- mp_scenario_lme4(y ~ condition + (1 | subject), design = d, assumptions = a)

mp_calibrate(scn, nsim = 60, seed = 11)
#> <mp_calibration>
#>   term:    condition (true effect set to 0)
#>   nominal alpha: 0.05
#>   empirical Type I: 0.0500  (95% CI 0.0104, 0.1392)
#>   verdict: well-calibrated

Catching a misspecified model

The classic trap: the data have a by-subject random slope, but the analysis model omits it. This inflates Type I error (Barr et al., 2013). mp_calibrate() flags it.

a_slope <- mp_assumptions(
  fixed_effects = list(`(Intercept)` = 0, condition = 0.4),
  random_effects = list(subject = list(
    intercept_sd = 0.5, slopes = list(condition = 0.8)
  )),
  residual_sd = 1
)
# Data have the slope; the fitted model (1 | subject) ignores it.
scn_mis <- mp_scenario_lme4(y ~ condition + (1 | subject), design = d, assumptions = a_slope)

mp_calibrate(scn_mis, nsim = 60, seed = 7)
#> <mp_calibration>
#>   term:    condition (true effect set to 0)
#>   nominal alpha: 0.05
#>   empirical Type I: 0.1667  (95% CI 0.0829, 0.2852)
#>   verdict: anti-conservative
#>   -> This test rejects too often under the null; its power is not
#>      trustworthy. Try a df-corrected or bootstrap test, or a model
#>      that matches the data-generating random-effects structure.

The verdict turns to anti-conservative: any power computed from this model would be inflated. The fix is to fit the maximal model (1 + condition | subject), which mixpower simulates and tests directly.

Recommend: which inference method?

Wald (z/t) tests are anti-conservative when the number of clusters is small, because their degrees of freedom are overstated (Luke, 2017). mp_recommend_method() gives fast, design-based guidance.

scn_few <- mp_scenario_lme4(
  y ~ condition + (1 | subject),
  design = mp_design(list(subject = 12), trials_per_cell = 8),
  assumptions = a
)
mp_recommend_method(scn_few)
#> <mp_recommendation>
#>   current method: wald
#>   smallest grouping factor: 12 levels
#>   caution: yes
#>   recommended: kenward-roger, satterthwaite, pb
#>   Smallest grouping factor has 12 levels (< 30). 'wald' tests can be anti-conservative with few clusters; prefer 'kenward-roger' or 'satterthwaite' or 'pb'. Verify with mp_calibrate().

With few clusters it steers you toward a degrees-of-freedom-corrected test (Satterthwaite or Kenward-Roger for LMMs) or a parametric bootstrap. You can then measure the improvement with mp_calibrate():

scn_kr <- mp_scenario_lme4(
  y ~ condition + (1 | subject),
  design = mp_design(list(subject = 12), trials_per_cell = 8),
  assumptions = a, test_method = "kenward-roger"
)
mp_calibrate(scn_kr, nsim = 40, seed = 3)$type1
#> [1] 0.075

Reporting

Calibration results drop into the same reporting layer as power results:

mp_report_table(mp_calibrate(scn, nsim = 40, seed = 1))
#>        term alpha type1     ci_low   ci_high         verdict nsim
#> 1 condition  0.05 0.075 0.01574218 0.2038647 well-calibrated   40

Run mp_calibrate() whenever you change the design size or the analysis model. A trustworthy power analysis reports both the power and the calibration that backs it.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.