You are planning an item response theory (IRT) study and you need an
answer to one question: how many examinees do I need?
Power-analysis formulas exist for simple designs, but real assessments
combine multiple items, multiple parameters per item, missing-data
mechanisms, and — increasingly — model misspecification you cannot fully
characterize a priori. irtsim answers the question by
simulation: you specify a plausible data-generating model, sweep across
candidate sample sizes, fit the estimation model many times, and report
the sample size at which a chosen performance criterion (mean squared
error, bias, coverage, …) crosses a target threshold.
The package implements the 10-decision framework from Schroeders & Gnambs (2025). This vignette walks you through the abridged version: pick a design, pick sample sizes, run, interpret.
Three function calls, three S3 objects:
irt_design() → irt_study() → irt_simulate() → summary() / plot()
data-generating conditions: Monte Carlo performance
model sample sizes, iterations criteria,
missing data recommendations
Every step is immutable: you can re-use a design across
many study objects, and re-run irt_simulate()
without rebuilding upstream state.
irt_design() takes three required arguments: the IRT
model ("1PL", "2PL", or
"GRM"), the number of items, and a list of true item
parameters. There are three common ways to supply the parameters.
If you have specific values in mind — from a content blueprint, a prior pilot, or a paper you are replicating — pass them directly.
design_byhand <- irt_design(
model = "2PL",
n_items = 10,
item_params = list(
a = c(0.8, 1.0, 1.1, 1.2, 1.3, 0.9, 1.4, 1.0, 1.2, 1.1),
b = seq(-2, 2, length.out = 10)
)
)
design_byhand
#> IRT Design
#> Model: 2PL
#> Items: 10 items
#> Theta dist: normal
#> Factors: 1
#> a range: [0.8, 1.4]
#> b range: [-2, 2]For a typical I/O or education assessment, you usually want
discriminations drawn from a lognormal and difficulties spanning the
trait range. irt_params_2pl() does this in one line:
set.seed(2026)
ip <- irt_params_2pl(
n_items = 10,
a_mean = 0, a_sd = 0.25, # log-normal: median a = 1
b_mean = 0, b_sd = 1,
b_range = c(-2, 2)
)
design_helper <- irt_design(
model = "2PL",
n_items = 10,
item_params = ip
)Use irt_params_grm() for graded-response items.
If you have already calibrated a similar instrument, treat the prior
estimates as the truth for planning purposes. mirt::LSAT7
ships with mirt and gives a clean, fast worked example.
prior_data <- mirt::expand.table(mirt::LSAT7)
prior_fit <- mirt::mirt(prior_data, 1, "2PL", verbose = FALSE)
co <- mirt::coef(prior_fit, IRTpars = TRUE, simplify = TRUE)$items
design_prior <- irt_design(
model = "2PL",
n_items = nrow(co),
item_params = list(a = co[, "a"], b = co[, "b"])
)
co
#> a b g u
#> Item.1 0.9879254 -1.8787456 0 1
#> Item.2 1.0808847 -0.7475160 0 1
#> Item.3 1.7058006 -1.0576962 0 1
#> Item.4 0.7651853 -0.6351358 0 1
#> Item.5 0.7357980 -2.5204102 0 1For the rest of this vignette we use design_helper — a
generic 2PL with 10 items.
irt_study() adds the things that vary across the
simulation grid: the sample sizes you want to compare, optionally a
missing-data mechanism, and optionally an estimation model that differs
from the data-generating model (model misspecification studies). For a
no-missing-data planning question, two arguments are enough.
study <- irt_study(
design = design_helper,
sample_sizes = c(100, 250, 500, 1000)
)
study
#> IRT Study
#> Model: 2PL
#> Items: 10
#> Sample sizes: 100, 250, 500, 1000
#> Missing data: none (complete data)The four sample sizes span a typical planning range: 100 is small for a 10-item 2PL, 1000 should be ample. The simulation will give us the curve in between.
irt_simulate() is where the work happens: for each
(sample_size, iteration) cell, it generates data under
design, fits the estimation model, and stores parameter
estimates. Two arguments control runtime: iterations (more
iterations → tighter Monte Carlo standard errors, longer runtime) and
parallel (off by default; turn on for production-scale
runs).
A fixed seed is required — every simulation is
reproducible by default. progress = FALSE suppresses the
cli progress bar so the vignette renders cleanly. For real studies,
leave it on.
iterations = 50 is small — chosen here to keep the
vignette build fast. For production planning use 500–1000 iterations (or
use irt_iterations() to compute the count needed for a
target Monte Carlo standard error). Set parallel = TRUE to
dispatch iterations across future workers; reproducibility
is preserved within mode.
summary() returns one row per
(sample_size, item, parameter) combination, with all
performance criteria attached.
res_summary <- summary(results,
criterion = c("mse", "bias", "rmse", "coverage"))
head(res_summary$item_summary)
#> sample_size item param true_value mse bias rmse
#> 1 100 1 a 1.1389961 0.32611950 0.065844261 0.5710687
#> 2 100 1 b -0.4082147 0.08559994 -0.023437985 0.2925747
#> 3 100 2 a 0.7634385 0.40789706 0.006876364 0.6386682
#> 4 100 2 b -0.7304333 22.44647061 0.678859473 4.7377706
#> 5 100 3 a 1.0354225 0.22133080 0.117301867 0.4704581
#> 6 100 3 b -0.2214366 0.07274307 -0.091348376 0.2697092
#> coverage n_converged
#> 1 0.9791667 48
#> 2 0.9166667 48
#> 3 0.9375000 48
#> 4 0.7708333 48
#> 5 0.9791667 48
#> 6 0.9583333 48Plot the criterion of interest against sample size to see where the
curve flattens. Each line is one item’s b (difficulty) MSE
trajectory.
The dashed horizontal line is the planning threshold (here 0.05 — a common default for parameter recovery). Items whose lines cross below the threshold are adequately recovered at that sample size; items still above need a larger N.
recommended_n() reads off the smallest sample size at
which the criterion crosses the threshold. The default rolls the
per-item recommendations up to a single number (the maximum, so no item
is left under-powered):
n_rec <- recommended_n(res_summary,
criterion = "mse",
threshold = 0.05,
param = "b")
#> Warning: No tested sample size meets mse <= 0.05 for some item/param combinations.
#> ℹ Affected: (item 5, param b) and (item 6, param b)
#> ℹ Aggregate returned as NA. Inspect `attr(result, "details")` for per-item
#> values.
n_rec
#> [1] NA
#> attr(,"details")
#> item param recommended_n criterion threshold
#> 1 1 b 250 mse 0.05
#> 2 2 b 500 mse 0.05
#> 3 3 b 250 mse 0.05
#> 4 4 b 250 mse 0.05
#> 5 5 b NA mse 0.05
#> 6 6 b NA mse 0.05
#> 7 7 b 250 mse 0.05
#> 8 8 b 250 mse 0.05
#> 9 9 b 500 mse 0.05
#> 10 10 b 250 mse 0.05
#> attr(,"aggregate")
#> [1] "max"
#> attr(,"criterion")
#> [1] "mse"
#> attr(,"threshold")
#> [1] 0.05The scalar return is the headline answer. The details
attribute preserves the per-item table so you can inspect which items
drove the recommendation:
attr(n_rec, "details")
#> item param recommended_n criterion threshold
#> 1 1 b 250 mse 0.05
#> 2 2 b 500 mse 0.05
#> 3 3 b 250 mse 0.05
#> 4 4 b 250 mse 0.05
#> 5 5 b NA mse 0.05
#> 6 6 b NA mse 0.05
#> 7 7 b 250 mse 0.05
#> 8 8 b 250 mse 0.05
#> 9 9 b 500 mse 0.05
#> 10 10 b 250 mse 0.05If you want a less conservative summary, pass
aggregate = "mean" or aggregate = "median". To
get the legacy per-item data frame back, use
aggregate = "none".
paper-example-* vignettes walk through the worked examples
from Schroeders & Gnambs (2025).missing = "mcar",
"mar", "booklet", or "linking" to
irt_study(). See ?irt_study and the
paper-example-2-mcar vignette.estimation_model to irt_study() to fit a
different model than you generated under. See
paper-example-1b-misspecification.criterion_fn to
summary() to compute a user-defined performance criterion
alongside the built-ins.parallel = TRUE in irt_simulate() and
configure a future::plan() for the workers. Reproducibility
is preserved within mode.Schroeders, U., & Gnambs, T. (2025). Sample size planning for item response models: A tutorial for the quantitative researcher. Methodology, 21(1), 1–28. https://doi.org/10.1177/25152459251314798