This vignette is meant to alleviate some common misunderstandings of mixed-subjects IRT calibration. While the core idea of augmenting human data with LLM-generated data and estimating a \(\lambda\) value to tune how much you want the LLM-generated to contribute is straightforward, the nuances of the ability-risk objective are not. In this vignette, we will go into the data requirements for mixed-subject calibrations and the ability-risk objective itself to help set users up for success. If there is one major takeaway from this document, it is that the tuning parameter \(\lambda\) is not asking whether LLM-generated responses look human in the aggregate. It is, instead, asking whether an LLM-based response-generation procedure can predict the row-level human response structure enough to reduce downstream ability-estimation error. The most important object is therefore the paired prediction matrix, \(P\).
Mixed-subjects IRT requires three item response matrices. Let \(J\) be the number of items, \(n\) be the number of observed human respondents, and \(N\) be the number of additional generated respondents.
This is the real human pilot calibration response matrix, with structure
\[O \in \{0,1\}^{n \times J}.\]
Each row is the full observed response string from one human respondent. Each column corresponds to an item. Each entry is an observed, dichotomously scored, response.
This is the LLM-predicted response matrix for the same human respondents in \(O\), with structure
\[P \in \{0,1\}^{n \times J}.\]
Pay special attention to the requirement of “same human respondents.” Row \(i\) in \(P\) must be predicted responses for the respondent in row \(i\) in \(O\). Column \(j\) in \(P\) must correspond to column \(j\) in \(O\). This is the diagnostic matrix. It tells the method whether the LLM response-generation procedure is informative about the human response process and defines the magnitude and confidence in the mixed-subjects correction.
Importantly, this means that when generating \(P\), users need to transmit both the content of the items being responded to and some sort of information about the rows of \(O\) to the LLM generating the predicted responses. This can be in the form of narrative or covariate information about the respondents, held out item responses, or something else.
This is the larger synthetic or LLM-generated response matrix, with structure
\[G \in \{0,1\}^{N \times J}.\]
Typically, \(N \gg n\) to maximize the potential improvement in post-calibration ability estimation precision. The rows in \(G\) are not paired with rows in \(O\), but instead additional generated respondents. However, the crucial requirement is that \(G\) is meant to be sampled from the same distribution as \(P\), meaning it should be created using a procedure that is as close as possible to the procedure used to create \(P\) and some amount of information about the ability distribution in \(P\).
A useful way to remember the design is:
| Matrix | Shape | Rows | Purpose |
|---|---|---|---|
| \(O\) | \(n \times J\) | observed human respondents | anchor the calibration target to the human population |
| \(P\) | \(n \times J\) | LLM predictions for those same human respondents | estimate how the LLM procedure agrees with or deviates from human responses |
| \(G\) | \(N \times J\) | additional LLM-generated rows | provide extra precision derived from synthetic information after correction |
Another helpful rule of thumb is that if \(P\) is not row-aligned with \(O\), you should expect \(\lambda \to 0\).
The recommended scalar-\(\lambda\)
workflow in mixedsubjectsirt uses a marginal maximum
likelihood (MML) objective of the form
\[\hat\gamma_\lambda =\arg\min_\gamma\big[L_O^{\mathrm{marg}}(\gamma)+\lambda\big(L_G^{\mathrm{marg}}(\gamma)-L_P^{\mathrm{marg}}(\gamma)\big)\big].\]
Here \(\gamma = \{a_1, \ldots, a_J, d_1, \ldots, d_J \}\) is the vector of item parameters. For a 2PL model, we write item \(j\)’s response probability as
\[p_j(\theta;\gamma_j) =\mathrm{logit}^{-1}(d_j + a_j \theta),\]
where \(d_j\) is the intercept and \(a_j\) is the discrimination.
The three pieces of the objective each have different jobs in the mixed-subjects loss: \[L_O^{\mathrm{marg}}(\gamma)\] is the usual human-data marginal IRT likelihood evaluated at \(\gamma\). This term anchors the calibration to humans. \[L_G^{\mathrm{marg}}(\gamma)\] is the likelihood contribution from the additional generated response rows. \[L_P^{\mathrm{marg}}(\gamma)\] is the correction term. It estimates what the same LLM-generation procedure says about the humans whose actual responses are observed.
The term \(L_G^{\mathrm{marg}} - L_P^{\mathrm{marg}}\) is the prediction-powered correction. The generated matrix \(G\) adds information, while the paired prediction matrix \(P\) allows us to estimate the LLM procedure’s bias, noise, and covariance with the human response process.
The tuning parameter \(\lambda\) weights how much the LLM-generated responses are allowed to contribute to parameter estimation. When \(\lambda = 0\),
\[ L_O^{\mathrm{marg}}(\gamma) + \lambda \{L_G^{\mathrm{marg}}(\gamma)-L_P^{\mathrm{marg}}(\gamma)\} = L_O^{\mathrm{marg}}(\gamma), \]
so the method falls back to human-only calibration. When \(\lambda > 0\), the method borrows information from \(G\), corrected by \(P\).
This means \(\lambda=0\) is not a failure of the software, instead it is a protective outcome that occurs when the LLM responses do not reduce expected ability-estimation error. A high \(\lambda\) requires more than plausible synthetic response rows. It requires the paired LLM predictions in \(P\) to track the human responses in \(O\) in a way that is consistent across respondents.
There are two related (but distinct) tuning ideas in the package. The
original PPI++ paper minimizes the standard errors of the estimated
parameters (the trace of the covariance matrix). The function
tune_lambda_ppi_score() implements this.
The function tune_lambda_ability_risk() asks a more
practical psychometric question: Which value of \(\lambda\) minimizes expected downstream
ability-estimation error?
The approximate target is \[ \widehat R(\lambda) = \frac{1}{M} \sum_{m=1}^{M} g_m^\top \widehat\Sigma_{\gamma,\lambda} g_m, \] where:
\(g_m^\top \widehat\Sigma_{\gamma,\lambda} g_m\) thus propagates item-parameter uncertainty and covariance structure into ability-estimation uncertainty.
This matters because ability-risk tuning is not the same as minimizing average item-parameter standard errors. The off-diagonal elements of the item-parameter covariance matrix now matter, describing how uncertainty in one item parameter moves with uncertainty in another. Some covariance patterns may cancel out for ability scoring; others may distort the scale in high-information regions. Ability-risk tuning weights these covariance patterns by their downstream impact on \(\hat\theta\), averaged over an expected ability distribution.
The easiest way to understand \(\lambda\) is to compare three cases.
Suppose
\[ P = O. \]
Here, the paired LLM prediction is exactly equal to the human response matrix. From the perspective of maximizing the contribution of \(G\) to estimation, this is the best possible version of \(P\). In this case, \(\lambda\) should be large, though not necessarily \(\lambda = 1\). The finite-\(N\) benchmark is
\[ \lambda_{\max} = \frac{1}{1+n/N} = \frac{N}{n+N}. \]
If \(n=400\) and \(N=1200\), we see
\[ \lambda_{\max} = \frac{1200}{400+1200}=0.75. \]
So even when \(P=O\), if \(N \not\gg n\), \(\lambda < 1\) should be expected.
Now suppose
\[ P = \mathrm{shuffle\_rows}(O). \]
The marginal item means are identical. The total-score distribution is identical. The item difficulty information is identical. The only thing that’s changed is row \(i\) in \(P\) is no longer a prediction for row \(i\) in \(O\). This should produce \(\lambda \approx 0\), because the row-aligned covariance structure has been destroyed. To observe this for both \(\lambda\) tuning objectives, you can run:
predicted_shuffled <- observed[sample(nrow(observed)), ]
lambda_shuffled_ppi <- tune_lambda_ppi_score(
observed = observed,
predicted = predicted_shuffled,
item_pars = human_fit$pars,
n_generated = nrow(generated)
)
lambda_shuffled_ppi$lambda
lambda_shuffled_ability_risk <- tune_lambda_ability_risk(
observed = observed,
predicted = predicted_shuffled,
item_pars = human_fit$pars,
n_generated = nrow(generated)
)
lambda_shuffled_ability_risk$lambdaSuppose both \(O\) and \(P\) are generated from the same IRT model:
\[ O_{ij} \mid \theta_i \sim \mathrm{Bernoulli}\{p_j(\theta_i)\}, \]
\[ P_{ij} \mid \theta_i \sim \mathrm{Bernoulli}\{p_j(\theta_i)\}. \]
This is “same DGP,” but \(P\) is still a fresh stochastic response. It is not the same as \(O\). The two matrices share person ability and item parameters, but they do not share response noise.
For a single item,
\[ \operatorname{Cov}(O_{ij},P_{ij}) = \operatorname{Var}_\theta[p_j(\theta)]. \]
But
\[ \operatorname{Var}(P_{ij}) = \operatorname{Var}_\theta[p_j(\theta)] + \mathbb{E}_\theta[p_j(\theta)\{1-p_j(\theta)\}]. \]
The Bernoulli noise in \(P\) dilutes the control-variate signal. As a result, a fresh same-DGP draw may produce a modest \(\lambda\), not a large one. This distinction is important: Merely producing the same item parameters does not imply strong paired prediction.
A useful \(P\) matrix has to predict row-level response structure. A good \(P\) should have these properties:
When you have a human response matrix \(O\), an approach to build \(P\) is leave-one-item-out response prediction. For each respondent \(i\) and item \(j\):
If you have construct-relevant covariates, you can build \(P\) without using other item responses or augment the LOO-response prediction approach outlined above.
Example covariates include:
Item-text-only generation usually predicts column properties (like item parameters and relative item spacings), not row-aligned responses. Row-aligned responses are important because they allow the method to link the underlying ability distributions between human respondents and LLM-generated respondents. This is why an item-text-only approach may produce synthetic data that looks plausible in aggregate, but still produces \(\lambda=0\).
The generated matrix \(G\) should be produced using a procedure that mirrors the procedure used to produce \(P\).
If \(P\) is generated with leave-one-item-out prompts, then \(G\) should also be generated with leave-one-item-out or masked-item prompts.
One possible procedure:
The most important rule is that \(G\) and \(P\) should be generated by the same prediction mechanism. If \(P\) is generated with row-aligned covariates and response history, but \(G\) is generated by asking the LLM to invent full response strings from scratch, then \(L_G^{\mathrm{marg}} - L_P^{\mathrm{marg}}\) may not be zero in expectation, producing asymptotically biased parameter estimates.
Mixed-subjects IRT is useful when the LLM response-generation procedure captures respondent-level response structure. The generated matrix \(G\) matters, but the paired matrix \(P\) is what lets the method learn whether \(G\) should be trusted. The key understanding is that \(G\) supplies extra synthetic rows, but \(P\) tells us how much to trust them.
If \(P\) is row-aligned and predictive of \(O\), \(\lambda\) can be positive and the generated data can improve calibration. If \(P\) only reproduces marginal item difficulty, or if its rows are not aligned with \(O\), then \(\lambda\) should shrink toward zero. This is a feature, not a bug.
The Choosing Lambda vignette explains which tuning function to use and when. This section derives the mathematics those functions implement: how item-parameter uncertainty is propagated into ability scores, and why minimizing the resulting ability risk is a fundamentally different objective from the original PPI++ trace criterion.
Throughout, let \(\gamma = \{a_1, \dots,
a_J, d_1, \dots, d_J\}\) collect the \(2J\) item parameters of a \(J\)-item 2PL model, ordered as all
discriminations followed by all intercepts. This is the ordering
convention used by package functions like fit$par,
vcov(), and ability_gradient(). The item
response function is again
\[ p_{j}(\theta;\gamma_j) \;=\; \Pr(Y_j = 1 \mid \theta) \;=\; \operatorname{logit}^{-1}\!\big(d_j + a_j \theta\big). \]
Ability-risk tuning chains four quantities together:
Tuning chooses \(\lambda\) to minimize that average. The sections below build up each link in turn.
The mixed-subjects estimator minimizes a PPI++-style combined objective over human (observed), paired-predicted, and generated responses,
\[ L_\lambda(\gamma) \;=\; L_O(\gamma) \;+\; \lambda\,\big[L_G(\gamma) - L_P(\gamma)\big], \]
where each \(L\) is a marginal (or Bock–Aitkin expected-count) negative log-likelihood. The estimator \(\hat\gamma(\lambda)\) solves the estimating equation
\[ \Psi_\lambda(\gamma) \;=\; \psi_O(\gamma) + \lambda\,\big[\psi_G(\gamma) - \psi_P(\gamma)\big] \;=\; 0, \qquad \psi = \nabla_\gamma L. \]
Setting \(\lambda = 0\) recovers the human-only calibration; \(\lambda > 0\) borrows strength from the LLM responses while the \(-\psi_P\) term de-biases that contribution at the population level (the PPI++ correction). The per-person score contributions to \(\psi\) are, for item \(j\),
\[ s_{ij}^{a} = (y_{ij} - \bar p_{ij})\,\bar\theta_i, \qquad s_{ij}^{d} = (y_{ij} - \bar p_{ij}), \]
evaluated under the posterior over \(\theta\).
\(\hat\gamma(\lambda)\) has asymptotic covariance of the form
\[ \Sigma_\gamma(\lambda) \;=\; A_\lambda^{-1}\, B_\lambda\, A_\lambda^{-1}, \]
with bread \(A_\lambda = \mathbb{E}[\nabla_\gamma \Psi_\lambda]\) and meat \(B_\lambda = \operatorname{Cov}(\Psi_\lambda)\).
Bread. Normally we would combine the three Hessians block by block,
\[ A_\lambda \;=\; H_O + \lambda\,\big(H_G - H_P \big), \]
where each \(H\) is the appropriate
Hessian. Since he MML estimator marginalizes over \(\theta\), its bread must use Louis’s (1982)
observed-information identity \(A^{\text{marg}} = H^{\text{comp}} -
I^{\text{miss}}\), subtracting the missing information
louis_missing_info() from the complete-data Hessian. Using
the complete-data Hessian alone would overstate efficiency. This is what
the package computes by default.
Meat. The meat is the covariance of the labeled correction plus the independent generated contribution,
\[ B_\lambda \;=\; \frac{1}{n}\operatorname{Cov}\!\big(S_{\text{obs}} - \lambda S_{\text{pred}}\big) \;+\; \frac{\lambda^2}{N}\operatorname{Cov}\!\big(S_{\text{gen}}\big), \]
with \(n\) labeled and \(N\) generated subjects. The
vcov() S3 method dispatches automatically.
Given item parameters \(\gamma\), the bounded maximum-likelihood ability estimate for response pattern \(y_i\) solves the scoring equation
\[ S(\theta;\gamma, y_i) \;=\; \sum_{j} a_j\,\big(y_{ij} - p_j(\theta)\big) \;=\; 0, \]
which score_theta() finds by 1-D optimization on the
interval bounds. The risk machinery needs the sensitivity
of that solution \(\hat\theta_i\) to
the item parameters, \(g_i = \partial
\hat\theta_i / \partial \gamma\). Because \(\hat\theta_i\) is defined
implicitly by \(S(\hat\theta_i;
\gamma) = 0\), the implicit function theorem gives
\[ \frac{\partial \hat\theta_i}{\partial \gamma_k} \;=\; -\,\Big(\frac{\partial S}{\partial \theta}\Big)^{-1} \frac{\partial S}{\partial \gamma_k}. \]
The denominator is the (negative) test information at \(\hat\theta_i\),
\[ \frac{\partial S}{\partial \theta} \;=\; -\sum_j a_j^2\, p_j(1 - p_j), \]
and the numerators, for the discrimination and intercept of item \(j\), are
\[ \frac{\partial S}{\partial a_j} = (y_{ij} - p_j) - a_j\, p_j(1 - p_j)\,\hat\theta_i, \qquad \frac{\partial S}{\partial d_j} = -\,a_j\, p_j(1 - p_j). \]
Where the gradient is undefined. The
implicit-function argument requires an interior optimum. At a
boundary estimate (all-correct or all-incorrect patterns push \(\hat\theta_i\) to a bound), the score
equation does not hold and the gradient is theoretically undefined;
ability_gradient() returns NA for those rows,
and they drop out of the risk average via na.rm = TRUE.
Rows with vanishing test information \((|\partial S/\partial\theta| <
\varepsilon)\) are treated the same way.
With \(g_i\) in hand, the delta method propagates item-parameter uncertainty into the score:
\[ \rho_i(\lambda) = \operatorname{Var}\big(\hat\theta_i\big) \;\approx\; g_i'\, \Sigma_\gamma(\lambda)\, g_i. \]
Averaging over a target population of \(M\) response patterns gives the scalar objective that the tuners minimize,
\[ R(\lambda) \;=\; \frac{1}{M}\sum_{i=1}^{M} \rho_i(\lambda) \;=\; \mathbb{E}_{\text{target}}\!\big[\, g'\,\Sigma_\gamma(\lambda)\, g \,\big]. \]
The expectation is over the target population’s ability
distribution, which is why target_resp matters: tuning for
the observed calibration sample (target_resp = observed)
and tuning for a separate operational scoring population generally give
different \(\lambda\).
The original PPI++ tuning rule minimizes the trace of the item-parameter covariance,
\[ \lambda^{\star}_{\text{PPI}} = \arg\min_\lambda \operatorname{Tr}\big[\Sigma_\gamma(\lambda)\big], \]
which tune_lambda_ppi_score() evaluates in closed form.
Writing \(\Sigma_\gamma =
(\sigma_{kl})\), the two objectives expand as
\[ \operatorname{Tr}(\Sigma_\gamma) = \sum_k \sigma_{kk}, \qquad g'\Sigma_\gamma g = \sum_{k} g_k^2\,\sigma_{kk} + \sum_{k \neq l} g_k g_l\,\sigma_{kl}. \]
The trace sees only the diagonal variances and weights every parameter equally. The ability risk weights each variance by \(g_k^2\) (how much that particular parameter moves the score) and uses the off-diagonal covariances \(\sigma_{kl}\), which encode the scale/identification structure of the 2PL. Errors in \(a_j\) and \(d_j\) that are correlated in a direction that leaves \(\hat\theta\) unchanged are penalized by the trace but (correctly) ignored by the ability risk. The two criteria therefore select different \(\lambda\) in general; use the ability risk for operational scoring and the trace as a theoretical diagnostic.
| Symbol | Meaning | Computed by |
|---|---|---|
| \(\hat\gamma(\lambda)\) | Mixed-subjects item parameters | fit_mixed_subjects_mml() |
| \(\Sigma_\gamma(\lambda)\) | Sandwich covariance of \(\hat\gamma\) | vcov() → vcov_mixed_subjects_mml() |
| \(\hat\theta_i\) | Bounded ML ability score | score_theta() |
| \(g_i = \partial\hat\theta_i/\partial\gamma\) | Implicit ability gradient | ability_gradient() |
| \(g_i'\Sigma_\gamma g_i\) | Propagated score variance | ability_risk() |
| \(R(\lambda) = \mathbb{E}[g'\Sigma_\gamma g]\) | Ability-risk objective | tune_lambda_ability_risk() |
For the practical workflow built on these pieces — cross-fitting, target populations, and the choice between estimators — see the Choosing Lambda vignette.