IRT Development: State of Play

Where the modelling journey is now, and how the pieces fit together

1 Executive summary

The screener is currently compared against M0, a simple raw-score benchmark: score earned across scorable responses divided by the scorable-response denominator. M0 is transparent, familiar, and easy to compute — not assumed to be psychometrically best.

The best IRT model developed to date is A2 (accuracy with probe/testlet effects). A2 is the IRT development baseline; it is not used for live scoring. The clearest finding from the achievement side is that a single broad numeracy score with probe effects is cleaner and more interpretable than a battery of separate ability scores (see Accuracy Modelling).

The response-process and speed work is not a dead end. Raw speed and raw completion are not safe as adjustments to achievement scores, but guarded response-process profiles may become useful after validity, fairness, stability, logging, and monitoring evidence gates are cleared.

The joint accuracy–response-time models (J0/J1/J2) are an active research line. J0 and J1 pass basic two-chain diagnostics and can be summarised as research evidence; old J2 is now treated as J2a / legacy sensitivity only because it uses a global RT <= 2s rapid flag. J2 Foundation is interpretable as sensitivity evidence; J2 Year 1 failed diagnostics and is not usable. The preferred next candidate is J2b, using family/item rapid thresholds, but it has not yet been fit. None of the joint models are decision evidence.

Operational implication: no change to live scoring or live bands.

2 What the screener and IRT pipeline are doing

The numeracy screener contains several short probes. Students answer items that differ in content, format, and timing demands. Item response theory (IRT) is used to separate student achievement from item difficulty and from task-specific dependencies. The current development programme then asks whether response-process evidence — reach, completion, nonresponse states, and response time — can add useful context without contaminating the achievement score.

3 Method in brief

Each model branch tests a different measurement claim. The diagram below shows how the branches relate; the master status table below makes each branch’s purpose, current status, and decision explicit.

flowchart TD
    M0["<b>M0</b><br/>Raw-score benchmark"]:::baseline
    A1["<b>A1</b><br/>IRT accuracy foundation"]:::superseded
    A2["<b>A2</b><br/>Accuracy + probe effects<br/><i>IRT development baseline</i>"]:::baseline
    A3["<b>A3</b><br/>Completion-sensitivity<br/><i>Research-only</i>"]:::research
    A4["<b>A4</b><br/>Response-state audit<br/><i>QC/context only</i>"]:::research
    T1b["<b>T1b</b><br/>Speed (pooled)"]:::research
    T1c["<b>T1c</b><br/>Speed (family-specific)"]:::research
    J0["<b>J0</b><br/>Joint accuracy–speed"]:::research
    J1["<b>J1</b><br/>Joint + family speed"]:::research
    J2["<b>J2a</b><br/>Legacy rapid-response<br/><i>Y1 failed diagnostics</i>"]:::failed
    J2b["<b>J2b</b><br/>Planned family/item rapid thresholds<br/><i>not fit</i>"]:::research

    M0 --> A1 --> A2
    A2 --> A3
    A2 --> A4
    A2 --> T1b
    A2 --> T1c
    T1c --> J0 --> J1 --> J2 --> J2b

    classDef baseline fill:#d6f0d8,stroke:#2f7a36,color:#1c3a1f;
    classDef superseded fill:#eef0f2,stroke:#90969d,color:#444,stroke-dasharray: 3 3;
    classDef research fill:#fdecc8,stroke:#a3741a,color:#3a2c0a;
    classDef failed fill:#f6d3d0,stroke:#8b3a3a,color:#3a1414;

Figure 1: Model family. Status colour: green = current benchmark/baseline; amber = research-only; red = not ready.

Across all branches, success is not better fit or greater complexity. A model advances only if it improves the validity, precision, fairness, or instructional usefulness of screening decisions.

4 Master status table

This table replaces the earlier model-overview, model-purpose, current-judgements, and operational-position tables. The validity argument (key assumption + main threat) is kept as a separate table below.

Code	Label	What it models	Status	Key threat	Decision	What would change it
M0	Raw-score benchmark	Score earned divided by scorable-response denominator. No IRT calibration.	Established benchmark	Denominator confounds completion/reach.	Retain as comparison benchmark.	Documented denominator/band evidence that survives fairness and reach checks.
A1	IRT accuracy foundation	IRT accuracy likelihood; separates student achievement from item difficulty.	Superseded by A2	Ignores probe/testlet local dependence.	Not used; A2 supersedes.	—
A2	Accuracy + probe effects	Adds probe/testlet random effects so local dependence is absorbed rather than mistaken for achievement.	Established (development)	Form effects, drift, unjustified band movement vs M0.	IRT development baseline; not live scoring.	Classification gain over M0 near cut-points without fairness or comparability harms.
A3	Completion-sensitivity	Adds a latent valid/scorable-response dimension correlated with achievement.	Partial	Missingness mechanism ambiguous; mixes reach, blanks, rapid placeholders.	Research-only.	Clear mechanism + classification gain + no subgroup burden.
A4	Response-state audit	Row-level response-state classification (valid, trailing nonreach, RT+ nonvalid, etc.).	Partial	Over-interpretation as cognition or exposure.	QC/context only.	Independent reach/exposure logging; states stable by form/device.
T1b	Speed-only (pooled)	Interval-censored lognormal RT model on valid positive-RT timed-math rows; pooled speed.	Pending	Speed entangled with completion, modality, device.	Research-only; under validation.	Diagnostics + fairness + decision value beyond achievement alone.
T1c	Speed-only (family)	As T1b, with separate speed factors per probe family.	Pending	Family-speed entanglement with valid-rate and trailing-rate (large).	Research-only; under validation.	Disentangling of speed from valid/trailing rates; family-speed reproducibility.
J0	Joint accuracy–speed	Joint likelihood for accuracy and RT with shared person distribution.	Partial (research)	Diagnostic fragility; limited temporal generalisation.	Research evidence only; not decisions.	Decision-relevant gain over A2 + temporal generalisation.
J1	Joint + family speed	Joint model with family-specific speed factors.	Partial (research)	As J0 plus family-speed identification.	Research evidence only; not decisions.	As J0 plus family-speed identification holding under sensitivity checks.
J2a	Legacy joint + rapid-response	Legacy J2 sensitivity model with global RT <= 2s rapid flag.	Legacy sensitivity; not ready (Y1)	Global RT <= 2s rapid rule is provisional; Y1 fit failed (544 divergences; E-BFMI 0.029; max R-hat 12.39).	Legacy sensitivity only; not decision evidence.	Only retained for comparability; not a promotion path unless rapid definition is replaced.
J2b	Planned family/item rapid-threshold model	Planned successor using family/item rapid thresholds; not yet fit.	Planned; not fit	Not yet fit; threshold and family specification must be validated.	Planned shadow candidate; no evidence yet.	Fit J2b with family/item thresholds, clean diagnostics, external/protection/fairness evidence beyond A2.

5 Five-domain evidence framework

Domain	Main question	Examples of evidence
1. Construct and interpretation	What does the model measure, and is that interpretation defensible?	Achievement vs completion vs speed; local dependence; response-process mechanism.
2. Technical adequacy	Can the model be estimated and trusted for the stated purpose?	Model diagnostics, fit/PPC, item stability, reliability/precision near cut-points.
3. Screening usefulness	Does it improve or preserve risk identification?	Sensitivity, specificity, false negatives, false positives, band movement/stability; AUC where useful.
4. Fairness and comparability	Is the signal construct-relevant and stable across forms, cohorts, modes, devices, and subgroups?	DIF, differential prediction/classification, form effects, link error, mode/device effects, subgroup burden.
5. Reporting and actionability	Can the result be explained and used safely by teachers?	Clear teacher action, misuse warnings, monitoring, and no faster-is-better interpretation.

External outcomes such as PAT Maths, teacher ratings, and later outcomes are useful but imperfect validation anchors. ENSSA is designed as an early numeracy risk screener focused on foundational number sense and response-process evidence, whereas broad standardised achievement tests measure wider curriculum performance under different timing, instructional, and response-demand conditions. Model evaluation therefore uses converging evidence, not a single external gold standard.

6 Validity argument snapshot

Claim	Intended use	Key assumption	Current status	Main threat	Decision
M0 provides a raw-score comparison benchmark	Comparison benchmark	Scorable-response denominator is a documented benchmark choice for current comparisons.	Current benchmark	Denominator confounds completion/reach.	Retain as benchmark
A2 supports broad achievement interpretation	IRT development baseline	Probe effects are nuisance structure, not separate reportable traits.	Partial / strongest IRT evidence	Local dependence, form effects, or unjustified band movement.	Use as development baseline
A3 should not alter achievement scoring	Research-only	Completion is separate from achievement unless decision evidence proves otherwise.	Not supported for scoring	Missingness mechanism ambiguous.	Research-only
A4 supports QC/context	Audit/context	Row states are process proxies, not traits.	Partial	Over-interpretation as cognition or exposure.	QC/context only
T1b/T1c estimate effective speed	Research-only profiles	RT signal is not mostly device/completion artefact.	Pending	Construct-irrelevant speed and modality effects.	Research-only
J0/J1/J2a/J2b improve decisions	Research-only	Joint model adds decision value beyond simpler models, passes diagnostics, and generalises beyond the current Term 3/4 validation slice.	Reviewed: J0/J1 research-only; J2a legacy only; J2b planned/not fit	Complexity, diagnostic fragility, and limited temporal generalisation.	Not decision evidence; J2a legacy only; J2b not yet fit

7 Benchmark and classification logic

A high correlation between M0 and A2 is expected because both are based on the same scored item responses. It is not proof that A2 is better. The key question is where A2 disagrees with M0 and whether those disagreements are psychometrically and operationally justified: for example, whether movement is explained by item difficulty, probe/testlet dependence, form or item-mix differences, improved precision near cut-points, or removal of local dependence.

For a screener, model promotion requires decision-relevant improvement. Public reporting should focus on a compact classification set: sensitivity, specificity, false negatives, false positives, and band movement/stability; AUC may be included where useful. False negatives are a major validity risk because genuinely at-risk students may miss support, but false positives also matter because intervention resources are finite.

Requirement	Promotion convention
Classification gain	Any gain must be practically meaningful, not just non-zero.
Specificity protection	Sensitivity gains must not create unacceptable over-flagging.
False-negative protection	Target risk groups must not experience increased missed-risk rates.
Band movement	Students moved between bands must show plausible criterion or psychometric evidence.
Fairness	No subgroup, form, device, or administration group should bear disproportionate burden.
Cut-point precision	Precision gains should matter near risk thresholds, not only on average.
Complexity	More complex models must justify added burden through decision value.

8 Known content gaps (next revision)

These are explicit content gaps still to be written into the public pages. Listing them here keeps the gap visible rather than hidden behind “Pending” cells.

A2 specification block. Likelihood, parameterisation, and priors for the IRT development baseline, in the same style as the J0/J1/J2 block on the Response Time Modelling page.
A3 specification block. Likelihood, parameterisation, priors, and a precise definition of the internal reliability-like index used to compare A3 against A2.
Number-line PCM sensitivity evidence. Category occupancy and conclusion-stability tables across candidate number-line binning policies; see Number-Line PCM Policy.
Cut scores, decision consistency, and test information near the cut points. Including the threshold-setting method.
T1c entanglement headline. Estimated family speed correlates strongly with valid-rate (0.51–0.71) and trailing-rate (−0.63 to −0.79). This belongs near the top of the Response Time Modelling page, with a partial-correlation paragraph.
J0/J1/J2 nesting diagram and a small J2 Year 1 diagnostic visual (divergences, R-hat, E-BFMI).
J2a/J2b rapid-response threshold spec. J2a used the legacy \(r_{pi} = \mathbb{1}\{\mathrm{round}(\mathrm{rt}_{pi}) \le 2\mathrm{s}\}\) indicator. That global raw-second rule is now provisional only; planned J2b uses family/item lower-tail thresholds and excludes/defer magnitude comparison.
Term 3/4 pooling. One-sentence definition: speed/joint evidence is pooled at the person-administration level (student × administration event) across Terms 3 and 4, not a longitudinal Term 1→Term 3→Term 4 model.
Anchor citations. Six inline citations to add: Yen (1984) at Q3; van der Linden (2007) at the lognormal RT model; Wang & Wilson (2005) at testlet/bifactor; Schnipke & Scrams (1997) / Wise & Kong (2005) at rapid-guessing; Standards for Educational and Psychological Testing (2014) at the validity argument framing.

9 External and classification validation status

Evidence	Validation role	Current status	Validation strength	Limitation
PAT Maths	Broad external maths achievement anchor	Pending	Pending	Broader construct, timing and instructional effects.
Teacher ratings	Ecological/instructional judgement	Pending	Pending	Subjective and may include classroom behaviour or expectations.
Later ENSSA waves	Longitudinal consistency within instrument family	Pending	Pending	Same instrument family; not fully independent.
Classification metrics	Screening decision utility	Pending	Pending	Requires cut-score/band definitions and outcome labels.
Band movement near thresholds	Operational impact and stability	Pending	Pending	Requires uncertainty near cut-points.
Fairness / subgroup checks	Construct-irrelevant variance and differential burden	Pending	Pending	Requires small-cell safeguards and appropriate subgroup data.

10 Internal structure, comparability, and scale gates

Gate	Required evidence	Current status
A2 internal structure	Dimensionality, local dependence before/after probe effects, probe/testlet variance, item fit, item parameter stability, and evidence that probe effects are nuisance rather than reportable subscores.	Partial
Form/term/cohort comparability	Anchor stability, item drift, form effects, term effects, link error, probe-family comparability, mode/device equivalence, and common-scale evidence.	Pending
Score/report scale	Theta orientation, any transformation to reporting scale, and comparability across forms/terms/year levels.	Pending documentation
Bands/cut scores	Definitions of Very Low, Low, On Track thresholds; threshold-setting method; uncertainty and decision consistency near cut-points.	Pending documentation

Achievement band order, where bands are shown, should remain: Very Low → Low → On Track.

11 What we have learned so far

The achievement side is not best described as a collection of unrelated probe scores. The current evidence supports a broad numeracy achievement score with probe/testlet effects rather than a large operational battery of separate ability scores.
A2 should remain the IRT development baseline. It is the model to beat before anything else can be considered for live scoring.
The legacy is_attempted field should be read as valid/scorable response, not as true exposure or effort. This is why the completion-sensitivity model (A3) is not a true reach model. A3 did show useful signal, but it did not consistently justify changing achievement scoring and remains research-only.
Response-state categories (A4) are useful as context and QC. They should not be counted as wrong answers inside the achievement likelihood.
Effective speed (T1b/T1c) is measurable under a clean data contract, but raw speed is not an achievement score and should not drive live risk bands.
The useful direction is profile-aware: protect slow-accurate students, flag rapid-risk patterns for audit, and distinguish completion-constrained profiles from low achievement.

12 Fairness, disclosure, and reproducibility safeguards

Fairness / comparability risk	Method check	Public reporting rule
Differential item functioning	DIF or item-parameter stability checks where subgroup data permit.	Suppress small cells
Differential prediction	Compare outcome relations across relevant groups where cell sizes permit.	Suppress small cells
Differential classification	Compare sensitivity, specificity, false negatives, false positives, and band movement by group where safe.	Suppress small cells
Profile-rate differences	Check whether speed/completion profiles concentrate by form, device, probe, school/class, or subgroup.	Suppress small cells
Device/mode effects	Audit whether timing or profile signals are driven by interface, modality, or platform behaviour.	No device-identifiable or school-identifiable reporting
School/class administration effects	Inspect administration clustering only with aggregation and suppression safeguards.	No school/class/teacher-identifiable reporting
Accessibility/accommodation	Check whether speed/profile interpretations would penalise students needing accommodations.	Use only where ethically and legally appropriate

Small-cell suppression applies to fairness and subgroup summaries. Avoid school-, class-, teacher-, device-, or subgroup-identifiable summaries where N is too small.

Model family	Required reproducibility metadata	Current public status
M0	Scoring convention version, denominator definition, band thresholds, extract date.	Pending documentation
A1/A2/A3	Data extract date, Stan file, run ID, seed, chains/iterations, inclusion/exclusion rules, output artefact path.	Pending documentation
A4	Response-state builder version, row inclusion rules, state definitions, extract date.	Pending documentation
T1b/T1c	RT data contract version, Stan/data builder files, exclusions, run ID, diagnostics artefact path.	Pending documentation
J0/J1/J2a/J2b	Stan file, run ID, seed, chains/iterations, inclusion/exclusion rules, diagnostic extraction artefact path.	Pending diagnostics

Key claims in the final public version should be supported with short references to psychometric standards and response-time modelling sources, including the Standards for Educational and Psychological Testing, Wu et al. on validity/reliability/IRT, van der Linden on response-time and speed-accuracy modelling, rapid-guessing/testlet-dependence literature, and ENSSA screening design materials.

13 What would change the recommendation?

The master status table above lists what would change each model’s status individually. The table below covers cross-cutting scenarios that would change the overall recommendation.

Evidence finding	Decision implication
A2 improves classification and precision near cut-points over M0 without fairness or comparability harms.	Consider operational evaluation pathway for A2.
A2-vs-M0 band movers are psychometrically plausible and not concentrated by form, device, school/class, probe, or subgroup.	Supports A2 transition case; otherwise investigate before promotion.
A3 improves fit but not external/classification evidence, or increases subgroup burden.	Keep A3 research-only.
A4 states vary strongly by form/device/administration.	Use A4 for QC/admin review only; do not student-report.
T1c speed improves false-negative detection without penalising slow-accurate students.	Consider guarded profile reporting after further validation and actionability checks.
J0/J1/J2a/J2b have persistent divergences, unstable correlations, no temporal generalisation, or no decision gain over simpler models.	Do not use joint models for decisions.

14 Path to operationalising speed responsibly

The objective is not to avoid speed forever. The objective is to operationalise speed responsibly. A promoted speed-related output should be a guarded response-process profile, not a faster-is-better score.

Gate	Promotion requirement
Construct	Profiles distinguish meaningful response processes, not artefacts of form, device, or scoring.
Data contract	Logging distinguishes seen/available, positive RT or focus-event proxy, responded, valid/scorable, correct, and timing stages.
Fairness	Profile rates and effects are checked by cohort, school/class where appropriate, subgroup proxies, device/mode, and administration conditions.
Stability	Profiles are stable across terms, cohorts, probe families, and reasonable thresholds.
Interpretation	Reports protect slow-accurate students and avoid implying faster is better.
Actionability	No score, profile, or flag is reported unless it maps to a defensible teacher action and has misuse warnings.
Monitoring	Live use has drift checks, fairness checks, override rules, and review cadence.

Candidate profile labels for future validation:

Efficient-accurate: accurate and timely, but not rewarded merely for being fast.
Slow-accurate: accurate with slower response time; must be protected from speed penalties.
Rapid-risk: very rapid responding with low accuracy or suspicious response-process pattern.
Slow-low-accuracy: low accuracy plus slow/effortful response process.
Completion-constrained: evidence suggests items were not reached or process was interrupted.
Inconclusive: logging or pattern does not support a defensible profile.

15 Where to read more

Accuracy Modelling: deeper evidence for a single achievement score with testlet effects.
Number-Line PCM Policy: how continuous number-line accuracy is converted into ordered model categories for GPCM.
Omitted and Non-Scored Responses: what response records tell us, A3 sensitivity findings, and A4 response-state categories.
Response Time Modelling: RT data contract, T1b/T1c, rapid-risk, slow-accurate protection, J0/J1/J2a, and planned J2b.
IRT Model - Foundation 2025 Term 1: archival Foundation T1 calibration and methodology reference.