IRT Development: State of Play

Where the modelling journey is now, and how the pieces fit together

1 Executive summary

The screener is currently compared against M0, a simple raw-score benchmark: score earned across scorable responses divided by the scorable-response denominator. M0 is transparent, familiar, and easy to compute — not assumed to be psychometrically best.

The best IRT model developed to date is A2 (accuracy with probe/testlet effects). A2 is the IRT development baseline; it is not used for live scoring. The clearest finding from the achievement side is that a single broad numeracy score with probe effects is cleaner and more interpretable than a battery of separate ability scores (see Accuracy Modelling).

The response-process and speed work is not a dead end. Raw speed and raw completion are not safe as adjustments to achievement scores, but guarded response-process profiles may become useful after validity, fairness, stability, logging, and monitoring evidence gates are cleared.

The joint accuracy–response-time models (J0/J1/J2) are an active research line. J0 and J1 pass basic two-chain diagnostics and can be summarised as research evidence; old J2 is now treated as J2a / legacy sensitivity only because it uses a global RT <= 2s rapid flag. J2 Foundation is interpretable as sensitivity evidence; J2 Year 1 failed diagnostics and is not usable. The preferred next candidate is J2b, using family/item rapid thresholds, but it has not yet been fit. None of the joint models are decision evidence.

Operational implication: no change to live scoring or live bands.

TipSuggested reading order
  1. This page — State of Play for the model map, status, and evidence gates.
  2. Accuracy Modelling — evidence for one broad achievement score with probe/testlet effects.
  3. Number-Line PCM Policy — how continuous number-line accuracy is converted into ordered model categories.
  4. Omitted and Non-Scored Responses — what response records tell us, A3 and A4 findings.
  5. Response Time Modelling — RT data contract, T1b/T1c, J0/J1/J2a and planned J2b.
  6. Return here for operational position and evidence gates.

2 What the screener and IRT pipeline are doing

The numeracy screener contains several short probes. Students answer items that differ in content, format, and timing demands. Item response theory (IRT) is used to separate student achievement from item difficulty and from task-specific dependencies. The current development programme then asks whether response-process evidence — reach, completion, nonresponse states, and response time — can add useful context without contaminating the achievement score.

3 Method in brief

Each model branch tests a different measurement claim. The diagram below shows how the branches relate; the master status table below makes each branch’s purpose, current status, and decision explicit.

flowchart TD
    M0["<b>M0</b><br/>Raw-score benchmark"]:::baseline
    A1["<b>A1</b><br/>IRT accuracy foundation"]:::superseded
    A2["<b>A2</b><br/>Accuracy + probe effects<br/><i>IRT development baseline</i>"]:::baseline
    A3["<b>A3</b><br/>Completion-sensitivity<br/><i>Research-only</i>"]:::research
    A4["<b>A4</b><br/>Response-state audit<br/><i>QC/context only</i>"]:::research
    T1b["<b>T1b</b><br/>Speed (pooled)"]:::research
    T1c["<b>T1c</b><br/>Speed (family-specific)"]:::research
    J0["<b>J0</b><br/>Joint accuracy–speed"]:::research
    J1["<b>J1</b><br/>Joint + family speed"]:::research
    J2["<b>J2a</b><br/>Legacy rapid-response<br/><i>Y1 failed diagnostics</i>"]:::failed
    J2b["<b>J2b</b><br/>Planned family/item rapid thresholds<br/><i>not fit</i>"]:::research

    M0 --> A1 --> A2
    A2 --> A3
    A2 --> A4
    A2 --> T1b
    A2 --> T1c
    T1c --> J0 --> J1 --> J2 --> J2b

    classDef baseline fill:#d6f0d8,stroke:#2f7a36,color:#1c3a1f;
    classDef superseded fill:#eef0f2,stroke:#90969d,color:#444,stroke-dasharray: 3 3;
    classDef research fill:#fdecc8,stroke:#a3741a,color:#3a2c0a;
    classDef failed fill:#f6d3d0,stroke:#8b3a3a,color:#3a1414;
Figure 1: Model family. Status colour: green = current benchmark/baseline; amber = research-only; red = not ready.

Across all branches, success is not better fit or greater complexity. A model advances only if it improves the validity, precision, fairness, or instructional usefulness of screening decisions.

4 Master status table

This table replaces the earlier model-overview, model-purpose, current-judgements, and operational-position tables. The validity argument (key assumption + main threat) is kept as a separate table below.

Code Label What it models Status Key threat Decision What would change it
M0 Raw-score benchmark Score earned divided by scorable-response denominator. No IRT calibration. Established benchmark Denominator confounds completion/reach. Retain as comparison benchmark. Documented denominator/band evidence that survives fairness and reach checks.
A1 IRT accuracy foundation IRT accuracy likelihood; separates student achievement from item difficulty. Superseded by A2 Ignores probe/testlet local dependence. Not used; A2 supersedes.
A2 Accuracy + probe effects Adds probe/testlet random effects so local dependence is absorbed rather than mistaken for achievement. Established (development) Form effects, drift, unjustified band movement vs M0. IRT development baseline; not live scoring. Classification gain over M0 near cut-points without fairness or comparability harms.
A3 Completion-sensitivity Adds a latent valid/scorable-response dimension correlated with achievement. Partial Missingness mechanism ambiguous; mixes reach, blanks, rapid placeholders. Research-only. Clear mechanism + classification gain + no subgroup burden.
A4 Response-state audit Row-level response-state classification (valid, trailing nonreach, RT+ nonvalid, etc.). Partial Over-interpretation as cognition or exposure. QC/context only. Independent reach/exposure logging; states stable by form/device.
T1b Speed-only (pooled) Interval-censored lognormal RT model on valid positive-RT timed-math rows; pooled speed. Pending Speed entangled with completion, modality, device. Research-only; under validation. Diagnostics + fairness + decision value beyond achievement alone.
T1c Speed-only (family) As T1b, with separate speed factors per probe family. Pending Family-speed entanglement with valid-rate and trailing-rate (large). Research-only; under validation. Disentangling of speed from valid/trailing rates; family-speed reproducibility.
J0 Joint accuracy–speed Joint likelihood for accuracy and RT with shared person distribution. Partial (research) Diagnostic fragility; limited temporal generalisation. Research evidence only; not decisions. Decision-relevant gain over A2 + temporal generalisation.
J1 Joint + family speed Joint model with family-specific speed factors. Partial (research) As J0 plus family-speed identification. Research evidence only; not decisions. As J0 plus family-speed identification holding under sensitivity checks.
J2a Legacy joint + rapid-response Legacy J2 sensitivity model with global RT <= 2s rapid flag. Legacy sensitivity; not ready (Y1) Global RT <= 2s rapid rule is provisional; Y1 fit failed (544 divergences; E-BFMI 0.029; max R-hat 12.39). Legacy sensitivity only; not decision evidence. Only retained for comparability; not a promotion path unless rapid definition is replaced.
J2b Planned family/item rapid-threshold model Planned successor using family/item rapid thresholds; not yet fit. Planned; not fit Not yet fit; threshold and family specification must be validated. Planned shadow candidate; no evidence yet. Fit J2b with family/item thresholds, clean diagnostics, external/protection/fairness evidence beyond A2.

5 Five-domain evidence framework

Domain Main question Examples of evidence
1. Construct and interpretation What does the model measure, and is that interpretation defensible? Achievement vs completion vs speed; local dependence; response-process mechanism.
2. Technical adequacy Can the model be estimated and trusted for the stated purpose? Model diagnostics, fit/PPC, item stability, reliability/precision near cut-points.
3. Screening usefulness Does it improve or preserve risk identification? Sensitivity, specificity, false negatives, false positives, band movement/stability; AUC where useful.
4. Fairness and comparability Is the signal construct-relevant and stable across forms, cohorts, modes, devices, and subgroups? DIF, differential prediction/classification, form effects, link error, mode/device effects, subgroup burden.
5. Reporting and actionability Can the result be explained and used safely by teachers? Clear teacher action, misuse warnings, monitoring, and no faster-is-better interpretation.

External outcomes such as PAT Maths, teacher ratings, and later outcomes are useful but imperfect validation anchors. ENSSA is designed as an early numeracy risk screener focused on foundational number sense and response-process evidence, whereas broad standardised achievement tests measure wider curriculum performance under different timing, instructional, and response-demand conditions. Model evaluation therefore uses converging evidence, not a single external gold standard.

6 Validity argument snapshot

Claim Intended use Key assumption Current status Main threat Decision
M0 provides a raw-score comparison benchmark Comparison benchmark Scorable-response denominator is a documented benchmark choice for current comparisons. Current benchmark Denominator confounds completion/reach. Retain as benchmark
A2 supports broad achievement interpretation IRT development baseline Probe effects are nuisance structure, not separate reportable traits. Partial / strongest IRT evidence Local dependence, form effects, or unjustified band movement. Use as development baseline
A3 should not alter achievement scoring Research-only Completion is separate from achievement unless decision evidence proves otherwise. Not supported for scoring Missingness mechanism ambiguous. Research-only
A4 supports QC/context Audit/context Row states are process proxies, not traits. Partial Over-interpretation as cognition or exposure. QC/context only
T1b/T1c estimate effective speed Research-only profiles RT signal is not mostly device/completion artefact. Pending Construct-irrelevant speed and modality effects. Research-only
J0/J1/J2a/J2b improve decisions Research-only Joint model adds decision value beyond simpler models, passes diagnostics, and generalises beyond the current Term 3/4 validation slice. Reviewed: J0/J1 research-only; J2a legacy only; J2b planned/not fit Complexity, diagnostic fragility, and limited temporal generalisation. Not decision evidence; J2a legacy only; J2b not yet fit

7 Benchmark and classification logic

A high correlation between M0 and A2 is expected because both are based on the same scored item responses. It is not proof that A2 is better. The key question is where A2 disagrees with M0 and whether those disagreements are psychometrically and operationally justified: for example, whether movement is explained by item difficulty, probe/testlet dependence, form or item-mix differences, improved precision near cut-points, or removal of local dependence.

For a screener, model promotion requires decision-relevant improvement. Public reporting should focus on a compact classification set: sensitivity, specificity, false negatives, false positives, and band movement/stability; AUC may be included where useful. False negatives are a major validity risk because genuinely at-risk students may miss support, but false positives also matter because intervention resources are finite.

Requirement Promotion convention
Classification gain Any gain must be practically meaningful, not just non-zero.
Specificity protection Sensitivity gains must not create unacceptable over-flagging.
False-negative protection Target risk groups must not experience increased missed-risk rates.
Band movement Students moved between bands must show plausible criterion or psychometric evidence.
Fairness No subgroup, form, device, or administration group should bear disproportionate burden.
Cut-point precision Precision gains should matter near risk thresholds, not only on average.
Complexity More complex models must justify added burden through decision value.

8 Known content gaps (next revision)

These are explicit content gaps still to be written into the public pages. Listing them here keeps the gap visible rather than hidden behind “Pending” cells.

  • A2 specification block. Likelihood, parameterisation, and priors for the IRT development baseline, in the same style as the J0/J1/J2 block on the Response Time Modelling page.
  • A3 specification block. Likelihood, parameterisation, priors, and a precise definition of the internal reliability-like index used to compare A3 against A2.
  • Number-line PCM sensitivity evidence. Category occupancy and conclusion-stability tables across candidate number-line binning policies; see Number-Line PCM Policy.
  • Cut scores, decision consistency, and test information near the cut points. Including the threshold-setting method.
  • T1c entanglement headline. Estimated family speed correlates strongly with valid-rate (0.51–0.71) and trailing-rate (−0.63 to −0.79). This belongs near the top of the Response Time Modelling page, with a partial-correlation paragraph.
  • J0/J1/J2 nesting diagram and a small J2 Year 1 diagnostic visual (divergences, R-hat, E-BFMI).
  • J2a/J2b rapid-response threshold spec. J2a used the legacy \(r_{pi} = \mathbb{1}\{\mathrm{round}(\mathrm{rt}_{pi}) \le 2\mathrm{s}\}\) indicator. That global raw-second rule is now provisional only; planned J2b uses family/item lower-tail thresholds and excludes/defer magnitude comparison.
  • Term 3/4 pooling. One-sentence definition: speed/joint evidence is pooled at the person-administration level (student × administration event) across Terms 3 and 4, not a longitudinal Term 1→Term 3→Term 4 model.
  • Anchor citations. Six inline citations to add: Yen (1984) at Q3; van der Linden (2007) at the lognormal RT model; Wang & Wilson (2005) at testlet/bifactor; Schnipke & Scrams (1997) / Wise & Kong (2005) at rapid-guessing; Standards for Educational and Psychological Testing (2014) at the validity argument framing.

9 External and classification validation status

Evidence Validation role Current status Validation strength Limitation
PAT Maths Broad external maths achievement anchor Pending Pending Broader construct, timing and instructional effects.
Teacher ratings Ecological/instructional judgement Pending Pending Subjective and may include classroom behaviour or expectations.
Later ENSSA waves Longitudinal consistency within instrument family Pending Pending Same instrument family; not fully independent.
Classification metrics Screening decision utility Pending Pending Requires cut-score/band definitions and outcome labels.
Band movement near thresholds Operational impact and stability Pending Pending Requires uncertainty near cut-points.
Fairness / subgroup checks Construct-irrelevant variance and differential burden Pending Pending Requires small-cell safeguards and appropriate subgroup data.

10 Internal structure, comparability, and scale gates

Gate Required evidence Current status
A2 internal structure Dimensionality, local dependence before/after probe effects, probe/testlet variance, item fit, item parameter stability, and evidence that probe effects are nuisance rather than reportable subscores. Partial
Form/term/cohort comparability Anchor stability, item drift, form effects, term effects, link error, probe-family comparability, mode/device equivalence, and common-scale evidence. Pending
Score/report scale Theta orientation, any transformation to reporting scale, and comparability across forms/terms/year levels. Pending documentation
Bands/cut scores Definitions of Very Low, Low, On Track thresholds; threshold-setting method; uncertainty and decision consistency near cut-points. Pending documentation

Achievement band order, where bands are shown, should remain: Very Low → Low → On Track.

11 What we have learned so far

  • The achievement side is not best described as a collection of unrelated probe scores. The current evidence supports a broad numeracy achievement score with probe/testlet effects rather than a large operational battery of separate ability scores.
  • A2 should remain the IRT development baseline. It is the model to beat before anything else can be considered for live scoring.
  • The legacy is_attempted field should be read as valid/scorable response, not as true exposure or effort. This is why the completion-sensitivity model (A3) is not a true reach model. A3 did show useful signal, but it did not consistently justify changing achievement scoring and remains research-only.
  • Response-state categories (A4) are useful as context and QC. They should not be counted as wrong answers inside the achievement likelihood.
  • Effective speed (T1b/T1c) is measurable under a clean data contract, but raw speed is not an achievement score and should not drive live risk bands.
  • The useful direction is profile-aware: protect slow-accurate students, flag rapid-risk patterns for audit, and distinguish completion-constrained profiles from low achievement.

12 Fairness, disclosure, and reproducibility safeguards

Fairness / comparability risk Method check Public reporting rule
Differential item functioning DIF or item-parameter stability checks where subgroup data permit. Suppress small cells
Differential prediction Compare outcome relations across relevant groups where cell sizes permit. Suppress small cells
Differential classification Compare sensitivity, specificity, false negatives, false positives, and band movement by group where safe. Suppress small cells
Profile-rate differences Check whether speed/completion profiles concentrate by form, device, probe, school/class, or subgroup. Suppress small cells
Device/mode effects Audit whether timing or profile signals are driven by interface, modality, or platform behaviour. No device-identifiable or school-identifiable reporting
School/class administration effects Inspect administration clustering only with aggregation and suppression safeguards. No school/class/teacher-identifiable reporting
Accessibility/accommodation Check whether speed/profile interpretations would penalise students needing accommodations. Use only where ethically and legally appropriate

Small-cell suppression applies to fairness and subgroup summaries. Avoid school-, class-, teacher-, device-, or subgroup-identifiable summaries where N is too small.

Model family Required reproducibility metadata Current public status
M0 Scoring convention version, denominator definition, band thresholds, extract date. Pending documentation
A1/A2/A3 Data extract date, Stan file, run ID, seed, chains/iterations, inclusion/exclusion rules, output artefact path. Pending documentation
A4 Response-state builder version, row inclusion rules, state definitions, extract date. Pending documentation
T1b/T1c RT data contract version, Stan/data builder files, exclusions, run ID, diagnostics artefact path. Pending documentation
J0/J1/J2a/J2b Stan file, run ID, seed, chains/iterations, inclusion/exclusion rules, diagnostic extraction artefact path. Pending diagnostics

Key claims in the final public version should be supported with short references to psychometric standards and response-time modelling sources, including the Standards for Educational and Psychological Testing, Wu et al. on validity/reliability/IRT, van der Linden on response-time and speed-accuracy modelling, rapid-guessing/testlet-dependence literature, and ENSSA screening design materials.

13 What would change the recommendation?

The master status table above lists what would change each model’s status individually. The table below covers cross-cutting scenarios that would change the overall recommendation.

Evidence finding Decision implication
A2 improves classification and precision near cut-points over M0 without fairness or comparability harms. Consider operational evaluation pathway for A2.
A2-vs-M0 band movers are psychometrically plausible and not concentrated by form, device, school/class, probe, or subgroup. Supports A2 transition case; otherwise investigate before promotion.
A3 improves fit but not external/classification evidence, or increases subgroup burden. Keep A3 research-only.
A4 states vary strongly by form/device/administration. Use A4 for QC/admin review only; do not student-report.
T1c speed improves false-negative detection without penalising slow-accurate students. Consider guarded profile reporting after further validation and actionability checks.
J0/J1/J2a/J2b have persistent divergences, unstable correlations, no temporal generalisation, or no decision gain over simpler models. Do not use joint models for decisions.

14 Path to operationalising speed responsibly

The objective is not to avoid speed forever. The objective is to operationalise speed responsibly. A promoted speed-related output should be a guarded response-process profile, not a faster-is-better score.

Gate Promotion requirement
Construct Profiles distinguish meaningful response processes, not artefacts of form, device, or scoring.
Data contract Logging distinguishes seen/available, positive RT or focus-event proxy, responded, valid/scorable, correct, and timing stages.
Fairness Profile rates and effects are checked by cohort, school/class where appropriate, subgroup proxies, device/mode, and administration conditions.
Stability Profiles are stable across terms, cohorts, probe families, and reasonable thresholds.
Interpretation Reports protect slow-accurate students and avoid implying faster is better.
Actionability No score, profile, or flag is reported unless it maps to a defensible teacher action and has misuse warnings.
Monitoring Live use has drift checks, fairness checks, override rules, and review cadence.

Candidate profile labels for future validation:

  • Efficient-accurate: accurate and timely, but not rewarded merely for being fast.
  • Slow-accurate: accurate with slower response time; must be protected from speed penalties.
  • Rapid-risk: very rapid responding with low accuracy or suspicious response-process pattern.
  • Slow-low-accuracy: low accuracy plus slow/effortful response process.
  • Completion-constrained: evidence suggests items were not reached or process was interrupted.
  • Inconclusive: logging or pattern does not support a defensible profile.

15 Where to read more