flowchart TD
M0["<b>M0</b><br/>Raw-score benchmark"]:::baseline
A1["<b>A1</b><br/>IRT accuracy foundation"]:::superseded
A2["<b>A2</b><br/>Accuracy + probe effects<br/><i>IRT development baseline</i>"]:::baseline
A3["<b>A3</b><br/>Completion-sensitivity<br/><i>Research-only</i>"]:::research
A4["<b>A4</b><br/>Response-state audit<br/><i>QC/context only</i>"]:::research
T1b["<b>T1b</b><br/>Speed (pooled)"]:::research
T1c["<b>T1c</b><br/>Speed (family-specific)"]:::research
J0["<b>J0</b><br/>Joint accuracy–speed"]:::research
J1["<b>J1</b><br/>Joint + family speed"]:::research
J2["<b>J2a</b><br/>Legacy rapid-response<br/><i>Y1 failed diagnostics</i>"]:::failed
J2b["<b>J2b</b><br/>Planned family/item rapid thresholds<br/><i>not fit</i>"]:::research
M0 --> A1 --> A2
A2 --> A3
A2 --> A4
A2 --> T1b
A2 --> T1c
T1c --> J0 --> J1 --> J2 --> J2b
classDef baseline fill:#d6f0d8,stroke:#2f7a36,color:#1c3a1f;
classDef superseded fill:#eef0f2,stroke:#90969d,color:#444,stroke-dasharray: 3 3;
classDef research fill:#fdecc8,stroke:#a3741a,color:#3a2c0a;
classDef failed fill:#f6d3d0,stroke:#8b3a3a,color:#3a1414;
IRT Development: State of Play
Where the modelling journey is now, and how the pieces fit together
1 Executive summary
The screener is currently compared against M0, a simple raw-score benchmark: score earned across scorable responses divided by the scorable-response denominator. M0 is transparent, familiar, and easy to compute — not assumed to be psychometrically best.
The best IRT model developed to date is A2 (accuracy with probe/testlet effects). A2 is the IRT development baseline; it is not used for live scoring. The clearest finding from the achievement side is that a single broad numeracy score with probe effects is cleaner and more interpretable than a battery of separate ability scores (see Accuracy Modelling).
The response-process and speed work is not a dead end. Raw speed and raw completion are not safe as adjustments to achievement scores, but guarded response-process profiles may become useful after validity, fairness, stability, logging, and monitoring evidence gates are cleared.
The joint accuracy–response-time models (J0/J1/J2) are an active research line. J0 and J1 pass basic two-chain diagnostics and can be summarised as research evidence; old J2 is now treated as J2a / legacy sensitivity only because it uses a global RT <= 2s rapid flag. J2 Foundation is interpretable as sensitivity evidence; J2 Year 1 failed diagnostics and is not usable. The preferred next candidate is J2b, using family/item rapid thresholds, but it has not yet been fit. None of the joint models are decision evidence.
Operational implication: no change to live scoring or live bands.
- This page — State of Play for the model map, status, and evidence gates.
- Accuracy Modelling — evidence for one broad achievement score with probe/testlet effects.
- Number-Line PCM Policy — how continuous number-line accuracy is converted into ordered model categories.
- Omitted and Non-Scored Responses — what response records tell us, A3 and A4 findings.
- Response Time Modelling — RT data contract, T1b/T1c, J0/J1/J2a and planned J2b.
- Return here for operational position and evidence gates.
2 What the screener and IRT pipeline are doing
The numeracy screener contains several short probes. Students answer items that differ in content, format, and timing demands. Item response theory (IRT) is used to separate student achievement from item difficulty and from task-specific dependencies. The current development programme then asks whether response-process evidence — reach, completion, nonresponse states, and response time — can add useful context without contaminating the achievement score.
3 Method in brief
Each model branch tests a different measurement claim. The diagram below shows how the branches relate; the master status table below makes each branch’s purpose, current status, and decision explicit.
Across all branches, success is not better fit or greater complexity. A model advances only if it improves the validity, precision, fairness, or instructional usefulness of screening decisions.
4 Master status table
This table replaces the earlier model-overview, model-purpose, current-judgements, and operational-position tables. The validity argument (key assumption + main threat) is kept as a separate table below.
| Code | Label | What it models | Status | Key threat | Decision | What would change it |
|---|---|---|---|---|---|---|
| M0 | Raw-score benchmark | Score earned divided by scorable-response denominator. No IRT calibration. | Established benchmark | Denominator confounds completion/reach. | Retain as comparison benchmark. | Documented denominator/band evidence that survives fairness and reach checks. |
| A1 | IRT accuracy foundation | IRT accuracy likelihood; separates student achievement from item difficulty. | Superseded by A2 | Ignores probe/testlet local dependence. | Not used; A2 supersedes. | — |
| A2 | Accuracy + probe effects | Adds probe/testlet random effects so local dependence is absorbed rather than mistaken for achievement. | Established (development) | Form effects, drift, unjustified band movement vs M0. | IRT development baseline; not live scoring. | Classification gain over M0 near cut-points without fairness or comparability harms. |
| A3 | Completion-sensitivity | Adds a latent valid/scorable-response dimension correlated with achievement. | Partial | Missingness mechanism ambiguous; mixes reach, blanks, rapid placeholders. | Research-only. | Clear mechanism + classification gain + no subgroup burden. |
| A4 | Response-state audit | Row-level response-state classification (valid, trailing nonreach, RT+ nonvalid, etc.). | Partial | Over-interpretation as cognition or exposure. | QC/context only. | Independent reach/exposure logging; states stable by form/device. |
| T1b | Speed-only (pooled) | Interval-censored lognormal RT model on valid positive-RT timed-math rows; pooled speed. | Pending | Speed entangled with completion, modality, device. | Research-only; under validation. | Diagnostics + fairness + decision value beyond achievement alone. |
| T1c | Speed-only (family) | As T1b, with separate speed factors per probe family. | Pending | Family-speed entanglement with valid-rate and trailing-rate (large). | Research-only; under validation. | Disentangling of speed from valid/trailing rates; family-speed reproducibility. |
| J0 | Joint accuracy–speed | Joint likelihood for accuracy and RT with shared person distribution. | Partial (research) | Diagnostic fragility; limited temporal generalisation. | Research evidence only; not decisions. | Decision-relevant gain over A2 + temporal generalisation. |
| J1 | Joint + family speed | Joint model with family-specific speed factors. | Partial (research) | As J0 plus family-speed identification. | Research evidence only; not decisions. | As J0 plus family-speed identification holding under sensitivity checks. |
| J2a | Legacy joint + rapid-response | Legacy J2 sensitivity model with global RT <= 2s rapid flag. | Legacy sensitivity; not ready (Y1) | Global RT <= 2s rapid rule is provisional; Y1 fit failed (544 divergences; E-BFMI 0.029; max R-hat 12.39). | Legacy sensitivity only; not decision evidence. | Only retained for comparability; not a promotion path unless rapid definition is replaced. |
| J2b | Planned family/item rapid-threshold model | Planned successor using family/item rapid thresholds; not yet fit. | Planned; not fit | Not yet fit; threshold and family specification must be validated. | Planned shadow candidate; no evidence yet. | Fit J2b with family/item thresholds, clean diagnostics, external/protection/fairness evidence beyond A2. |
5 Five-domain evidence framework
| Domain | Main question | Examples of evidence |
|---|---|---|
| 1. Construct and interpretation | What does the model measure, and is that interpretation defensible? | Achievement vs completion vs speed; local dependence; response-process mechanism. |
| 2. Technical adequacy | Can the model be estimated and trusted for the stated purpose? | Model diagnostics, fit/PPC, item stability, reliability/precision near cut-points. |
| 3. Screening usefulness | Does it improve or preserve risk identification? | Sensitivity, specificity, false negatives, false positives, band movement/stability; AUC where useful. |
| 4. Fairness and comparability | Is the signal construct-relevant and stable across forms, cohorts, modes, devices, and subgroups? | DIF, differential prediction/classification, form effects, link error, mode/device effects, subgroup burden. |
| 5. Reporting and actionability | Can the result be explained and used safely by teachers? | Clear teacher action, misuse warnings, monitoring, and no faster-is-better interpretation. |
External outcomes such as PAT Maths, teacher ratings, and later outcomes are useful but imperfect validation anchors. ENSSA is designed as an early numeracy risk screener focused on foundational number sense and response-process evidence, whereas broad standardised achievement tests measure wider curriculum performance under different timing, instructional, and response-demand conditions. Model evaluation therefore uses converging evidence, not a single external gold standard.
6 Validity argument snapshot
| Claim | Intended use | Key assumption | Current status | Main threat | Decision |
|---|---|---|---|---|---|
| M0 provides a raw-score comparison benchmark | Comparison benchmark | Scorable-response denominator is a documented benchmark choice for current comparisons. | Current benchmark | Denominator confounds completion/reach. | Retain as benchmark |
| A2 supports broad achievement interpretation | IRT development baseline | Probe effects are nuisance structure, not separate reportable traits. | Partial / strongest IRT evidence | Local dependence, form effects, or unjustified band movement. | Use as development baseline |
| A3 should not alter achievement scoring | Research-only | Completion is separate from achievement unless decision evidence proves otherwise. | Not supported for scoring | Missingness mechanism ambiguous. | Research-only |
| A4 supports QC/context | Audit/context | Row states are process proxies, not traits. | Partial | Over-interpretation as cognition or exposure. | QC/context only |
| T1b/T1c estimate effective speed | Research-only profiles | RT signal is not mostly device/completion artefact. | Pending | Construct-irrelevant speed and modality effects. | Research-only |
| J0/J1/J2a/J2b improve decisions | Research-only | Joint model adds decision value beyond simpler models, passes diagnostics, and generalises beyond the current Term 3/4 validation slice. | Reviewed: J0/J1 research-only; J2a legacy only; J2b planned/not fit | Complexity, diagnostic fragility, and limited temporal generalisation. | Not decision evidence; J2a legacy only; J2b not yet fit |
7 Benchmark and classification logic
A high correlation between M0 and A2 is expected because both are based on the same scored item responses. It is not proof that A2 is better. The key question is where A2 disagrees with M0 and whether those disagreements are psychometrically and operationally justified: for example, whether movement is explained by item difficulty, probe/testlet dependence, form or item-mix differences, improved precision near cut-points, or removal of local dependence.
For a screener, model promotion requires decision-relevant improvement. Public reporting should focus on a compact classification set: sensitivity, specificity, false negatives, false positives, and band movement/stability; AUC may be included where useful. False negatives are a major validity risk because genuinely at-risk students may miss support, but false positives also matter because intervention resources are finite.
| Requirement | Promotion convention |
|---|---|
| Classification gain | Any gain must be practically meaningful, not just non-zero. |
| Specificity protection | Sensitivity gains must not create unacceptable over-flagging. |
| False-negative protection | Target risk groups must not experience increased missed-risk rates. |
| Band movement | Students moved between bands must show plausible criterion or psychometric evidence. |
| Fairness | No subgroup, form, device, or administration group should bear disproportionate burden. |
| Cut-point precision | Precision gains should matter near risk thresholds, not only on average. |
| Complexity | More complex models must justify added burden through decision value. |
8 Known content gaps (next revision)
These are explicit content gaps still to be written into the public pages. Listing them here keeps the gap visible rather than hidden behind “Pending” cells.
- A2 specification block. Likelihood, parameterisation, and priors for the IRT development baseline, in the same style as the J0/J1/J2 block on the Response Time Modelling page.
- A3 specification block. Likelihood, parameterisation, priors, and a precise definition of the internal reliability-like index used to compare A3 against A2.
- Number-line PCM sensitivity evidence. Category occupancy and conclusion-stability tables across candidate number-line binning policies; see Number-Line PCM Policy.
- Cut scores, decision consistency, and test information near the cut points. Including the threshold-setting method.
- T1c entanglement headline. Estimated family speed correlates strongly with valid-rate (0.51–0.71) and trailing-rate (−0.63 to −0.79). This belongs near the top of the Response Time Modelling page, with a partial-correlation paragraph.
- J0/J1/J2 nesting diagram and a small J2 Year 1 diagnostic visual (divergences, R-hat, E-BFMI).
- J2a/J2b rapid-response threshold spec. J2a used the legacy \(r_{pi} = \mathbb{1}\{\mathrm{round}(\mathrm{rt}_{pi}) \le 2\mathrm{s}\}\) indicator. That global raw-second rule is now provisional only; planned J2b uses family/item lower-tail thresholds and excludes/defer magnitude comparison.
- Term 3/4 pooling. One-sentence definition: speed/joint evidence is pooled at the person-administration level (student × administration event) across Terms 3 and 4, not a longitudinal Term 1→Term 3→Term 4 model.
- Anchor citations. Six inline citations to add: Yen (1984) at Q3; van der Linden (2007) at the lognormal RT model; Wang & Wilson (2005) at testlet/bifactor; Schnipke & Scrams (1997) / Wise & Kong (2005) at rapid-guessing; Standards for Educational and Psychological Testing (2014) at the validity argument framing.
9 External and classification validation status
| Evidence | Validation role | Current status | Validation strength | Limitation |
|---|---|---|---|---|
| PAT Maths | Broad external maths achievement anchor | Pending | Pending | Broader construct, timing and instructional effects. |
| Teacher ratings | Ecological/instructional judgement | Pending | Pending | Subjective and may include classroom behaviour or expectations. |
| Later ENSSA waves | Longitudinal consistency within instrument family | Pending | Pending | Same instrument family; not fully independent. |
| Classification metrics | Screening decision utility | Pending | Pending | Requires cut-score/band definitions and outcome labels. |
| Band movement near thresholds | Operational impact and stability | Pending | Pending | Requires uncertainty near cut-points. |
| Fairness / subgroup checks | Construct-irrelevant variance and differential burden | Pending | Pending | Requires small-cell safeguards and appropriate subgroup data. |
10 Internal structure, comparability, and scale gates
| Gate | Required evidence | Current status |
|---|---|---|
| A2 internal structure | Dimensionality, local dependence before/after probe effects, probe/testlet variance, item fit, item parameter stability, and evidence that probe effects are nuisance rather than reportable subscores. | Partial |
| Form/term/cohort comparability | Anchor stability, item drift, form effects, term effects, link error, probe-family comparability, mode/device equivalence, and common-scale evidence. | Pending |
| Score/report scale | Theta orientation, any transformation to reporting scale, and comparability across forms/terms/year levels. | Pending documentation |
| Bands/cut scores | Definitions of Very Low, Low, On Track thresholds; threshold-setting method; uncertainty and decision consistency near cut-points. | Pending documentation |
Achievement band order, where bands are shown, should remain: Very Low → Low → On Track.
11 What we have learned so far
- The achievement side is not best described as a collection of unrelated probe scores. The current evidence supports a broad numeracy achievement score with probe/testlet effects rather than a large operational battery of separate ability scores.
- A2 should remain the IRT development baseline. It is the model to beat before anything else can be considered for live scoring.
- The legacy
is_attemptedfield should be read as valid/scorable response, not as true exposure or effort. This is why the completion-sensitivity model (A3) is not a true reach model. A3 did show useful signal, but it did not consistently justify changing achievement scoring and remains research-only. - Response-state categories (A4) are useful as context and QC. They should not be counted as wrong answers inside the achievement likelihood.
- Effective speed (T1b/T1c) is measurable under a clean data contract, but raw speed is not an achievement score and should not drive live risk bands.
- The useful direction is profile-aware: protect slow-accurate students, flag rapid-risk patterns for audit, and distinguish completion-constrained profiles from low achievement.
12 Fairness, disclosure, and reproducibility safeguards
| Fairness / comparability risk | Method check | Public reporting rule |
|---|---|---|
| Differential item functioning | DIF or item-parameter stability checks where subgroup data permit. | Suppress small cells |
| Differential prediction | Compare outcome relations across relevant groups where cell sizes permit. | Suppress small cells |
| Differential classification | Compare sensitivity, specificity, false negatives, false positives, and band movement by group where safe. | Suppress small cells |
| Profile-rate differences | Check whether speed/completion profiles concentrate by form, device, probe, school/class, or subgroup. | Suppress small cells |
| Device/mode effects | Audit whether timing or profile signals are driven by interface, modality, or platform behaviour. | No device-identifiable or school-identifiable reporting |
| School/class administration effects | Inspect administration clustering only with aggregation and suppression safeguards. | No school/class/teacher-identifiable reporting |
| Accessibility/accommodation | Check whether speed/profile interpretations would penalise students needing accommodations. | Use only where ethically and legally appropriate |
Small-cell suppression applies to fairness and subgroup summaries. Avoid school-, class-, teacher-, device-, or subgroup-identifiable summaries where N is too small.
| Model family | Required reproducibility metadata | Current public status |
|---|---|---|
| M0 | Scoring convention version, denominator definition, band thresholds, extract date. | Pending documentation |
| A1/A2/A3 | Data extract date, Stan file, run ID, seed, chains/iterations, inclusion/exclusion rules, output artefact path. | Pending documentation |
| A4 | Response-state builder version, row inclusion rules, state definitions, extract date. | Pending documentation |
| T1b/T1c | RT data contract version, Stan/data builder files, exclusions, run ID, diagnostics artefact path. | Pending documentation |
| J0/J1/J2a/J2b | Stan file, run ID, seed, chains/iterations, inclusion/exclusion rules, diagnostic extraction artefact path. | Pending diagnostics |
Key claims in the final public version should be supported with short references to psychometric standards and response-time modelling sources, including the Standards for Educational and Psychological Testing, Wu et al. on validity/reliability/IRT, van der Linden on response-time and speed-accuracy modelling, rapid-guessing/testlet-dependence literature, and ENSSA screening design materials.
13 What would change the recommendation?
The master status table above lists what would change each model’s status individually. The table below covers cross-cutting scenarios that would change the overall recommendation.
| Evidence finding | Decision implication |
|---|---|
| A2 improves classification and precision near cut-points over M0 without fairness or comparability harms. | Consider operational evaluation pathway for A2. |
| A2-vs-M0 band movers are psychometrically plausible and not concentrated by form, device, school/class, probe, or subgroup. | Supports A2 transition case; otherwise investigate before promotion. |
| A3 improves fit but not external/classification evidence, or increases subgroup burden. | Keep A3 research-only. |
| A4 states vary strongly by form/device/administration. | Use A4 for QC/admin review only; do not student-report. |
| T1c speed improves false-negative detection without penalising slow-accurate students. | Consider guarded profile reporting after further validation and actionability checks. |
| J0/J1/J2a/J2b have persistent divergences, unstable correlations, no temporal generalisation, or no decision gain over simpler models. | Do not use joint models for decisions. |
14 Path to operationalising speed responsibly
The objective is not to avoid speed forever. The objective is to operationalise speed responsibly. A promoted speed-related output should be a guarded response-process profile, not a faster-is-better score.
| Gate | Promotion requirement |
|---|---|
| Construct | Profiles distinguish meaningful response processes, not artefacts of form, device, or scoring. |
| Data contract | Logging distinguishes seen/available, positive RT or focus-event proxy, responded, valid/scorable, correct, and timing stages. |
| Fairness | Profile rates and effects are checked by cohort, school/class where appropriate, subgroup proxies, device/mode, and administration conditions. |
| Stability | Profiles are stable across terms, cohorts, probe families, and reasonable thresholds. |
| Interpretation | Reports protect slow-accurate students and avoid implying faster is better. |
| Actionability | No score, profile, or flag is reported unless it maps to a defensible teacher action and has misuse warnings. |
| Monitoring | Live use has drift checks, fairness checks, override rules, and review cadence. |
Candidate profile labels for future validation:
- Efficient-accurate: accurate and timely, but not rewarded merely for being fast.
- Slow-accurate: accurate with slower response time; must be protected from speed penalties.
- Rapid-risk: very rapid responding with low accuracy or suspicious response-process pattern.
- Slow-low-accuracy: low accuracy plus slow/effortful response process.
- Completion-constrained: evidence suggests items were not reached or process was interrupted.
- Inconclusive: logging or pattern does not support a defensible profile.
15 Where to read more
- Accuracy Modelling: deeper evidence for a single achievement score with testlet effects.
- Number-Line PCM Policy: how continuous number-line accuracy is converted into ordered model categories for GPCM.
- Omitted and Non-Scored Responses: what response records tell us, A3 sensitivity findings, and A4 response-state categories.
- Response Time Modelling: RT data contract, T1b/T1c, rapid-risk, slow-accurate protection, J0/J1/J2a, and planned J2b.
- IRT Model - Foundation 2025 Term 1: archival Foundation T1 calibration and methodology reference.