Response Time Modelling

Raw speed is not safe as an achievement adjustment — but guarded accuracy–speed profiles may be, once evidence gates are cleared

↑ State of Play

1 Summary

The screener records response times (RTs), but RTs are not automatically interpretable as mathematical fluency or effort. A response time can reflect item difficulty, interface modality, rounding/capping, whether the item was reached, whether the final response was scorable, administration conditions, and student strategy. This page explains the clean RT contract and the path from speed models to possible response-process profiles.

Speed is not ready to change live bands, but it is not being abandoned. The current evidence supports a guarded path: clean the response-time data contract, estimate effective speed only where timing is interpretable, protect slow-accurate students, audit rapid-risk patterns, and test whether joint accuracy–RT models can support stable response-process profiles. The active joint path is J0/J1/J2, now split conceptually into legacy J2a and planned J2b. The full run shows that J0/J1 are viable research models and that J2 Foundation is interpretable as a legacy rapid-response sensitivity result, but J2 Year 1 failed diagnostics and the global RT <= 2s rapid rule is now legacy/provisional only.

2 RT data contract — what is included and why

The clean speed likelihood uses a deliberately narrow response-time contract.

Include	Exclude	Reason
Timed-math probes	Number-line tasks	The construct is timed-math effective speed, not general platform speed.
Valid/scorable rows	Invalid/non-scorable rows	The RT likelihood should not mix valid work with nonvalid mechanisms.
Positive response time	Zero-RT QC rows	Zero RT is a QC edge case, not a response-time observation.
Rounded Janison RT treated as interval-censored	STPM/STDD speed anchors and legacy Year 1 Term 3 ASDD_2025/ASDD	These rows have different constructs or known modality/scoring risks.

Response time is modelled as a rounded / interval-censored lognormal process using lower and upper RT bounds supplied to the Stan data. A larger latent speed parameter means faster effective speed. Achievement remains separate unless a joint model is explicitly being tested. See Model specifications for the full formulae.

The Janison rounding convention still needs final platform confirmation before the interval-censoring interpretation is treated as final. If recorded seconds are floored, ceiled, rounded to nearest second, or exact elapsed integers, the correct interval differs.

The latest speed/joint evidence is grounded mainly in pooled Term 3/4 person-administration data for the clean timed-math families. It is not a full Term 1/3/4 longitudinal speed-profile model. Term 1 evidence remains part of the broader project history, but temporal generalisation of speed profiles is still a future gate.

3 Speed-only model (pooled): T1b

The pooled speed model estimates effective speed from timed-math, valid/scorable, positive-RT rows after excluding legacy/problematic forms. It is under active validation only and must not influence live bands.

Year	cor(T1 FE speed, Stan speed)	cor(Stan speed, current achievement)	Interpretation
Foundation	0.923	0.125	Stable speed signal; weak relation to achievement.
Year 1	0.928	0.189	Stable speed signal; modest relation to achievement.

The weak-to-modest relationship with achievement is descriptive only. Observed speed-achievement correlations should not be used to infer a cognitive speed–ability relation unless item time-intensity, item difficulty, person sampling, and form design are accounted for. This argues against folding raw speed directly into the achievement score, but supports further response-process validation.

4 Speed-only model (family-specific): T1c

Family-specific speed models test whether one pooled speed dimension is enough. They show related but not identical speed signals across probe families. T1c is speed-only: family speed dimensions are not jointly modelled with achievement in this branch; the joint family-speed model enters at J1/J2.

Year	Family	N	cor(pooled speed)	cor(current achievement)	cor(valid rate)	cor(trailing rate)
Foundation	Match quantity	2525	0.755	0.157	0.511	-0.632
Foundation	Missing number	2491	0.859	0.124	0.548	-0.664
Year 1	Arithmetic	1366	0.862	0.255	0.714	-0.786
Year 1	Missing number	2558	0.926	0.190	0.621	-0.718

Family-speed estimates are related but not interchangeable:

Year	Family comparison	Person-speed correlation	Interpretation
Foundation	Missing number vs match quantity	0.433	Moderate relation: one pooled speed factor is probably too crude for these families.
Year 1	Arithmetic vs missing number	0.566	Moderate relation: arithmetic and missing-number speed share signal but are not identical.

These results support broad family-specific speed factors, not a separate speed factor for every small probe or item.

The strong association between estimated speed and valid/trailing row rates shows that the speed signal is not pure automaticity. It also reflects opportunity, completion, item-row dynamics, and possibly administration or modality effects. This is another reason speed should remain under validation and should not be used as a direct achievement adjustment.

5 Why family-specific speed, not family-specific achievement?

The current evidence supports a broad achievement score with probe/testlet effects, but allows speed to vary by task family. These are not contradictory claims.

Accuracy asks whether the student can solve the item. Response time asks how quickly the student produces a valid response under a specific task process and interface. Correctness across tasks can be driven by a broad numeracy achievement dimension, while time taken can vary more by format, motor demand, response modality, visual search, and strategy.

The Ability Structure Review supports one broad achievement score with local probe effects rather than separate operational achievement subscores. Family-specific achievement scores would need stronger evidence: enough items per family, reliable subscore precision, external validation, stability over time, fairness checks, and clear instructional interpretation.

Family-specific speed is safer to treat as response-process context. The current structure is therefore:

Achievement: one broad theta + item/probe/testlet effects
Speed: broad task-family speed factors for eligible timed-math families
Rapid-risk: separate rapid-response sensitivity/audit signal
A4: separate response-state/context layer outside achievement scoring

6 Which task families are covered?

The current clean speed and joint runs do not estimate a unique speed factor for every task. They cover broad eligible timed-math families with enough clean RT evidence.

Task family / probe type	Current speed treatment	Reason
Missing number	Included as timed-math speed family where eligible.	Timed task process with clean valid/scorable positive-RT rows.
Match quantity	Included as timed-math speed family where eligible.	Timed task process with clean valid/scorable positive-RT rows.
Arithmetic	Included for Year 1 arithmetic where eligible and not affected by legacy modality exclusions.	Timed task process, but modality/form caveats matter.
Number line	Excluded from main speed factor; treat RT as separate number-line response-process/context evidence if used.	Untimed/spatial estimation; time may reflect deliberation or clicking strategy rather than fluency.
Decomposition / DMT	Not included in the current clean joint speed evidence unless timing/data-contract checks later justify it.	Requires timing-contract, RT-distribution, and construct checks before inclusion.
Magnitude comparison	Not included in the current clean joint speed evidence unless timing/data-contract checks later justify it.	Requires timing-contract, RT-distribution, and construct checks before inclusion.

Untimed tasks can still have useful RT/context information, but that evidence should not be folded into the timed-math speed factor. For number line in particular, a better future construct is a separate response-style audit such as rapid-clicking versus deliberate estimation, not general effective speed.

7 Rapid-risk and slow-accurate audit cohorts

The descriptive profile flags are audit cohorts, not interventions or classifications. They test whether accuracy and RT together can separate qualitatively different response-process patterns.

Threshold definitions used for these audit cohorts should be treated as provisional. In particular, RT <= 2s is a legacy comparator, not an operational rapid-risk definition.

Flag	Accuracy condition	Speed / RT condition	Completion condition	Minimum evidence
Slow-accurate	High reached accuracy / upper achievement group	Slow speed / lower speed quintile	Sufficient valid RT evidence	Pending documentation
Fast-low-accuracy	Low reached accuracy / lower achievement group	Fast speed / upper speed quintile	Sufficient valid RT evidence	Pending documentation
Rapid-risk	Low accuracy or suspicious rapid pattern	Elevated rapid-response rate; legacy examples used RT <= 2s rows	Sufficient valid RT evidence	Pending documentation
High-trailing high-reached-accuracy	High reached accuracy	Not primarily speed-defined	High trailing proxy rate	Pending documentation

Year	Family	N person-admins	Slow-accurate	Fast-low-accuracy	Rapid-risk	High-trailing high-reached-accuracy
Foundation	Missing number	2491	190	205	479	215
Foundation	Match quantity	2525	157	237	184	156
Year 1	Missing number	2558	292	173	230	298
Year 1	Arithmetic	1366	80	86	105	91

7.1 Accuracy-speed surface

The key point: some high-achievement groups are not the fastest, and some rapid groups need audit rather than reward. A one-dimensional speed rule would misclassify both.

8 Slow-accurate protection

The key operational principle is: speed should not mean faster is better. Slow-accurate students may be demonstrating the target achievement with a slower response process. Any profile system must protect them from being penalised for speed alone.

9 Joint accuracy–RT models: J0/J1/J2

The joint accuracy–response-time family (J0, J1, J2) tests whether achievement and speed should be estimated together. The full J0/J1/J2 run is now available and has been reviewed at the aggregate diagnostic level.

Model	Purpose	Diagnostic result	Current status	Can it change live bands now?
J0	Baseline joint accuracy–RT: one achievement and one speed dimension, bivariate normal.	Basic two-chain diagnostics acceptable in Foundation and Year 1: 0 divergences; max treedepth 6; min E-BFMI 0.596; scalar max R-hat 1.015.	Research summary available; not decision evidence.	No
J1	Extends J0 with family-specific speed dimensions.	Basic two-chain diagnostics acceptable in Foundation and Year 1: 0 divergences; max treedepth 6; min E-BFMI 0.676; scalar max R-hat 1.011.	Research summary available; not decision evidence.	No
J2	Extends J1 with a rapid-response indicator in the accuracy model.	Mixed: Foundation acceptable, but Year 1 failed diagnostics (544 divergences in one chain, E-BFMI 0.029, scalar max R-hat 12.39).	Not ready; Year 1 requires reparameterisation or model revision.	No

Run ID: m3-joint-accuracy-rt-j0j1j2-full-20260525T015526Z.

Joint models must beat simpler models on decision-relevant evidence. They do not win by being more complex or more theoretically complete. The current run supports research interpretation of J0/J1 only; it does not clear any joint model for reporting or live bands.

Immediate next modelling step is not to promote old J2. The old J2 is now treated as J2a / legacy sensitivity only because its rapid flag is the global RT <= 2s rule. A J2 Year 1 rerun is only a comparability option, not the main promotion gate. The preferred next candidate is J2b, using family/item rapid-threshold flags within the clean timed families. Residual-RT conditional accuracy is reserved for a later J3 candidate.

9.1 What the full run says so far

Model	Year	N person-admins	cor(theta, A2)	cor(speed, median RT)	cor(speed, rapid rate)	cor(speed, accuracy)	Interpretation status
J0	Foundation	2537	0.595	-0.804	0.722	-0.205	Research-only
J0	Year 1	2559	0.604	-0.757	0.758	-0.090	Research-only
J1	Foundation	2537	0.600	-0.792	0.676	-0.166	Research-only
J1	Year 1	2559	0.607	-0.757	0.736	-0.054	Research-only
J2	Foundation	2537	0.606	-0.793	0.674	-0.159	Foundation only; caution
J2	Year 1	2559	0.607	-0.747	0.720	-0.020	Not interpretable: failed diagnostics

The J0/J1 speed dimensions behave like response-time process estimates: they correlate strongly with median RT and rapid-response rate, but only weakly with accuracy. This supports the current governance position: speed is response-process evidence, not an achievement adjustment. The moderate J0/J1 correlations with A2 achievement (about 0.60) also show that the joint achievement scale is not simply a drop-in replacement for A2; disagreements would need separate validation before any reporting use.

For J2a, the Foundation rapid-response coefficient is strongly negative (about -1.04) under acceptable diagnostics, supporting a rapid-response sensitivity signal for that slice. The Year 1 rapid-response coefficient is also negative in the scalar summaries, but the Year 1 fit failed diagnostics and should not be interpreted. Treat old J2/J2a as legacy sensitivity evidence only; it does not define operational rapid-risk.

10 What is established, and what is not

Established so far	Not established yet
Joint accuracy–speed modelling is technically viable for J0/J1 and for J2 Foundation.	J2 Year 1 is not trustworthy, and old J2/J2a is not the preferred next model because its rapid rule is global RT <= 2s.
Achievement and speed are related but distinct dimensions.	Speed should not change live achievement bands.
The speed estimate behaves as expected: higher speed corresponds to shorter median RT and higher rapid-response rate.	External validity and classification utility of speed profiles beyond A2 are not yet established.
Family-specific speed is justified as response-process context for eligible timed-math families.	Operational cut rules for profiles are not final.
Legacy J2a Foundation shows a strong negative rapid-response effect, but the global RT <= 2s rule is provisional only.	Fairness and safety across cohorts, schools/classes, devices/admin contexts, and subgroups are not yet established.
Slow-accurate students exist and require protection from speed penalties.	Temporal generalisation from Term 3/4 to Term 1/3/4 is not yet established.

The operational direction is therefore: keep achievement as the separate scoring construct, keep A2 as the IRT achievement reference for model comparisons, and develop guarded accuracy–speed response-process profiles around it. M0 remains the raw-score comparison benchmark for live reporting.

11 J2b rapid-threshold decision

The current decision is to keep old J2 as J2a / legacy sensitivity evidence and use J2b as the preferred next candidate if this branch continues. J2b remains shadow_only_not_for_live_bands. It would use only the current clean timed families:

Foundation: missing number and match quantity;
Year 1: missing number and arithmetic under the existing clean data contract.

Magnitude comparison is excluded/deferred because it is a binary-choice special case with interval/discrete RT behaviour; ultra-fast responses can be accurate, and side-bias/distance-effect metadata are not yet available. Number line, decomposition, STPM/picture match, and STDD remain collateral/context evidence, not timed-speed-factor inputs.

The planned primary J2b flag is:

rapid_j2b = rt_sec <= min(item_p05_rt, family_p05_rt)

computed within the clean valid/scorable positive-RT timed-maths contract. Accuracy-collapse evidence is used to validate or challenge the flag; it does not define the flag. Residual-log-RT thresholding is reserved for a later J3 candidate.

12 Model specifications

This page focuses on speed and joint accuracy–RT specifications. The A1/A2/A3 achievement-side specifications belong in the State of Play and Ability Structure Review. The joint models below use an A2-style achievement component, but the formulae here focus on the RT and joint-model additions.

12.1 T1b — Speed-only model (pooled)

Response time \(T_{pi}\) is represented in Stan by lower and upper bounds \((\ell_{pi}, u_{pi})\) and modelled as interval-censored. Current runs used the rounded-RT data contract, but the exact Janison convention still needs final platform verification before interpreting the bounds as floor, ceiling, nearest-second rounding, or exact elapsed integer seconds.

\[\Pr(\ell_{pi} < T_{pi} \le u_{pi}) = \Phi_{\text{LN}}\!\left(u_{pi};\, \mu_{pi},\, \sigma\right) - \Phi_{\text{LN}}\!\left(\ell_{pi};\, \mu_{pi},\, \sigma\right)\]

where \(\Phi_{\text{LN}}\) is the lognormal CDF and the location parameter is:

\[\mu_{pi} = \beta_0 + \beta_i - \text{speed}_p, \quad \text{speed}_p = \sigma_{\text{speed}} \cdot z_p\]

A larger \(\text{speed}_p\) shifts the RT distribution left — meaning faster expected response. \(\beta_i\) is an item time-intensity effect (mean-centred); larger values imply longer expected response time. \(\beta_0\) is the overall intercept.

Priors: \(z_p \sim \mathcal{N}(0,1)\); \(\;\beta_0 \sim \mathcal{N}(\log 5,\, 1)\); \(\;\sigma \sim \text{Exp}(1)\) with lower bound 0.05; \(\;\sigma_{\text{speed}},\, \sigma_{\text{item}} \sim \text{Exp}(1)\) with lower bound 0.01.

12.2 T1c — Speed-only model (family-specific)

As T1b, but with a separate speed factor \(\tau_{pf}\) for each probe family \(f\):

\[\mu_{pif} = \beta_0 + \beta_i - \tau_{pf}\]

Each family has its own residual SD \(\sigma_f\). The person speed vector \((\tau_{p1}, \dots, \tau_{pF})\) is not jointly modelled with achievement in T1c; joint modelling enters in J0/J1/J2.

12.3 J0 — Joint accuracy–speed (baseline)

Achievement \(\theta_p\) and speed \(\tau_p\) are drawn from a bivariate normal via the Stan non-centred parameterisation:

\[\begin{pmatrix}\theta_p \\ \tau_p\end{pmatrix} = \text{diag}(\boldsymbol{\sigma}_{\text{person}})\,\mathbf{L}_\Omega\,\mathbf{z}_p, \quad \mathbf{z}_p \sim \mathcal{N}_2(\mathbf{0},\, \mathbf{I})\]

\[\mathbf{L}_\Omega \sim \text{LKJ-Cholesky}(2), \quad \boldsymbol{\sigma}_{\text{person}} \sim \text{Exponential}(1)\]

This matches the Stan code using diag_pre_multiply(sigma_person, L_Omega).

The accuracy likelihood is:

\[\Pr(Y_{pi} = 1) = \text{logit}^{-1}(b_0 + \theta_p - b_i)\]

The RT likelihood is the same interval-censored lognormal as T1b, with \(\mu_{pi} = \beta_0 + \beta_i - \tau_p\) and a single residual SD \(\sigma_{\text{RT}}\).

Item effects are mean-centred: \(b_i = \sigma_b \cdot (\tilde{b}_i - \bar{\tilde{b}})\); \(\;\beta_i = \sigma_\beta \cdot (\tilde{\beta}_i - \bar{\tilde{\beta}})\).

The posterior correlation \(\rho(\theta, \tau)\) is reported via the generated quantities block.

12.4 J1 — Joint accuracy–family speed

Extends J0 to \(F\) probe families. The person parameter vector is \((F+1)\)-dimensional:

\[(\theta_p,\; \tau_{p1},\; \dots,\; \tau_{pF})^\top \;\sim\; \mathcal{N}_{F+1}\!\left(\mathbf{0},\; \boldsymbol{\Sigma}\right)\]

\[\boldsymbol{\Sigma} = \text{diag}(\boldsymbol{\sigma})\,\boldsymbol{\Omega}\,\text{diag}(\boldsymbol{\sigma}), \quad \boldsymbol{\Omega}=\mathbf{L}_\Omega\mathbf{L}_\Omega^\top, \quad \mathbf{L}_\Omega \sim \text{LKJ-Cholesky}(2)\]

The non-centred draw is \(\mathbf{x}_p = \text{diag}(\boldsymbol{\sigma})\mathbf{L}_\Omega\mathbf{z}_p\); the Stan code implements the equivalent row-vector form with diag_pre_multiply(sigma_person, L_Omega).

The RT likelihood for item \(i\) in family \(f\) uses the family-specific speed \(\tau_{pf}\) and family-specific residual \(\sigma_f\):

\[\mu_{pif} = \beta_0 + \beta_i - \tau_{pf}\]

The accuracy likelihood is identical to J0.

12.5 J2a — Legacy joint accuracy–family speed + rapid-response check

Extends J1 with a binary rapid-response indicator \(r_{pi} \in \{0, 1\}\) in the accuracy model:

\[\Pr(Y_{pi} = 1) = \text{logit}^{-1}(b_0 + \theta_p - b_i + \gamma_{\text{rapid}} \cdot r_{pi})\]

\[\gamma_{\text{rapid}} \sim \mathcal{N}(0,\, 1)\]

The parameter \(\gamma_{\text{rapid}}\) tests whether accuracy is systematically different for very rapid responses under the legacy RT <= 2s flag. It does not alter the RT likelihood. A negative estimate means the legacy rapid flag is accuracy-negative on average in that data slice; it is not a causal estimate, not a student trait, not a full rapid-guessing model, and not an operational rapid-risk definition. J2b replaces the global raw-second flag with family/item thresholds if this branch proceeds.

13 Priors and identification — audit reference

Model	Likelihood	Person parameters	Key identification constraint
T1b/T1c (speed-only, rounded lognormal)	Interval-censored lognormal RT.	speed_p = σ_speed · z_p; larger speed = faster.	Item time effects mean-centred; person speed centred by prior.
J0 (joint accuracy–RT)	Bernoulli-logit accuracy + interval-censored lognormal RT.	(θ_p, τ_p) from 2-dim Cholesky; larger τ = faster.	Item difficulty and time effects mean-centred.
J1 (joint accuracy–family speed)	As J0 with family-specific RT residual SD.	(θ_p, τ_p1, …, τ_pF) from (F+1)-dim Cholesky.	As J0; family speed factors share one multivariate/correlated prior with θ in J1/J2.
J2 (joint + rapid-response check)	As J1 with rapid-response term in accuracy linear predictor.	Same as J1 plus γ_rapid in accuracy predictor.	As J1; rapid indicator is observed, not a person parameter.

14 Diagnostics and validation status

Detailed Bayesian checks belong in a technical appendix or internal diagnostics note. The public status is deliberately compact.

Model	Run status	Diagnostics status	Checks required before interpretation	Decision status
T1b	Completed	Partial / under review	PPC by item/probe and RT tail; prior sensitivity where needed; R-hat/ESS/divergences/treedepth/BFMI.	Research-only
T1c	Completed	Partial / under review	Family-specific PPC, RT tail checks, profile-rate checks, prior sensitivity where needed.	Research-only
J0	Completed	Acceptable basic full-run diagnostics	0 divergences, acceptable scalar R-hat and E-BFMI in full run; still needs PPC and decision validation.	Not decision evidence
J1	Completed	Acceptable basic full-run diagnostics	0 divergences, acceptable scalar R-hat and E-BFMI in full run; still needs family RT PPC and decision validation.	Not decision evidence
J2a	Completed	Failed in Year 1	Year 1 failed: 544 divergences in one chain, E-BFMI 0.029, scalar max R-hat 12.39; legacy sensitivity only.	Legacy sensitivity only
J2b	Planned/not fit	Not run	Requires implementation and full diagnostics; planned family/item thresholds only.	Planned shadow candidate only

15 Candidate operational profiles

These are candidate labels for future validation. None are currently operational. No score, profile, or flag should be reported unless it maps to a defensible teacher action and includes misuse warnings.

Profile	Meaning	Before reporting
Efficient-accurate	Accurate and timely; not a reward for speed alone.	Show that this adds useful action beyond a strong achievement score; avoid rewarding speed for its own sake.
Slow-accurate	Accurate but slower; protected from speed penalties.	Validate protection rule and reporting language; specify what the teacher should do differently.
Rapid-risk	Very rapid response process with low accuracy or suspicious pattern; audit only until validated.	Validate thresholds, false-positive risk, and whether this is audit-only or an intervention signal.
Slow-low-accuracy	Low accuracy plus slow/effortful response process.	Validate interpretation against outcomes and teacher evidence; clarify support intensity.
Completion-constrained	Evidence suggests unreached or interrupted items, not simply wrong answers.	Validate logging/admin mechanisms and whether action is administration review, extra time, or instruction.
Inconclusive	Logging or pattern is insufficient for a defensible profile.	Define conservative default / no-label rule.

16 Prior iteration: Term 3 v5 R0 S1–S4 review

A prior S1–S4 joint speed-accuracy review from the March 2026 irt-joint-stan-pcm v5-t3 code path has been archived. It is useful historical background, but it is no longer the active modelling path and is partly superseded by the current M3 T1b/T1c/J0/J1/J2 line. Treat it as background only until the J0/J1/J2 diagnostics are reviewed and written up.