Joint Speed-Accuracy Model Review

Term 3 v5 R0 comparison of S1-S4, with S0 as benchmark only

1 Why this review matters

This page compares the candidate joint speed-accuracy models for the Term 3 v5 R0 run set. The aim is to choose the most psychometrically useful joint model of accuracy (theta) and speed (zeta).

S0 is included only as an accuracy-only benchmark. It is not part of the joint-model decision set.

2 What the candidate models mean

Model	Accuracy side	Speed side	Plain-language interpretation	Role in this review
S1	Common theta plus testlet effects	One latent speed factor using all RT-bearing probes	All speededness is absorbed into one general zeta dimension.	Joint-model comparator
S2	Common theta plus testlet effects	Baseline speed plus timed-math deviation components	Separates a student’s general speed tendency from extra timed-math speed effects.	Preferred joint-model candidate
S3	Common theta plus testlet effects	One latent speed factor using timed-math probes only	Uses one simpler overall speed dimension without decomposing speed into subcomponents.	Main sensitivity challenger
S4	Common theta plus testlet effects	Timed-math speed structure with added covariance complexity	Adds extra dependence structure on top of timed-math speed, making it the most complex option.	Over-specified alternative

Formally, all four candidate models estimate the same underlying accuracy trait (theta) and differ mainly in how they parameterise the latent speed process (zeta).

In this report, “timed-math” means the timed arithmetic and magnitude-comparison style probes, not the separate speed-anchor probes such as STPM (and in Year 1, not STDD either). So the key distinction is:

S1: one general speed factor built from all RT-bearing probes, including the speed anchors
S3: one speed factor built only from the timed-math RT probes, excluding the speed anchors
S2: all-RT model that additionally decomposes speed into baseline plus timed-math deviation
S4: timed-math-focused variant with extra covariance structure

At a high level, the shared accuracy side can be written as:

\[ \Pr(Y_{ij}=1) = \operatorname{logit}^{-1}(\theta_i + u_{it(j)} - b_j) \]

for binary items, where \(\theta_i\) is student \(i\)’s latent accuracy, \(u_{it(j)}\) is the testlet effect for the relevant item cluster, and \(b_j\) is item difficulty. For number-line items, the same latent accuracy enters a partial-credit measurement equation rather than a binary logit.

The models differ on the response-time side. Writing \(\log T_{ij}\) for log response time:

\[ \log T_{ij} \sim \mathcal{N}(\mu_{ij}^{(RT)}, \sigma_{g(j)}^2) \]

with different definitions of \(\mu_{ij}^{(RT)}\) by model:

S1: one-factor speed model

\[ \mu_{ij}^{(RT)} = \lambda_j - \zeta_i \]

S2: baseline speed plus timed-math deviation

\[ \mu_{ij}^{(RT)} = \lambda_j - \zeta_i^{(base)} - I_{\text{timed}}(j)\,\zeta_i^{(timed)} \]

S3: simpler one-factor speed model used as the main challenger

\[ \mu_{ij}^{(RT)} = \lambda_j - \zeta_i \]

S4: timed-math formulation with extra covariance structure layered on top of the S2-style decomposition

\[ \mu_{ij}^{(RT)} = \lambda_j - \zeta_i^{(base)} - I_{\text{timed}}(j)\,\zeta_i^{(timed)} \]

but with a richer covariance structure among the latent speed components.

S1 and S3 are both one-factor speed models, but S3 is the simpler operational challenger carried forward in this comparison.
S2 is the richer decomposition: it allows speed to be split into a general baseline component and an additional timed-math-specific deviation.
S4 extends the timed-math formulation with extra covariance structure, which makes it the most complex specification.

In practice, the main decision is whether the richer decomposition in S2 earns its keep relative to the simpler S3. The latest construct-validity pass suggests that S2 does recover a baseline/general-speed factor, but that the resulting timed-math speed score stays extremely close to S3 for almost all students.

3 Executive decision

Cohort	Winner	Mean theta SD	Teacher rho	Latest PAT rho	Verdict
Foundation	S2	0.437	0.392	0.213	Preferred joint model
Year 1	S2	0.398	0.489	0.515	Preferred joint model

S3 is the preferred model when the intended construct is timed-math speed.
S2 remains a useful broader alternative when explicit baseline-speed adjustment is a reporting goal.
S0 remains useful as a benchmark for how much joint modelling adds, but not as a candidate answer to the speed question.

4 Main comparison table

Cohort	Model	Role	Theta SD	Teacher rho	PAT rho (latest)	PAT rho (strict EOY)	Raw total rho	Speed-teacher rho	Speed-PAT rho	Max theta group R2	Max speed group R2	Verdict
Foundation	S1	Other joint model	0.439	0.386	0.217	0.290	0.676	0.217	0.142	0.026	0.025	Competitive but not preferred
Foundation	S2	Recommended	0.437	0.392	0.213	0.307	0.672	0.207	0.148	0.027	0.021	Preferred joint model
Foundation	S3	Sensitivity challenger	0.440	0.384	0.218	0.283	0.670	0.200	0.150	0.026	0.019	Closest challenger
Foundation	S4	Other joint model	0.445	0.373	0.201	0.269	0.634	0.153	0.065	0.027	0.015	Added complexity without payoff
Year 1	S1	Other joint model	0.398	0.490	0.518	0.650	0.815	0.284	0.261	0.047	0.017	Competitive but not preferred
Year 1	S2	Recommended	0.398	0.489	0.515	0.647	0.816	0.288	0.281	0.047	0.020	Preferred joint model
Year 1	S3	Sensitivity challenger	0.398	0.490	0.509	0.649	0.810	0.290	0.277	0.048	0.022	Closest challenger
Year 1	S4	Other joint model	0.402	0.485	0.505	0.653	0.793	0.263	0.278	0.048	0.032	Added complexity without payoff

5 Benchmark only: S0 reference

Cohort	Model	Theta SD	Teacher rho	PAT rho	Raw total rho	Bridge anchors
Foundation	S0	0.446	0.371	0.189	0.618	0
Year 1	S0	0.408	0.481	0.500	0.769	0

6 Plots

6.1 Core psychometric comparisons

6.2 Speed validity signal

7 Cohort-by-cohort summary

7.1 Foundation

S2 has the smallest mean theta uncertainty among the joint models.
S3 remains very close on the timed-math speed construct itself.
The practical difference between S3 and S2 is small: only 0.2% of Foundation students shift by two or more speed deciles.
For a cleaner timed-math interpretation, S3 is the simpler and more defensible choice.

7.2 Year 1

S2 again has the smallest mean theta uncertainty.
S3 is almost tied on person-score precision and remains highly similar to S2 on the timed-math speed score.
Only 0.3% of Year 1 students shift by two or more speed deciles when moving from S2 to S3.
That makes S3 the cleaner default unless explicit baseline-speed adjustment is required.

8 Construct-validity decision: S3 versus S2

8.1 Practical impact of choosing S3

Cohort	Timed-speed correlation	Mean abs speed diff	P95 abs speed diff	Mean abs percentile shift	2+ decile shifts	2+ quintile shifts
Foundation	0.994	0.016	0.044	1.910	0.2%	0.2%
Year 1	0.998	0.010	0.028	1.051	0.3%	0.1%

The timed-math speed scores from S3 and S2 are almost identical in practice. S2 identifies a real baseline/general-speed factor, but that factor changes student timed-math speed bands only rarely.

8.2 Final validity pass for S3

Cohort	Theta SD	Teacher rho	PAT rho (latest)	PAT rho (strict EOY)	Binary PPC misses	RT PPC misses	RT-accuracy misses	Max theta group R2	Max speed group R2
Foundation	0.440	0.384	0.218	0.283	4/7	0/5	3/5	0.026	0.019
Year 1	0.398	0.490	0.509	0.649	6/6	1/6	1/6	0.048	0.022

This final pass does not show a reason to reject S3 operationally. The remaining PPC and fit issues are shared broadly across models rather than uniquely concentrated in S3.

9 Structural fit

9.1 Posterior predictive checks

cohort	model	domain	subgroups	misses
Foundation	S2	Binary subgroup means	7	4
Foundation	S3	Binary subgroup means	7	4
Year 1	S2	Binary subgroup means	6	3
Year 1	S3	Binary subgroup means	6	6
Foundation	S2	RT subgroup means	6	1
Foundation	S3	RT subgroup means	5	0
Year 1	S2	RT subgroup means	8	0
Year 1	S3	RT subgroup means	6	1
Foundation	S2	RT-accuracy correlation	6	3
Foundation	S3	RT-accuracy correlation	5	3
Year 1	S2	RT-accuracy correlation	8	2
Year 1	S3	RT-accuracy correlation	6	1

Lower misses are better. S3 is slightly cleaner on some RT-focused checks, which supports the case for preferring the simpler timed-math construct when the practical difference from S2 is so small.

9.2 Item-fit summary

cohort	model	domain	items	underfit	severe
Foundation	S2	Binary items	179	13	6
Foundation	S3	Binary items	179	13	6
Year 1	S2	Binary items	293	25	8
Year 1	S3	Binary items	293	25	10
Foundation	S2	Number-line items	20	0	0
Foundation	S3	Number-line items	20	0	0
Year 1	S2	Number-line items	37	0	0
Year 1	S3	Number-line items	37	0	0

Cohort	Model	Mean \|z\|	Mean RMSE z
Foundation	S2	0.724	0.983
Foundation	S3	0.731	0.987
Year 1	S2	0.742	0.986
Year 1	S3	0.752	0.990

Binary item-fit issues are common across models, suggesting broad instrument-level misfit rather than a problem unique to one joint specification.

10 External validity

10.1 Teacher rating and PAT relationships

Cohort	Model	Teacher rho	PAT rho (latest)	PAT rho (strict EOY)	Raw total rho
Foundation	S1	0.386	0.217	0.290	0.676
Foundation	S2	0.392	0.213	0.307	0.672
Foundation	S3	0.384	0.218	0.283	0.670
Foundation	S4	0.373	0.201	0.269	0.634
Year 1	S1	0.490	0.518	0.650	0.815
Year 1	S2	0.489	0.515	0.647	0.816
Year 1	S3	0.490	0.509	0.649	0.810
Year 1	S4	0.485	0.505	0.653	0.793

The joint models all improve on the S0 benchmark for most external-validity slices. Among the joint candidates, S2 and S3 are the closest pair, and the current evidence does not show a large enough outcome-side gain to force the more complex S2 specification.

11 Fairness and proxy screens

Cohort	Model	Max theta group R2	Max speed group R2
Foundation	S1	0.026	0.025
Foundation	S2	0.027	0.021
Foundation	S3	0.026	0.019
Foundation	S4	0.027	0.015
Year 1	S1	0.047	0.017
Year 1	S2	0.047	0.020
Year 1	S3	0.048	0.022
Year 1	S4	0.048	0.032

The subgroup proxy screens are small overall. At this stage there is no strong evidence that the speed factors are mainly acting as proxies for demographic grouping variables.

12 Linking caveat

All reviewed R0 runs have n_bridge_anchors = 0. That means this report supports the within-Term-3 choice of a preferred joint model, but it does not establish anchor-based linking across forms or occasions.

13 Final recommendation

Prefer S3 when the target construct is timed-math speed and the goal is a transparent, operationally simple model.
Use S2 only if a separate baseline/general-speed adjustment is explicitly required for interpretation or reporting.
On current evidence, that adjustment has only a modest practical effect on student timed-math speed scores.
Keep S0 only as the benchmark reference, not as a candidate answer.