Accuracy-speed modelling

1. Intro

This report walks through how the joint accuracy + response-time (RT) models for Foundation and Year 1 were built, what each iteration tried, and how each variant performs against the gates we use to decide whether a model is fit for purpose. It is written for the internal IRT/data team and uses plain language to support review and decisions.

Glossary

Term	Plain meaning	Internal label
Achievement score	how well a student is doing on the math items	theta
Response pace	how quickly a student is responding, relative to typical	tau
Rapid response	unusually fast for that specific item	per-item/subtest p05 flag
Subtest type	the type of task, e.g. number line or magnitude comparison	`test_subgroup`
Benchmark accuracy-only model	broad achievement with subtest/testlet effects, no RT	A2

Headline summary

Two model designs are currently candidates: the per-subtest/item rapid-threshold joint model (J2b) and the expanded selected-subtest joint model (J3a conservative).

Foundation: passes the current diagnostics/stability checks at both 500/500 and 1000/1000 MCMC settings; broader external and movement checks are still being filled in.
Year 1 J2b: passes diagnostics/stability checks after a source correction (addition multi-choice + subtraction multi-choice + missing number).
Year 1 J3a conservative: 1000/1000 cisbox run complete and passes basic diagnostics/stability against corrected J2b.
J3a moderate (adds Magnitude Comparison RT): Foundation cisbox run complete and clean diagnostically, but it moves theta substantially vs conservative; keep exploratory. Year 1 moderate exploratory run is still in progress.
J3b hierarchical response pace: Foundation 100/100 smoke complete; promising, but needs longer-run diagnostics.
T2 speed-only counterparts: Foundation 100/100 smokes complete for separate tau, hierarchical tau, and tau + task timing.
J3a broad (adds true ASDD modality): smoke-tested only; held as a sensitivity / stress-test.

Current preferred internal candidate: Foundation J3a conservative; Year 1 J3a conservative now passes the completed 1000/1000 basic gate against corrected J2b. Achievement and response pace remain separate constructs: slower correct responses do not reduce achievement estimates.

Several validation gates remain open and are tracked in the scorecard in §12.

2. Distribution of response time

Before any modelling, look at the raw response-time evidence. The first plot shows RT distributions; the second shows accuracy against median RT.

3. Response-time modelling

Before combining response time with accuracy, what structure does response pace appear to have on its own?

This section supersedes the earlier response-time modelling notes in the IRT development page group. It focuses on a fast, auditable Term 3 speed-only comparison. Achievement estimates remain separate: these response-pace models do not change the current achievement anchor or achievement bands.

Three plausible shapes for response pace are being compared:

flowchart LR
  A["One overall pace<br/>per student"] ~~~ B["One pace<br/>per subtest"] ~~~ C["Overall pace<br/>+ subtest residual"]

3.1 Scope and data

The main construct is Term 3 timed-math response pace. ABR rows, practice rows, invalid/non-scorable rows, and non-positive response times are excluded. STPM/STDD are not pooled into the main timed-math construct; they are reserved for separate platform/interface response-pace context checks.

Analysis set	Included	Excluded	Why
Main timed-math set	time-limited math subtests	STPM/STDD, untimed tasks, ABR, invalid/non-positive RT	tests timed-math response pace
Non-math context set	STPM; STDD separately where available	math tasks	checks general interaction-speed context
Foundation linking sensitivity	F-A only, F-B only, MC0-20 only, family hierarchy	—	checks thin Foundation A/B linking

The report below focuses on interpretable modelling evidence. The main timed-math comparisons are shown for both Foundation and Year 1. STPM/STDD remain planned context checks after the one-subtest context specification is patched.

3.2 Subtest coverage and linking

Year	Cohort	Included timed-math subtests	Shared A/B link
Foundation	F-A	MQ1-10, MC0-20, MNC0-20	MC0-20
Foundation	F-B	MQ1-20, MC0-20, MNA0-20	MC0-20
Year 1	Y1-A	AAMC, ASMC, ASDD, MC0-100, MNA0-100	AAMC, ASMC, ASDD, MC0-100
Year 1	Y1-B	AAMC, ASMC, ASDD, MC0-100, MNC0-100	AAMC, ASMC, ASDD, MC0-100

Year 1 is strongly linked across A/B cohorts. Foundation is more thinly linked: MC0-20 is the only shared timed-math subtest, so Foundation pooled results need explicit form-split sensitivity checks.

3.3 Models fitted

Plain-English model	Statistical idea	Question answered
Overall timed-math pace	one student pace effect	Is one pace factor enough?
Separate subtest pace	student pace within each subtest	Do subtests need distinct pace estimates?
Hierarchical pace	overall pace + subtest residual pace	Can we keep both general and subtest-specific pace?
Overall pace + task timing	one pace plus task/testlet timing effects	Are differences mainly task-side rather than student-side?
STPM / STDD context	non-math interaction-speed pace	How much is general platform/interface pace?
Foundation linking sensitivities	F-A only, F-B only, MC0-20 only, family hierarchy	Is Foundation A/B pooling stable?

3.4 Pooled vs subtest-specific pace

This plot compares one overall timed-math pace estimate with subtest-specific pace estimates. Each dot is one student-administration × subtest combination.

How to read this

A strong upward slope means there is a general timed-math response-pace signal. Points that do not sit on the identity line mean the subtest-specific pace estimates still carry information beyond one overall pace.

3.5 Subtest-to-subtest correlation

The heatmap below uses the subtest-specific pace estimates from the separate-subtest model. Cells show Spearman rank correlations across student-administrations.

The Foundation heatmap shows moderate positive response-pace correlations among jointly observed timed-math subtests. MC0-20 is the key cross-form link, correlating with all other Foundation timed-math subtests. Blank cells are structural: those subtest pairs were not jointly observed because of the A/B form split. This supports an overall timed-math pace factor, but with explicit Foundation linking caveats and retained subtest residuals.

The Year 1 heatmap shows a more complete linked structure across the timed-math subtests. Most correlations are moderate and positive, so Year 1 has clearer evidence for a shared timed-math pace factor than Foundation. The correlations are still well below 1.00, which means subtest-specific pace differences remain meaningful; this supports the hierarchical structure rather than a single pooled pace alone.

How to read this

Higher correlations support a general timed-math pace factor. Variation across cells supports keeping subtest residuals rather than forcing all response pace into one undifferentiated estimate.

3.6 Hierarchical residual size

The hierarchical model separates response pace into an overall student component and subtest residual components. The chart compares their standard deviations.

The Foundation chart shows that the overall pace component is the largest source of response-pace variation, but each subtest also retains non-trivial residual pace variation. This means Foundation response time is not fully captured by one overall pace estimate. The modelling implication is an overall timed-math pace plus subtest residuals, with the Foundation linking caveat retained.

The Year 1 chart shows the same broad pattern: overall pace is a major component, but subtest residual pace remains visible across the timed-math subtests. Because Year 1 has stronger A/B linking, this gives cleaner support for the hierarchical response-pace structure than Foundation: one overall pace estimate plus retained subtest-specific residual pace.

How to read this

If residual standard deviations are non-trivial, subtest-specific pace is adding information beyond overall pace.

3.7 Pooled vs hierarchical vs task-timing comparison

This table summarises the key model-pair comparisons. For separate-subtest pace, the pooled comparison uses the student’s average separate-subtest pace; the hierarchical comparison uses person × subtest estimates.

Year	Model pair	Spearman	median \|diff\|	Read
Foundation	pooled vs separate-subtest mean	0.964	0.047	how close one overall pace is to separate-subtest pace
Foundation	hierarchical subtest pace vs separate subtest pace	0.981	0.027	does hierarchy preserve subtest signal
Foundation	pooled vs hierarchical overall pace	0.980	0.040	does hierarchy change overall pace
Foundation	task-timing vs hierarchical overall pace	0.972	0.042	can task timing replace subtest student pace
Year 1	pooled vs separate-subtest mean	0.977	0.036	how close one overall pace is to separate-subtest pace
Year 1	hierarchical subtest pace vs separate subtest pace	0.983	0.029	does hierarchy preserve subtest signal
Year 1	pooled vs hierarchical overall pace	0.987	0.027	does hierarchy change overall pace
Year 1	task-timing vs hierarchical overall pace	0.987	0.027	can task timing replace subtest student pace

How to read this

The model structure should be the simplest one that preserves real subtest pace differences without turning subtest variation into one undifferentiated pace estimate or into pure task-side timing.

3.8 Foundation A/B linking sensitivity

Foundation needs extra scrutiny because A/B linking is thin: MC0-20 is the only shared timed-math subtest. The sensitivities compare pooled A/B Foundation pace with form-specific and link-anchor estimates.

Sensitivity model	Compared to pooled A/B Foundation pace
F-A only pooled pace	rank agreement and shift
F-B only pooled pace	rank agreement and shift
MC0-20-only pace	link-anchor stability
Family-level hierarchy	structural sensitivity

Section conclusion

The Foundation sensitivity checks support pooled A/B response-pace modelling. Within each form, the form-only pace estimates produce essentially the same student ordering as the pooled Foundation estimate, and a family-level hierarchical sensitivity also preserves the pooled ordering closely. The MC0-20-only estimate, which uses the main shared A/B anchor, remains strongly aligned with the full timed-math pace estimate but is less identical, as expected for a single-subtest construct. Overall, Foundation pooled pace is stable enough for internal modelling evidence, while retaining an explicit caveat that cross-form linking relies heavily on MC0-20.

3.9 STPM / STDD context

STPM/STDD are modelled as non-math interaction-speed context checks, not as part of the primary timed-math pace construct.

STPM/STDD context pace is moderately correlated with timed-math response pace, suggesting a general response-tempo or interaction-speed component. However, the correlations are not high enough to treat non-math context pace as equivalent to timed-math pace. STPM/STDD therefore provide useful context and sensitivity evidence, but should not be pooled into the primary timed-math response-pace construct.

How to read this

High correlation would indicate a meaningful general interaction-speed component. Low correlation would indicate that timed-math pace is largely separate from general interaction speed. Either way, STPM/STDD remain context checks, not part of the primary timed-math construct.

3.10 Interim read

The current internal direction remains a hierarchical response-pace structure: one overall student pace plus subtest-specific residual pace, fitted to Term 3 selected timed-math data. Foundation evidence must be read with the A/B-linking caveat. Year 1 has stronger A/B linking. STPM/STDD remain context checks rather than part of the primary timed-math construct.

4. Iteration map

The remainder of the report walks through the joint accuracy + RT model iterations. Each node names the model in plain English, with the internal label in parentheses; arrows show what each iteration adds or changes.

flowchart TD
  A2["Accuracy-only benchmark (A2)<br/>broad achievement + testlet"]
  T1b["Pooled speed-only (T1b)"]
  T1c["Subtest-specific speed-only (T1c)"]
  J0["Joint accuracy + pooled RT (J0)"]
  J1["Joint accuracy + subtest RT (J1)"]
  J2a["+ global 2s rapid (J2a)<br/><i>superseded</i>"]
  J2b["+ per-subtest/item p05 rapid (J2b)"]
  J2c["+ residual log-RT (J2c)<br/><i>sensitivity</i>"]
  J3aC["Expanded selected subtests (J3a conservative)"]
  J3aM["+ Magnitude Comparison RT (J3a moderate)"]
  J3aB["+ ASDD modality (J3a broad)<br/><i>stress-test</i>"]

  A2 --> T1b --> T1c --> J0 --> J1
  J1 --> J2a
  J1 --> J2b
  J2b --> J2c
  J2b --> J3aC
  J3aC --> J3aM
  J3aC --> J3aB

4b. Reading Bayesian MCMC IRT diagnostics — brief primer

Every iteration below is fit with Bayesian Markov chain Monte Carlo (MCMC). Instead of a single best estimate per parameter, MCMC produces thousands of plausible draws from the posterior distribution. Diagnostics tell us whether those draws can be trusted.

Metric	Plain meaning	What “good” looks like	Why it matters
Chains	independent simulations (we use 4)	all 4 finish and agree	catches local stuck-ness
Warm-up / sampling iters	tuning vs kept draws per chain	e.g. 1000 / 1000	more = more precise posteriors
Divergences	sampler steps that hit numerical trouble	0 (target)	divergences mean posteriors may be biased
Max tree depth	how deep the sampler had to search per step	well below cap (15)	hitting the cap = inefficient, possibly biased
Min E-BFMI	energy-based exploration check	≥ 0.3 (want ≥ 0.5)	low = chains not exploring posterior well
R-hat (max)	agreement across chains per parameter	≤ 1.01	values > 1.01 mean chains disagree → not converged
ESS bulk (min)	effective independent draws per parameter	≥ ~400	low = posterior summaries are noisy
Finite lp	log-probability values are finite	yes	non-finite = something is broken

Quick reading rule. If every row of a per-iteration diagnostics table is green, treat the run as fit-for-interpretation. If anything fails, we stop and do not interpret the model for decision-making (it may still be reported for transparency).

MCMC settings shorthand used in this report.

"500/500" = 4 chains × 500 warm-up + 500 sampling iterations
"1k/1k"   = 4 chains × 1000 warm-up + 1000 sampling iterations

The student sample is unchanged across these settings; only the number of posterior draws changes.

5. Foundation model walk-through

Foundation is the clearest evidence base so far. The conservative selected-subtest path has passed the current stability checks, and the new speed-only / hierarchical response-pace smokes now give an initial check on whether the next hierarchical tau direction is sensible.

5.1 Foundation status at a glance

Plain label	Code	What it tests	Current status	Main reading
Accuracy-only benchmark	A2	protected achievement anchor	benchmark	remains the achievement reference
Pooled speed-only	T1b	one response-pace trait	background	useful, but too coarse as the only pace structure
Subtest-specific speed-only	T1c	separate subtest pace	adopted as design evidence	subtest pace retains information beyond pooled pace
Per-subtest/item rapid joint model	J2b	selected-subtest joint accuracy + RT	candidate	clean, interpretable baseline
Residual log-RT sensitivity	J2c	extra residual RT term	sensitivity only	clean, but not a replacement for J2b
Expanded selected-subtest joint model	J3a conservative	operational selected-subtest expansion	preferred internal candidate	stable against J2b and 500/1000 settings
Hierarchical response-pace joint model	J3b	overall pace + subtest residuals	100/100 smoke complete	promising; needs longer run and diagnostics
Speed-only J3 counterparts	T2a/T2b/T2c	RT-only counterparts to J3 structures	100/100 smokes complete	tau estimates are very similar across structures
Magnitude Comparison variant	J3a moderate	adds Magnitude Comparison RT	Foundation run complete; clean diagnostics but large movement vs conservative	exploratory, not a replacement decision yet

5.2 Foundation conservative stability evidence

Comparison	Metric	n	Pearson	Spearman	median \|diff\|	p95 \|diff\|	max \|diff\|
foundation_1000_vs_500	theta	2537	0.9999	0.9991	0.0119	0.0391	0.0845
foundation_1000_vs_500	tau_joint	2537	1.0000	1.0000	0.0015	0.0044	0.0110
foundation_1000_vs_500	tau_match_quantity	2537	0.9999	0.9999	0.0021	0.0062	0.0119
foundation_1000_vs_500	tau_missing_number	2537	1.0000	0.9999	0.0019	0.0058	0.0165
foundation_1000_vs_j2b	theta	2537	0.9982	0.9974	0.0343	0.1647	0.5367
foundation_1000_vs_j2b	tau_joint	2537	1.0000	0.9999	0.0015	0.0044	0.0121
foundation_1000_vs_j2b	tau_match_quantity	2537	0.9999	0.9999	0.0018	0.0056	0.0157
foundation_1000_vs_j2b	tau_missing_number	2537	1.0000	0.9999	0.0020	0.0060	0.0191

Read: Foundation J3a conservative is highly stable. The 1000/1000 run is almost identical to the 500/500 run for theta and tau. Compared with Foundation J2b, tau is essentially unchanged and theta remains strongly aligned. The theta differences are larger than the 500/1000 sampling differences because J3a expands the selected-subtest scope, not because the sampler is unstable.

5.3 Foundation J2b → J2c → J3a interpretation

J2b — per-subtest/item rapid rule

J2b replaces the earlier global rapid rule with a per-subtest/item empirical p05 rule:

Accuracy: logit P(Y_pi = 1) = a_i(theta_p - b_i) + testlet_p + gamma_family * rapid_pi
RT:       log(RT_pi)         = lambda_i - tau_{p,family(i)} + error
Rapid:   rapid_pi = 1{RT <= min(item p05, subtest p05)}

For Foundation, J2b is the selected-subtest baseline. It gives a clean response-process interpretation: unusually rapid responses have negative accuracy effects, while tau remains a separate response-pace estimate.

J2c — residual log-RT sensitivity

J2c completed cleanly and compared closely with J2b. Its role is sensitivity evidence: it asks whether adding residual logRT changes the interpretation. It did not provide a reason to replace J2b as the Foundation baseline.

Key available summary:

Quantity	Available result
Divergences	0 across 4 chains
Max treedepth	8
J2c vs J2b theta correlation	0.991
J2c vs J2b tau correlation	1.000
Decision	sensitivity evidence, not replacement

J3a conservative — expanded selected-subtest joint model

J3a conservative keeps the J2b likelihood idea and expands the selected-subtest scope using the corrected J3 data packs. Foundation has passed both the 500/500 and 1000/1000 gates.

Key available summary:

Gate	Foundation result
Divergences	0
Max treedepth	7
Min E-BFMI	0.547
1000 vs 500 theta Spearman	0.9991
1000 vs 500 p95 abs(theta shift)	0.039
1000 vs J2b theta Spearman	0.9974
Tau stability	Spearman about 0.9999

Finding: Foundation J3a conservative is the strongest current Foundation candidate for selected-subtest response-process evidence, pending remaining external/subgroup/movement review.

5.4 Foundation T2/J3b hierarchical response-pace smoke

The new T2/J3b smoke suite asks whether the next model should separate:

overall response pace
+
subtest-specific residual pace

model	chains	draws_per_chain	max_treedepth	min_ebfmi	lp_finite
J3b hierarchical joint	2	100	10	0.699	TRUE
T2a separate-subtest speed-only	2	100	10	0.535	TRUE
T2b hierarchical speed-only	2	100	10	0.854	TRUE
T2c tau + task timing speed-only	2	100	10	0.498	TRUE

Comparison	n	Pearson	Spearman	median \|diff\|	p95 \|diff\|
T2b overall vs T2a post-hoc mean tau	2537	0.9990	0.9990	0.0064	0.0242
T2c single tau vs T2a post-hoc mean tau	2537	0.9877	0.9891	0.0230	0.0926
J3b overall tau vs T2b speed-only overall tau	2537	0.9995	0.9993	0.0054	0.0172
J3b family-mean tau vs T2b speed-only family mean	2537	0.9995	0.9993	0.0054	0.0172
T2b family tau vs T2a tau: match_quantity	2537	0.9989	0.9987	0.0106	0.0341
J3b family tau vs T2b family tau: match_quantity	2537	0.9992	0.9988	0.0072	0.0238
T2b family tau vs T2a tau: missing_number	2537	0.9989	0.9989	0.0102	0.0411
J3b family tau vs T2b family tau: missing_number	2537	0.9990	0.9988	0.0073	0.0263

Read: the 100/100 smokes completed with 0 divergences and acceptable E-BFMI. They did hit max treedepth 10, so these are not final production-length runs. The substantive result is promising: hierarchical speed-only tau and separate-subtest speed-only tau are very highly aligned, and J3b joint hierarchical tau is very highly aligned with T2b speed-only hierarchical tau. That supports proceeding to a longer J3b/T2b run rather than dropping the hierarchical path.

5.5 Foundation current decision

Foundation conservative evidence is strong enough to keep J3a conservative as the current selected-subtest candidate. J3b/T2b is now the highest-priority next specification test because it can answer whether we can report/interpret response pace as:

one overall response-pace estimate
plus subtest-specific deviations

That would be easier to explain than a set of unrelated subtest taus, but it must pass longer-run diagnostics and movement checks first.

6. Year 1 model walk-through

Year 1 follows the same logic as Foundation, but the evidence base is more recent because the Year 1 J2b source scope was corrected.

6.1 Year 1 source correction

The earlier narrow Year 1 J2b specification excluded part of the intended arithmetic scope. The corrected J2b includes:

addition multi-choice
subtraction multi-choice
missing number

This corrected source scope is the Year 1 baseline for J3a validation.

6.2 Year 1 status at a glance

Plain label	Code	Current status	Main reading
Global 2s rapid rule	J2a	failed MCMC, not interpreted	retained as a flag only
Corrected per-subtest/item rapid	J2b corrected	passed	current Year 1 selected-subtest baseline
Expanded selected-subtest joint model	J3a conservative 500/500	passed against corrected J2b	highly stable
Expanded selected-subtest joint model	J3a conservative 1000/1000	cisbox run complete; basic gate passed	use as current Year 1 conservative candidate pending broader anchors/subgroups
Magnitude Comparison variant	J3a moderate	exploratory Year 1 run in progress	review only after moderate run completes
Hierarchical response-pace model	J3b Year 1	not launched	next after Foundation J3b review

6.3 Year 1 corrected J2b diagnostics

Available corrected Year 1 J2b evidence:

Gate	Result
Families	addition MC, subtraction MC, missing number
Chains	4
Draws per chain	800
Divergences	0
Max treedepth	8
Rapid effects	negative for all included families
Decision	corrected J2b adopted as Year 1 selected-subtest baseline

The rapid effects are in the expected direction: unusually rapid responses are associated with lower accuracy after accounting for achievement and item difficulty.

6.4 Year 1 J3a conservative validation to date

Comparison	Metric	n	Pearson	Spearman	median \|diff\|	p95 \|diff\|	max \|diff\|
year1_500_vs_corrected_j2b	theta	2562	0.9999	0.9998	0.0120	0.0401	0.0767
year1_500_vs_corrected_j2b	tau_joint	2562	1.0000	1.0000	0.0017	0.0052	0.0107
year1_500_vs_corrected_j2b	tau_addition_mc	2562	1.0000	1.0000	0.0021	0.0062	0.0137
year1_500_vs_corrected_j2b	tau_missing_number	2562	1.0000	0.9999	0.0020	0.0060	0.0109
year1_500_vs_corrected_j2b	tau_subtraction_mc	2562	0.9999	0.9999	0.0027	0.0086	0.0202
year1_j3a_conservative_1000_vs_corrected_j2b	theta_joint	2562	0.9999	0.9999	0.0097	0.0335	0.0708
year1_j3a_conservative_1000_vs_corrected_j2b	tau_joint	2562	1.0000	1.0000	0.0014	0.0042	0.0089
year1_j3a_conservative_1000_vs_corrected_j2b	tau_addition_mc	2562	1.0000	1.0000	0.0017	0.0050	0.0093
year1_j3a_conservative_1000_vs_corrected_j2b	tau_missing_number	2562	1.0000	1.0000	0.0016	0.0047	0.0089
year1_j3a_conservative_1000_vs_corrected_j2b	tau_subtraction_mc	2562	1.0000	0.9999	0.0023	0.0071	0.0161

Read: Year 1 J3a conservative 500/500 and 1000/1000 are both extremely stable against corrected J2b. The 1000/1000 run has 0 divergences, max treedepth 7, and min E-BFMI 0.615. Broader external-anchor and subgroup movement checks still remain.

6.5 Year 1 current decision

Year 1 corrected J2b remains the corrected selected-subtest baseline. Year 1 J3a conservative now passes the completed 1000/1000 basic gate against corrected J2b; broader external-anchor and subgroup movement checks still remain.

7. Cross-model performance vs benchmark

This section consolidates the current status across models. The benchmark remains A2/current achievement. Response-pace estimates are evaluated as response-process evidence, not as achievement.

7.1 Current model status table

Year level	Current strongest candidate	Stability status	External-anchor status	Current decision
Foundation	J3a conservative	passed 500/1000 and J2b comparisons	partial J2b anchors available; J3a broader anchors pending	preferred internal selected-subtest candidate
Year 1	J3a conservative 1k	J2b passed; J3a 500 and 1k basic gates passed	partial J2b anchors available; J3a broader anchors pending	current conservative candidate pending broader validation
Foundation hierarchical pace	J3b/T2b	100/100 smoke complete	no external-anchor interpretation yet	proceed to longer run
Year 1 hierarchical pace	J3b/T2b	not launched	not available	wait for Foundation hierarchical review

7.2 Available anchor heatmap

The heatmap below uses currently available J2b anchor correlations. It is included as a status snapshot, not as the final J3a/J3b validation table.

Read: theta should align with achievement anchors such as A2, PAT, and teacher ratings. Tau should align more strongly with response-process anchors such as observed RT and rapid-response rate, and only modestly with achievement anchors. That separation is the desired pattern.

7.3 Anchor sample-size context

Anchor	Foundation n	Year 1 n	Interpretation note
PAT score	176	252	linked subset only
PAT percentile	176	252	same linked subset as PAT score
Teacher rating	741	733	teacher survey returners only
Observed RT / rapid rate	model sample	model sample	response-process anchors from cleaned response data

Because PAT and teacher anchors are partial-linkage samples, they are useful checks but not the only validation gate.

8. Cross-cutting stability comparisons

8.1 Conservative J3a stability

Comparison	Metric	n	Spearman	median \|diff\|	p95 \|diff\|	max \|diff\|
foundation_1000_vs_500	theta	2537	0.9991	0.0119	0.0391	0.0845
foundation_1000_vs_500	tau_joint	2537	1.0000	0.0015	0.0044	0.0110
foundation_1000_vs_j2b	theta	2537	0.9974	0.0343	0.1647	0.5367
foundation_1000_vs_j2b	tau_joint	2537	0.9999	0.0015	0.0044	0.0121
year1_500_vs_corrected_j2b	theta	2562	0.9998	0.0120	0.0401	0.0767
year1_500_vs_corrected_j2b	tau_joint	2562	1.0000	0.0017	0.0052	0.0107
year1_j3a_conservative_1000_vs_corrected_j2b	theta_joint	2562	0.9999	0.0097	0.0335	0.0708
year1_j3a_conservative_1000_vs_corrected_j2b	tau_joint	2562	1.0000	0.0014	0.0042	0.0089
foundation_j3a_moderate_500_vs_conservative_1000	theta_joint	2537	0.7949	0.4229	1.2911	5.0118
foundation_j3a_moderate_500_vs_conservative_1000	tau_joint	2537	0.9214	0.0652	0.2122	0.4784

Read: sampling-setting stability is excellent for Foundation. Year 1 conservative is also excellent against corrected J2b at both 500/500 and 1000/1000. Foundation moderate is deliberately separated from conservative stability because adding Magnitude Comparison RT produced substantial theta movement and needs deeper review.

8.2 Rapid-rule sensitivities

Sensitivity	Foundation status	Year 1 status	Current reading
strict vs capped p05	stable in available checks	stable in corrected checks	rapid policy choice does not materially move theta/tau
rapid vs no-rapid	stable in available checks	stable in corrected checks	rapid term adds interpretation more than movement
global 2s vs p05	global rule superseded	global rule failed	per-subtest/item empirical p05 is retained

8.3 Hierarchical response-pace sensitivity

Comparison	Current smoke result	Reading
T2b overall vs T2a post-hoc mean tau	Spearman about 0.999	hierarchical overall tau matches the average of separate taus very closely
T2b family tau vs T2a family tau	Spearman about 0.999 by family	hierarchical model preserves subtest tau ordering
J3b overall tau vs T2b overall tau	Spearman about 0.999	adding accuracy/rapid terms did not materially change overall tau in the smoke
T2c tau-testlet vs T2a mean tau	Spearman about 0.989	task/testlet timing model is related but less identical; worth retaining as sensitivity

9. External / anchor validation

9.1 What is already available

Current anchor evidence is strongest for J2b and Year 1 corrected J2b. It shows the intended separation:

Quantity	Expected relationship	Current pattern
theta vs A2/PAT/teacher	positive achievement alignment	available J2b anchors are positive
tau vs observed RT / rapid rate	strong response-process alignment	available J2b anchors behave as expected
tau vs A2/PAT/teacher	modest, not dominant	available checks are modest compared with RT/process anchors

9.2 What is still pending

Pending validation	Why it matters	Blocking output
J3a conservative external anchors	confirms expanded selected-subtest model preserves achievement alignment	final J3a posterior extracts joined to PAT/teacher/A2
J3b/T2b external anchors	checks hierarchical tau interpretation	longer J3b/T2b run and posterior extract
Subgroup movement by school/class/student characteristics	ensures changes are not concentrated in specific groups	movement table with subgroup joins
Slow-but-accurate review	confirms pace remains separate from achievement	targeted movement and tau distribution checks

9.3 Current validation stance

The available anchor pattern is encouraging, but external validation is not closed for J3a/J3b. The report should therefore describe J3a conservative as a strong current candidate and J3b as a promising specification test, not as a completed reporting model.

10. Movement and protection

The central protection rule is unchanged:

A2/current achievement remains the protected achievement anchor.
Response pace is separate response-process evidence.
No response-pace model changes current achievement bands.

10.1 Available movement evidence

Comparison	n	Spearman	median abs theta shift	p95 abs theta shift	max abs theta shift
foundation_1000_vs_500	2537	0.9991	0.0119	0.0391	0.0845
foundation_1000_vs_j2b	2537	0.9974	0.0343	0.1647	0.5367
year1_500_vs_corrected_j2b	2562	0.9998	0.0120	0.0401	0.0767

Read: Foundation 500/1000 movement is tiny. Foundation J3a vs J2b movement is larger because the selected-subtest scope changes, but rank alignment remains high. Year 1 J3a 500 and 1000 comparisons against corrected J2b are both small.

10.2 Subgroup movement still required

Subgroup lens	Status	Requirement before broader use
Cohort A vs B	pending	compare median/p95 theta shift and large-mover rates
LBOTE	pending	same movement checks, suppress small cells
ATSI	pending	same checks, with small-cell caution
School size band	pending	ensure movement is not concentrated in small/large schools
Class size band	pending	same movement checks
Disability/supported adjustments	sensitivity appendix	only if source fields are approved and complete

10.3 Slow-but-accurate protection

Slow-but-accurate students should remain high on achievement if they are correct. In these models:

theta is driven by accuracy and item difficulty;
tau is a separate response-pace estimate;
slow responses are not penalised by the rapid-response term;
the rapid-response term only applies to unusually fast responses under the per-subtest/item p05 rule.

A targeted slow-but-accurate movement table is still needed before any non-internal interpretation.

11. Wright-map-style views

Wright-map-style views remain useful for communication, but they should be generated after the final candidate posterior extracts are frozen.

11.1 Accuracy view

The accuracy view will place:

student theta distribution
against
item difficulty / threshold locations

on the same achievement scale. This is the view that explains achievement.

11.2 Response-pace view

The response-pace view must be separate:

student tau distribution
against
item time-intensity locations

on a response-pace scale. It should not be overlaid on the achievement scale. This separation is important for avoiding any speed-as-achievement interpretation.

11.3 Status

View	Current status	Next step
Foundation J3a accuracy map	not yet generated	use final Foundation J3a posterior + item bank
Foundation J3a pace map	not yet generated	use selected-subtest tau + item time intensity
Year 1 J3a maps	pending final external/subgroup review	generate after Year 1 candidate review closes
J3b hierarchical maps	pending longer run	decide whether to show overall pace plus residuals

12. Gates and decisions scorecard + full lineage

12.1 Current gates scorecard

Model run	Diagnostics	Stability	External anchors	Source/scope	Movement	Overall decision
Foundation J2b	pass	pass	partial pass	pass	pass at available level	candidate baseline
Foundation J2c	pass	close to J2b	limited	pass	sensitivity only	not replacement
Foundation J3a conservative 1k/1k	pass	pass	pending	pass	overall movement available; subgroup pending	preferred internal candidate
Year 1 corrected J2b	pass	pass	partial pass	pass	pass at available level	candidate baseline
Year 1 J3a conservative 500/500	pass	pass vs corrected J2b	pending	pass	overall movement available	comparison evidence
Year 1 J3a conservative 1k/1k	pass	pass vs corrected J2b	pending	pass	overall movement available; subgroup pending	current conservative candidate
Foundation T2a/T2b/T2c 100/100	smoke pass with treedepth caution	pass directional comparisons	not applicable	pass	not a theta model	design evidence
Foundation J3b 100/100	smoke pass with treedepth caution	tau aligns with T2b	pending	pass	pending	proceed to longer run
J3a moderate	Foundation diagnostic pass; Year 1 running	Foundation movement large vs conservative	pending	pass	pending	exploratory only
Broad/untimed/speed-testlet combined variants	not launched	not available	not available	specification pending	pending	hold

12.2 Full lineage

display_label	internal_label	year_level	status	role	headline_result
J0/J1 early RT prototypes	J0/J1	Foundation and Year 1	completed legacy research	selected-family background only	Useful early exploration but not full-battery and superseded by family-aware J2b
J2 global rapid sensitivity	J2	Foundation	completed legacy sensitivity	not preferred selected-family model	Foundation usable as sensitivity but old global rapid rule is too crude; not full-battery
J2 Year 1	J2-Year1	Year 1	failed MCMC	not usable	544 divergences and poor E-BFMI/no reliable inference; not full-battery
Foundation J2b	J2b-Foundation	Foundation	clean full run	primary Foundation selected-family internal candidate	0 divergences; rapid effects negative; selected-family theta aligns with A2; tau aligns with RT/rapid behaviour
Year 1 J2b	Year1-J2b-strict-p05	Year 1	clean full run	primary Year 1 selected-family internal candidate	0 divergences; improved validation; family rapid effects negative; source/current definition stable
Year 1 J2b capped-p05 sensitivity	Year1-J2b-capped-p05	Year 1	clean full run	selected-family robustness sensitivity	Strict vs capped theta/tau essentially unchanged; gamma ordering stable
Year 1 rapid-vs-no-rapid comparison	Year1-J2b-no-rapid-comparison	Year 1	clean full run	selected-family parsimony/specification check	No-rapid gives essentially identical theta/tau and validation; rapid rows still much less accurate
Foundation J2c	J2c-Foundation	Foundation	clean full run	selected-family exploratory sensitivity	Residual-logRT effects are family-specific; theta close to Foundation J2b
Proposed selected-family speed/accuracy 3x3 grid	grid-prototype	Foundation and Year 1	in prototype	selected-family proposed reporting product prototype	Needs scope caveat plus grid counts/cell validation/fairness/uncertainty checks

13. What we can / cannot say

Can say now	Cannot yet say
Response time carries information beyond accuracy in selected timed subtests	Response time is an achievement signal
A single pooled response-pace estimate is too coarse as the only structure	There is a final reporting-ready response-pace score
Subtest-specific pace is justified as current design evidence	Faster responses are better
J3a conservative Foundation is stable against current gates	J3a/J3b are ready for broader reporting
Year 1 corrected J2b is stable after source correction	Year 1 J3a is complete before external-anchor and subgroup checks close
J3b/T2b smokes support continuing hierarchical response-pace tests	J3b has passed production-length diagnostics
Slow-but-accurate students are protected conceptually because pace is separate from achievement	Slow responses by themselves have a single approved interpretation

14. Next steps and open work

Continue monitoring Year 1 J3a moderate exploratory run.
Investigate Foundation J3a moderate movement before considering any Magnitude Comparison RT expansion.
Fill Year 1 J3a conservative external-anchor and subgroup movement tables.
Run a longer Foundation J3b/T2b hierarchical response-pace pilot, with treedepth mitigation if needed.
Decide whether T2c tau + task/testlet timing should remain a separate sensitivity path or become J3c.
Fill J3a external-anchor tables by joining posterior summaries to A2, PAT, teacher, observed RT, and rapid-rate anchors.
Build subgroup movement tables: cohort, LBOTE, ATSI, school size, and any approved support-adjustment fields.
Generate Wright-map-style achievement and response-pace views after the candidate posterior extracts are frozen.
Keep broad, untimed RT, and combined speed-testlet variants on hold until their specifications are explicitly approved.

15. Audit appendix

15.1 Run directory roots

studies/enssa-v2-response-process-program-2026-t2/results/round2/m3/speed/
    t1b_rounded_lognormal_clean/                         # pooled speed-only background
    t1c_family_speed/                                    # subtest-specific speed-only background
    accuracy_speed_validation/joint_accuracy_rt_j2b/      # Foundation J2b
    accuracy_speed_validation/joint_accuracy_rt_j2b_y1_add_sub_mc_strict_p05/  # Year 1 corrected J2b
    accuracy_speed_validation/joint_accuracy_rt_j2c_foundation/  # J2c sensitivity
    j3_operational_accuracy_speed_design/j3a_conservative_pilot_stan/  # J3a conservative
    j3_operational_accuracy_speed_design/t2_speed_only_tau_stan/        # T2a/T2b/T2c smokes
    j3_operational_accuracy_speed_design/j3b_hierarchical_tau_stan/     # J3b smoke
    j3_operational_accuracy_speed_design/t2_j3b_extracted_summaries/    # extracted summaries for this report

15.2 Stan models

models/irt/irt-joint-stan-pcm/v4-longitudinal/stan/
    m3_joint_accuracy_rt_j2b_family_rapid.stan
    m3_joint_accuracy_rt_j2c_foundation_resid_logrt.stan
    m3_joint_accuracy_rt_j2b_family_no_rapid_ablation.stan
    m3_j3a_selected_timed_family_accuracy_rt.stan
    m3_j3b_hierarchical_tau_accuracy_rt.stan
    m3_t2a_speed_only_separate_tau.stan
    m3_t2b_speed_only_hierarchical_tau.stan
    m3_t2c_speed_only_tau_testlet.stan

15.3 Key scripts

models/irt/irt-joint-stan-pcm/v4-longitudinal/R/113_m3_j3a_expanded_timed_family_data_builder.R
models/irt/irt-joint-stan-pcm/v4-longitudinal/R/117_m3_j3a_conservative_pilot_runner.sh
models/irt/irt-joint-stan-pcm/v4-longitudinal/R/118_m3_j3a_stan_csv_diagnostics.R
models/irt/irt-joint-stan-pcm/v4-longitudinal/R/119_m3_j3a_extract_person_posteriors.py
models/irt/irt-joint-stan-pcm/v4-longitudinal/R/120_m3_j3a_variant_pilot_runner.sh
models/irt/irt-joint-stan-pcm/v4-longitudinal/R/121_m3_j3b_hierarchical_tau_runner.sh
studies/.../colleague_reporting/R_generate_accuracy_speed_report_figures.R
studies/.../colleague_reporting/R_generate_accuracy_speed_anchor_heatmap.R
/tmp/extract_t2_j3b_summaries.py  # extraction utility used to create local T2/J3b summary CSVs

15.4 Extracted summary files used by this report

studies/.../j3_operational_accuracy_speed_design/t2_j3b_extracted_summaries/
    m3_t2_j3b_100_100_diagnostics.csv
    m3_t2_j3b_100_100_scalar_means.csv
    m3_t2_j3b_foundation_comparisons.csv
    m3_t2_j3b_foundation_person_summaries.csv

15.5 Figure manifest

Figure / table	Script or source	Current status
Section 2 RT distributions	`R_generate_accuracy_speed_report_figures.R`	rendered
Section 2 accuracy × RT panels	`R_generate_accuracy_speed_report_figures.R`	rendered
Section 3 T1 scatterplots	`m3_t1c_family_person_speed_with_context...csv`	rendered inline
Section 3/5 T2/J3b comparison tables	`m3_t2_j3b_foundation_comparisons.csv`	rendered inline
Section 7 anchor heatmap	`R_generate_accuracy_speed_anchor_heatmap.R`	rendered
Section 8 stability table	`m3_j3a_conservative_validation_compare...csv`	rendered inline
Wright-map-style views	pending final posterior extracts	not yet generated

15.6 Current heavy-artifact location on cisbox

/data/numeracy-screening-models/irt/m3_joint_accuracy_rt/j3a_conservative_pilot/work/numeracy-screening/

Completed cisbox smoke runs used here:

j3_operational_accuracy_speed_design/t2_speed_only_tau_stan/t2-speedonly-foundation-cisbox-100-100-20260604T092931Z/
j3_operational_accuracy_speed_design/j3b_hierarchical_tau_stan/j3b-foundation-conservative-cisbox-100-100-20260604T093040Z/