Number line PCM vs continuous

1. Executive summary

The evidence currently supports keeping the testlet + partial-credit-style ordinal model (NL2) as the operational-compatible Number Line target. The continuous testlet model (NL4aTH) is technically viable and competitive at both Foundation and Year 1, but it does not yet improve validation, screening, and operational simplicity enough to replace the current target.

Key points

Existing .85/.95 ordinal categories remain defensible for operational modelling.
Continuous testlet modelling succeeded technically in both Foundation and Year 1.
The continuous testlet model (NL4aTH) correlates strongly with the testlet + partial-credit-style ordinal model (NL2) but moves some students meaningfully.
External validation and screening do not show a consistent cross-year advantage for the continuous challenger.
Signed-error and continuous analyses remain important for response-process validity and item design.

2. Full-battery variants compared in the main report

Variant	Label	Response used	Model type	Estimation framework	Testlet/local dependence
Full-battery raw accuracy benchmark	`FB0`	Raw accuracy aggregated across all eligible maths items/subtests	Benchmark, not IRT	Computed aggregate	No
Testlet + partial-credit-style ordinal model	`NL2`	Full battery with Number Line scored as `.85/.95` ordinal categories	Embedded ordinal/testlet model	`mirt` frequentist MML	Yes
Continuous testlet model	`NL4aTH`	Full battery with Number Line scored as continuous logit-scale accuracy	Stan logit-normal full-battery model	Stan MCMC	Yes

3. Model configurations and estimands

Full-battery raw accuracy benchmark (`FB0`)

FB0 is not an IRT model. It averages raw item accuracy across maths items/subtests where is_attempted == TRUE, is_practice == FALSE, and raw_score is usable (not missing). Non-Number-Line items contribute binary accuracy; Number Line items contribute their continuous raw accuracy score.

\[ FB0_p = \frac{1}{n_p}\sum_{i \in \mathcal{I}_p} s_{pi} \]

where \(s_{pi}\) is the item-level score and \(\mathcal{I}_p\) is the set of attempted, non-practice items with non-missing raw_score for student administration \(p\).

Testlet + partial-credit-style ordinal model (`NL2`)

NL2 is the current operational-compatible target. It is estimated with mirt using frequentist marginal maximum likelihood (MML). It uses the full battery, with non-Number-Line items as binary accuracy items and Number Line items converted to .85/.95 ordinal categories.

For Number Line responses:

\[ Y_{pi}=\begin{cases} 0, & r_{pi}<0.85,\\ 1, & 0.85 \le r_{pi}<0.95,\\ 2, & r_{pi}\ge 0.95. \end{cases} \]

The ordinal Number Line component is partial-credit-style: it treats the three ordered categories as increasing levels of response accuracy. In implementation this is GPCM/PCM-style rather than a strict textbook PCM.

Feature	Testlet + partial-credit-style ordinal model (`NL2`)	Strict Rasch PCM
Response type	Ordered `.85/.95` Number Line categories plus binary non-Number-Line items	Ordered polytomous item categories
Item discrimination	Allows item calibration/discrimination in a GPCM-style form	Fixed/equal discrimination, typically slope 1
Local dependence	Includes person-by-testlet effects	Not included by default
Assessment scope	Embedded in the full maths battery	Usually fitted to the polytomous items unless extended
Purpose here	Operational full-battery ability estimate with ordinal Number Line scoring	A stricter sensitivity model that would test an additional Rasch constraint

Continuous testlet model (`NL4aTH`)

NL4aTH is a full-battery Stan MCMC model that treats Number Line responses as continuous accuracy values on the logit scale.

\[ r^*_{pi}=\operatorname{clip}_{[0.001,0.999]}(r_{pi}), \qquad \ell_{pi}=\operatorname{logit}(r^*_{pi}). \]

For Number Line items:

\[ \ell_{pi}\mid\theta_p,u_{p,k[i]},b_i,\sigma_{f[i]} \sim \mathcal{N}(\theta_p+u_{p,k[i]}-b_i,\sigma^2_{f[i]}). \]

The small interior clipping is only for the logit transform: exact 1 scores exist, and \(\operatorname{logit}(1)\) is infinite.

Note

Because NL2 and NL4aTH use different Number Line response variables, this report does not compare them using AIC, BIC, or log-likelihood. The comparison uses agreement, movement, external validation, screening behaviour, subgroup movement, and operational burden.

4. Why ordinal binning can outperform a continuous model

A continuous Number Line model is attractive because the response is naturally continuous: a placement closer to the target contains more information than a placement farther away. In principle, modelling that continuous accuracy should preserve more measurement information than converting the score into categories.

In practice, the current evidence does not show a consistent operational advantage for the continuous testlet challenger over the ordinal testlet target. There are several plausible psychometric reasons for this.

Mechanism	Why it matters for Number Line scoring
Motor and interface noise	Very small differences in click location can reflect mouse/touch precision, screen size, device handling, or calibration noise rather than numeracy. Ordinal categories can deliberately ignore some of this fine-grained noise.
Boundary mass	Many responses are near-perfect or exactly perfect. Logit-scale continuous models require clipping exact 0/1 values, and the high-end boundary can dominate model behaviour.
Thresholded competence	External anchors such as PAT and teacher ratings may align more with “close enough / not close enough” competence than with every small spatial-error difference.
Signed-error structure	Absolute accuracy loses whether students systematically over-place or under-place numbers. A continuous absolute-accuracy model can still miss compression, endpoint, and target-location biases.
Heterogeneous error variance	Error variance likely differs by target value, bounded vs unbounded format, centred vs non-centred layout, term/form, and ability level. A simple continuous likelihood can be too smooth.
Local dependence and method effects	Number Line items share interface and strategy demands. The continuous challenger improved only after adding testlet structure, which suggests method effects matter at least as much as response scale.
Operational robustness	The `.85/.95` categories may act as a denoising rule: coarse enough to be stable, but still graded enough to preserve useful Number Line evidence.

This does not mean the current .85/.95 categories are psychometric truth. It means the continuous challenger has to clear an operational burden: it must improve validation, screening behaviour, fairness, score stability, and interpretability enough to justify the extra modelling complexity.

The next continuous-model iteration should therefore start with a boundary and signed-error audit before adding complexity. Candidate extensions include zero/one-inflated beta or logit-normal likelihoods, target and family/method effects, and compression/signed-error diagnostics. These should be introduced sequentially rather than all at once.

5. Score agreement among models

Selective scatter plots:

6. Movement relative to the current target

Movement of the continuous testlet model (NL4aTH) relative to the testlet + partial-credit-style ordinal model (NL2):

Term	N	Median absolute z-difference	P90 absolute z-difference	Same-band rate	Moves higher	Moves lower
1	2,517	.354	.871	.728	.126	.147
3	1,449	.319	.780	.753	.115	.132
4	1,099	.378	.923	.701	.183	.116

Interpretation: the continuous testlet model creates meaningful but not wholesale movement. Term 4 has the lowest same-band rate / largest band movement and more upward than downward movement.

Movement of the continuous testlet model (NL4aTH) relative to the testlet + partial-credit-style ordinal model (NL2):

Term	N	Median absolute z-difference	P90 absolute z-difference	Same-band rate	Moves higher	Moves lower
1	2,435	.346	.826	.725	.125	.150
3	1,510	.348	.858	.709	.142	.149
4	1,070	.309	.787	.708	.174	.119

Interpretation: Year 1 movement is similar in scale to Foundation. Same-band rates are about 71–73%, with Term 4 again showing more upward than downward movement.

7. Wright map

Why this matters: the Year 1 full continuous testlet model is now complete, so the same person-item threshold view can be shown for both year levels.

8. External validation

Benchmark against EOY PAT scores

Each row compares a term-specific full-battery variant score with end-of-year PAT scaled score. Entries are Pearson correlations.

Year	Term	N	Full-battery raw benchmark (`FB0`)	Testlet + partial-credit-style ordinal model (`NL2`)	Continuous testlet model (`NL4aTH`)
Foundation	1	59	.365	.393	.260
Foundation	3	49	.306	.269	.367
Foundation	4	45	.415	.482	.401
Year 1	1	112	.460	.544	.495
Year 1	3	117	.517	.577	.587
Year 1	4	120	.410	.496	.478

Benchmark against teacher ratings

Each row compares a term-specific full-battery variant score with teacher rating. Entries are Spearman rank correlations because teacher rating is ordinal.

Year	Term	N	Full-battery raw benchmark (`FB0`)	Testlet + partial-credit-style ordinal model (`NL2`)	Continuous testlet model (`NL4aTH`)
Foundation	1	769	.458	.446	.430
Foundation	3	585	.418	.404	.358
Foundation	4	541	.378	.372	.364
Year 1	1	755	.498	.469	.512
Year 1	3	626	.439	.472	.464
Year 1	4	508	.441	.460	.435

9. Screening classification

Note

For each variant, the external risk label is fixed first: EOY PAT risk means PAT percentile <=35; teacher-rating risk means teacher rating <=2. For each year and term, we then search across possible model-score thresholds and choose the threshold that gives the highest sensitivity while keeping specificity >= .80. In plain terms, among students not labelled at risk by the external anchor, at least 80% must remain unflagged by the model. This is not an 80th-percentile cutoff on PAT or on the model score, and these thresholds are for model comparison only, not operational cut scores.

Anchor	Term	Variant	N	Risk N	AUC	Sensitivity	False negative rate
EOY PAT percentile <=35	1	`FB0`	59	21	.653	.524	.476
EOY PAT percentile <=35	1	`NL2`	59	21	.659	.476	.524
EOY PAT percentile <=35	1	`NL4aTH`	59	21	.645	.476	.524
EOY PAT percentile <=35	3	`FB0`	49	16	.769	.625	.375
EOY PAT percentile <=35	3	`NL2`	49	16	.688	.625	.375
EOY PAT percentile <=35	3	`NL4aTH`	49	16	.691	.500	.500
EOY PAT percentile <=35	4	`FB0`	45	18	.831	.833	.167
EOY PAT percentile <=35	4	`NL2`	45	18	.809	.611	.389
EOY PAT percentile <=35	4	`NL4aTH`	45	18	.761	.500	.500
Teacher rating <=2	1	`FB0`	769	101	.811	.644	.356
Teacher rating <=2	1	`NL2`	769	101	.813	.723	.277
Teacher rating <=2	1	`NL4aTH`	769	101	.812	.663	.337
Teacher rating <=2	3	`FB0`	585	73	.755	.616	.384
Teacher rating <=2	3	`NL2`	585	73	.755	.548	.452
Teacher rating <=2	3	`NL4aTH`	585	73	.746	.562	.438
Teacher rating <=2	4	`FB0`	541	55	.762	.527	.473
Teacher rating <=2	4	`NL2`	541	55	.756	.618	.382
Teacher rating <=2	4	`NL4aTH`	541	55	.780	.618	.382

Anchor	Term	Variant	N	Risk N	AUC	Sensitivity	False negative rate
EOY PAT percentile <=35	1	`FB0`	72	31	.829	.806	.194
EOY PAT percentile <=35	1	`NL2`	72	31	.838	.742	.258
EOY PAT percentile <=35	1	`NL4aTH`	72	31	.809	.710	.290
EOY PAT percentile <=35	3	`FB0`	71	32	.829	.750	.250
EOY PAT percentile <=35	3	`NL2`	71	32	.835	.688	.312
EOY PAT percentile <=35	3	`NL4aTH`	71	32	.816	.656	.344
EOY PAT percentile <=35	4	`FB0`	69	33	.789	.667	.333
EOY PAT percentile <=35	4	`NL2`	69	33	.766	.515	.485
EOY PAT percentile <=35	4	`NL4aTH`	69	33	.744	.485	.515
Teacher rating <=2	1	`FB0`	755	163	.813	.644	.356
Teacher rating <=2	1	`NL2`	755	163	.804	.613	.387
Teacher rating <=2	1	`NL4aTH`	755	163	.818	.669	.331
Teacher rating <=2	3	`FB0`	626	132	.782	.629	.371
Teacher rating <=2	3	`NL2`	626	132	.799	.644	.356
Teacher rating <=2	3	`NL4aTH`	626	132	.798	.644	.356
Teacher rating <=2	4	`FB0`	508	95	.809	.674	.326
Teacher rating <=2	4	`NL2`	508	95	.814	.674	.326
Teacher rating <=2	4	`NL4aTH`	508	95	.804	.632	.368

10. Model-score precision

The NL2 values are frequentist mirt model-score standard errors; the NL4aTH values are Bayesian posterior theta SDs from Stan. Both describe uncertainty in the estimated latent ability, but they come from different estimation frameworks, so the comparison should be read as approximate model-score precision rather than a single definitive reliability coefficient.

Approximate reliability is computed as:

\[ 1 - \frac{\operatorname{mean}(\text{uncertainty}^2)}{\operatorname{var}(\hat\theta)} \]

where uncertainty is the mirt score SE for NL2 and posterior theta SD for NL4aTH. Higher approximate reliability is better; lower median uncertainty is better.

Read: NL2 is generally at least as precise as NL4aTH, except Year 1 Term 3 where the continuous testlet model has slightly lower posterior uncertainty and a slightly higher approximate reliability index.

11. Technical diagnostics and model-detail panel

In this section, person counts are person-administrations. Diagnostic fields differ by variant: FB0 is a computed benchmark, NL2 is a frequentist mirt fit, and NL4aTH is a Stan fit.

Foundation

Full-battery raw accuracy benchmark (FB0)

Diagnostic	Foundation
Person-administrations	5,065
Median eligible items	71–80 across terms
Response used	Raw accuracy across eligible maths items/subtests
Model fit	None; computed benchmark

Testlet + partial-credit-style ordinal model (NL2)

Diagnostic	Foundation
Person-administrations	5,065
Items in embedded run	397
Binary non-Number-Line items	317
Number Line ordinal items	80
Testlets	18
Estimated parameters	637
Median ability SE	.500
Fit status	Converged

Continuous testlet model (NL4aTH)

Diagnostic	Foundation
Person-administrations	5,065
Binary non-Number-Line observations	281,195
Number Line observations	87,313
Items in Stan run	479
Testlets	18
Chains	2
Warmup / sample	800 / 800
Runtime	6.66 hours
Divergences	0
Treedepth hits	0
E-BFMI	.854
Max Rhat	1.014
Min bulk ESS	155

Year 1

Full-battery raw accuracy benchmark (FB0)

Diagnostic	Year 1
Person-administrations	5,015
Median eligible items	80–111 across terms
Response used	Raw accuracy across eligible maths items/subtests
Model fit	None; computed benchmark

Testlet + partial-credit-style ordinal model (NL2)

Diagnostic	Year 1
Person-administrations	5,015
Items in embedded run	479
Binary non-Number-Line items	345
Number Line ordinal items	134
Testlets	18
Estimated parameters	881
Median ability SE	.469
Fit status	Converged

Continuous testlet model (NL4aTH)

Diagnostic	Year 1
Person-administrations	5,015
Binary non-Number-Line observations	327,135
Number Line observations	154,422
Items in Stan run	560
Testlets	18
Chains	2
Warmup / sample	800 / 800
Runtime	15.64 hours
Divergences	0
Treedepth hits	0
E-BFMI	.941
Max Rhat	1.045
Min bulk ESS	27.5

12. Subgroup movement

Initial read: most movement is not obviously alarming, but some subgroup/term cells, especially LBOTE yes in Term 1, show larger movement and should be reviewed cautiously given metadata missingness.

Initial read: Year 1 subgroup movement follows a similar pattern to Foundation, with most groups close to zero on average and small groups suppressed. Review LBOTE yes cells with caution because of metadata missingness.

13. Summary

Year	Model	Technical / precision	External validation	Screening classification	Overall read
Foundation	Testlet + partial-credit-style ordinal model (`NL2`)	Converged; median SE about .48–.55 across terms; approximate reliability .59–.73.	Strongest EOY PAT correlation in Terms 1 and 4; weaker in Term 3. Teacher-rating correlations are close to `NL4aTH`.	PAT <=35 screening is mixed; teacher-risk screening is competitive.	Remains the preferred operational-compatible model.
Foundation	Continuous testlet model (`NL4aTH`)	Stan fit is technically viable; no divergences or treedepth hits; median posterior SD about .55–.57; approximate reliability .36–.58.	Strongest EOY PAT correlation in Term 3 only; otherwise does not improve on `NL2`.	Does not improve PAT <=35 screening; teacher-risk screening is similar to `NL2`.	Viable research challenger, but no Foundation replacement case.
Year 1	Testlet + partial-credit-style ordinal model (`NL2`)	Converged; median SE about .41–.55 across terms; approximate reliability .67–.82.	Strongest or near-strongest EOY PAT correlation in Terms 1 and 4; close to `NL4aTH` in Term 3. Teacher-rating correlations are strongest in Terms 3 and 4.	Strong PAT <=35 AUC in Terms 1 and 3; teacher-risk screening is competitive.	Still the most defensible operational-compatible target.
Year 1	Continuous testlet model (`NL4aTH`)	Stan fit is technically viable; weak mixing is localised to testlet SDs; median posterior SD about .46–.50; approximate reliability .61–.71.	Strongest EOY PAT correlation in Term 3, but not Terms 1 or 4. Teacher-rating correlation is strongest in Term 1 only.	Slightly stronger teacher-risk AUC in Term 1 and equal sensitivity in Term 3; PAT <=35 sensitivity is not consistently better than `NL2`.	Credible research challenger, but gains are term- and anchor-specific.

Overall, the continuous testlet model (NL4aTH) is technically credible and informative, but it does not consistently outperform the testlet + partial-credit-style ordinal model (NL2) across validation, screening, precision, movement, subgroup movement, and operational burden. The evidence supports retaining NL2 as the operational-compatible target while keeping NL4aTH as a research challenger.

1. Executive summary

2. Full-battery variants compared in the main report

3. Model configurations and estimands

Full-battery raw accuracy benchmark (FB0)

Testlet + partial-credit-style ordinal model (NL2)

Continuous testlet model (NL4aTH)

4. Why ordinal binning can outperform a continuous model

5. Score agreement among models

6. Movement relative to the current target

7. Wright map

8. External validation

Benchmark against EOY PAT scores

Benchmark against teacher ratings

9. Screening classification

10. Model-score precision

11. Technical diagnostics and model-detail panel

Foundation

Year 1

12. Subgroup movement

13. Summary

Full-battery raw accuracy benchmark (`FB0`)

Testlet + partial-credit-style ordinal model (`NL2`)

Continuous testlet model (`NL4aTH`)