Number line PCM vs continuous

Related report: Number Line probe response-process report

1. Executive summary

The evidence currently supports keeping the testlet + partial-credit-style ordinal model (NL2) as the operational-compatible Number Line target. The continuous testlet model (NL4aTH) is technically viable and competitive at both Foundation and Year 1, but it does not yet improve validation, screening, and operational simplicity enough to replace the current target.

Key points

  • Existing .85/.95 ordinal categories remain defensible for operational modelling.
  • Continuous testlet modelling succeeded technically in both Foundation and Year 1.
  • The continuous testlet model (NL4aTH) correlates strongly with the testlet + partial-credit-style ordinal model (NL2) but moves some students meaningfully.
  • External validation and screening do not show a consistent cross-year advantage for the continuous challenger.
  • Signed-error and continuous analyses remain important for response-process validity and item design.

2. Full-battery variants compared in the main report

Variant Label Response used Model type Estimation framework Testlet/local dependence
Full-battery raw accuracy benchmark FB0 Raw accuracy aggregated across all eligible maths items/subtests Benchmark, not IRT Computed aggregate No
Testlet + partial-credit-style ordinal model NL2 Full battery with Number Line scored as .85/.95 ordinal categories Embedded ordinal/testlet model mirt frequentist MML Yes
Continuous testlet model NL4aTH Full battery with Number Line scored as continuous logit-scale accuracy Stan logit-normal full-battery model Stan MCMC Yes

3. Model configurations and estimands

Full-battery raw accuracy benchmark (FB0)

FB0 is not an IRT model. It averages raw item accuracy across maths items/subtests where is_attempted == TRUE, is_practice == FALSE, and raw_score is usable (not missing). Non-Number-Line items contribute binary accuracy; Number Line items contribute their continuous raw accuracy score.

\[ FB0_p = \frac{1}{n_p}\sum_{i \in \mathcal{I}_p} s_{pi} \]

where \(s_{pi}\) is the item-level score and \(\mathcal{I}_p\) is the set of attempted, non-practice items with non-missing raw_score for student administration \(p\).

Testlet + partial-credit-style ordinal model (NL2)

NL2 is the current operational-compatible target. It is estimated with mirt using frequentist marginal maximum likelihood (MML). It uses the full battery, with non-Number-Line items as binary accuracy items and Number Line items converted to .85/.95 ordinal categories.

For Number Line responses:

\[ Y_{pi}=\begin{cases} 0, & r_{pi}<0.85,\\ 1, & 0.85 \le r_{pi}<0.95,\\ 2, & r_{pi}\ge 0.95. \end{cases} \]

The ordinal Number Line component is partial-credit-style: it treats the three ordered categories as increasing levels of response accuracy. In implementation this is GPCM/PCM-style rather than a strict textbook PCM.

Feature Testlet + partial-credit-style ordinal model (NL2) Strict Rasch PCM
Response type Ordered .85/.95 Number Line categories plus binary non-Number-Line items Ordered polytomous item categories
Item discrimination Allows item calibration/discrimination in a GPCM-style form Fixed/equal discrimination, typically slope 1
Local dependence Includes person-by-testlet effects Not included by default
Assessment scope Embedded in the full maths battery Usually fitted to the polytomous items unless extended
Purpose here Operational full-battery ability estimate with ordinal Number Line scoring A stricter sensitivity model that would test an additional Rasch constraint

Continuous testlet model (NL4aTH)

NL4aTH is a full-battery Stan MCMC model that treats Number Line responses as continuous accuracy values on the logit scale.

\[ r^*_{pi}=\operatorname{clip}_{[0.001,0.999]}(r_{pi}), \qquad \ell_{pi}=\operatorname{logit}(r^*_{pi}). \]

For Number Line items:

\[ \ell_{pi}\mid\theta_p,u_{p,k[i]},b_i,\sigma_{f[i]} \sim \mathcal{N}(\theta_p+u_{p,k[i]}-b_i,\sigma^2_{f[i]}). \]

The small interior clipping is only for the logit transform: exact 1 scores exist, and \(\operatorname{logit}(1)\) is infinite.

Note

Because NL2 and NL4aTH use different Number Line response variables, this report does not compare them using AIC, BIC, or log-likelihood. The comparison uses agreement, movement, external validation, screening behaviour, subgroup movement, and operational burden.

4. Why ordinal binning can outperform a continuous model

A continuous Number Line model is attractive because the response is naturally continuous: a placement closer to the target contains more information than a placement farther away. In principle, modelling that continuous accuracy should preserve more measurement information than converting the score into categories.

In practice, the current evidence does not show a consistent operational advantage for the continuous testlet challenger over the ordinal testlet target. There are several plausible psychometric reasons for this.

Mechanism Why it matters for Number Line scoring
Motor and interface noise Very small differences in click location can reflect mouse/touch precision, screen size, device handling, or calibration noise rather than numeracy. Ordinal categories can deliberately ignore some of this fine-grained noise.
Boundary mass Many responses are near-perfect or exactly perfect. Logit-scale continuous models require clipping exact 0/1 values, and the high-end boundary can dominate model behaviour.
Thresholded competence External anchors such as PAT and teacher ratings may align more with “close enough / not close enough” competence than with every small spatial-error difference.
Signed-error structure Absolute accuracy loses whether students systematically over-place or under-place numbers. A continuous absolute-accuracy model can still miss compression, endpoint, and target-location biases.
Heterogeneous error variance Error variance likely differs by target value, bounded vs unbounded format, centred vs non-centred layout, term/form, and ability level. A simple continuous likelihood can be too smooth.
Local dependence and method effects Number Line items share interface and strategy demands. The continuous challenger improved only after adding testlet structure, which suggests method effects matter at least as much as response scale.
Operational robustness The .85/.95 categories may act as a denoising rule: coarse enough to be stable, but still graded enough to preserve useful Number Line evidence.

This does not mean the current .85/.95 categories are psychometric truth. It means the continuous challenger has to clear an operational burden: it must improve validation, screening behaviour, fairness, score stability, and interpretability enough to justify the extra modelling complexity.

The next continuous-model iteration should therefore start with a boundary and signed-error audit before adding complexity. Candidate extensions include zero/one-inflated beta or logit-normal likelihoods, target and family/method effects, and compression/signed-error diagnostics. These should be introduced sequentially rather than all at once.

5. Score agreement among models

Selective scatter plots:

Selective scatter plots:

6. Movement relative to the current target

Movement of the continuous testlet model (NL4aTH) relative to the testlet + partial-credit-style ordinal model (NL2):

Term N Median absolute z-difference P90 absolute z-difference Same-band rate Moves higher Moves lower
1 2,517 .354 .871 .728 .126 .147
3 1,449 .319 .780 .753 .115 .132
4 1,099 .378 .923 .701 .183 .116

Interpretation: the continuous testlet model creates meaningful but not wholesale movement. Term 4 has the lowest same-band rate / largest band movement and more upward than downward movement.

Movement of the continuous testlet model (NL4aTH) relative to the testlet + partial-credit-style ordinal model (NL2):

Term N Median absolute z-difference P90 absolute z-difference Same-band rate Moves higher Moves lower
1 2,435 .346 .826 .725 .125 .150
3 1,510 .348 .858 .709 .142 .149
4 1,070 .309 .787 .708 .174 .119

Interpretation: Year 1 movement is similar in scale to Foundation. Same-band rates are about 71–73%, with Term 4 again showing more upward than downward movement.

7. Wright map

Why this matters: the Year 1 full continuous testlet model is now complete, so the same person-item threshold view can be shown for both year levels.

8. External validation

Benchmark against EOY PAT scores

Each row compares a term-specific full-battery variant score with end-of-year PAT scaled score. Entries are Pearson correlations.

Year Term N Full-battery raw benchmark (FB0) Testlet + partial-credit-style ordinal model (NL2) Continuous testlet model (NL4aTH)
Foundation 1 59 .365 .393 .260
Foundation 3 49 .306 .269 .367
Foundation 4 45 .415 .482 .401
Year 1 1 112 .460 .544 .495
Year 1 3 117 .517 .577 .587
Year 1 4 120 .410 .496 .478

Benchmark against teacher ratings

Each row compares a term-specific full-battery variant score with teacher rating. Entries are Spearman rank correlations because teacher rating is ordinal.

Year Term N Full-battery raw benchmark (FB0) Testlet + partial-credit-style ordinal model (NL2) Continuous testlet model (NL4aTH)
Foundation 1 769 .458 .446 .430
Foundation 3 585 .418 .404 .358
Foundation 4 541 .378 .372 .364
Year 1 1 755 .498 .469 .512
Year 1 3 626 .439 .472 .464
Year 1 4 508 .441 .460 .435

9. Screening classification

Note

For each variant, the external risk label is fixed first: EOY PAT risk means PAT percentile <=35; teacher-rating risk means teacher rating <=2. For each year and term, we then search across possible model-score thresholds and choose the threshold that gives the highest sensitivity while keeping specificity >= .80. In plain terms, among students not labelled at risk by the external anchor, at least 80% must remain unflagged by the model. This is not an 80th-percentile cutoff on PAT or on the model score, and these thresholds are for model comparison only, not operational cut scores.

Anchor Term Variant N Risk N AUC Sensitivity False negative rate
EOY PAT percentile <=35 1 FB0 59 21 .653 .524 .476
EOY PAT percentile <=35 1 NL2 59 21 .659 .476 .524
EOY PAT percentile <=35 1 NL4aTH 59 21 .645 .476 .524
EOY PAT percentile <=35 3 FB0 49 16 .769 .625 .375
EOY PAT percentile <=35 3 NL2 49 16 .688 .625 .375
EOY PAT percentile <=35 3 NL4aTH 49 16 .691 .500 .500
EOY PAT percentile <=35 4 FB0 45 18 .831 .833 .167
EOY PAT percentile <=35 4 NL2 45 18 .809 .611 .389
EOY PAT percentile <=35 4 NL4aTH 45 18 .761 .500 .500
Teacher rating <=2 1 FB0 769 101 .811 .644 .356
Teacher rating <=2 1 NL2 769 101 .813 .723 .277
Teacher rating <=2 1 NL4aTH 769 101 .812 .663 .337
Teacher rating <=2 3 FB0 585 73 .755 .616 .384
Teacher rating <=2 3 NL2 585 73 .755 .548 .452
Teacher rating <=2 3 NL4aTH 585 73 .746 .562 .438
Teacher rating <=2 4 FB0 541 55 .762 .527 .473
Teacher rating <=2 4 NL2 541 55 .756 .618 .382
Teacher rating <=2 4 NL4aTH 541 55 .780 .618 .382

Anchor Term Variant N Risk N AUC Sensitivity False negative rate
EOY PAT percentile <=35 1 FB0 72 31 .829 .806 .194
EOY PAT percentile <=35 1 NL2 72 31 .838 .742 .258
EOY PAT percentile <=35 1 NL4aTH 72 31 .809 .710 .290
EOY PAT percentile <=35 3 FB0 71 32 .829 .750 .250
EOY PAT percentile <=35 3 NL2 71 32 .835 .688 .312
EOY PAT percentile <=35 3 NL4aTH 71 32 .816 .656 .344
EOY PAT percentile <=35 4 FB0 69 33 .789 .667 .333
EOY PAT percentile <=35 4 NL2 69 33 .766 .515 .485
EOY PAT percentile <=35 4 NL4aTH 69 33 .744 .485 .515
Teacher rating <=2 1 FB0 755 163 .813 .644 .356
Teacher rating <=2 1 NL2 755 163 .804 .613 .387
Teacher rating <=2 1 NL4aTH 755 163 .818 .669 .331
Teacher rating <=2 3 FB0 626 132 .782 .629 .371
Teacher rating <=2 3 NL2 626 132 .799 .644 .356
Teacher rating <=2 3 NL4aTH 626 132 .798 .644 .356
Teacher rating <=2 4 FB0 508 95 .809 .674 .326
Teacher rating <=2 4 NL2 508 95 .814 .674 .326
Teacher rating <=2 4 NL4aTH 508 95 .804 .632 .368

10. Model-score precision

The NL2 values are frequentist mirt model-score standard errors; the NL4aTH values are Bayesian posterior theta SDs from Stan. Both describe uncertainty in the estimated latent ability, but they come from different estimation frameworks, so the comparison should be read as approximate model-score precision rather than a single definitive reliability coefficient.

Approximate reliability is computed as:

\[ 1 - \frac{\operatorname{mean}(\text{uncertainty}^2)}{\operatorname{var}(\hat\theta)} \]

where uncertainty is the mirt score SE for NL2 and posterior theta SD for NL4aTH. Higher approximate reliability is better; lower median uncertainty is better.

Read: NL2 is generally at least as precise as NL4aTH, except Year 1 Term 3 where the continuous testlet model has slightly lower posterior uncertainty and a slightly higher approximate reliability index.

11. Technical diagnostics and model-detail panel

In this section, person counts are person-administrations. Diagnostic fields differ by variant: FB0 is a computed benchmark, NL2 is a frequentist mirt fit, and NL4aTH is a Stan fit.

Foundation

Full-battery raw accuracy benchmark (FB0)
Diagnostic Foundation
Person-administrations 5,065
Median eligible items 71–80 across terms
Response used Raw accuracy across eligible maths items/subtests
Model fit None; computed benchmark
Testlet + partial-credit-style ordinal model (NL2)
Diagnostic Foundation
Person-administrations 5,065
Items in embedded run 397
Binary non-Number-Line items 317
Number Line ordinal items 80
Testlets 18
Estimated parameters 637
Median ability SE .500
Fit status Converged
Continuous testlet model (NL4aTH)
Diagnostic Foundation
Person-administrations 5,065
Binary non-Number-Line observations 281,195
Number Line observations 87,313
Items in Stan run 479
Testlets 18
Chains 2
Warmup / sample 800 / 800
Runtime 6.66 hours
Divergences 0
Treedepth hits 0
E-BFMI .854
Max Rhat 1.014
Min bulk ESS 155

Year 1

Full-battery raw accuracy benchmark (FB0)
Diagnostic Year 1
Person-administrations 5,015
Median eligible items 80–111 across terms
Response used Raw accuracy across eligible maths items/subtests
Model fit None; computed benchmark
Testlet + partial-credit-style ordinal model (NL2)
Diagnostic Year 1
Person-administrations 5,015
Items in embedded run 479
Binary non-Number-Line items 345
Number Line ordinal items 134
Testlets 18
Estimated parameters 881
Median ability SE .469
Fit status Converged
Continuous testlet model (NL4aTH)
Diagnostic Year 1
Person-administrations 5,015
Binary non-Number-Line observations 327,135
Number Line observations 154,422
Items in Stan run 560
Testlets 18
Chains 2
Warmup / sample 800 / 800
Runtime 15.64 hours
Divergences 0
Treedepth hits 0
E-BFMI .941
Max Rhat 1.045
Min bulk ESS 27.5

12. Subgroup movement

Initial read: most movement is not obviously alarming, but some subgroup/term cells, especially LBOTE yes in Term 1, show larger movement and should be reviewed cautiously given metadata missingness.

Initial read: Year 1 subgroup movement follows a similar pattern to Foundation, with most groups close to zero on average and small groups suppressed. Review LBOTE yes cells with caution because of metadata missingness.

13. Summary

Year Model Technical / precision External validation Screening classification Overall read
Foundation Testlet + partial-credit-style ordinal model (NL2) Converged; median SE about .48–.55 across terms; approximate reliability .59–.73. Strongest EOY PAT correlation in Terms 1 and 4; weaker in Term 3. Teacher-rating correlations are close to NL4aTH. PAT <=35 screening is mixed; teacher-risk screening is competitive. Remains the preferred operational-compatible model.
Foundation Continuous testlet model (NL4aTH) Stan fit is technically viable; no divergences or treedepth hits; median posterior SD about .55–.57; approximate reliability .36–.58. Strongest EOY PAT correlation in Term 3 only; otherwise does not improve on NL2. Does not improve PAT <=35 screening; teacher-risk screening is similar to NL2. Viable research challenger, but no Foundation replacement case.
Year 1 Testlet + partial-credit-style ordinal model (NL2) Converged; median SE about .41–.55 across terms; approximate reliability .67–.82. Strongest or near-strongest EOY PAT correlation in Terms 1 and 4; close to NL4aTH in Term 3. Teacher-rating correlations are strongest in Terms 3 and 4. Strong PAT <=35 AUC in Terms 1 and 3; teacher-risk screening is competitive. Still the most defensible operational-compatible target.
Year 1 Continuous testlet model (NL4aTH) Stan fit is technically viable; weak mixing is localised to testlet SDs; median posterior SD about .46–.50; approximate reliability .61–.71. Strongest EOY PAT correlation in Term 3, but not Terms 1 or 4. Teacher-rating correlation is strongest in Term 1 only. Slightly stronger teacher-risk AUC in Term 1 and equal sensitivity in Term 3; PAT <=35 sensitivity is not consistently better than NL2. Credible research challenger, but gains are term- and anchor-specific.

Overall, the continuous testlet model (NL4aTH) is technically credible and informative, but it does not consistently outperform the testlet + partial-credit-style ordinal model (NL2) across validation, screening, precision, movement, subgroup movement, and operational burden. The evidence supports retaining NL2 as the operational-compatible target while keeping NL4aTH as a research challenger.