Number-Line PCM Policy

How continuous number-line accuracy is converted into ordered model categories

1 Executive summary

Number-line responses are different from most other screener items. Instead of a simple correct/incorrect outcome, each response receives a continuous accuracy score from 0 to 1. For the accuracy-structure models, those continuous scores are converted into ordered categories and fitted with a partial-credit / generalised partial-credit model (PCM/GPCM).

The current structure-comparison run uses a three-category policy:

  • 0: raw_score < 0.85
  • 1: 0.85 <= raw_score < 0.95
  • 2: raw_score >= 0.95

These are model categories, not teacher-facing bands. They should not be read as “wrong / nearly right / right” labels for students. Their purpose is to preserve graded number-line evidence while avoiding unstable, overly fine categories.

Earlier hierarchy-comparison code supported alternative policies, including a looser three-category policy and a five-category policy. The current headline structure conclusion — one broad accuracy-based numeracy score with probe/testlet effects — should be interpreted as conditional on the selected three-category policy, with alternative binning treated as sensitivity evidence.

2 Why number-line items need different treatment

Most screener items are binary: the response is scored as correct or incorrect. Number-line items are graded: a click close to the target contains more evidence than a click far away, even if neither is exactly correct.

Response type Raw evidence Model treatment in this section
Binary probes 0/1 correctness Rasch binary item model
Number-line probes Continuous accuracy from 0 to 1 Ordered PCM/GPCM category after binning
Response time Seconds / rounded seconds Not part of the accuracy likelihood; handled in RT models

The PCM/GPCM approach is a compromise. It avoids discarding graded evidence, but it also avoids treating the continuous score as if it were directly comparable to binary correctness inside the same MML structure-comparison model.

3 Candidate binning policies

The repo contains three relevant number-line binning policies.

Policy Category cutpoints Source Current role
3-category loose 0: <0.80; 1: 0.80–<0.95; 2: >=0.95 Earlier hierarchy-comparison module default (HIER_NL_SCHEME=3cat_B) Sensitivity / historical
3-category current 0: <0.85; 1: 0.85–<0.95; 2: >=0.95 Current structure_model_comparison_v4 configuration Main policy for current Accuracy Modelling page
5-category fine 0: <0.80; 1: 0.80–<0.90; 2: 0.90–<0.95; 3: 0.95–<0.98; 4: >=0.98 Earlier hierarchy-comparison coded sensitivity option Sensitivity / coded option

The current page reports the 3-category current policy because that is the policy used in the full-N robustness run supporting the Accuracy Modelling page.

4 Model specification

For a number-line item \(j\), the cleaned pipeline produces a continuous score \(s_{ij} \in [0, 1]\) for student/person-administration \(i\).

Under the current policy:

\[ y_{ij} = \begin{cases} 0 & s_{ij} < 0.85 \\ 1 & 0.85 \le s_{ij} < 0.95 \\ 2 & s_{ij} \ge 0.95 \end{cases} \]

The resulting ordered category \(y_{ij}\) is fitted with a GPCM-style item response function:

\[ P(Y_{ij}=k) \propto \exp\left(\sum_{c=1}^{k} a_j(\theta_i - b_j - d_{jc})\right) \]

where:

  • \(\theta_i\) is the broad accuracy-based numeracy trait;
  • \(a_j\) is the estimated number-line discrimination parameter;
  • \(b_j\) is item location;
  • \(d_{jc}\) are ordered step thresholds.

Binary non-number-line items use a Rasch model in the structure-comparison analysis. The number-line GPCM is therefore the main place where item discrimination is estimated within the current accuracy-structure page.

5 How the current policy was judged

The cutpoints are a modelling policy. The relevant question is not whether a threshold is intuitively perfect, but whether the policy produces useful, stable, interpretable measurement evidence.

Check Question Current status
Category occupancy Are there enough responses in each category to estimate thresholds? Pending public table
Category ordering Do higher categories correspond to higher overall achievement? Pending public table
Model stability Do fitted models converge and produce sensible parameters? Partial: current full-N structure run completed
Structure robustness Does the winning structure change under plausible binning policies? Partial: current run documented; older policy options exist
Interpretability Can the categories be explained without turning them into teacher-facing labels? Established for internal modelling use

The current strongest evidence is indirect: the structure-comparison conclusion is stable and strong under the current policy, with M1-struct (one broad trait plus probe/testlet effects) preferred over separate sub-skill factors in the full-N robustness run. The remaining public-reporting gap is to show category occupancy and model-conclusion sensitivity side-by-side across candidate policies.

6 Cutpoint comparison graphic

The figure below shows how the candidate policies partition the 0–1 number-line accuracy scale. It is a policy schematic, not an empirical distribution.

7 Past runs and what they tell us

There has been number-line PCM/GPCM work already. The current gap is not that no number-line models have been tried; it is that the cutpoint evidence has not yet been distilled into a compact public-facing summary with category occupancy, diagnostics, and conclusion stability side by side.

7.1 Historical joint Stan PCM v2

The earliest joint Stan PCM runs include 2026_preflight_3catB, 2026_pilot_3catB, and 2026_full_3catB_2core. These used the looser 3cat_B number-line policy: <0.80, 0.80–<0.95, and >=0.95.

These runs are useful historical evidence that number-line responses were already being treated as ordered partial-credit evidence rather than forced into binary correctness. They are not a clean cutpoint-selection study: the provenance was backfilled later and the runs were part of broader joint-model development rather than a targeted comparison of cutpoint policies.

7.2 Hierarchy comparison v1

The hierarchy-comparison v1 run (1b0dc6f) also used the looser 3cat_B policy. It compared broad accuracy structures across Term 1 and Term 4, operational and stable panels, with binary items fitted as Rasch and number-line items fitted as GPCM.

The key result was structural rather than cutpoint-specific: the one-dimensional model with testlet/probe effects won by BIC in 12 out of 12 slice × year comparisons. This matters because it shows that under the older looser number-line policy, the broad-score-plus-probe-effects conclusion was already supported.

7.3 Structure comparison v4 initial run

The current Accuracy Modelling page is based on structure_model_comparison_v4, which moved to the current three-category policy: <0.85, 0.85–<0.95, and >=0.95.

The initial run (foundation_parallel_20260206) fitted the structure-comparison ladder across 12 Foundation units: single trait, single trait plus probe/testlet effects, and multidimensional sub-skill structures. M1-struct — the single trait plus probe/testlet effects model — won by BIC in 12 out of 12 completed units.

7.4 Structure comparison v4 full-N robustness run

The full-N robustness run (foundation_fulln_boot4_prod_20260209) used the same current number-line policy and focused on the four main pooled units. It completed the M1-struct vs M3-struct comparison and added school-level bootstrap stability checks.

Again, M1-struct won by BIC in 4 out of 4 main pooled units, and the bootstrap deltas all supported the same direction. This is the current strongest evidence that, under the selected number-line policy, the achievement structure is better represented as one broad score with probe/testlet effects than as separate latent sub-skill scores.

7.5 Later A2 binary+number-line Stan benchmarks

The later A2 work includes binary-plus-number-line Stan implementation benchmarks (m3bench-a2binnl*, Foundation). These runs checked that the later A2 implementation could represent the binary and number-line likelihoods efficiently and consistently in Stan.

These are useful implementation and compute checks. They should not be over-read as evidence that one cutpoint policy is psychometrically best. Their role here is to show that number-line PCM/GPCM handling continued into the later A2 implementation pathway rather than being only an older mirt-era choice.

7.6 Compact run summary

Run family NL policy Main question Result Role
Historical Stan PCM v2 3cat_B (0.80/0.95) Can NL enter a PCM-style joint model? Completed historical runs; partial/backfilled provenance Context
Hierarchy v1 3cat_B (0.80/0.95) Does broad score + testlets hold under older NL policy? 12/12 comparisons support 1D + testlets Sensitivity evidence
Structure v4 initial Current (0.85/0.95) Does broad score + testlets hold under current NL policy? 12/12 completed units support M1-struct Main evidence
Structure v4 full-N Current (0.85/0.95) Does the current-policy result replicate at full N? 4/4 pooled units support M1-struct; bootstrap direction stable Strongest robustness evidence
A2 binary+NL benchmarks Binary+NL Stan variants Can the later Stan A2 implementation carry the NL likelihood? Completed implementation/compute checks Implementation evidence

The main conclusion is stronger than “we have not reported this yet”:

  • Historical 3cat_B runs used the looser <0.80 / 0.80–<0.95 / >=0.95 policy and already supported a broad-score-plus-probe-effects structure.
  • Current structure-comparison runs use <0.85 / 0.85–<0.95 / >=0.95 and again support M1-struct consistently.
  • Later Stan/A2 benchmark work confirms the implementation pathway for binary plus number-line likelihoods.
  • New number-line-specific checks now support the current three-category policy as the operational-compatible baseline, while keeping continuous and signed-error models as challengers.

8 Model ladder and updated evidence

The current conclusion is not “PCM is psychometrically true”. It is: use the current three-category PCM/GPCM as the operational-compatible baseline, then test whether richer models earn their added complexity.

Code Plain-English label Current evidence Decision
NL0 Mean PAE/raw-score benchmark Recovered student and family summaries. Simple, transparent, but no IRT uncertainty and loses signed error. Keep as comparison benchmark.
NL1 Current .85/.95 3-category GPCM Number-line-only and embedded whole-battery GPCM fits both converged. Embedded full-battery fit used 397 Foundation items and 479 Year 1 items. Main operational-compatible baseline.
NL2 Current .85/.95 with testlet/method structure Embedded current-policy structure evidence favours broad score + probe/testlet effects over separate sub-skill dimensions. Student-level NL2 scores were not separately extracted in the latest pass. Preferred ordinal upgrade path, not a new live score.
NL3 Ordinal sensitivity policies Legacy 3-cat, coded 5-cat, consultant 4/5 PAE bands, and binary >=.95 all fitted as sensitivity checks; consultant PAE bands are sparse at item level. Sensitivity only.
NL4 Continuous absolute-error challenger Embedded continuous prototype is highly correlated with the raw-score benchmark, especially in Year 1, but external validation did not beat NL1. Research challenger, not promoted yet.
NL5 Signed-error / click-location model Signed-error checks show target-dependent bias and method/family patterns, especially BNL compression. Response-process validation model.

8.1 Threshold evidence

For the current Term 3/4 modelling scope, the category policy comparison is:

Policy Foundation category shares Year 1 category shares Read
Current 3-cat .85/.95 34.2 / 32.9 / 32.9% 30.4 / 36.9 / 32.7% Best operational-compatible balance.
Legacy 3-cat .80/.95 25.4 / 41.8 / 32.9% 20.1 / 47.2 / 32.7% More middle-heavy.
Existing coded 5-cat 25.4 / 22.0 / 19.8 / 18.2 / 14.6% 20.1 / 25.5 / 21.7 / 18.1 / 14.6% Feasible sensitivity policy.
Consultant 4 PAE bands 13.3 / 12.1 / 22.0 / 52.7% 8.2 / 11.9 / 25.5 / 54.4% Top-heavy; item-level sparsity.
Consultant 5 PAE bands 13.3 / 12.1 / 22.0 / 19.9 / 32.8% 8.2 / 11.9 / 25.5 / 21.7 / 32.7% Interpretable but sparse at item level.

The item-level sparsity check is the stronger reason not to start operationally with the consultant 4/5 PAE bands:

Policy Foundation cells <5% Year 1 cells <5% Empty item-categories
Current 3-cat .85/.95 5 11 0
Legacy 3-cat .80/.95 6 13 0
Existing coded 5-cat 16 25 0
Consultant 4 PAE bands 32 53 1
Consultant 5 PAE bands 32 54 1

8.2 Response-process evidence

Number-line items have useful signal: median continuous item-rest correlations are about .40–.46 across BNL/UNL families. But the response process is not just ordinal accuracy. Signed-error slopes show systematic target-dependent bias. For example, Year 1 BNL signed-error slopes are around -.38 to -.44, consistent with compression/target-location effects.

This is why continuous and signed-error models remain formal challengers even though the current three-category policy is the baseline.

8.3 Embedded whole-battery comparison

The latest embedded comparison fitted the current .85/.95 ordinal model (NL1) inside the wider scored battery. This was not a Bayesian run; it was a marginal maximum-likelihood mirt fit with binary items and number-line GPCM items.

Year level Person-administrations Items Number-line ordinal items Converged Median SE
Foundation 5,065 397 80 Yes 0.407
Year 1 5,015 479 134 Yes 0.360

Agreement with the raw-score benchmark (NL0) was high but not perfect. NL1 correlated .93–.94 with NL0 in Foundation and .84–.89 in Year 1 across terms. The continuous embedded prototype (NL4) correlated .89–.93 with NL0 in Foundation and .94–.95 in Year 1. This means the continuous prototype is coherent, but mostly tracks the raw-score ordering rather than clearly improving on the ordinal model.

External validation also favoured retaining NL1 for now:

Anchor Foundation NL0 Foundation NL1 Foundation NL4 Year 1 NL0 Year 1 NL1 Year 1 NL4
PAT scaled score .430 .455 .343 .395 .449 .379
Teacher rating .508 .508 .487 .496 .544 .473

These are converging anchors, not gold standards. But they do not support replacing the ordinal baseline with the continuous prototype at this stage.

9 Remaining evidence gates

The embedded comparison resolves the immediate question: the continuous prototype does not yet displace the current .85/.95 ordinal baseline. The remaining gates are now narrower:

  1. extract student-level scores from the NL2 testlet/method structure, rather than relying only on recovered structure-comparison evidence;
  2. fit a full mixed-response continuous model if we want to keep testing NL4 as more than a z-score prototype;
  3. develop NL5 signed-error/click-location as a response-process validity model, not a direct score replacement;
  4. continue fairness/form/chair checks, especially BNL vs UNLC vs UNLNC and Year 1 Term 1 repeated targets;
  5. evaluate uncertainty near any future operational cut-points.

External anchors are converging evidence, not gold standards. PAT Maths cells below n = 50 should be suppressed or marked descriptive-only; cells from 50 to 99 should be treated as exploratory.

10 Final lock-in comparison

A final lock-in pass compared the serious number-line variants under one common protocol: raw-score benchmark (NL0), simple ordinal GPCM (NL1), the default ability + probe/testlet ordinal model (NL2), the continuous prototype (NL4), and signed-error diagnostics (NL5).

The operational target is now NL2: the current .85/.95 number-line categories inside the default ability + probe/testlet model. This is the model challengers must beat.

Model Year level Person-administrations Items NL ordinal items Testlets Converged Median SE
NL1 Foundation 5,065 397 80 Yes .407
NL1 Year 1 5,015 479 134 Yes .360
NL2 Foundation 5,065 397 80 18 Yes .500
NL2 Year 1 5,015 479 134 18 Yes .469

The higher NL2 standard errors are expected because the model absorbs probe/testlet dependence rather than treating every item response as fully independent evidence.

External validation remains broadly supportive of the ordinal path:

Anchor Foundation NL0 Foundation NL1 Foundation NL2 Foundation NL4 Year 1 NL0 Year 1 NL1 Year 1 NL2 Year 1 NL4
PAT scaled score .430 .455 .440 .343 .395 .449 .460 .379
Teacher rating .508 .508 .500 .487 .496 .544 .534 .473

Screening checks at approximately 80% specificity also support keeping the ordinal path. In Year 1 PAT-risk validation, NL2 had the strongest AUC among compared models (.743) and the highest sensitivity at fixed specificity. Foundation results were closer, but the continuous result is not enough to promote NL4 because it remains a prototype rather than a full continuous IRT model.

11 Decision statement

Lock the selected three-category policy (<0.85, 0.85–<0.95, >=0.95) inside the default ability + probe/testlet model as the operational-compatible number-line treatment for the next scoring version (NL2). It preserves graded number-line evidence, avoids the sparsity seen in finer PAE bands, and handles local dependence better than the simple ordinal baseline (NL1). It is not a teacher-facing achievement band or a claim that 0.85 and 0.95 are operational cut scores.

Do not declare PCM/GPCM the final psychometric answer for all time. Number-line responses are continuous spatial-estimation responses, and signed-error evidence shows meaningful response-process structure. Continuous and signed-error models should remain formal challengers. However, the current lock-in evidence does not justify replacing the default ordinal testlet model (NL2) with the continuous prototype or signed-error diagnostic model.

12 See also