Numeracy Screening Test Reports
  • Home
  • Probe Reports
    • Arithmetic – Addition
    • Arithmetic – Subtraction
    • Decomposition
    • Match quantity
    • Magnitude comparison
    • Missing number
    • Number line

    • STPM (Speed Anchor)
    • EDA (Internal, WIP)
  • Student Outcome/Metadata
    • Student Metadata
    • EOY Outcome Data Report
    • EOY Risk Classification
    • Outcome Pack Comparison
  • IRT model dev
    • State of Play
    • Accuracy Modelling
    • Number-Line PCM Policy
    • Number line PCM vs continuous
    • Omitted and Non-Scored Responses
    • Response Time Modelling
    • Accuracy-speed modelling
    • Timed subtest scoring policy
    • Foundation Term 1 Calibration (Archived)

Timed subtest scoring policy

Scoring timed subtests: how should unreached items count?

Last updated 13 June 2026, 05:59 PM Sydney time

What we are investigating

ENSSA needs a fluency-based measure that helps identify students who may need extra support. For timed subtests, that means the score should reflect not only whether students answer correctly, but also how far they progress through the intended item set within the time limit.

Normally omitted responses are treated as missing, but for this timed-performance purpose we need to test whether trailing unreached items — items after a student’s last response when time runs out — should instead count as zero credit.

The key distinction is between accuracy on only the items a student reached and performance over the full timed item set.

  • Accuracy on reached items can look high for a student who answered most reached items correctly but reached very few items.
  • Timed score, by contrast, should distinguish that pattern from a student who answered the same number correctly while progressing further through the subtest.

What the response patterns show

Most students stop well before the end of timed subtests. The median student completes between 10% and 32% of items depending on the subtest. Skipped items inside the subtest are negligible: the skipped-within-subtest rate stays below 1% at every item position in every plotted subtest.

Nonresponse by item position, Term 3

The table below shows why reached-item accuracy is not a timed-performance score: students are highly accurate on the items they reach, but they reach only a small share of the intended item set. The final column treats the full timed subtest as the denominator, so unreached trailing items add no correct responses.

Year Subtest N students Median completion rate Mean accuracy on reached items Mean score: correct / total items
Foundation MC0-20 1443 0.317 0.916 0.295
Foundation MNA0-20 648 0.267 0.859 0.239
Foundation MNC0-20 785 0.233 0.848 0.192
Foundation MQ1-10 789 0.267 0.930 0.244
Foundation MQ1-20 655 0.167 0.828 0.133
Year 1 AAMC 1495 0.225 0.910 0.231
Year 1 ASDD 703 0.100 0.768 0.098
Year 1 ASMC 789 0.233 0.767 0.199
Year 1 MC0-100 1503 0.317 0.909 0.295
Year 1 MNA0-100 867 0.225 0.898 0.200
Year 1 MNC0-100 633 0.267 0.929 0.265

In the scatterplot below, points far above the diagonal are students whose reached items were mostly correct but who reached relatively few items — exactly the cases where reached-item accuracy and the full timed score disagree.

Accuracy on reached items against timed item-set score, Term 3

Many students have summed response time below the timed limit even though they did not finish all items. That matters mainly for correct-per-minute: raw sum score is not affected because it does not use response time as a denominator. In these data, there are very few cases where students with few correct items get unusually high correct-per-minute scores. The main point is simpler: correct-per-minute from summed rt_sec is harder to interpret than the raw number correct.

Histogram of total response time, Term 3 timed subtests
Total response time against correct items per minute, Term 3 timed subtests

Scoring methodology

We computed scores five different ways to separate two questions: how accurate students were on the items they reached, and how much credit they earned over the full timed item set.

Raw sum score (A)correct items

Total credit earned across the intended timed item set. The denominator is a scoring convention, not a claim that every item was displayed on screen.

Raw accuracy on reached (B)correct / reached

Correct responses divided by reached items. Useful diagnostic context, but not a timed-performance score.

Logit accuracy on reached (C)IRT over reached items

A logit-scale item-response score using reached items only; unreached items are treated as missing.

Logit accuracy trailing-zero (D)IRT with unreached = 0

A logit-scale item-response score where trailing unreached items are scored zero; skipped and leading gaps remain missing. This is the policy candidate.

Joint accuracy-reach expected credit (F)Σ P(reach) × P(correct | reached)

A Bayesian joint model of item reach and correctness if reached. It reports expected credit over the intended timed item set.

There are two main kinds of missing data. Trailing unreached items are items after the student’s last valid response; for timed-performance scoring, these count as zero credit. Skipped, leading, or invalid rows do not show how far the student progressed, so they are treated as missing.

The tornado plot summarises the reached-only C score against the trailing-zero D score. Each bar is D minus C for one year-level × subtest: positive values favour D on that metric. Reliability and PSI are construct-specific model diagnostics; the Spearman columns are directly comparable correlations.

C versus D tornado plot across Term 3 timed subtests

Results by year level × subtest

Choose a year level and subtest below. Each tab keeps the scoring-policy evidence within one year-level and subtest context.

Foundation

Year 1

Magnitude comparison 0–20 · Term 3 · MC0-20_2025

Logit accuracy trailing-zero (D) and Raw sum score (A) order students almost identically; the joint model agrees and adds no material external-alignment gain.

Download item CSVDownload CSV dictionary
Score Mean SD Sumscore Spearman PAT Spearman Teacher-rating Spearman
Raw sum score (A) 17.672 6.554 1.000 0.112 (n=192) 0.243 (n=581)
Raw accuracy on reached (B) 0.916 0.146 0.406 0.158 (n=190) 0.264 (n=578)
Logit accuracy on reached (C) -0.623 1.363 0.593 0.176 (n=190) 0.302 (n=578)
Logit accuracy trailing-zero (D) -0.161 2.705 0.997 0.087 (n=190) 0.238 (n=578)
Joint accuracy-reach expected credit (F) 0.301 0.074 0.994 0.080 (n=190) 0.230 (n=578)
C and D Wright maps
C and D test characteristic curves
C and D item fit MSQ chart

Means and SDs are on each score’s own scale; comparisons across rows should use the rank-agreement and alignment columns.

Score-variant correlation matrix

Model diagnostics

Score Reliability Reliability type Mean SE / posterior SD Median posterior SD PSI / posterior PSI proxy Items used in fit Fit note Divergent transitions Max R-hat Min bulk ESS Min tail ESS Reach–accuracy trait correlation
Logit accuracy on reached (C) 0.626 item-response 1.151 1.185 36 tam_ok
Logit accuracy trailing-zero (D) 0.939 item-response 0.659 4.105 38 tam_ok
Joint accuracy-reach expected credit (F) 0.904 posterior proxy 0.024 0.023 3.085 60 0 1.011 578 1099 0.176

The F reliability figure is a posterior-variance proxy, not the same metric as TAM reliability.

Missing number (ascending) 0–20 · Term 3 · MNA0-20_2025

Logit accuracy trailing-zero (D) and Raw sum score (A) order students very similarly; the joint model shows some divergence; see the key contrasts and diagnostics below.

Download item CSVDownload CSV dictionary
Score Mean SD Sumscore Spearman PAT Spearman Teacher-rating Spearman
Raw sum score (A) 7.157 3.087 1.000 0.367 (n=26) 0.265 (n=241)
Raw accuracy on reached (B) 0.859 0.179 0.205 0.458 (n=26) 0.204 (n=240)
Logit accuracy on reached (C) -0.523 1.313 0.705 0.575 (n=26) 0.310 (n=240)
Logit accuracy trailing-zero (D) -0.035 2.093 0.979 0.377 (n=26) 0.249 (n=240)
Joint accuracy-reach expected credit (F) 0.257 0.056 0.962 0.293 (n=26) 0.252 (n=240)
C and D Wright maps
C and D test characteristic curves
C and D item fit MSQ chart

Means and SDs are on each score’s own scale; comparisons across rows should use the rank-agreement and alignment columns.

Score-variant correlation matrix

Model diagnostics

Score Reliability Reliability type Mean SE / posterior SD Median posterior SD PSI / posterior PSI proxy Items used in fit Fit note Divergent transitions Max R-hat Min bulk ESS Min tail ESS Reach–accuracy trait correlation
Logit accuracy on reached (C) 0.465 item-response 1.301 1.009 25 tam_ok
Logit accuracy trailing-zero (D) 0.834 item-response 0.860 2.434 28 tam_ok
Joint accuracy-reach expected credit (F) 0.750 posterior proxy 0.032 0.031 1.742 30 0 1.009 667 943 0.072

The F reliability figure is a posterior-variance proxy, not the same metric as TAM reliability.

Missing number (choice) 0–20 · Term 3 · MNC0-20_2025

Logit accuracy trailing-zero (D) and Raw sum score (A) order students very similarly; the joint model agrees and adds no material external-alignment gain.

Download item CSVDownload CSV dictionary
Score Mean SD Sumscore Spearman PAT Spearman Teacher-rating Spearman
Raw sum score (A) 5.766 3.318 1.000 0.212 (n=163) 0.354 (n=337)
Raw accuracy on reached (B) 0.848 0.276 0.555 0.225 (n=160) 0.313 (n=328)
Logit accuracy on reached (C) -1.586 1.447 0.886 0.212 (n=160) 0.360 (n=328)
Logit accuracy trailing-zero (D) -0.056 2.574 0.967 0.144 (n=160) 0.363 (n=328)
Joint accuracy-reach expected credit (F) 0.217 0.055 0.956 0.148 (n=160) 0.376 (n=328)
C and D Wright maps
C and D test characteristic curves
C and D item fit MSQ chart

Means and SDs are on each score’s own scale; comparisons across rows should use the rank-agreement and alignment columns.

Score-variant correlation matrix

Model diagnostics

Score Reliability Reliability type Mean SE / posterior SD Median posterior SD PSI / posterior PSI proxy Items used in fit Fit note Divergent transitions Max R-hat Min bulk ESS Min tail ESS Reach–accuracy trait correlation
Logit accuracy on reached (C) 0.602 item-response 1.437 1.007 21 tam_ok
Logit accuracy trailing-zero (D) 0.876 item-response 0.916 2.809 22 tam_ok
Joint accuracy-reach expected credit (F) 0.737 posterior proxy 0.032 0.032 1.693 30 0 1.011 766 1286 -0.118

The F reliability figure is a posterior-variance proxy, not the same metric as TAM reliability.

Match quantity 1–10 · Term 3 · MQ1-10_2025

Logit accuracy trailing-zero (D) and Raw sum score (A) order students almost identically; the joint model agrees and adds no material external-alignment gain.

Download item CSVDownload CSV dictionary
Score Mean SD Sumscore Spearman PAT Spearman Teacher-rating Spearman
Raw sum score (A) 7.305 2.661 1.000 0.248 (n=165) 0.248 (n=339)
Raw accuracy on reached (B) 0.930 0.125 0.184 0.131 (n=164) 0.079 (n=336)
Logit accuracy on reached (C) -1.188 1.106 0.656 0.201 (n=164) 0.203 (n=336)
Logit accuracy trailing-zero (D) 0.001 2.217 0.989 0.244 (n=164) 0.242 (n=336)
Joint accuracy-reach expected credit (F) 0.262 0.040 0.970 0.230 (n=164) 0.244 (n=336)
C and D Wright maps
C and D test characteristic curves
C and D item fit MSQ chart

Means and SDs are on each score’s own scale; comparisons across rows should use the rank-agreement and alignment columns.

Score-variant correlation matrix

Model diagnostics

Score Reliability Reliability type Mean SE / posterior SD Median posterior SD PSI / posterior PSI proxy Items used in fit Fit note Divergent transitions Max R-hat Min bulk ESS Min tail ESS Reach–accuracy trait correlation
Logit accuracy on reached (C) 0.414 item-response 1.490 0.743 18 tam_ok
Logit accuracy trailing-zero (D) 0.824 item-response 0.929 2.386 21 tam_ok
Joint accuracy-reach expected credit (F) 0.675 posterior proxy 0.027 0.027 1.449 30 0 1.012 634 1397 -0.234

The F reliability figure is a posterior-variance proxy, not the same metric as TAM reliability.

Match quantity 1–20 · Term 3 · MQ1-20_2025

Logit accuracy trailing-zero (D) and Raw sum score (A) order students almost identically; the joint model shows some divergence; see the key contrasts and diagnostics below.

Download item CSVDownload CSV dictionary
Score Mean SD Sumscore Spearman PAT Spearman Teacher-rating Spearman
Raw sum score (A) 3.977 1.909 1.000 0.258 (n=27) 0.284 (n=243)
Raw accuracy on reached (B) 0.828 0.237 0.331 0.066 (n=27) 0.222 (n=241)
Logit accuracy on reached (C) -0.828 1.242 0.663 0.055 (n=27) 0.286 (n=241)
Logit accuracy trailing-zero (D) 0.013 1.457 0.990 0.258 (n=27) 0.304 (n=241)
Joint accuracy-reach expected credit (F) 0.148 0.021 0.957 0.351 (n=27) 0.309 (n=241)
C and D Wright maps
C and D test characteristic curves
C and D item fit MSQ chart

Means and SDs are on each score’s own scale; comparisons across rows should use the rank-agreement and alignment columns.

Score-variant correlation matrix

Model diagnostics

Score Reliability Reliability type Mean SE / posterior SD Median posterior SD PSI / posterior PSI proxy Items used in fit Fit note Divergent transitions Max R-hat Min bulk ESS Min tail ESS Reach–accuracy trait correlation
Logit accuracy on reached (C) 0.525 item-response 1.436 0.866 17 tam_ok
Logit accuracy trailing-zero (D) 0.628 item-response 0.926 1.573 22 tam_ok
Joint accuracy-reach expected credit (F) 0.471 posterior proxy 0.022 0.020 0.973 30 0 1.011 1234 1428 -0.501

The F reliability figure is a posterior-variance proxy, not the same metric as TAM reliability.

Arithmetic facts — addition (multiple choice) · Term 3 · AADD_2025-NEW

Logit accuracy trailing-zero (D) and Raw sum score (A) order students very similarly; the joint model agrees and adds no material external-alignment gain.

Download item CSVDownload CSV dictionary
Score Mean SD Sumscore Spearman PAT Spearman Teacher-rating Spearman
Raw sum score (A) 9.227 4.122 1.000 0.320 (n=296) 0.399 (n=624)
Raw accuracy on reached (B) 0.910 0.159 0.330 0.183 (n=296) 0.213 (n=624)
Logit accuracy on reached (C) -1.156 1.378 0.726 0.305 (n=296) 0.365 (n=624)
Logit accuracy trailing-zero (D) -0.196 2.754 0.976 0.330 (n=296) 0.398 (n=624)
Joint accuracy-reach expected credit (F) 0.247 0.061 0.965 0.318 (n=296) 0.404 (n=624)
C and D Wright maps
C and D test characteristic curves
C and D item fit MSQ chart

Means and SDs are on each score’s own scale; comparisons across rows should use the rank-agreement and alignment columns.

Score-variant correlation matrix

Model diagnostics

Score Reliability Reliability type Mean SE / posterior SD Median posterior SD PSI / posterior PSI proxy Items used in fit Fit note Divergent transitions Max R-hat Min bulk ESS Min tail ESS Reach–accuracy trait correlation
Logit accuracy on reached (C) 0.541 item-response 1.384 0.995 27 tam_ok
Logit accuracy trailing-zero (D) 0.906 item-response 0.834 3.303 30 tam_ok
Joint accuracy-reach expected credit (F) 0.827 posterior proxy 0.028 0.027 2.212 40 0 1.010 457 827 0.062

The F reliability figure is a posterior-variance proxy, not the same metric as TAM reliability.

Arithmetic facts — subtraction · Term 3 · ASDD_2025

Logit accuracy trailing-zero (D) and Raw sum score (A) show some ordering differences worth noting; the joint model shows some divergence; see the key contrasts and diagnostics below.

Download item CSVDownload CSV dictionary
Score Mean SD Sumscore Spearman PAT Spearman Teacher-rating Spearman
Raw sum score (A) 2.949 3.244 1.000 0.117 (n=147) 0.245 (n=263)
Raw accuracy on reached (B) 0.768 0.312 0.569 0.372 (n=99) 0.292 (n=202)
Logit accuracy on reached (C) -0.810 1.505 0.813 0.438 (n=98) 0.309 (n=202)
Logit accuracy trailing-zero (D) -0.046 2.624 0.902 0.511 (n=99) 0.360 (n=202)
Joint accuracy-reach expected credit (F) 0.171 0.060 0.886 0.542 (n=99) 0.385 (n=202)
C and D Wright maps
C and D test characteristic curves
C and D item fit MSQ chart

Means and SDs are on each score’s own scale; comparisons across rows should use the rank-agreement and alignment columns.

Score-variant correlation matrix

Model diagnostics

Score Reliability Reliability type Mean SE / posterior SD Median posterior SD PSI / posterior PSI proxy Items used in fit Fit note Divergent transitions Max R-hat Min bulk ESS Min tail ESS Reach–accuracy trait correlation
Logit accuracy on reached (C) 0.594 item-response 1.456 1.034 17 tam_ok
Logit accuracy trailing-zero (D) 0.862 item-response 0.988 2.655 24 tam_ok
Joint accuracy-reach expected credit (F) 0.749 posterior proxy 0.034 0.034 1.744 30 0 1.019 440 1175 0.305

The F reliability figure is a posterior-variance proxy, not the same metric as TAM reliability.

Arithmetic facts — subtraction (multiple choice) · Term 3 · ASDD_2025_NEW

Logit accuracy trailing-zero (D) and Raw sum score (A) order students very similarly; the joint model agrees and adds no material external-alignment gain.

Download item CSVDownload CSV dictionary
Score Mean SD Sumscore Spearman PAT Spearman Teacher-rating Spearman
Raw sum score (A) 5.981 3.828 1.000 0.311 (n=149) 0.419 (n=361)
Raw accuracy on reached (B) 0.767 0.297 0.608 0.258 (n=149) 0.262 (n=361)
Logit accuracy on reached (C) -0.623 1.546 0.754 0.290 (n=149) 0.323 (n=361)
Logit accuracy trailing-zero (D) -0.063 2.253 0.978 0.293 (n=149) 0.412 (n=361)
Joint accuracy-reach expected credit (F) 0.218 0.076 0.973 0.292 (n=149) 0.408 (n=361)
C and D Wright maps
C and D test characteristic curves
C and D item fit MSQ chart

Means and SDs are on each score’s own scale; comparisons across rows should use the rank-agreement and alignment columns.

Score-variant correlation matrix

Model diagnostics

Score Reliability Reliability type Mean SE / posterior SD Median posterior SD PSI / posterior PSI proxy Items used in fit Fit note Divergent transitions Max R-hat Min bulk ESS Min tail ESS Reach–accuracy trait correlation
Logit accuracy on reached (C) 0.671 item-response 1.239 1.248 29 tam_ok
Logit accuracy trailing-zero (D) 0.874 item-response 0.826 2.729 29 tam_ok
Joint accuracy-reach expected credit (F) 0.797 posterior proxy 0.038 0.037 2.008 30 0 1.011 442 783 0.030

The F reliability figure is a posterior-variance proxy, not the same metric as TAM reliability.

Magnitude comparison 0–100 · Term 3 · MC0-100_2025

Logit accuracy trailing-zero (D) and Raw sum score (A) order students almost identically; the joint model agrees and adds no material external-alignment gain.

Download item CSVDownload CSV dictionary
Score Mean SD Sumscore Spearman PAT Spearman Teacher-rating Spearman
Raw sum score (A) 17.717 6.652 1.000 0.299 (n=296) 0.324 (n=624)
Raw accuracy on reached (B) 0.909 0.137 0.542 0.317 (n=296) 0.368 (n=621)
Logit accuracy on reached (C) -0.580 1.383 0.695 0.356 (n=296) 0.364 (n=621)
Logit accuracy trailing-zero (D) -0.223 2.804 0.996 0.297 (n=296) 0.322 (n=621)
Joint accuracy-reach expected credit (F) 0.302 0.078 0.994 0.293 (n=296) 0.316 (n=621)
C and D Wright maps
C and D test characteristic curves
C and D item fit MSQ chart

Means and SDs are on each score’s own scale; comparisons across rows should use the rank-agreement and alignment columns.

Score-variant correlation matrix

Model diagnostics

Score Reliability Reliability type Mean SE / posterior SD Median posterior SD PSI / posterior PSI proxy Items used in fit Fit note Divergent transitions Max R-hat Min bulk ESS Min tail ESS Reach–accuracy trait correlation
Logit accuracy on reached (C) 0.636 item-response 1.136 1.217 36 tam_ok
Logit accuracy trailing-zero (D) 0.943 item-response 0.659 4.257 39 tam_ok
Joint accuracy-reach expected credit (F) 0.908 posterior proxy 0.025 0.024 3.158 60 0 1.011 584 1217 0.270

The F reliability figure is a posterior-variance proxy, not the same metric as TAM reliability.

Missing number (ascending) 0–100 · Term 3 · MNA0-100_2025

Logit accuracy trailing-zero (D) and Raw sum score (A) order students very similarly; the joint model agrees and adds no material external-alignment gain.

Download item CSVDownload CSV dictionary
Score Mean SD Sumscore Spearman PAT Spearman Teacher-rating Spearman
Raw sum score (A) 7.980 3.069 1.000 0.296 (n=245) 0.448 (n=424)
Raw accuracy on reached (B) 0.898 0.143 0.240 0.201 (n=244) 0.299 (n=423)
Logit accuracy on reached (C) -0.584 1.030 0.534 0.288 (n=244) 0.417 (n=423)
Logit accuracy trailing-zero (D) -0.000 1.885 0.977 0.272 (n=244) 0.439 (n=423)
Joint accuracy-reach expected credit (F) 0.216 0.039 0.963 0.271 (n=244) 0.423 (n=423)
C and D Wright maps
C and D test characteristic curves
C and D item fit MSQ chart

Means and SDs are on each score’s own scale; comparisons across rows should use the rank-agreement and alignment columns.

Score-variant correlation matrix

Model diagnostics

Score Reliability Reliability type Mean SE / posterior SD Median posterior SD PSI / posterior PSI proxy Items used in fit Fit note Divergent transitions Max R-hat Min bulk ESS Min tail ESS Reach–accuracy trait correlation
Logit accuracy on reached (C) 0.437 item-response 1.279 0.806 29 tam_ok
Logit accuracy trailing-zero (D) 0.824 item-response 0.790 2.385 31 tam_ok
Joint accuracy-reach expected credit (F) 0.708 posterior proxy 0.024 0.023 1.578 40 0 1.013 1017 1333 -0.189

The F reliability figure is a posterior-variance proxy, not the same metric as TAM reliability.

Missing number (choice) 0–100 · Term 3 · MNC0-100_2025

Logit accuracy trailing-zero (D) and Raw sum score (A) order students almost identically; the joint model agrees and adds no material external-alignment gain.

Download item CSVDownload CSV dictionary
Score Mean SD Sumscore Spearman PAT Spearman Teacher-rating Spearman
Raw sum score (A) 7.937 3.806 1.000 0.444 (n=51) 0.307 (n=200)
Raw accuracy on reached (B) 0.929 0.173 0.348 0.234 (n=51) 0.292 (n=200)
Logit accuracy on reached (C) -2.051 1.196 0.877 0.414 (n=51) 0.332 (n=200)
Logit accuracy trailing-zero (D) -0.233 2.967 0.989 0.437 (n=51) 0.303 (n=200)
Joint accuracy-reach expected credit (F) 0.285 0.068 0.982 0.435 (n=51) 0.295 (n=200)
C and D Wright maps
C and D test characteristic curves
C and D item fit MSQ chart

Means and SDs are on each score’s own scale; comparisons across rows should use the rank-agreement and alignment columns.

Score-variant correlation matrix

Model diagnostics

Score Reliability Reliability type Mean SE / posterior SD Median posterior SD PSI / posterior PSI proxy Items used in fit Fit note Divergent transitions Max R-hat Min bulk ESS Min tail ESS Reach–accuracy trait correlation
Logit accuracy on reached (C) 0.469 item-response 1.450 0.825 24 tam_ok
Logit accuracy trailing-zero (D) 0.907 item-response 0.904 3.283 24 tam_ok
Joint accuracy-reach expected credit (F) 0.799 posterior proxy 0.034 0.033 2.009 30 0 1.013 838 1396 -0.131

The F reliability figure is a posterior-variance proxy, not the same metric as TAM reliability.


What the evidence supports

Trailing unreached items dominate nonresponse and concentrate at the end of each timed subtest. Skipped items inside the subtest are rare. That pattern supports treating trailing unreached items as part of timed performance rather than as ordinary missing data.

Scoring trailing unreached items as zero is therefore not a penalty; it is a description of how far the student progressed within the time limit. For a timed score, the denominator should be the timed item set, not only the items the student reached.

The operational scoring model should be Logit accuracy trailing-zero (D): trailing unreached items count as zero, while skipped or leading gaps remain missing. When subtests are modelled in isolation, D orders students almost identically to Raw sum score (A), but adds what a raw count cannot:

  • item-difficulty weighting,
  • standard errors and reliability evidence, and
  • a linking basis for comparing terms and subtests.

Nonetheless, Raw sum score (A) is not necessarily wrong. It is the transparent observed baseline, and because A and D agree so closely, reporting raw sum score instead would be defensible for ENSSA.

Joint accuracy-reach expected credit (F) fitted cleanly (0 divergent transitions across the 11 cells; worst R-hat 1.019) and reproduces D’s ordering almost exactly (median rank correlation 0.989). Its precision proxy sits below D’s reliability (medians 0.75 against 0.87, on different metrics), and external alignment is effectively level: the median difference against D is -0.002 for the latest PAT assessment and 0.002 for teacher ratings. Applied one subtest at a time, the more complex model lands in the same place as the simpler policy. By Occam’s razor, D is the better operational choice at the subtest level.

The remaining question is across subtests. Analyses so far model each timed subtest in isolation. A global hierarchical reach–accuracy model pooling all timed subtests could borrow strength across subtests and report reach and accuracy as separate traits. That global F-style model may prove more reliable or valid than a global version of D.

Next steps:

  1. Compare global D and global F-style scoring.
  2. Compare testlet and no-testlet specifications.
  3. Compare hierarchical multidimensional and unidimensional specifications.
Expanded plot

Appendix

Score definitions and legacy column names

  • presented_ columns in older CSVs refer to the timed item-set denominator; they do not imply on-screen display. Preferred name: full_form_denominator_.
  • presented_zero and all_presented_zero_sensitivity are legacy names for the timed item-set zero-credit policy and the all-nonresponse-zero sensitivity (variant E).
  • modelled_expected_full_form_score is a legacy name for expected credit over the timed item set (the F score).
  • F_stage1_exp_credit is the superseded two-step approximation; Joint accuracy-reach expected credit (F) is the joint Stan model reported here.

Logit score fit detail (Term 3)

Year Subtest Item set ID Score N students Items in subtest Items used in fit Reliability Fit note
Foundation MC0-20 MC0-20_2025 Logit accuracy on reached (C) 1433 38 36 0.626 tam_ok
Foundation MC0-20 MC0-20_2025 Logit accuracy trailing-zero (D) 1433 60 38 0.939 tam_ok
Foundation MNA0-20 MNA0-20_2025 Logit accuracy on reached (C) 647 30 25 0.465 tam_ok
Foundation MNA0-20 MNA0-20_2025 Logit accuracy trailing-zero (D) 647 30 28 0.834 tam_ok
Foundation MNC0-20 MNC0-20_2025 Logit accuracy on reached (C) 771 28 21 0.602 tam_ok
Foundation MNC0-20 MNC0-20_2025 Logit accuracy trailing-zero (D) 771 30 22 0.876 tam_ok
Foundation MQ1-10 MQ1-10_2025 Logit accuracy on reached (C) 783 23 18 0.414 tam_ok
Foundation MQ1-10 MQ1-10_2025 Logit accuracy trailing-zero (D) 783 30 21 0.824 tam_ok
Foundation MQ1-20 MQ1-20_2025 Logit accuracy on reached (C) 652 24 17 0.525 tam_ok
Foundation MQ1-20 MQ1-20_2025 Logit accuracy trailing-zero (D) 652 30 22 0.628 tam_ok
Year 1 AAMC AADD_2025-NEW Logit accuracy on reached (C) 1489 35 27 0.541 tam_ok
Year 1 AAMC AADD_2025-NEW Logit accuracy trailing-zero (D) 1489 40 30 0.906 tam_ok
Year 1 ASDD ASDD_2025 Logit accuracy on reached (C) 490 24 17 0.594 tam_ok
Year 1 ASDD ASDD_2025 Logit accuracy trailing-zero (D) 490 30 24 0.862 tam_ok
Year 1 ASMC ASDD_2025_NEW Logit accuracy on reached (C) 784 30 29 0.671 tam_ok
Year 1 ASMC ASDD_2025_NEW Logit accuracy trailing-zero (D) 784 30 29 0.874 tam_ok
Year 1 MC0-100 MC0-100_2025 Logit accuracy on reached (C) 1496 39 36 0.636 tam_ok
Year 1 MC0-100 MC0-100_2025 Logit accuracy trailing-zero (D) 1496 60 39 0.943 tam_ok
Year 1 MNA0-100 MNA0-100_2025 Logit accuracy on reached (C) 861 33 29 0.437 tam_ok
Year 1 MNA0-100 MNA0-100_2025 Logit accuracy trailing-zero (D) 861 40 31 0.824 tam_ok
Year 1 MNC0-100 MNC0-100_2025 Logit accuracy on reached (C) 630 30 24 0.469 tam_ok
Year 1 MNC0-100 MNC0-100_2025 Logit accuracy trailing-zero (D) 630 30 24 0.907 tam_ok

Stage-1 expected credit (superseded)

The Stage-1 score approximated expected credit over the timed item set in two steps before the joint model was fitted. It is retained here for context only.

Year Subtest Item set ID N Completion rate Accuracy on reached items Timed item-set score Stage-1 expected credit Reached–item-set gap
Year 1 MNA0-100 MNA0-100_2025 867.000 0.224 0.898 0.200 0.191 0.699
Foundation MQ1-20 MQ1-20_2025 655.000 0.170 0.828 0.133 0.129 0.695
Foundation MQ1-10 MQ1-10_2025 789.000 0.263 0.930 0.244 0.232 0.687
Year 1 AAMC AADD_2025-NEW 1495.000 0.253 0.910 0.231 0.221 0.680
Year 1 ASDD ASDD_2025 703.000 0.123 0.768 0.098 0.083 0.670
Year 1 MNC0-100 MNC0-100_2025 633.000 0.288 0.929 0.265 0.250 0.664
Foundation MNC0-20 MNC0-20_2025 785.000 0.228 0.848 0.192 0.180 0.655
Foundation MC0-20 MC0-20_2025 1443.000 0.319 0.916 0.295 0.284 0.621
Foundation MNA0-20 MNA0-20_2025 648.000 0.278 0.859 0.239 0.228 0.621
Year 1 MC0-100 MC0-100_2025 1503.000 0.321 0.909 0.295 0.285 0.614
Year 1 ASMC ASDD_2025_NEW 789.000 0.256 0.767 0.199 0.189 0.568

Additional descriptive views

Term 3 score contrast by subtest
Mean accuracy on reached items (grey), timed item-set score (blue), and Stage-1 expected credit (amber) by subtest.
Term 3 response-process profile by subtest
Response-process composition by subtest. Trailing-unreached rows are the dominant type of nonresponse.

Global composites

Person-level composites formed by averaging each score’s within-subtest standardised value across the available timed subtests. The subtest-level agreement between the raw, trailing-zero and joint-model scores also holds at the composite level; these composites are the bridge to the planned global hierarchical reach–accuracy model.

Global composite correlation matrix
Foundation, Term 3: global composite correlations across the five reported scores.
Global composite correlation matrix
Year 1, Term 3: global composite correlations across the five reported scores.

Subtest score eligibility

The wider subtest audit recommends reviewed, family-specific scoring decisions rather than automatic inclusion of every subtest lookup.

Recommendation Count
insufficient data 35
eligible with monitoring 12
exclude from global score 9
exclude pending revision 9
collateral context only 7

Artifacts

  • studies/enssa-v2-response-process-program-2026-t2/results/round2/m3/model_iteration_evaluation/subtest_irt_audit/tables/timed_variant_a_to_f_by_subtest_person_internal.csv (internal person-level)
  • studies/enssa-v2-response-process-program-2026-t2/results/round2/m3/model_iteration_evaluation/subtest_irt_audit/tables/timed_variant_a_to_f_global_scores_internal.csv (internal person-level composites)
  • studies/enssa-v2-response-process-program-2026-t2/results/round2/m3/model_iteration_evaluation/subtest_irt_audit/tables/timed_variant_a_to_f_subtest_summary.csv
  • studies/enssa-v2-response-process-program-2026-t2/results/round2/m3/model_iteration_evaluation/subtest_irt_audit/tables/timed_variant_a_to_f_subtest_scorecard.csv
  • studies/enssa-v2-response-process-program-2026-t2/results/round2/m3/model_iteration_evaluation/subtest_irt_audit/tables/timed_variant_a_to_f_external_validation.csv
  • studies/enssa-v2-response-process-program-2026-t2/results/round2/m3/model_iteration_evaluation/subtest_irt_audit/tables/timed_variant_decision_a_modelled_scorecard.csv
  • studies/enssa-v2-response-process-program-2026-t2/results/round2/m3/model_iteration_evaluation/subtest_irt_audit/tables/timed_variant_decision_b_raw_vs_irt_scorecard.csv
  • studies/enssa-v2-response-process-program-2026-t2/results/round2/m3/model_iteration_evaluation/subtest_irt_audit/tables/timed_variant_decision_summary.csv
  • studies/enssa-v2-response-process-program-2026-t2/results/round2/m3/model_iteration_evaluation/subtest_irt_audit/tables/timed_variant_a_to_f_subtest_plot_index.csv
  • studies/enssa-v2-response-process-program-2026-t2/results/round2/m3/model_iteration_evaluation/subtest_irt_audit/tables/timed_variant_a_to_f_global_correlations.csv
  • studies/enssa-v2-response-process-program-2026-t2/results/round2/m3/model_iteration_evaluation/subtest_irt_audit/tables/timed_variant_reach_completion_term3.csv
  • studies/enssa-v2-response-process-program-2026-t2/results/round2/m3/model_iteration_evaluation/subtest_irt_audit/tables/f_joint_term3_comparison_summary.csv
  • studies/enssa-v2-response-process-program-2026-t2/results/round2/m3/model_iteration_evaluation/subtest_irt_audit/tables/f_joint_term3_diagnostics.csv
  • studies/enssa-v2-response-process-program-2026-t2/results/round2/m3/model_iteration_evaluation/subtest_irt_audit/tables/f_joint_term3_reliability_proxy.csv
  • studies/enssa-v2-response-process-program-2026-t2/results/round2/m3/model_iteration_evaluation/subtest_irt_audit/tables/f_joint_term3_decision_readout.csv