Outcome Pack Comparison: Preparing Benchmark Data for Screener Prediction

Pack A (Strict) vs Pack B (Full Coverage)

Author

ENSSA Analytics Team

Published

February 20, 2026

1 Purpose and Overview

An outcome pack is a scored dataset containing latent attainment estimates (η) and risk probabilities for each student, derived from PAT scores and teacher achievement ratings. It serves as the benchmark target for screener-based prediction models.

Two outcome packs are compared here:

Pack Scope Description
A Strict Model fit restricted to students in schools with type-gated PAT. Includes PAT-backed students plus rating-only students in those same schools.
B Full coverage Model fit uses all students with any outcome evidence: type-gated PAT plus all teacher ratings regardless of school PAT availability.

Both packs use the same Bayesian joint measurement model and scoring pipeline; they differ only in which students enter the model fit.

1.1 PAT Score vs Provided Percentile

Before comparing packs, we examine a data quality issue that motivates the anchor-truth policy used in Section 6. Each dot is one PAT record. If percentiles were correctly assigned, dots would form a smooth upward curve. Where they scatter or reverse direction, the percentile mapping is unreliable — meaning we cannot simply use “percentile ≤ 25” to define who is truly below the 25th percentile.

PAT scaled score vs provider-reported percentile, faceted by year level and PAT type. Non-monotonic regions are visible in several panels, particularly PAT Early Years.
Note

For PAT Early Years in Foundation (Oct–Dec administration window), the relationship between scaled score and provider-reported percentile shows non-monotonic regions, indicating internal inconsistency in the percentile mapping. This motivates the hybrid anchor-truth policy described in Section 6.

2 Model Specification — Pack A vs Pack B

2.1 Type Gate

The type gate restricts which PAT types enter the model. PAT Maths Adaptive is excluded because of norm-frame incompatibility with the joint model’s measurement equation.

Field Value
Included PAT types PAT Early Years, PAT Maths (non-adaptive), PAT type missing
Excluded PAT types PAT Maths Adaptive

2.2 Joint Measurement Model

Both packs use the same Bayesian model — a single latent construct η (end-of-year maths attainment, in PAT scale-score units) measured through two instruments:

PAT measurement equation (students with PAT scores):

\[\text{PAT}_{ij} \sim \text{Normal}\!\left(\eta_i + \delta_{\text{type}[j]},\; \sqrt{\sigma^2_{\text{pat}} + \sigma^2_{\text{time}} \cdot \Delta t_{ij}^2 + \text{SE}^2_{\text{cond}[ij]}}\right)\]

Teacher rating equation (students with achievement ratings):

\[\text{Rating}_i \sim \text{OrderedLogistic}(\alpha + \lambda \, \eta_i, \; \tau)\]

Prior on student attainment:

\[\eta_i \sim \text{Normal}(\mu_{\text{yr}},\; \sigma_\eta)\]

In plain terms:

  • η is what we are estimating — each student’s underlying maths attainment at end-of-year, expressed in PAT scale-score units (~50–150).
  • δtype captures systematic offsets between PAT products (e.g. PAT Early Years vs PAT Maths non-adaptive). Different tests can give slightly different readings of the same ability.
  • σtime · Δt inflates uncertainty for PAT scores administered further from end-of-year. A test taken in August is less informative about December attainment than one taken in November.
  • SEcond is the conditional standard error reported by ACER for each PAT record — known measurement noise baked into the likelihood.
  • λ (lambda) links the teacher rating scale to the PAT scale. It translates from “scale-score units” to “how much the rating moves per unit of attainment.”
  • τ (tau) are ordered thresholds defining the boundaries between rating categories 1–5.

The model specification is identical. The only difference is which students are included in the fit:

  • Pack A restricts to students in schools that have type-gated PAT, so the model sees both PAT and rating evidence from the same school context.
  • Pack B includes all students with any outcome evidence, adding rating-only students from non-PAT schools.

Pack B’s broader sample may improve precision for the rating pathway but could introduce school-level confounding if rating cultures differ systematically between PAT-backed and non-PAT schools.

2.3 Eligibility Breakdown (Pack A)

Eligibility reason n %
No_outcome_data 5561 69.8%
Rating_only_non_pat_school 1813 22.7%
Both 431 5.4%
Rating_only_in_pat_backed_school 114 1.4%
PAT_only 52 0.7%

2.4 Composition Summary

Pack n Foundation Year 1 Inconclusive % inconclusive
A 597 281 316 138 23.1%
B 2410 1166 1244 232 9.6%

Inconclusive is a governance policy, not a model output. A student is flagged inconclusive when:

  • Their posterior standard error exceeds the 95th percentile threshold, or
  • Their PAT timing distance exceeds 150 days from the end-of-year reference.

These students receive η estimates but are excluded from operational risk classification.

2.5 Source Composition

Source Pack A Pack B
Both 431 431
PAT only 52 52
Rating only 114 1927

3 Distributions

3.1 Latent Attainment (ηz)

This shows the spread of estimated attainment within each year level. ηz is a z-score: 0 means the ACER national average, negative values are below average. Faceting by Foundation and Year 1 avoids pooled-shape artefacts and makes Pack A vs Pack B comparisons interpretable within cohort.

Density of η_z (ACER norm frame) by pack, separated by year level.

3.2 Tail Probability: P(η < P25)

For each student, the model produces a probability of being below the 25th percentile. Most students cluster near 0% (unlikely to be at risk) or near 100% (very likely at risk). Students in between are the “borderline zone” where classification is least certain.

Distribution of P(η below P25) by pack.

3.3 Uncertainty by Source

Rating-only students (no PAT score) have higher uncertainty than students with PAT data. Students with both sources (“Both”) tend to have the lowest uncertainty. This plot shows why having PAT data matters for precision — and why Pack B’s additional rating-only students come with a cost in confidence.

Posterior standard error (η_se) by model-consistent source flag, faceted by pack.

3.4 PAT Timing Distance

PAT tests taken further from the end-of-year reference date carry more uncertainty because students may have grown in the interim. This histogram shows how many days separated each PAT administration from the reference date. A tight cluster near zero is ideal.

Days from PAT administration to end-of-year reference date (PAT-backed students only).

4 Coherence Checks

4.1 Teacher Rating vs ηz

If the model is sensible, students rated higher by their teacher should also have higher estimated attainment. The boxplots should step upward from rating 1 to rating 5. Overlap between adjacent boxes is expected — ratings are coarse.

η_z by teacher rating (1–5), faceted by pack. Higher ratings should correspond to higher η_z.

4.2 Spearman Correlations

Spearman rank correlations quantify the monotonic agreement between teacher ratings and model estimates. Values near −1 or +1 indicate strong agreement; near 0 indicates no relationship. A negative correlation with P(η < P25) is expected — higher-rated students should have lower risk probability.

Pack n ρ (rating, η_z) ρ (rating, P25)
A 545 0.535 -0.452
B 2358 0.673 -0.616

4.3 ACER Risk Band Distribution

Students are classified into High, Moderate, or Low risk using ACER national norms. This table shows how many students fall into each band under each pack.

ACER risk band Pack A Pack B
High 3 3
Moderate 17 26
Low 577 2381

5 Pack A vs Pack B Sensitivity (Overlap Students)

Pack A and Pack B are separate model fits. To assess sensitivity to the sample composition, we compare estimates for the overlapping student set (students in both packs). High correlations and low MAE indicate that the choice between Pack A and Pack B has little practical effect on individual student estimates. Systematic differences would appear as points deviating from the dashed diagonal.

5.1 Correlation and Agreement

Metric Value
n overlap 597
Pearson (η̂) 0.9986
Spearman (η̂) 0.9984
MAE (η̂) 0.277
Pearson (P25) 0.9962
Spearman (P25) 0.9971
MAE (P25) 0.0098

5.2 η̂ Scatter (A vs B)

Each dot is one student. Points close to the dashed line received similar η̂ estimates under both packs. Systematic departures from the diagonal would suggest the additional students in Pack B are pulling estimates in a particular direction.

η̂ under Pack A fit vs Pack B fit for overlap students. Dashed line = perfect agreement.

5.3 P(η < P25) Scatter (A vs B)

Same as above but for the risk probability. Points in the upper-left or lower-right corners would indicate students whose risk classification flips between packs.

P(η < P25) under Pack A vs Pack B for overlap students.

5.4 η̂ Difference Distribution

The distribution of Pack A minus Pack B η̂ values. A distribution centred on zero with small spread means the two packs produce practically interchangeable estimates.

Distribution of η̂(Pack A) − η̂(Pack B) for overlap students.

6 PAT-Anchor Benchmark

This section evaluates how well the outcome pack’s risk probabilities align with an external anchor: PAT-derived “below P25” status. This is the primary validation available before screener-level external validation data arrives.

6.1 Why PAT Percentile as “Truth” Is Non-Trivial

PAT Early Years percentiles show internal incoherence — the provided percentile does not always increase monotonically with the scaled score (see Section 1.1). This means we cannot naively use the provider-reported percentile to define “truly below P25” for all students.

The hybrid anchor-truth policy resolves this:

PAT type × year level Anchor truth method Rationale
PAT Maths (non-adaptive), Year 1 Percentile ≤ 25 Monotonic score-percentile relationship
PAT Maths (non-adaptive), Foundation Scale-cut fallback Sparse data, percentile unreliable
PAT Early Years, Foundation Scale-cut fallback Non-monotonic percentile mapping
PAT Early Years, Year 1 Scale-cut fallback Cross-year norm frame issues

Row-level safeguard: Even where percentile is used as anchor truth, PAT Maths records with empirical residual > 15 fall back to scale-cut to guard against individual anomalous mappings.

6.2 Anchor Truth Provenance

The table below shows how many students had their “truth” status determined by each method. Because PAT Early Years dominates this cohort and its percentiles are unreliable, most students fall back to the scale-cut method.

Truth source breakdown (all PAT-anchored students)
Anchor truth source n %
percentile 142 38.2%
scale_cut_fallback_frame 229 61.6%
scale_cut_fallback_impossible_pair 1 0.3%
Truth source breakdown (calibration subset used for threshold fitting)
Anchor truth source n %
percentile 28 16.4%
scale_cut_fallback_frame 143 83.6%
Note

Most anchor truth comes from the scale-cut fallback because PAT Early Years — the dominant PAT type in this cohort — has unreliable percentile mappings. The scale-cut method uses year-level-specific η cutpoints instead of provider percentiles.

6.3 Confusion Matrix vs PAT Anchor Truth

The table below shows classification performance at each candidate operating threshold, stratified by year level. The model’s P(η < P25) is thresholded and compared against PAT-anchored truth. Read this table as: at threshold 0.6, if the model says a student has ≥60% probability of being below P25, how often is PAT anchor truth confirming that? Precision = of those flagged, how many truly are below P25. Recall = of those truly below P25, how many did the model catch.

Confusion matrix: P(η
Threshold Year level n TP FP FN TN Precision Recall Specificity
0.5 foundation 108 13 3 11 81 0.812 0.542 0.964
0.5 year1 63 8 0 7 48 1.000 0.533 1.000
0.6 foundation 108 9 0 15 84 1.000 0.375 1.000
0.6 year1 63 8 0 7 48 1.000 0.533 1.000
0.7 foundation 108 5 0 19 84 1.000 0.208 1.000
0.7 year1 63 8 0 7 48 1.000 0.533 1.000
0.8 foundation 108 3 0 21 84 1.000 0.125 1.000
0.8 year1 63 3 0 12 48 1.000 0.200 1.000
Operational threshold summary across all students
Threshold n operational n high-flag % high-flag n review % review n PAT-anchored Precision (PAT) Recall (PAT)
0.5 1657 288 17.4% 0 0.0% 171 0.875 0.538
0.6 1657 85 5.1% 203 12.3% 171 1.000 0.436
0.7 1657 68 4.1% 220 13.3% 171 1.000 0.333
0.8 1657 12 0.7% 276 16.7% 171 1.000 0.154
Tip

The selected operating threshold of 0.6 achieves precision = 1.00 (no false positives among PAT-anchored students) with recall = 0.44 (catches 44% of PAT-confirmed below-P25 students). The conservative precision reflects the governance priority: avoid mislabelling students as at-risk when external evidence does not support it.

6.4 Calibration View

Does the model’s stated probability track observed reality? Below, students are binned by their predicted P(η < P25) and we plot the observed proportion with anchor truth confirming below-P25 status. If the model is well-calibrated, points should lie on the dashed diagonal — e.g., among students the model gives a 40% risk probability, roughly 40% should actually be below P25 according to anchor truth. Points above the diagonal mean the model underestimates risk; below means it overestimates.

Calibration plot: predicted P(η < P25) bins vs observed anchor-truth below-P25 rate. Perfect calibration follows the dashed diagonal.

7 Appendix: Pack A Posterior Diagnostics

These plots visualise sampled posterior densities for student-level η from the Pack A Bayesian fit. They serve as qualitative diagnostics for posterior shape, separation, and uncertainty — not as a primary reporting view.

7.1 By Source Flag

Each translucent curve is one student’s posterior density — the model’s belief about where that student’s true attainment lies. Wider curves mean more uncertainty. Compare across panels: rating-only posteriors (purple) are typically wider than PAT-backed posteriors (orange/green).

7.2 By ACER Risk Band

Same idea, but students are grouped by ACER risk band. High-risk posteriors (red) should cluster at lower η values; Low-risk posteriors (green) should cluster higher. Overlap between bands in the same year level indicates the borderline zone.