IRT Model - Foundation 2025 Term 1

Data Quality, Methodology, and Instrument Insights

Published

February 10, 2026

Executive Summary

This report reviews Foundation Term 1 numeracy screening data, first with model-free exploratory analyses, then with the joint IRT + RT model for accuracy and timed fluency.

Data highlights:

Students: 2,558 Foundation students in cleaned Term 1 responses after filters (practice/ABR excluded); 2,517 included in the fitted joint model (requires modelled responses/RT).
Coverage (modelled students): strong per-student coverage (median ~60 accuracy items, ~52 timed RTs).
Item exposure: moderate inequality (Gini ~0.3–0.4); 36 items with <20 responses flagged for parameter instability.
RT patterns: incorrect responses are faster than correct on timed-math items, consistent with rapid guessing on items beyond ability.
Fast-wrong signal: substantial fast-wrong excess on MC0-20 and MNC0-20 in particular — exceeds what a lognormal RT model predicts, indicating a rapid-guess subpopulation.
60s cap rates: STPM shows the highest rate of hitting the 60-second ceiling, reflecting pause/disengage behaviour.

Model findings:

Model convergence: clean MCMC (0 divergences; max Rhat ~1.01; E-BFMI min ~0.65).
Construct validity: θ aligns strongly with observed accuracy; τ_base aligns with STPM RT; τ_math_total aligns with timed-math RT; τ_reg_adj is ~independent of STPM speed.
Low-information rates: 2.4% of students flagged, driven mainly by low counts rather than unstable posteriors.
Item QC flags: 36 low-exposure items (<20 responses) and 3/30 NL items with disordered PCM steps (one per NL subgroup).
Accuracy model fit (PPC): strong for binary items (0/184 items outside 90% PI). Total score density matches posterior replications; the mean/SD table shows a small downward shift in observed vs predicted, motivating NL category PPCs and a scoring-parity audit.
RT model fit (PPC): systematic misfit at the extremes — (a) log-RT p99 tails underpredicted; (b) STPM 60s cap rate exceeds the model’s 90% PI; (c) fast-wrong rates for MC0-20 and MNC0-20 are much higher than predicted. This v3 run uses all timed RTs; next-iteration speed modelling should be probe-aware (e.g., correct-only RT for probes with strong rapid-guess evidence).

1 Data & Cohorts

1.1 Student Counts

Student counts by cohort
exam_group_cohort	n_students
F-A	1320
F-B	1238

Total Foundation students: 2558

Note: the fitted joint model includes a subset of these students (those with modelled accuracy responses and/or timed RT). See Model Diagnostics → Input coverage for the model-included counts.

1.2 Subtest overview

Below are short descriptions of each Foundation subtest (paired A/B forms noted where applicable):

Subtest	Description
BNL0-20	Bounded number line; place a target number on a 0-20 line with endpoints.
UNLC0-20 / UNLNC0-20	Unbounded number line; place a target using 0 and a unit marker (chairs vs no-chairs variants).
MC0-20	Magnitude comparison; choose the larger of two numbers.
MNC0-20 / MNA0-20	Missing number; MNC = choose the missing number from options, MNA = judge if a sequence is ascending.
MQ1-10 / MQ1-20	Match quantity; match quantity representations to numerals (timed).
DMT5 / DMT10	Decomposition; part-whole hidden quantity, choose how many are hidden (untimed).
STPM	Speeded pattern matching; tap the matching picture (baseline speed).

Subtest inventory by cohort with item counts, names, and modality flags
Cohort	test_subgroup	subtest_name	n_items	is_timed	typical_distractor_count
A	BNL0-20	Bounded Number Line 0-20	10	FALSE	NA
A	DMT5	Decomposition to 5	10	FALSE	2
A	MC0-20	Magnitude Comparison 0-20	38	TRUE	4
A	MNC0-20	Missing Number Choice 0-20	30	TRUE	4
A	MQ1-10	Match Quantity 1-10	30	TRUE	4
A	STPM	Speeded Pattern Matching	20	TRUE	2
A	UNLNC0-20	Unbounded Number Line 0-20 (no chairs)	10	FALSE	NA
B	BNL0-20	Bounded Number Line 0-20	10	FALSE	NA
B	DMT10	Decomposition to 10	10	FALSE	2
B	MC0-20	Magnitude Comparison 0-20	44	TRUE	4
B	MNA0-20	Missing Number Ascending 0-20	30	TRUE	4
B	MQ1-20	Match Quantity 1-20	30	TRUE	4
B	STPM	Speeded Pattern Matching	20	TRUE	2
B	UNLC0-20	Unbounded Number Line 0-20 (chairs)	10	FALSE	NA

1.2.1 Data Filters & Inclusion Criteria

This analysis uses Foundation year level Term 1 data only, excluding practice items (is_practice == TRUE) and abridged version (ABR, is_abr == TRUE).

Important distinction: The dataset contains placeholder rows with is_attempted == FALSE for many items. This does NOT necessarily mean the item was presented to the student. We use gap/tail classification as a proxy for not-reached status:

Gap missingness: is_attempted == FALSE where question_no ≤ max(question_no) among attempted items within a subtest
Tail missingness: is_attempted == FALSE where question_no > max(question_no) among attempted items (likely not reached)

Exploratory data analysis (sections 2–9) uses both attempted and not-attempted rows for missingness analysis. Performance metrics (accuracy, fluency) use attempted rows only.

1.3 Subtest Order

Typical subtest administration order by cohort (rank based on median attempted_at)
test_subgroup	F-A	F-B
BNL0-20	1	1
STPM	2	2
MQ1-10	3	NA
MC0-20	4	4
DMT5	5	NA
MNC0-20	6	NA
UNLNC0-20	7	NA
MQ1-20	NA	3
DMT10	NA	5
MNA0-20	NA	6
UNLC0-20	NA	7

2 Student Coverage and Missingness

2.1 Item Exposure & Missingness

2.1.1 Exposure Inequality (Lorenz Curve & Gini Coefficient)

The Lorenz curve shows inequality in item exposure. The dashed diagonal represents perfect equality (all items attempted by the same number of students). The further the curve bows below this line, the greater the inequality. The Gini coefficient quantifies this (0 = perfect equality, 1 = maximum inequality). Higher values indicate some items receive disproportionately more attempts than others.

3 Response Time Analysis

RT Preprocessing for EDA vs Model

EDA summaries: Response times are capped at 0.5–60 seconds for per-response summaries in this exploratory data analysis section to provide robust descriptive statistics.
Model preprocessing: Timed items use a 0.5–60 second cap, while untimed items use a 0.5–180 second cap (sensitivity runs only; baseline model excludes untimed RTs).
Rationale: Prevents very long untimed RTs from dominating model estimation while allowing exploratory analysis of untimed response patterns.

3.1 RT Distributions

3.1.1 Total Response Time (per student)

3.1.2 Total Time by Subtest (per student)

3.2 Item-Level RT Diagnostics

This plot shows the relationship between what percentage of students in each cohort attempted each item (x-axis) and the median response time (y-axis). Items with very low exposure (near 0%) may have unreliable RT estimates. Items with unusually high or low RT compared to peers in the same subtest warrant investigation for clarity or technical issues.

3.2.1 RT Deltas: Incorrect vs Correct (Timed vs Untimed)

Pooled across 2025 terms (T1/T3/T4).

3.2.2 Correct vs Incorrect RT (Timed Math) by Subtest

The relationship between response time (RT) and correctness is probe-dependent.

3.2.3 Diagnosing fast-wrong responses

Rapid responses (defined as ≤ 1 second) paired with incorrect answers can indicate rapid guessing or UI/technical issues. These can inflate estimated speed by adding very short response times that are unlikely to reflect genuine fluent responding.

We computed subtest-level fast–wrong and fast–correct rates for timed, non-speed-test maths subtests (within Term 1 × cohort × subtest) and pooled counts by subtest (response-weighted), reporting fast–wrong excess (fast–wrong minus fast–correct) as the primary diagnostic. Results indicate subtest-dependent response behaviour, consistent with differences in item format or interaction demands.

In this v3 run, the RT likelihood uses all timed RTs; these diagnostics are intended to inform a probe-aware RT inclusion policy for the next iteration (e.g., model correct-only RT for probes with high fast-wrong rates and large fast-wrong excess).

4 Student Performance (Model-Free)

5 Subtest Results

Subtest-level summary statistics by cohort
test_subgroup	exam_group_cohort	n_students	median_items	median_time_min	median_accuracy	median_fluency
BNL0-20	F-A	1290	10	1.98	0.86	3.81
BNL0-20	F-B	1196	10	2.03	0.86	3.73
DMT10	F-B	1197	10	2.27	0.30	1.26
DMT5	F-A	1288	10	1.74	0.61	2.82
MC0-20	F-A	1289	16	0.83	0.93	17.50
MC0-20	F-B	1197	16	0.83	0.92	16.80
MNA0-20	F-B	1172	6	0.88	0.83	5.66
MNC0-20	F-A	1245	5	0.88	0.83	4.29
MQ1-10	F-A	1288	14	1.80	0.96	7.07
MQ1-20	F-B	1201	7	1.78	0.78	3.13
STPM	F-A	1295	20	1.53	1.00	12.39
STPM	F-B	1209	20	1.52	1.00	12.39
UNLC0-20	F-B	1178	10	1.33	0.81	5.50
UNLNC0-20	F-A	1280	10	1.20	0.81	6.14

5.0.1 Items Attempted per Subtest

5.0.2 Total Time per Subtest

5.0.3 Mean RT per Item per Subtest

6 IRT Model Overview

The joint model estimates accuracy and timed responding as separate but correlated latent traits, and produces a derived residual for analytic work.

6.1 What the model estimates

Construct	Description	Interpretation
θ (theta)	Accuracy on numeracy items	Higher = more accurate
τ_base	Baseline timed responding speed anchored by STPM	Higher = faster (lower RT)
τ_math_total	Absolute timed-math speed (headline fluency)	Higher = faster (lower RT)
τ_reg_adj	Regression-adjusted timed-math speed (residual after accounting for τ_base)	Higher = faster than expected given baseline speed

Interpreting sign and size

RT is modelled on the log scale as log(RT) = lambda_item - tau_person + noise.

Higher tau = faster.
A 0.3 increase in tau corresponds to roughly exp(-0.3) ≈ 0.74× RT (about 25% faster), holding item time-intensity constant.

Notation update

To align with common conventions, speed will be denoted with zeta rather than tau in future updates. Tau is typically reserved for partial credit model step parameters.

6.1.1 Response coding

Binary items (0/1)

Binary items (MC, MNA, MNC, MQ, DMT) use a Rasch (1PL) model with difficulty parameter b (discrimination fixed to 1).

Number-line items (3-category PCM)

NL items record continuous accuracy (0–1) and are discretised into 3 ordered categories:

Category 0: raw_score < 0.80
Category 1: 0.80 ≤ raw_score < 0.95
Category 2: raw_score ≥ 0.95

These are modelled using a partial credit model (PCM) with difficulty b and two step thresholds (step₁, step₂).

6.1.2 RT preprocessing

Timed RTs are preprocessed with:

Floor: 0.5 seconds
Cap: 60 seconds for timed items
Transformation: log(rt_adj)
Untimed items: Excluded from the RT likelihood entirely. Sensitivity runs used a 0.5–180 second cap; the baseline model uses no untimed RTs.

Only STPM (baseline) and timed math items contribute to the RT likelihood.

6.2 Model structure

6.2.1 Accuracy component (Rasch + PCM)

Binary items: \[P(Y_{ij} = 1 | \theta_i, b_j) = \text{logit}^{-1}(\theta_i - b_j)\]

Parameters:

\(Y_{ij}\): response for student \(i\) on item \(j\) (1 = correct, 0 = incorrect)
\(\theta_i\): accuracy trait for student \(i\)
\(b_j\): difficulty for item \(j\)

NL items (PCM): \[P(Y_{ij} = k | \theta_i, b_j, \text{step}_{j,\cdot}) \propto \exp\left(\sum_{c=1}^k (\theta_i - b_j - \text{step}_{j,c})\right)\]

Parameters:

\(Y_{ij}\): ordered category for student \(i\) on item \(j\) (0, 1, 2)
\(k\): category index
\(b_j\): difficulty for item \(j\)
\(\text{step}_{j,c}\): step threshold \(c\) for item \(j\)

6.2.2 Response-time component

For each RT observation:

\[\log(RT_{ij}) \sim \text{Normal}(\lambda_j - \tau_{eff,i}, \sigma_{rt,g})\]

Parameters:

\(RT_{ij}\): response time for student \(i\) on item \(j\) (seconds)
\(\lambda_j\): item time-intensity parameter
\(\tau_{eff,i}\): effective speed for student \(i\) on item \(j\)
\(\sigma_{rt,g}\): RT noise SD for subgroup \(g\)
\(g\): RT subgroup (speed anchor vs timed math)

where:

\(\tau_{eff,i} = \tau_{base,i}\) for STPM anchor items
\(\tau_{eff,i} = \tau_{math\_total,i}\) for timed math items

6.2.3 Correlation structure

Latent traits are jointly modelled with a multivariate normal correlation structure on a standardised latent vector. Cohort linking for θ uses group-specific mean/SD (Cohort A fixed to 0/1).

6.2.4 Priors (summary)

Parameter	Prior	Role
`z_person`	Normal(0, 1)	Non-centred latent factors (θ_z, τ_math, τ_base)
`L_corr`	LKJ(2)	Cholesky factor of 3×3 latent correlation matrix
`sigma_tau`	Normal⁺(0, 1)	Speed component SDs [2]: math, baseline
`theta_mu_free`	Normal(0, 1)	Cohort mean accuracy (Cohort A fixed to 0)
`theta_sd_free`	Normal⁺(1, 0.5)	Cohort SD accuracy (Cohort A fixed to 1)
`b`	Normal(0, 1.5)	Item difficulty
`step_raw`	Normal(0, 1.5)	PCM step parameters (mean-centred within item)
`lambda`	Normal(0, 1.0)	Item time-intensity
`sigma_rt`	Normal⁺(0, 0.5)	RT residual SD per subgroup

Normal⁺ denotes a half-normal prior (parameter declared with <lower=0> in Stan). The model uses a non-centred parameterisation: raw latent factors z_person are standard normal, then scaled by diag(sigma_trait) * L_corr to obtain correlated (θ_z, τ_math, τ_base). Cohort A is anchored at mean = 0, SD = 1 for θ; Cohort B’s mean and SD are estimated freely. Step parameters are mean-centred within each item; step ordering is checked post-fit but not enforced during sampling.

7 Model Results

This section reports latent scores for Foundation students. For operational use, the recommended pair is θ (accuracy) and τ_math_total (timed-math speed). Use τ_base as contextual baseline speed and treat τ_reg_adj as an analytic residual rather than the headline speed score.

7.1 Latent Score Distributions

7.2 Uncertainty Distributions

7.3 Scale precision and reliability

Posterior SDs provide a conditional SEM-style view of precision along the θ scale. The plot below bins students by θ and summarizes uncertainty within each bin.

Marginal reliability from posterior SDs (θ)
Cohort	var_theta	mean_se2	marginal_reliability
Cohort A	0.554	0.076	0.880
Cohort B	0.377	0.065	0.853

7.4 Construct checks (latent vs observed)

These checks verify that latent scores align with direct, model-free summaries from the raw data.

The expected patterns appear: θ tracks accuracy, τ_base tracks STPM speed, τ_math_total tracks timed-math speed, and τ_reg_adj is largely independent of baseline speed.

7.5 Correlation plots (latent scores)

The three panels show the expected construct pattern. θ vs τ_math_total (pooled r = 0.22) shows a positive correlation — more accurate students tend to respond faster on timed math items. τ_base vs τ_math_total (pooled r = 0.77) shows a strong positive correlation, confirming a general speed factor: students fast on baseline pattern matching are also fast on timed math. θ vs τ_base (pooled r = 0.36) shows a weaker relationship, indicating that accuracy is largely separable from baseline motor speed — supporting discriminant validity of the accuracy construct.

7.6 Low-Information Flagging Rates

Students are flagged as low-information when they have too few responses or unusually wide posterior uncertainty:

Low accuracy items: n_acc_total < 20
Low timed-math RT count: n_timed_math_rt < 10
Low STPM RT count: n_speed_rt < 10
Low θ precision: theta_sd > 0.7
Low τ_reg_adj precision: tau_reg_adj_sd > 0.7
Low info (any): flagged on any of the above

High rates indicate thin data, not necessarily poor model fit. Thresholds are configurable in the model pipeline; defaults are shown here.

Low-information flagging rates by cohort
exam_group_cohort	n_students	n_low_acc_items	n_low_timed_math_rt	n_low_speed_rt	n_low_theta_precision	n_low_tau_reg_precision	n_low_info_any	pct_low_acc_items	pct_low_timed_math_rt	pct_low_speed_rt	pct_low_theta_precision	pct_low_tau_reg_precision	pct_low_info_any
F-A	1302	10	18	18	2	0	29	0.8	1.4	1.4	0.2	0	2.2
F-B	1215	14	30	10	0	0	32	1.2	2.5	0.8	0.0	0	2.6

7.7 Prior Predictive Checks

The prior predictive check verifies that the model’s priors are weakly informative — they should generate plausible data without concentrating on pathological regions of the parameter space. We simulate from the prior distributions using pure R (no Stan refit), drawing synthetic students and items to check the implied ranges.

The prior predictive distributions cover sensible ranges without pathological concentration. Item response probabilities span the full 0–1 range, student mean accuracies centre near 0.5 with reasonable spread, and prior-implied RTs cover the plausible range of observed response times. The priors are weakly informative — they express mild expectations about parameter scale without constraining the posterior to a narrow region.

7.8 Posterior Predictive Checks

For each of 200 posterior draws (subsampled from the full MCMC output), we simulate replicated data from the fitted model and compare summaries of replicated data to the observed data. This is the standard Bayesian posterior predictive check (Gelman et al., 2013).

How to read PPC plots

Grey lines/bars show what the fitted model predicts across replicated datasets (90% posterior predictive interval).
Dots/blue curves show the observed data.
If the observed statistic falls outside the predictive interval, the model does not reproduce that feature of the data (or the PPC computation needs auditing). A few misses are expected when many checks are run; systematic misses are the main concern.

7.8.1 Total score distribution

7.8.2 Item calibration

7.8.3 Summary statistics

Posterior predictive check: observed vs replicated summary statistics (90% PI from 200 draws)
Statistic	Observed	Posterior mean	90% PI lower	90% PI upper
Mean total score	49.03	49.00	48.82	49.19
SD total score	17.55	17.62	17.46	17.79

The posterior predictive checks compare observed summaries to summaries computed on replicated datasets drawn from the fitted model’s posterior. The total score density overlay is a global shape check (replicated distributions in grey vs observed in blue). The item calibration plot is an in-sample internal consistency check for binary items: it asks whether the model recovers each item’s observed proportion-correct within posterior uncertainty.

The mean/SD table is a coarse diagnostic. If the observed mean/SD fall outside the predictive interval, first audit that the observed and replicated scores use identical inclusion rules and scoring (NL discretisation/indexing is a common source of PPC mismatch); persistent shifts then suggest mild global misfit rather than item-specific problems.

7.8.4 RT component checks

These checks compare observed log-RT summaries against posterior predictive replicates for the RT likelihood (STPM + timed math). Log-RT summaries are computed on the same floored/capped scale used for model fitting.

Subtest log-RT quantiles

Interpretation: p50 reflects the typical (median) response time within each subtest; p90 and p99 focus on the slow tail (including pauses). Dots outside the grey interval indicate the fitted RT model does not reproduce that part of the distribution. Interval width varies with the number of RT observations and posterior uncertainty.

RT cap rate by subtest

Interpretation: this plot summarizes how often responses land at the 60s ceiling. If the observed dot sits above the model’s predictive interval, the data contain more pause/timeout behaviour than the lognormal RT likelihood can generate. This is most consequential for STPM because it anchors τ_base.

Item-level RT calibration

Interpretation: each point is an item’s observed mean log-RT (x) compared to the model-predicted mean log-RT (y), with a 90% predictive interval. Dots far from the diagonal or outside the interval flag items whose time intensity is not well captured (interpret only for items with n ≥ 20).

7.8.5 Joint speed–accuracy checks

These checks evaluate whether the joint model reproduces fast-wrong and fast-correct rates for timed-math subtests (Term 1 only), using posterior predictive simulations of both accuracy and RT.

Interpretation: these panels compare observed fast-correct and fast-wrong rates (≤ 1s) to model-implied intervals. Large observed fast-wrong rates above the predictive interval suggest a rapid-guess / accidental-tap process that this v3 RT likelihood does not represent. This is a primary motivation for probe-aware RT modelling (e.g., correct-only RT for affected probes) in the next iteration.

8 Model Diagnostics

8.1 Input coverage (Foundation)

Model input coverage summary (Foundation)
year_level	n_rows	n_persons	n_acc_items	n_acc_bin	n_acc_nl	n_rt_items	n_rt_obs
foundation	200822	2517	214	106479	46060	184	131978

8.2 Sampler diagnostics

MCMC sampler diagnostics summary
Metric	Value
Divergences	0.000
Treedepth hit rate	0.000
Max Rhat (key params)	1.009
Min ESS (key params)	407.122
Max Rhat (theta)	1.004
Min ESS (theta)	5273.110
Max Rhat (tau_base)	1.004
Min ESS (tau_base)	5326.209
Max Rhat (tau_math_total)	1.004
Min ESS (tau_math_total)	5778.137
Max Rhat (item params)	1.007
Min ESS (item params)	557.667
Min E-BFMI	0.654

Diagnostics are clean (0 divergences; Rhat ≈ 1; strong ESS; healthy E-BFMI). Note that the JSON stores QC metadata from the QC run environment; the actual fit used 1500 warmup + 1500 sampling per chain.

8.3 Traceplots

Traceplots for key population-level parameters across all chains. Healthy mixing appears as overlapping, stationary traces with no trends or stuck regions.

8.4 Prior vs posterior overlap (key hyperparameters)

These overlays check whether the data meaningfully update the priors for core scale and linking parameters. Posteriors that closely match priors indicate weak identification.

8.5 RT preprocessing (floor/cap)

RT preprocessing: flooring and capping by timed subtest
test_subgroup	n_rt	n_included	n_floored	n_capped	p_floored	p_capped
MC0-20	40402	40402	4	1	1e-04	0.0000
MNA0-20	7758	7758	1	1	1e-04	0.0001
MNC0-20	7445	7445	0	2	0e+00	0.0003
MQ1-10	17809	17809	0	9	0e+00	0.0005
MQ1-20	10281	10281	0	50	0e+00	0.0049
STPM	48283	48283	0	285	0e+00	0.0059

Flooring is negligible and capping is rare, concentrated where expected (STPM has the largest share of long pauses).

8.6 Low-information flags

Low-information rates (Foundation) from model outputs
year_level	low_info_n	low_info_rate	low_acc_items_rate	low_timed_math_rt_rate	low_speed_rt_rate	low_theta_precision_rate	low_tau_reg_precision_rate
foundation	61	0.024	0.01	0.019	0.011	0.001	0

Low-information flags are rare (about 2.4%) and are driven by low item/RT counts rather than unstable posterior uncertainty.

8.7 Item-level QC notes

8.7.1 Low-exposure items

Low-exposure accuracy items (n < 20) summary
n_items	n_low_n	pct_low_n	obs_low_n	obs_total	obs_share
214	36	16.822	269	152539	0.002

Lowest-exposure items (n < 20); item parameters are unstable
item_id	test_subgroup	n_total	b_mean	b_sd
MC0-20__MC0-20_044	MC0-20	1	-0.506	1.327
MNC0-20__MNC0-20_030	MNC0-20	1	-0.782	1.285
MNC0-20__MNC0-20_028	MNC0-20	2	0.973	1.207
MNC0-20__MNC0-20_026	MNC0-20	2	1.091	1.202
MNA0-20__MNA0-20_029new	MNA0-20	2	-0.148	1.136
MNC0-20__MNC0-20_029	MNC0-20	2	-1.651	1.133
MNA0-20__MNA0-20_030new	MNA0-20	2	-0.155	1.103
MNC0-20__MNC0-20_027	MNC0-20	3	1.190	1.138
MNA0-20__MNA0-20_027new	MNA0-20	3	0.230	1.006
MNC0-20__MNC0-20_025	MNC0-20	3	-0.989	0.995
MNA0-20__MNA0-20_028new	MNA0-20	3	-0.778	0.993
MC0-20__MC0-20_042	MC0-20	4	-1.300	1.127
MC0-20__MC0-20_041	MC0-20	4	-1.281	1.102
MC0-20__MC0-20_039	MC0-20	4	-1.278	1.098
MC0-20__MC0-20_043	MC0-20	4	-1.281	1.090

These low-n items represent a tiny share of all accuracy observations, so they have limited impact on student scores but should be filtered in any item-level ranking or review.

8.7.2 Number-line step ordering

NL step ordering: percentage of items with step₁ < step₂
n_items	n_ordered	pct_ordered
30	27	90

NL items with disordered steps (step₁ > step₂)
item_id	test_subgroup	step1_step_mean	step2_step_mean
BNL0-20__BNL0-20_008new-copy	BNL0-20	0.156	-0.156
UNLC0-20__UNLnc0-20_005-copy	UNLC0-20	0.209	-0.209
UNLNC0-20__UNLnc0-20_005-copy	UNLNC0-20	0.671	-0.671

For 3-category PCM, ordered steps are expected. A small subset of NL items show disordered steps; these are targeted candidates for review rather than a threat to overall model stability.

8.8 Pending item review

The following item-level review tasks remain outstanding:

Outfit / infit statistics: Compute and review item fit indices to identify items performing worse than expected under the Rasch/PCM model.
Differential item functioning (DIF): Test whether items function equivalently across cohorts A and B.
Anchor item residuals: Assess stability of shared items used for cohort linking.

The flags identified in this report — low-exposure items (n < 20) and NL step disorder — are review flags, not automatic retirement triggers. They identify items to inspect for scoring, coding, or form assignment issues before any change to the instrument.

9 Lessons Learnt & Implications for Next Iteration

This section collects findings from the Foundation Term 1 review that inform future modelling decisions. It is intended as a working reference for the next calibration cycle.

9.1 What worked well

Accuracy model convergence: 0 divergences, healthy ESS across all parameters, max Rhat ~1.01, E-BFMI well above warning thresholds. The Rasch (1PL) + PCM specification is well-identified for this dataset.
Construct validity: θ aligns strongly with observed accuracy; τ_base tracks STPM speed; τ_math_total tracks timed-math speed; τ_reg_adj is approximately independent of baseline speed. The latent traits behave as intended.
Binary item calibration: 0/184 items outside the 90% posterior predictive interval for observed proportion correct. In-sample calibration is tight.
Prior calibration: posteriors are narrower than priors for all key hyperparameters, indicating that the data are informative and the priors are not unduly constraining estimates.

9.2 What needs attention

Fast-wrong responses not modelled: v3 includes all attempted RTs. Observed fast-wrong rates exceed PPC intervals for timed-math subtests (MC0-20, MNC0-20). Speed scores for students with many fast-wrong responses may be biased upward (faster apparent speed from rapid guesses rather than fluent retrieval).
STPM 60-second cap rate: the observed cap rate far exceeds the model’s 90% PI. Capped RTs are treated as exact observations (log(60)), which biases τ_base downward (slower apparent speed) for affected students.
Heavy RT tails: p99 quantiles are systematically underpredicted — the lognormal residual is too thin-tailed at the extremes, even where the bulk of the distribution fits well.
PPC plumbing matters: PPCs are only as reliable as their indexing/aggregation. Keep Stan matrix reshaping consistent with Stan’s column-major order, and ensure group aggregation preserves ordering (avoid rowsum(..., reorder = FALSE) when results are later aligned to numeric IDs).
PPC total score definition: the current PPC sums binary 0/1 accuracy scores with NL category scores (0/1/2). This is a valid internal consistency check but does not correspond to an operational reporting metric. Future PPCs should consider separate accuracy-only and NL-only checks to avoid conflating different scoring scales.

9.3 Recommendations for next iteration

9.3.1 Immediate (v4 changes)

Correct-only RT filtering for flagged subtests. Apply a probe-aware RT inclusion policy for timed-math subtests with high fast-wrong rates/excess (e.g., MC0-20, MNC0-20). This targets the primary source of RT misfit without discarding accuracy data.
Document the RT filtering policy as a formalised pre-model data step with auditable thresholds (e.g., percentage of fast-wrong responses triggering correct-only mode for a subtest), and lock the resulting policy table per run_id.
Add speed reliability flags for reporting/operations (e.g., min number of correct RTs, STPM cap hits, fast-wrong excess) so speed scores can be downweighted or withheld when not informative.

9.3.2 Medium-term (model extensions)

Censored or mixture model for 60-second cap. Right-censoring is the cleanest fix for cap-rate misfit — it tells the model “this student’s true RT is ≥60s” rather than “this student’s true RT is exactly 60s.” A mixture model is more complex but could target both p99 and cap-rate misfit simultaneously.
Heavier-tailed RT residual (Student-t with df = 5–10) if p99 underprediction persists after correct-only filtering. The lognormal assumption is adequate for the bulk of the distribution but fails at the extremes.

9.3.3 Deferred

Item-fit, DIF, and local dependence checks remain outstanding (see Pending Item Review above). These are necessary before any high-stakes use of item parameters.
Cross-validate PPC total score definition — investigate whether any observed mean/SD shift persists after PPC plumbing audits. The total score density overlay is the more robust check; mean/SD deviations may be a definitional artefact of mixing binary and ordinal scoring scales rather than genuine model misfit.

10 (APPENDIX) Appendix: Reproducibility

10.1 Run Metadata

Run metadata for reproducibility
Field	Value
Run ID	t1_2025_joint_flu_v1
Model root	models/irt/irt_joint_stan_pcm
R Version	4.5.2
cmdstanr Version	0.9.0
Report Generated	2026-02-03 01:49:23.856953

10.2 File Paths

Input file paths relative to project root
File	Path
Cleaned responses	data/processed/cleaned_responses/cleaned_responses.parquet
Student scores	models/irt/irt_joint_stan_pcm/outputs/runs/t1_2025_joint_flu_v1/final/students/term1_joint_scores_foundation.csv
Item parameters	models/irt/irt_joint_stan_pcm/outputs/runs/t1_2025_joint_flu_v1/final/items/term1_joint_item_params_foundation.csv
Model diagnostics (sampler)	models/irt/irt_joint_stan_pcm/outputs/runs/t1_2025_joint_flu_v1/qc/sampler_diagnostics_foundation.json
Model summary	models/irt/irt_joint_stan_pcm/outputs/runs/t1_2025_joint_flu_v1/qc/term1_joint_summary_foundation.csv
Low-information summary	models/irt/irt_joint_stan_pcm/outputs/runs/t1_2025_joint_flu_v1/qc/term1_joint_low_info_foundation.csv
RT preprocessing summary	models/irt/irt_joint_stan_pcm/outputs/runs/t1_2025_joint_flu_v1/qc/term1_rt_preproc_foundation.csv
Stan data (foundation)	models/irt/irt_joint_stan_pcm/outputs/runs/t1_2025_joint_flu_v1/intermediate/term1_joint_stan_data_foundation.rds

10.3 Known Quirks & Limitations

UNLC/UNLNC variants: Form-confounded (chair vs no-chair), NOT true anchors
Reach analysis: Deferred; gap/tail classification is a proxy only
RT censoring: Not modelled; 60s cap may truncate valid slow responses
NL step disorder: A small subset of NL items show disordered PCM steps; these items should be reviewed for scoring/calibration issues
Low-n items: Parameter estimates for items with <20 responses are unstable and should be filtered from item-level rankings
Untimed RT: Excluded from baseline model; 180s cap used in sensitivity runs only