Unidimensional or Multidimensional? Comparing IRT Structural Models

Evidence for a single-ability model with testlet effects

1 Why this matters

Before we can score students on this screener, we need to answer a foundational question: are we measuring one ability or several?

If students’ performance on number-line, magnitude comparison, and other tasks is driven by a single underlying numeracy ability, we should report one overall score. If there are genuinely distinct sub-skills — spatial number sense vs. symbolic comparison, say — then separate profile scores could give teachers more actionable information.

This analysis compares three structural models head-to-head to determine which best describes the data.

2 Three candidate models

The three models represent increasingly complex accounts of what drives student responses.

Schematic path diagrams for the three candidate models. Circles represent latent traits; rectangles represent observed item responses.
Model Structure Wins if...
M0: Unidimensional Single latent trait drives all items No meaningful local dependence or sub-skill structure exists
M1: Testlet (bifactor) One general trait plus testlet factors for probe clusters Most variation is one trait, but items within a probe are more similar than expected
M3: Multidimensional Multiple correlated latent factors (one per domain) Distinct sub-skills are stable enough to improve fit and produce usable scores

2.1 Model equations

All three models share the same item response functions — Rasch for binary items and the generalised partial credit model (GPCM) for number-line items — and differ only in how they parameterise the latent trait(s).

Shared item response functions. For binary items:

\[P(Y_{ij} = 1 \mid \theta_i, b_j) = \text{logit}^{-1}(\theta_i - b_j)\]

For number-line items (3 ordered categories, \(k \in \{0,1,2\}\)):

\[P(Y_{ij} = k \mid \theta_i, b_j, d_{j,\cdot}) \propto \exp\!\left(\sum_{c=1}^{k} (\theta_i - b_j - d_{j,c})\right)\]

where \(Y_{ij}\) is the response of student \(i\) on item \(j\), \(b_j\) is item difficulty, and \(d_{j,c}\) are step thresholds.

M0 — Unidimensional. A single latent trait \(\theta_i\) enters the response functions above directly. No additional structure.

M1 — Testlet (bifactor). A general trait \(\theta_i\) loads on every item. Each probe cluster \(t\) adds a testlet-specific factor \(s_{t,i}\), orthogonal to \(\theta\):

\[\text{logit}\,P(Y_{ij} = 1) = a_j\,\theta_i + a_{S,j}\,s_{t(j),i} - b_j\]

where \(a_j\) and \(a_{S,j}\) are the general and specific loadings. Testlet factors absorb local dependence among items within the same probe without splitting the general ability dimension.

M3 — Multidimensional (correlated factors). Each item loads on a single domain factor \(\theta_{f,i}\):

\[\text{logit}\,P(Y_{ij} = 1) = a_j\,\theta_{f(j),i} - b_j\]

Factor correlations \(\text{Cor}(\theta_f, \theta_g)\) are freely estimated. There is no general factor — all covariation between domains is captured through pairwise correlations.

3 How subtests map to dimensions

M3 with a separate factor for every task failed to converge on the main pooled units, so we collapsed number-line tasks (BNL, UNLC, UNLNC) into a single NL factor and magnitude comparison & missing number tasks (MC, MNA, MNC) into a single SYM_MAG factor. Match quantity and decomposition each keep their own factor. Because each exam group is administered one MQ variant and one DMT variant, a cohort-level unit has 4 factors (NL, SYM_MAG, MQ, DMT); pooled units that combine both exam groups have 6. The diagram below shows all tasks grouped by test family.

Warning: The `label.size` argument of `geom_label()` is deprecated as of ggplot2 3.5.0.
ℹ Please use the `linewidth` argument instead.

Mapping of tasks to M3 latent factors. Number-line and magnitude comparison & missing number tasks collapse into shared NL and SYM_MAG factors; remaining tasks each keep their own factor. Each exam group sees one MQ variant and one DMT variant, giving 4 factors per cohort.

4 What the evidence shows

We evaluated the three models across 12 analysis units (2 terms × 2 sample definitions × 3 cohort scopes), using BIC as the primary selection criterion, residual dependence (Q3) as a secondary check, and school-level bootstrap resampling to test robustness.

4.1 BIC comparison

The analysis was run across 12 units representing every combination of term (Term 1 and Term 4), sample definition (“All students” = every student with at least 5 attempted items in that term; “Stable schools” = only students from the 39 schools present in both terms), and cohort scope (Foundation A, Foundation B, or all foundation students pooled).

The Bayesian Information Criterion penalises model complexity: a lower BIC means the model explains the data better relative to its number of free parameters.

BIC values across all 12 analysis units. M1 (testlet/bifactor) achieves the lowest BIC in every unit.

M1 wins in every unit. Across all 12 analysis units — both terms, both sample definitions, all cohorts — the testlet model achieves the lowest BIC. The unidimensional model (M0) consistently ranks second, and the multidimensional model (M3) ranks last.

4.2 Head-to-head BIC differences

To quantify the margin of victory, we examine pairwise BIC differences between models. A negative value in “M1 vs M0” means M1 fits better; a positive value in “M3 vs M1” means M1 fits better than M3.

Pairwise BIC differences by analysis unit. Left panel: M1 consistently beats M0. Right panel: M3 consistently loses to M1.

The testlet model beats the unidimensional model by a consistent margin (accounting for local probe dependence), while the multidimensional model performs worse than the testlet model in every unit — often by thousands of BIC points.

4.3 Bootstrap stability

Could these results be an artefact of the particular school sample? To check, we resampled schools 30 times per slice (with replacement) and refit both M1 and M3 on each bootstrap sample. The slices range from approximately 1,100 to 2,500 students; stable-school slices draw from the 39 schools present in both terms.

Slice N students N observations N schools
Term 1 — All students 2,514 152,068
Term 1 — Stable schools 1,186 71,289 39
Term 4 — All students 1,098 68,371
Term 4 — Stable schools 1,076 67,039 39

Distribution of ΔBIC (M3 − M1) across 30 bootstrap replicates per slice. All distributions sit entirely above zero, confirming M1’s advantage is robust.

Each ridge shows the distribution of ΔBIC across 30 bootstrap replicates. All four distributions sit entirely above zero — even the most conservative replicate (smallest ΔBIC) favours M1 by over 1,000 BIC points. The wider spread for Term 4 — All students reflects the smaller sample (1,098 students vs 2,514 in Term 1).

4.4 Residual dependence (Q3)

The Q3 statistic measures unexplained local dependence between item pairs after accounting for the latent structure. Lower values indicate the model has adequately captured the correlation structure.

Scope Model Units Median mean abs(Q3) Median prop abs(Q3) > threshold
By cohort M0: Unidimensional 8 0.1051 0.1424
By cohort M1: Testlet (bifactor) 8 0.0940 0.1146
By cohort M3: Multidimensional 8 0.1200 0.1804
All students M0: Unidimensional 4 0.1130 0.1573
All students M1: Testlet (bifactor) 4 0.1028 0.1355

M1 achieves the lowest residual dependence across both scopes: it absorbs local probe effects that M0 leaves unexplained, and does so more efficiently than the multidimensional M3.

5 Conclusion

5.1 M1 (testlet/bifactor) is the preferred structural model

The testlet model wins consistently: it achieves the lowest BIC in all 12 analysis units, demonstrates the lowest residual dependence, and its advantage over both competitors is stable under school-level bootstrap resampling.

Practical implication: the screener should be scored using a single general ability estimate (θ) with testlet effects absorbed into the model. Multidimensional profile scores — separate sub-skill scores for number-line vs. symbolic comparison, for instance — are not viable at current item counts because the multidimensional model fits worse and factor-level precision is insufficient.

Looking ahead: as the item pool grows in future assessment cycles, it will be worth revisiting whether distinct sub-skill profiles become feasible. For now, no profile-level scores should be reported to teachers.

6 Technical appendix

Unit BIC M0 BIC M1 BIC M3 Δ M0–best Δ M1–best Δ M3–best Winner
Term 1 — Foundation A 87,094.6 83,634.9 91,584.3 3459.7 0 7949.4 M1_1d_testlets
Term 1 — Foundation B 85,936.3 83,469.7 88,127.7 2466.6 0 4658.0 M1_1d_testlets
Term 1 — Foundation A (stable) 38,089.5 36,788.2 39,947.5 1301.2 0 3159.2 M1_1d_testlets
Term 1 — Foundation B (stable) 44,334.2 43,145.2 45,216.9 1189.0 0 2071.7 M1_1d_testlets
Term 4 — Foundation A 33,577.8 32,184.3 34,653.1 1393.6 0 2468.9 M1_1d_testlets
Term 4 — Foundation B 34,125.4 32,723.3 34,097.5 1402.2 0 1374.3 M1_1d_testlets
Term 4 — Foundation A (stable) 31,965.5 30,622.2 32,906.2 1343.3 0 2284.0 M1_1d_testlets
Term 4 — Foundation B (stable) 34,125.4 32,723.3 34,097.5 1402.2 0 1374.3 M1_1d_testlets
Term 1 — All students 104,043.8 100,457.6 NA 3586.2 0 NA M1_1d_testlets
Term 1 — Stable schools 82,192.0 79,592.7 NA 2599.3 0 NA M1_1d_testlets
Term 4 — All students 67,401.5 64,532.3 NA 2869.2 0 NA M1_1d_testlets
Term 4 — Stable schools 65,799.3 62,973.6 NA 2825.6 0 NA M1_1d_testlets
Unit Model Mean abs(Q3) Max abs(Q3) Prop abs(Q3) > threshold
Term 1 — Foundation A M0: Unidimensional 0.0953 0.9828 0.1257
Term 1 — Foundation A M1: Testlet (bifactor) 0.0856 0.9977 0.1145
Term 1 — Foundation A M3: Multidimensional 0.1220 1.0000 0.1949
Term 1 — Foundation A (stable) M0: Unidimensional 0.0979 1.0000 0.1228
Term 1 — Foundation A (stable) M1: Testlet (bifactor) 0.0890 0.7938 0.1050
Term 1 — Foundation A (stable) M3: Multidimensional 0.1216 0.7943 0.1829
Term 4 — Foundation A M0: Unidimensional 0.1047 0.9870 0.1437
Term 4 — Foundation A M1: Testlet (bifactor) 0.0931 0.9792 0.1092
Term 4 — Foundation A M3: Multidimensional 0.1258 0.9880 0.1841
Term 4 — Foundation A (stable) M0: Unidimensional 0.1074 0.9964 0.1471
Term 4 — Foundation A (stable) M1: Testlet (bifactor) 0.0958 0.9943 0.1146
Term 4 — Foundation A (stable) M3: Multidimensional 0.1265 0.9974 0.1825
Term 1 — Foundation B M0: Unidimensional 0.1017 1.0000 0.1335
Term 1 — Foundation B M1: Testlet (bifactor) 0.0952 0.8873 0.1274
Term 1 — Foundation B M3: Multidimensional 0.1073 0.9454 0.1503
Term 1 — Foundation B (stable) M0: Unidimensional 0.1124 0.9958 0.1645
Term 1 — Foundation B (stable) M1: Testlet (bifactor) 0.1068 0.9619 0.1529
Term 1 — Foundation B (stable) M3: Multidimensional 0.1184 0.9769 0.1784
Term 4 — Foundation B M0: Unidimensional 0.1054 1.0000 0.1424
Term 4 — Foundation B M1: Testlet (bifactor) 0.0940 0.9554 0.1147
Term 4 — Foundation B M3: Multidimensional 0.1030 0.9880 0.1369
Term 4 — Foundation B (stable) M0: Unidimensional 0.1054 1.0000 0.1424
Term 4 — Foundation B (stable) M1: Testlet (bifactor) 0.0940 0.9554 0.1147
Term 4 — Foundation B (stable) M3: Multidimensional 0.1030 0.9880 0.1369
Term 1 — All students M0: Unidimensional 0.1268 1.0000 0.1940
Term 1 — All students M1: Testlet (bifactor) 0.1180 1.0000 0.1808
Term 1 — Stable schools M0: Unidimensional 0.1125 1.0000 0.1542
Term 1 — Stable schools M1: Testlet (bifactor) 0.1049 1.0000 0.1445
Term 4 — All students M0: Unidimensional 0.1125 1.0000 0.1561
Term 4 — All students M1: Testlet (bifactor) 0.0997 0.9941 0.1237
Term 4 — Stable schools M0: Unidimensional 0.1136 1.0000 0.1584
Term 4 — Stable schools M1: Testlet (bifactor) 0.1007 0.9938 0.1266
Slice Planned reps Usable reps Median ΔBIC P10 ΔBIC P90 ΔBIC
Term 1 — All students 30 30 2,951.0 2,726.8 3,379.8
Term 1 — Stable schools 30 28 1,308.0 1,130.8 1,524.9
Term 4 — All students 30 17 1,635.0 1,194.8 1,743.6
Term 4 — Stable schools 30 14 1,339.9 1,190.2 1,585.2
all_slices 120 NA 1,536.3 1,183.4 3,062.1

Positive values throughout confirm M1 is preferred over M3 in every bootstrap replicate. The P10 column shows even the most conservative replicate favours M1 by a large margin.

Three-model comparison run

  • Run ID: foundation_parallel_20260206
  • Source path: ../../analysis/structure_model_comparison_v4/outputs/runs/foundation_parallel_20260206

Full-N robustness run (bootstrap)

  • Run ID: foundation_fulln_boot4_prod_20260209
  • Source path: ../../analysis/structure_model_comparison_v4/outputs/runs/foundation_fulln_boot4_prod_20260209
Run Status Converged Count
Full-N robustness ok TRUE 8
Three-model comparison ok TRUE 32
Three-model comparison timed_out FALSE 4

The structural model selection follows a layered decision protocol:

  1. Primary — BIC (Bayesian Information Criterion): the model with the lowest BIC across the majority of analysis units is preferred. BIC penalises free parameters, guarding against overfitting.

  2. Secondary — Residual dependence (Q3): Yen’s Q3 statistic measures unexplained pairwise item dependence after conditioning on the latent structure. Lower mean |Q3| and a lower proportion of item pairs exceeding the threshold (|Q3| > 0.2) indicate better local-dependence absorption.

  3. Robustness — School-level bootstrap: schools are resampled with replacement (30 replicates per slice) and both M1 and M3 are refit. If the BIC preference reverses in a meaningful proportion of replicates, the result is considered unstable.

  4. Profile gate (secondary): even if a multidimensional model won on fit, profile scores are only reported to teachers if factor-level measurement precision exceeds minimum thresholds (determinacy > 0.80 for teacher-facing scores, > 0.70 for internal use). This gate is currently not met for any M3 factor.

“Teacher profile pass” and “Internal profile pass” indicate whether M3’s factor-level precision exceeds minimum thresholds for reporting profile scores (determinacy > 0.80 for teacher-facing, > 0.70 for internal use). All units show FALSE — no M3 factor achieves sufficient precision for profile scores.

Unit BIC winner BIC value Teacher profile pass Internal profile pass
Term 1 — Foundation A M1_1d_testlets 83,634.9 FALSE FALSE
Term 1 — Foundation A (stable) M1_1d_testlets 36,788.2 FALSE FALSE
Term 4 — Foundation A M1_1d_testlets 32,184.3 FALSE FALSE
Term 4 — Foundation A (stable) M1_1d_testlets 30,622.2 FALSE FALSE
Term 1 — Foundation B M1_1d_testlets 83,469.7 FALSE FALSE
Term 1 — Foundation B (stable) M1_1d_testlets 43,145.2 FALSE FALSE
Term 4 — Foundation B M1_1d_testlets 32,723.3 FALSE FALSE
Term 4 — Foundation B (stable) M1_1d_testlets 32,723.3 FALSE FALSE
Term 1 — All students M1_1d_testlets 100,457.6 FALSE FALSE
Term 1 — Stable schools M1_1d_testlets 79,592.7 FALSE FALSE
Term 4 — All students M1_1d_testlets 64,532.3 FALSE FALSE
Term 4 — Stable schools M1_1d_testlets 62,973.6 FALSE FALSE