teleo-codex/domains/ai-alignment/capability-scaling-increases-error-incoherence-on-difficult-tasks-inverting-the-expected-relationship-between-model-size-and-behavioral-predictability.md
m3taversal be8ff41bfe link: bidirectional source↔claim index — 414 claims + 252 sources connected
Wrote sourced_from: into 414 claim files pointing back to their origin source.
Backfilled claims_extracted: into 252 source files that were processed but
missing this field. Matching uses author+title overlap against claim source:
field, validated against 296 known-good pairs from existing claims_extracted.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-21 11:55:18 +01:00

2.4 KiB

type domain description confidence source created attribution supports reweave_edges sourced_from
claim ai-alignment Larger more capable models show MORE random unpredictable failures on hard tasks than smaller models, suggesting capability gains worsen alignment auditability in the relevant regime experimental Anthropic Research, ICLR 2026, empirical measurements across model scales 2026-03-30
extractor sourcer
handle
theseus
handle context
anthropic-research Anthropic Research, ICLR 2026, empirical measurements across model scales
frontier-ai-failures-shift-from-systematic-bias-to-incoherent-variance-as-task-complexity-and-reasoning-length-increase
frontier-ai-failures-shift-from-systematic-bias-to-incoherent-variance-as-task-complexity-and-reasoning-length-increase|supports|2026-04-03
inbox/archive/ai-alignment/2026-03-30-anthropic-hot-mess-of-ai-misalignment-scale-incoherence.md

Capability scaling increases error incoherence on difficult tasks inverting the expected relationship between model size and behavioral predictability

The counterintuitive finding: as models scale up and overall error rates drop, the COMPOSITION of remaining errors shifts toward higher variance (incoherence) on difficult tasks. This means that the marginal errors that persist in larger models are less systematic and harder to predict than the errors in smaller models. The mechanism appears to be that harder tasks require longer reasoning traces, and longer traces amplify the dynamical-system nature of transformers rather than their optimizer-like behavior. This has direct implications for alignment strategy: you cannot assume that scaling to more capable models will make behavioral auditing easier or more reliable. In fact, on the hardest tasks—where alignment matters most—scaling may make auditing HARDER because failures become less patterned. This challenges the implicit assumption in much alignment work that capability improvements and alignment improvements move together. The data suggests they may diverge: more capable models may be simultaneously better at solving problems AND worse at failing predictably.


Relevant Notes:

Topics: