teleo-codex/domains/ai-alignment/capability-scaling-increases-error-incoherence-on-difficult-tasks.md
Teleo Agents de662b6f6a extract: 2026-03-30-anthropic-hot-mess-of-ai-misalignment-scale-incoherence
Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70>
2026-03-30 00:34:02 +00:00

33 lines
2.4 KiB
Markdown

---
type: claim
domain: ai-alignment
description: Larger models show more random, unpredictable failures on hard problems than smaller models, inverting the naive expectation that intelligence improvements aid oversight
confidence: experimental
source: Anthropic Research, ICLR 2026, empirical testing across Claude Sonnet 4, o3-mini, o4-mini
created: 2026-03-30
attribution:
extractor:
- handle: "theseus"
sourcer:
- handle: "anthropic-research"
context: "Anthropic Research, ICLR 2026, empirical testing across Claude Sonnet 4, o3-mini, o4-mini"
---
# More capable AI models exhibit increasing error incoherence on difficult tasks, meaning capability gains in frontier regimes worsen rather than improve alignment auditability
The Hot Mess paper's most counterintuitive finding is that capability scaling has opposite effects depending on task difficulty. On easy tasks, larger models show reduced incoherence (more predictable behavior). But on hard tasks - precisely where oversight matters most - larger, more capable models exhibit INCREASED incoherence compared to smaller models.
This directly challenges the assumption underlying much alignment work: that capability improvements will make models more coherent and therefore more auditable. Instead, the data suggests that in the frontier regime (hard tasks, long reasoning traces), capability gains may actively worsen our ability to predict and prevent failures.
The mechanism appears to be that as models become more capable, they explore larger solution spaces on difficult problems, and the variance in their exploration strategies increases faster than their ability to converge on correct answers. This creates a regime where 'smarter' means 'less predictable' rather than 'more reliable.'
This has profound implications for scalable oversight strategies that assume capability improvements will eventually make AI behavior more transparent and auditable. If the opposite is true in the relevant regime, then oversight methods must be designed to handle increasing unpredictability as a structural feature, not a temporary limitation.
---
Relevant Notes:
- [[AI capability and reliability are independent dimensions because Claude solved a 30-year open mathematical problem while simultaneously degrading at basic program execution during the same session]]
- scalable oversight degrades rapidly as capability gaps grow
Topics:
- [[_map]]