teleo-codex/domains/ai-alignment/frontier-ai-failures-shift-from-systematic-bias-to-incoherent-variance-as-task-complexity-and-reasoning-length-increase.md
m3taversal be8ff41bfe link: bidirectional source↔claim index — 414 claims + 252 sources connected
Wrote sourced_from: into 414 claim files pointing back to their origin source.
Backfilled claims_extracted: into 252 source files that were processed but
missing this field. Matching uses author+title overlap against claim source:
field, validated against 296 known-good pairs from existing claims_extracted.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-21 11:55:18 +01:00

3.7 KiB

type domain description confidence source created attribution supports reweave_edges related sourced_from
claim ai-alignment Anthropic's ICLR 2026 paper decomposes model errors into bias (systematic) and variance (random) and finds that longer reasoning traces and harder tasks produce increasingly incoherent failures experimental Anthropic Research, ICLR 2026, tested on Claude Sonnet 4, o3-mini, o4-mini 2026-03-30
extractor sourcer
handle
theseus
handle context
anthropic-research Anthropic Research, ICLR 2026, tested on Claude Sonnet 4, o3-mini, o4-mini
capability-scaling-increases-error-incoherence-on-difficult-tasks-inverting-the-expected-relationship-between-model-size-and-behavioral-predictability
capability-scaling-increases-error-incoherence-on-difficult-tasks-inverting-the-expected-relationship-between-model-size-and-behavioral-predictability|supports|2026-04-03
Behavioral divergence between AI evaluation and deployment is formally bounded by regime information extractable from internal representations but regime-blind training interventions achieve only limited and inconsistent protection|related|2026-04-17
Provider-level behavioral biases persist across model versions because they are embedded in training infrastructure rather than model-specific features|related|2026-04-17
Behavioral divergence between AI evaluation and deployment is formally bounded by regime information extractable from internal representations but regime-blind training interventions achieve only limited and inconsistent protection
Provider-level behavioral biases persist across model versions because they are embedded in training infrastructure rather than model-specific features
inbox/archive/ai-alignment/2026-03-30-anthropic-hot-mess-of-ai-misalignment-scale-incoherence.md
inbox/archive/ai-alignment/2026-03-12-metr-sabotage-review-claude-opus-4-6.md

Frontier AI failures shift from systematic bias to incoherent variance as task complexity and reasoning length increase making behavioral auditing harder on precisely the tasks where it matters most

The paper measures error decomposition across reasoning length (tokens), agent actions, and optimizer steps. Key empirical findings: (1) As reasoning length increases, the variance component of errors grows while bias remains relatively stable, indicating failures become less systematic and more unpredictable. (2) On hard tasks, larger more capable models show HIGHER incoherence than smaller models—directly contradicting the intuition that capability improvements make behavior more predictable. (3) On easy tasks, the pattern reverses: larger models are less incoherent. This creates a troubling dynamic where the tasks that most need reliable behavior (hard, long-horizon problems) are precisely where capable models become most unpredictable. The mechanism appears to be that transformers are natively dynamical systems, not optimizers, and must be trained into optimization behavior—but this training breaks down at longer traces. For alignment, this means behavioral auditing faces a moving target: you cannot build defenses against consistent misalignment patterns because the failures are random. This compounds the verification degradation problem—not only does human capability fall behind AI capability, but AI failure modes become harder to predict and detect.


Relevant Notes:

Topics: