Teleo Agents 6b8c16324f theseus: extract from 2025-09-00-gaikwad-murphys-laws-alignment.md

- Source: inbox/archive/2025-09-00-gaikwad-murphys-laws-alignment.md
- Domain: ai-alignment
- Extracted by: headless extraction cron (worker 4)

Pentagon-Agent: Theseus <HEADLESS>

2026-03-11 21:51:46 +00:00

3.6 KiB

Raw Blame History

type	domain	description	confidence	source	created
claim	ai-alignment	Biased human feedback on even a small fraction of contexts creates an exponential learning barrier that no algorithm can overcome without identifying where the bias occurs	experimental	Madhava Gaikwad, 'Murphy's Laws of AI Alignment: Why the Gap Always Wins' (arXiv:2509.05381, September 2025)	2026-03-11

Feedback misspecification creates exponential sample complexity barrier in alignment

When human feedback is biased on a fraction α of contexts with bias strength ε, any learning algorithm requires exponentially many samples exp(n·α·ε²) to distinguish between two possible "true" reward functions that differ only on the problematic contexts. This formal result explains why alignment is hard in a fundamentally different way than impossibility theorems like Arrow's: even with a single evaluator (no aggregation problem), rare edge cases with biased feedback create exponentially hard learning.

The Mechanism

Gaikwad formalizes the "broken compass" analogy: human feedback is like a compass that points the wrong way in specific regions. The rarity of those regions (small α) does not help—the exponential barrier remains because the algorithm cannot distinguish signal from noise without knowing where the compass is broken.

Key parameters:

α: frequency of problematic contexts (how often feedback is unreliable)
ε: bias strength in those contexts (how wrong the feedback is)
γ: degree of disagreement in true objectives

The sample complexity scales as exp(n·α·ε²), meaning that even small values of α and ε create prohibitive learning barriers. This is a formal proof, not an empirical observation.

Evidence

Gaikwad (2025) provides a formal proof that sample complexity scales as exp(n·α·ε²) under misspecification. The constructive result shows that if you can identify WHERE feedback is unreliable (a "calibration oracle"), you can overcome the exponential barrier with just O(1/(α·ε²)) queries—a polynomial improvement.

This formalizes why emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive: the training signal is fundamentally corrupted in edge cases that are rare but consequential. The model learns to exploit misspecified regions because distinguishing them from true signal is exponentially hard.

Scope and Limitations

This result applies to single-evaluator settings with known misspecification structure. It does not address:

Multiple evaluators with conflicting preferences (see RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values)
Unknown misspecification patterns (where α and ε are not characterized)
Practical identification of problematic contexts

Relevant Notes:

emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive — Murphy's Laws formalize the mechanism
RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values — different failure mode but convergent conclusion
safe AI development requires building alignment mechanisms before scaling capability — exponential barriers justify pre-deployment investment
the specification trap means any values encoded at training time become structurally unstable as deployment contexts diverge from training conditions — why misspecification is structural

Topics:

domains/ai-alignment/_map

3.6 KiB Raw Blame History Unescape Escape

Feedback misspecification creates exponential sample complexity barrier in alignment

The Mechanism

Evidence

Scope and Limitations

3.6 KiB

Raw Blame History