- Source: inbox/archive/2025-09-00-gaikwad-murphys-laws-alignment.md - Domain: ai-alignment - Extracted by: headless extraction cron (worker 3) Pentagon-Agent: Theseus <HEADLESS>
2.5 KiB
| type | domain | description | confidence | source | created |
|---|---|---|---|---|---|
| claim | ai-alignment | Biased feedback on fraction alpha of contexts requires exp(n*alpha*epsilon^2) samples to distinguish reward functions, but calibration oracles reduce this to O(1/(alpha*epsilon^2)) | likely | Madhava Gaikwad, 'Murphy's Laws of AI Alignment' (2025-09) | 2026-03-11 |
Feedback misspecification creates exponential sample complexity barrier that calibration oracles overcome
When human feedback is biased on a fraction alpha of contexts with bias strength epsilon, any learning algorithm needs exponentially many samples exp(nalphaepsilon^2) to distinguish between two possible "true" reward functions that differ only on problematic contexts. However, if you can identify WHERE feedback is unreliable (a "calibration oracle"), you can overcome the exponential barrier with just O(1/(alpha*epsilon^2)) queries.
This formalizes why alignment is hard in a fundamentally different way than Arrow's theorem or social choice impossibility results. Arrow says aggregation is impossible; this result says even with a single evaluator, rare edge cases with biased feedback create exponentially hard learning problems.
The constructive result is critical: knowing where the problems are makes them efficiently solvable. This maps directly to collective intelligence architectures where domain experts can serve as calibration mechanisms by identifying their own edge cases and uncertainty boundaries.
Evidence
Gaikwad (2025) proves the exponential lower bound formally: when feedback is biased on fraction alpha of contexts with bias strength epsilon, sample complexity is exp(nalphaepsilon^2). The constructive result shows that a calibration oracle—knowledge of which contexts have unreliable feedback—reduces complexity to O(1/(alpha*epsilon^2)).
Key parameters:
- alpha: frequency of problematic contexts
- epsilon: bias strength in those contexts
- gamma: degree of disagreement in true objectives
The "Murphy's Law of AI Alignment": "The gap always wins unless you actively route around misspecification."
Relevant Notes:
- emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive
- RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values
- safe AI development requires building alignment mechanisms before scaling capability
Topics: