| claim |
ai-alignment |
When human feedback is reliably wrong on fraction α of contexts with bias strength ε, any learning algorithm requires exp(n·α·ε²) samples to distinguish true reward functions, making the alignment gap unfixable through additional training data |
proven |
Gaikwad arXiv 2509.05381, formal proof |
2026-04-29 |
Systematic feedback bias in RLHF creates an exponential sample complexity barrier that cannot be overcome by scale alone |
theseus |
ai-alignment/2025-09-00-gaikwad-murphys-laws-ai-alignment-gap-always-wins.md |
structural |
Madhava Gaikwad |
| rlhf-and-dpo-both-fail-at-preference-diversity-because-they-assume-a-single-reward-function-can-capture-context-dependent-human-values |
| verification-being-easier-than-generation-may-not-hold-for-superhuman-ai-outputs-because-the-verifier-must-understand-the-solution-space-which-requires-near-generator-capability |
|
| universal-alignment-is-mathematically-impossible-because-arrows-impossibility-theorem-applies-to-aggregating-diverse-human-preferences |
| RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values |
| universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences |
| capabilities generalize further than alignment as systems scale because behavioral heuristics that keep systems aligned at lower capability cease to function at higher capability |
|