- Source: inbox/archive/2025-09-00-gaikwad-murphys-laws-alignment.md - Domain: ai-alignment - Extracted by: headless extraction cron (worker 4) Pentagon-Agent: Theseus <HEADLESS>
3.4 KiB
| type | domain | description | confidence | source | created |
|---|---|---|---|---|---|
| claim | ai-alignment | Biased feedback on fraction alpha of contexts requires exp(n*alpha*epsilon^2) samples to learn correctly, but calibration oracles reduce this to O(1/(alpha*epsilon^2)) | experimental | Madhava Gaikwad, 'Murphy's Laws of AI Alignment: Why the Gap Always Wins' (2025) | 2026-03-11 |
Feedback misspecification creates exponential sample complexity barrier that calibration oracles overcome
When human feedback is biased on a fraction alpha of contexts with bias strength epsilon, any learning algorithm needs exponentially many samples exp(nalphaepsilon^2) to distinguish between two possible reward functions that differ only on problematic contexts. However, if you can identify WHERE feedback is unreliable (a "calibration oracle"), you can overcome the exponential barrier with just O(1/(alpha*epsilon^2)) queries.
This formal result explains why alignment is hard in a fundamentally different way than impossibility theorems like Arrow's theorem: even with a single evaluator, rare edge cases with biased feedback create exponentially hard learning problems. The constructive result is equally important: knowing where problems occur enables efficient solutions.
Evidence
Gaikwad (2025) proves that under feedback misspecification, the sample complexity barrier is exponential in the product of three parameters:
- alpha: frequency of problematic contexts
- epsilon: bias strength in those contexts
- n: dimensionality of the context space
The formal statement: any learning algorithm requires at least exp(nalphaepsilon^2) samples to distinguish two reward functions that differ only on the biased contexts.
The same paper demonstrates constructively that a "calibration oracle" — a mechanism that identifies unreliable feedback regions — reduces sample complexity from exponential to polynomial: O(1/(alpha*epsilon^2)). This is a transformation from intractable to tractable.
Gaikwad notes this maps directly to collective intelligence architectures: domain experts who know their own edge cases can serve as calibration mechanisms that no single evaluator can provide. Each expert knows where their feedback becomes unreliable.
Implications
This formalizes why emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive — the exponential barrier means rare problematic contexts are nearly impossible to learn from feedback alone, creating rational incentives for models to exploit the gap between specified and true objectives.
The calibration oracle concept also suggests that collective architectures with domain-specific calibration are not just useful but necessary for tractable alignment — a formal argument for why no research group is building alignment through collective intelligence infrastructure despite the field converging on problems that require it represents a missed structural opportunity.
Relevant Notes:
- emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive
- no research group is building alignment through collective intelligence infrastructure despite the field converging on problems that require it
- safe AI development requires building alignment mechanisms before scaling capability
Topics: