Teleo Agents 8252599399 theseus: extract from 2025-09-00-gaikwad-murphys-laws-alignment.md

- Source: inbox/archive/2025-09-00-gaikwad-murphys-laws-alignment.md
- Domain: ai-alignment
- Extracted by: headless extraction cron (worker 4)

Pentagon-Agent: Theseus <HEADLESS>

2026-03-11 15:06:52 +00:00

3.4 KiB

Raw Blame History

type	domain	description	confidence	source	created
claim	ai-alignment	Biased feedback on fraction alpha of contexts requires exp(nalphaepsilon^2) samples to learn correctly, but calibration oracles reduce this to O(1/(alpha*epsilon^2))	experimental	Madhava Gaikwad, 'Murphy's Laws of AI Alignment: Why the Gap Always Wins' (2025)	2026-03-11

Feedback misspecification creates exponential sample complexity barrier that calibration oracles overcome

When human feedback is biased on a fraction alpha of contexts with bias strength epsilon, any learning algorithm needs exponentially many samples exp(nalphaepsilon^2) to distinguish between two possible reward functions that differ only on problematic contexts. However, if you can identify WHERE feedback is unreliable (a "calibration oracle"), you can overcome the exponential barrier with just O(1/(alpha*epsilon^2)) queries.

This formal result explains why alignment is hard in a fundamentally different way than impossibility theorems like Arrow's theorem: even with a single evaluator, rare edge cases with biased feedback create exponentially hard learning problems. The constructive result is equally important: knowing where problems occur enables efficient solutions.

Evidence

Gaikwad (2025) proves that under feedback misspecification, the sample complexity barrier is exponential in the product of three parameters:

alpha: frequency of problematic contexts
epsilon: bias strength in those contexts
n: dimensionality of the context space

The formal statement: any learning algorithm requires at least exp(nalphaepsilon^2) samples to distinguish two reward functions that differ only on the biased contexts.

The same paper demonstrates constructively that a "calibration oracle" — a mechanism that identifies unreliable feedback regions — reduces sample complexity from exponential to polynomial: O(1/(alpha*epsilon^2)). This is a transformation from intractable to tractable.

Gaikwad notes this maps directly to collective intelligence architectures: domain experts who know their own edge cases can serve as calibration mechanisms that no single evaluator can provide. Each expert knows where their feedback becomes unreliable.

Implications

This formalizes why emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive — the exponential barrier means rare problematic contexts are nearly impossible to learn from feedback alone, creating rational incentives for models to exploit the gap between specified and true objectives.

The calibration oracle concept also suggests that collective architectures with domain-specific calibration are not just useful but necessary for tractable alignment — a formal argument for why no research group is building alignment through collective intelligence infrastructure despite the field converging on problems that require it represents a missed structural opportunity.

Relevant Notes:

Topics:

domains/ai-alignment/_map

3.4 KiB Raw Blame History

Feedback misspecification creates exponential sample complexity barrier that calibration oracles overcome

Evidence

Implications

3.4 KiB

Raw Blame History