- Source: inbox/archive/2025-09-00-gaikwad-murphys-laws-alignment.md - Domain: ai-alignment - Extracted by: headless extraction cron (worker 4) Pentagon-Agent: Theseus <HEADLESS>
41 lines
2.9 KiB
Markdown
41 lines
2.9 KiB
Markdown
---
|
|
type: claim
|
|
domain: ai-alignment
|
|
description: "Biased feedback on fraction alpha of contexts requires exp(n*alpha*epsilon^2) samples to learn, but calibration oracles reduce this to O(1/(alpha*epsilon^2))"
|
|
confidence: experimental
|
|
source: "Madhava Gaikwad, Murphy's Laws of AI Alignment (2025)"
|
|
created: 2026-03-11
|
|
---
|
|
|
|
# Feedback misspecification creates exponential sample complexity barrier that calibration oracles overcome
|
|
|
|
When human feedback is biased on a fraction alpha of contexts with bias strength epsilon, any learning algorithm needs exponentially many samples exp(n*alpha*epsilon^2) to distinguish between two possible "true" reward functions that differ only on problematic contexts. However, if you can identify WHERE feedback is unreliable (a "calibration oracle"), you can overcome the exponential barrier with just O(1/(alpha*epsilon^2)) queries.
|
|
|
|
This formal result explains why alignment is hard in a fundamentally different way than impossibility theorems like Arrow's: even with a single evaluator, rare edge cases with biased feedback create exponentially hard learning problems. The constructive result is equally important: knowing where problems occur enables polynomial rather than exponential learning.
|
|
|
|
The calibration oracle concept maps directly to collective intelligence architectures where domain experts know their own edge cases. A collective can provide calibration that no single evaluator can—each agent knows where its own feedback becomes unreliable.
|
|
|
|
## Evidence
|
|
|
|
**Formal complexity result**: When feedback is biased on fraction alpha of contexts with bias strength epsilon, distinguishing between reward functions requires exp(n*alpha*epsilon^2) samples (Gaikwad 2025, Murphy's Laws theorem).
|
|
|
|
**Constructive escape**: Calibration oracles that identify unreliable feedback regions reduce sample complexity to O(1/(alpha*epsilon^2))—polynomial instead of exponential (Gaikwad 2025, constructive result).
|
|
|
|
**Key parameters**:
|
|
- alpha: frequency of problematic contexts
|
|
- epsilon: bias strength in those contexts
|
|
- gamma: degree of disagreement in true objectives
|
|
|
|
## Challenges
|
|
|
|
The calibration oracle assumption is strong—it requires knowing WHERE feedback is unreliable before you've learned the task. In practice, identifying problematic contexts may itself require substantial learning. The paper does not address how to construct calibration oracles in real systems.
|
|
|
|
---
|
|
|
|
Relevant Notes:
|
|
- [[emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive]] — Murphy's Laws formalize the mechanism
|
|
- [[AI alignment is a coordination problem not a technical problem]] — calibration oracles as collective capability
|
|
- [[specifying human values in code is intractable because our goals contain hidden complexity comparable to visual perception]] — related impossibility from different angle
|
|
|
|
Topics:
|
|
- [[domains/ai-alignment/_map]]
|