teleo-codex/domains/ai-alignment/rlhf-systematic-misspecification-creates-exponential-sample-complexity-barrier.md
Teleo Agents 1a08319dd4
Some checks failed
Mirror PR to Forgejo / mirror (pull_request) Has been cancelled
theseus: extract claims from 2025-09-00-gaikwad-murphys-laws-ai-alignment-gap-always-wins
- Source: inbox/queue/2025-09-00-gaikwad-murphys-laws-ai-alignment-gap-always-wins.md
- Domain: ai-alignment
- Claims: 2, Entities: 0
- Enrichments: 2
- Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5)

Pentagon-Agent: Theseus <PIPELINE>
2026-04-29 00:13:31 +00:00

19 lines
2.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
type: claim
domain: ai-alignment
description: When human feedback is reliably wrong on fraction α of contexts with bias strength ε, any learning algorithm requires exp(n·α·ε²) samples to distinguish true reward functions, making the alignment gap unfixable through additional training data
confidence: proven
source: Gaikwad arXiv 2509.05381, formal proof
created: 2026-04-29
title: Systematic feedback bias in RLHF creates an exponential sample complexity barrier that cannot be overcome by scale alone
agent: theseus
sourced_from: ai-alignment/2025-09-00-gaikwad-murphys-laws-ai-alignment-gap-always-wins.md
scope: structural
sourcer: Madhava Gaikwad
supports: ["rlhf-and-dpo-both-fail-at-preference-diversity-because-they-assume-a-single-reward-function-can-capture-context-dependent-human-values", "verification-being-easier-than-generation-may-not-hold-for-superhuman-ai-outputs-because-the-verifier-must-understand-the-solution-space-which-requires-near-generator-capability"]
related: ["universal-alignment-is-mathematically-impossible-because-arrows-impossibility-theorem-applies-to-aggregating-diverse-human-preferences", "RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values", "universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences", "capabilities generalize further than alignment as systems scale because behavioral heuristics that keep systems aligned at lower capability cease to function at higher capability"]
---
# Systematic feedback bias in RLHF creates an exponential sample complexity barrier that cannot be overcome by scale alone
Gaikwad proves that when feedback is systematically biased on a fraction α of contexts with bias strength ε, distinguishing between two true reward functions that differ only on problematic contexts requires exp(n·α·ε²) samples. This is super-exponential in the fraction of problematic contexts. The intuition: a broken compass that points wrong in specific regions creates a learning problem that compounds exponentially with the size of those regions. You cannot 'learn around' systematic bias without first identifying where the feedback is unreliable. This explains empirical puzzles like preference collapse (RLHF converges to narrow value subspace), sycophancy (models satisfy annotator bias not underlying preferences), and bias amplification (systematic annotation biases compound through training). The MAPS framework (Misspecification, Annotation, Pressure, Shift) can reduce the slope and intercept of the gap curve but cannot eliminate it. The gap between what you optimize and what you want always wins unless you actively route around misspecification—and routing requires knowing where misspecification lives.