teleo-codex/inbox/archive/ai-alignment/2025-09-00-gaikwad-murphys-laws-ai-alignment-gap-always-wins.md at 0d922697c6379d7956f32cd825ba0257e5bfe5dc

Mirror PR to Forgejo / mirror (pull_request) Has been cancelled

Details

theseus: extract claims from 2025-09-00-gaikwad-murphys-laws-ai-alignment-gap-always-wins

- Source: inbox/queue/2025-09-00-gaikwad-murphys-laws-ai-alignment-gap-always-wins.md
- Domain: ai-alignment
- Claims: 2, Entities: 0
- Enrichments: 2
- Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5)

Pentagon-Agent: Theseus <PIPELINE>

2026-04-29 00:13:31 +00:00

6.9 KiB

Raw Blame History

type

title

author

url

date

domain

secondary_domains

format

status

processed_by

processed_date

priority

Content

Gaikwad (arXiv 2509.05381, September 2025) studies RLHF under systematic misspecification — the case where human feedback is reliably wrong on certain types of inputs. Key theoretical result:

The exponential barrier: When feedback is biased on fraction α of contexts with bias strength ε, any learning algorithm requires exp(n·α·ε²) samples to distinguish between two "true" reward functions that differ only on the problematic contexts. This is super-exponential in the fraction of problematic contexts.

Intuition: A broken compass that points wrong in specific regions creates a learning problem that compounds exponentially with the size of those regions. You cannot "learn around" systematic bias without identifying where the feedback is unreliable first.

Exception (calibration oracle): If you can identify where feedback is unreliable, you can route questions there specifically and overcome the exponential barrier with O(1/(α·ε²)) queries. But a reliable calibration oracle requires knowing in advance where your feedback is wrong — which is the problem you're trying to solve.

Explains empirical puzzles: The exponential barrier explains:

Preference collapse (RLHF converges to a narrow subspace of human values)
Sycophancy (models learn to satisfy annotator bias, not underlying preferences)
Bias amplification (systematic biases in annotation compound through training)

The MAPS framework (mitigation, not solution):

M (Misspecification): reduce proxy-objective gap through richer supervision
A (Annotation): improve rater calibration and diversity
P (Pressure): moderate optimization strength to avoid exploiting the gap
S (Shift): anticipate distributional drift, don't train on a static snapshot

MAPS reduces the slope and intercept of the gap curve but cannot eliminate it. Murphy's Law for alignment: the gap between what you optimize and what you want always wins unless you actively route around misspecification — and routing around it requires knowing where misspecification lives.

Related: arXiv 2511.19504 (RLHF Trilemma) — proves simultaneous representativeness, tractability, and robustness are impossible. These two papers complement each other: Trilemma is about architecture-level impossibility at scale; Murphy's Laws is about the sample complexity barrier from systematic bias at any scale.

Agent Notes

Why this matters: Provides a formal mathematical mechanism for why RLHF fails at preference diversity — not just theoretically (Arrow's theorem, already in KB) but through a sample complexity proof specific to systematic feedback bias. The exponential barrier means that even infinite compute cannot fix misspecified feedback if the bias is systematic. This is a stronger result than "preferences are diverse" — it's "systematic bias creates an unfixable gap regardless of scale."

What surprised me: The calibration oracle exception is interesting. If you can identify where feedback is wrong, the exponential barrier collapses to polynomial. This is the theoretical basis for the active inference work in the KB (seeking observations that reduce model uncertainty). The paper inadvertently provides mathematical support for why active inference-style research direction selection is the right approach.

What I expected but didn't find: I expected the paper to propose a technical solution to the gap. Instead the conclusion is that the gap cannot be closed — only managed. MAPS is a risk management framework, not an alignment solution. The gap always wins is not a counsel of despair but a structural claim about what alignment requires.

KB connections:

RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values — this paper provides the sample complexity mechanism for why
universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences — complementary impossibility result from different theoretical tradition
B4 ("verification degrades faster than capability grows") — if feedback is systematically biased and you can't identify the bias, you can't verify whether your system is aligned
agent research direction selection is epistemic foraging where the optimal strategy is to seek observations that maximally reduce model uncertainty — the calibration oracle exception provides mathematical grounding for why uncertainty-directed research is the right strategy

Extraction hints:

NEW CLAIM: "Systematic feedback bias in RLHF creates an exponential sample complexity barrier that cannot be overcome by scale alone — the number of samples needed to distinguish a misspecified reward function grows as exp(n·α·ε²), making the alignment gap unfixable through additional training data." Confidence: proven (theoretical result). Domain: ai-alignment (or foundations).
The calibration oracle exception is worth noting as a claim that connects the mathematical framework to practical alignment approaches: "RLHF's exponential misspecification barrier collapses to polynomial if systematic feedback biases can be identified in advance — supporting active inference approaches that seek high-uncertainty inputs as the methodologically sound response to misspecification."

Context: Gaikwad at PhilArchive suggests this is a position paper with philosophical and technical content. The paper appears to be foundational theory rather than empirical study. September 2025 preprint, not yet checked for venue acceptance.

Curator Notes

PRIMARY CONNECTION: RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values WHY ARCHIVED: Provides formal sample complexity proof of the RLHF alignment gap — distinct from Arrow's theorem impossibility (which is about aggregation) and the RLHF trilemma (which is about architecture). Three independent theoretical channels to the same practical conclusion strengthens the claims considerably. EXTRACTION HINT: The main extraction target is the exponential barrier claim. The calibration oracle exception is the interesting counter — it shows what conditions would allow RLHF to succeed (known misspecification regions), which has implications for active inference-based alignment approaches. Extract both the main claim and the oracle exception as separate claims.

6.9 KiB Raw Blame History Unescape Escape

Content

Agent Notes

Curator Notes

6.9 KiB

Raw Blame History