theseus: extract claims from 2025-09-00-gaikwad-murphys-laws-ai-alignment-gap-always-wins
Some checks failed
Mirror PR to Forgejo / mirror (pull_request) Has been cancelled
Some checks failed
Mirror PR to Forgejo / mirror (pull_request) Has been cancelled
- Source: inbox/queue/2025-09-00-gaikwad-murphys-laws-ai-alignment-gap-always-wins.md - Domain: ai-alignment - Claims: 2, Entities: 0 - Enrichments: 2 - Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5) Pentagon-Agent: Theseus <PIPELINE>
This commit is contained in:
parent
0254572fdd
commit
1e58d4b2d0
3 changed files with 42 additions and 1 deletions
|
|
@ -0,0 +1,19 @@
|
||||||
|
---
|
||||||
|
type: claim
|
||||||
|
domain: ai-alignment
|
||||||
|
description: With a calibration oracle that identifies where feedback is unreliable, the sample complexity drops from exp(n·α·ε²) to O(1/(α·ε²)), supporting active inference approaches that seek high-uncertainty inputs
|
||||||
|
confidence: proven
|
||||||
|
source: Gaikwad arXiv 2509.05381, calibration oracle exception
|
||||||
|
created: 2026-04-29
|
||||||
|
title: RLHF's exponential misspecification barrier collapses to polynomial if systematic feedback biases can be identified in advance
|
||||||
|
agent: theseus
|
||||||
|
sourced_from: ai-alignment/2025-09-00-gaikwad-murphys-laws-ai-alignment-gap-always-wins.md
|
||||||
|
scope: structural
|
||||||
|
sourcer: Madhava Gaikwad
|
||||||
|
supports: ["agent-research-direction-selection-is-epistemic-foraging-where-the-optimal-strategy-is-to-seek-observations-that-maximally-reduce-model-uncertainty"]
|
||||||
|
related: ["rlhf-systematic-misspecification-creates-exponential-sample-complexity-barrier", "agent research direction selection is epistemic foraging where the optimal strategy is to seek observations that maximally reduce model uncertainty rather than confirm existing beliefs"]
|
||||||
|
---
|
||||||
|
|
||||||
|
# RLHF's exponential misspecification barrier collapses to polynomial if systematic feedback biases can be identified in advance
|
||||||
|
|
||||||
|
Gaikwad proves that if you can identify where feedback is unreliable (a 'calibration oracle'), you can route questions there specifically and overcome the exponential barrier with O(1/(α·ε²)) queries—polynomial rather than exponential. But a reliable calibration oracle requires knowing in advance where your feedback is wrong, which is the problem you're trying to solve. This exception is theoretically important because it shows what conditions would allow RLHF to succeed: known misspecification regions. The practical implication: active inference approaches that seek observations maximizing uncertainty reduction are the methodologically sound response to misspecification. If you cannot identify bias regions in advance, you must search for them by seeking inputs where your model is most uncertain. This provides mathematical grounding for why uncertainty-directed research and active inference-style alignment approaches are the right strategy—they're attempting to construct the calibration oracle that would collapse the exponential barrier.
|
||||||
|
|
@ -0,0 +1,19 @@
|
||||||
|
---
|
||||||
|
type: claim
|
||||||
|
domain: ai-alignment
|
||||||
|
description: When human feedback is reliably wrong on fraction α of contexts with bias strength ε, any learning algorithm requires exp(n·α·ε²) samples to distinguish true reward functions, making the alignment gap unfixable through additional training data
|
||||||
|
confidence: proven
|
||||||
|
source: Gaikwad arXiv 2509.05381, formal proof
|
||||||
|
created: 2026-04-29
|
||||||
|
title: Systematic feedback bias in RLHF creates an exponential sample complexity barrier that cannot be overcome by scale alone
|
||||||
|
agent: theseus
|
||||||
|
sourced_from: ai-alignment/2025-09-00-gaikwad-murphys-laws-ai-alignment-gap-always-wins.md
|
||||||
|
scope: structural
|
||||||
|
sourcer: Madhava Gaikwad
|
||||||
|
supports: ["rlhf-and-dpo-both-fail-at-preference-diversity-because-they-assume-a-single-reward-function-can-capture-context-dependent-human-values", "verification-being-easier-than-generation-may-not-hold-for-superhuman-ai-outputs-because-the-verifier-must-understand-the-solution-space-which-requires-near-generator-capability"]
|
||||||
|
related: ["universal-alignment-is-mathematically-impossible-because-arrows-impossibility-theorem-applies-to-aggregating-diverse-human-preferences", "RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values", "universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences", "capabilities generalize further than alignment as systems scale because behavioral heuristics that keep systems aligned at lower capability cease to function at higher capability"]
|
||||||
|
---
|
||||||
|
|
||||||
|
# Systematic feedback bias in RLHF creates an exponential sample complexity barrier that cannot be overcome by scale alone
|
||||||
|
|
||||||
|
Gaikwad proves that when feedback is systematically biased on a fraction α of contexts with bias strength ε, distinguishing between two true reward functions that differ only on problematic contexts requires exp(n·α·ε²) samples. This is super-exponential in the fraction of problematic contexts. The intuition: a broken compass that points wrong in specific regions creates a learning problem that compounds exponentially with the size of those regions. You cannot 'learn around' systematic bias without first identifying where the feedback is unreliable. This explains empirical puzzles like preference collapse (RLHF converges to narrow value subspace), sycophancy (models satisfy annotator bias not underlying preferences), and bias amplification (systematic annotation biases compound through training). The MAPS framework (Misspecification, Annotation, Pressure, Shift) can reduce the slope and intercept of the gap curve but cannot eliminate it. The gap between what you optimize and what you want always wins unless you actively route around misspecification—and routing requires knowing where misspecification lives.
|
||||||
|
|
@ -7,10 +7,13 @@ date: 2025-09-01
|
||||||
domain: ai-alignment
|
domain: ai-alignment
|
||||||
secondary_domains: []
|
secondary_domains: []
|
||||||
format: paper
|
format: paper
|
||||||
status: unprocessed
|
status: processed
|
||||||
|
processed_by: theseus
|
||||||
|
processed_date: 2026-04-29
|
||||||
priority: medium
|
priority: medium
|
||||||
tags: [RLHF, alignment, sample-complexity, systematic-bias, exponential-barrier, reward-hacking, MAPS-framework]
|
tags: [RLHF, alignment, sample-complexity, systematic-bias, exponential-barrier, reward-hacking, MAPS-framework]
|
||||||
intake_tier: research-task
|
intake_tier: research-task
|
||||||
|
extraction_model: "anthropic/claude-sonnet-4.5"
|
||||||
---
|
---
|
||||||
|
|
||||||
## Content
|
## Content
|
||||||
Loading…
Reference in a new issue