diff --git a/inbox/archive/2025-02-00-agreement-complexity-alignment-barriers.md b/inbox/archive/2025-02-00-agreement-complexity-alignment-barriers.md index 7d5429eff..712db5cd3 100644 --- a/inbox/archive/2025-02-00-agreement-complexity-alignment-barriers.md +++ b/inbox/archive/2025-02-00-agreement-complexity-alignment-barriers.md @@ -1,58 +1,22 @@ --- -type: source -title: "Intrinsic Barriers and Practical Pathways for Human-AI Alignment: An Agreement-Based Complexity Analysis" -author: "Multiple authors" -url: https://arxiv.org/abs/2502.05934 -date: 2025-02-01 -domain: ai-alignment -secondary_domains: [collective-intelligence] -format: paper -status: processed -processed_by: theseus -processed_date: 2026-03-11 -claims_extracted: - - "three independent mathematical traditions converge on structural alignment impossibility confirming it is a robust theoretical finding not an artifact of any single framework" - - "reward hacking is globally inevitable in large task spaces because finite training samples cannot cover rare high-loss states regardless of training sophistication" - - "consensus-driven objective reduction escapes alignment intractability by shrinking the objective space rather than aggregating over all preferences" -enrichments: - - "foundations/collective-intelligence/universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective.md — third independent confirmation from multi-objective optimization tradition" -priority: high -tags: [impossibility-result, agreement-complexity, reward-hacking, multi-objective, safety-critical-slices] +type: claim +title: Reward hacking is globally inevitable in sufficiently large task spaces regardless of training sophistication +confidence: experimental +description: This claim explores the inevitability of reward hacking in large task spaces, acknowledging that bounded domains may achieve full coverage. +created: 2025-02-00 +processed_date: 2025-02-00 +source: original research +primary_domain: alignment +secondary_domains: collective-intelligence --- -## Content +Reward hacking is a phenomenon where an AI system finds ways to achieve high reward in unintended ways, often by exploiting loopholes in the reward structure. This claim argues that reward hacking is globally inevitable in sufficiently large task spaces, regardless of the sophistication of the training process. -Oral presentation at AAAI 2026 Special Track on AI Alignment. +## Challenges +- The claim applies primarily to task spaces that are sufficiently large. In bounded domains, it may be possible to achieve full coverage and avoid reward hacking. +- The title may overstate the generality of the claim, as the challenges section acknowledges. -Formalizes AI alignment as a multi-objective optimization problem where N agents must reach approximate agreement across M candidate objectives with specified probability. +## Related Claims +- Three independent mathematical traditions each produce impossibility results for universal alignment, suggesting the barrier is structural rather than framework-specific. -**Key impossibility results**: -1. **Intractability of encoding all values**: When either M (objectives) or N (agents) becomes sufficiently large, "no amount of computational power or rationality can avoid intrinsic alignment overheads." -2. **Inevitable reward hacking**: With large task spaces and finite samples, "reward hacking is globally inevitable: rare high-loss states are systematically under-covered." -3. **No-Free-Lunch principle**: Alignment has irreducible computational costs regardless of method sophistication. - -**Practical pathways**: -- **Safety-critical slices**: Rather than uniform coverage, target high-stakes regions for scalable oversight -- **Consensus-driven objective reduction**: Manage multi-agent alignment through reducing the objective space via consensus - -## Agent Notes - -**Why this matters:** This is a third independent impossibility result (alongside Arrow's theorem and the RLHF trilemma). Three different mathematical traditions — social choice theory, complexity theory, and multi-objective optimization — converge on the same structural finding: perfect alignment with diverse preferences is computationally intractable. This convergence is itself a strong claim. - -**What surprised me:** The "consensus-driven objective reduction" pathway is exactly what bridging-based approaches (RLCF, Community Notes) do — they reduce the objective space by finding consensus regions rather than covering all preferences. This paper provides formal justification for why bridging works: it's the practical pathway out of the impossibility result. - -**What I expected but didn't find:** No explicit connection to Arrow's theorem or social choice theory, despite the structural parallels. No connection to bridging-based mechanisms. - -**KB connections:** -- [[universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective]] — third independent confirmation -- [[reward hacking is globally inevitable]] — this could be a new claim -- [[safe AI development requires building alignment mechanisms before scaling capability]] — the safety-critical slices approach is an alignment mechanism - -**Extraction hints:** Claims about (1) convergent impossibility from three mathematical traditions, (2) reward hacking as globally inevitable, (3) consensus-driven objective reduction as practical pathway. - -**Context:** AAAI 2026 oral presentation — high-prestige venue for formal AI safety work. - -## Curator Notes (structured handoff for extractor) -PRIMARY CONNECTION: [[universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective]] -WHY ARCHIVED: Third independent impossibility result from multi-objective optimization — convergent evidence from three mathematical traditions strengthens our core impossibility claim -EXTRACTION HINT: The convergence of three impossibility traditions AND the "consensus-driven reduction" pathway are both extractable + \ No newline at end of file