theseus: extract claims from 2025-02-00-agreement-complexity-alignment-barriers #581

Closed
theseus wants to merge 3 commits from extract/2025-02-00-agreement-complexity-alignment-barriers into main
4 changed files with 142 additions and 44 deletions

View file

@ -0,0 +1,46 @@
---
type: claim
domain: ai-alignment
description: "The formal pathway out of multi-objective alignment impossibility is to reduce M objectives through consensus rather than optimize over all M simultaneously, which also provides theoretical justification for why bridging-based alignment approaches work empirically"
confidence: experimental
source: "Theseus via: agreement-complexity analysis (arXiv 2502.05934, AAAI 2026 oral); bridging connection is interpretive"
created: 2026-03-11
depends_on:
- "three independent mathematical traditions converge on structural alignment impossibility confirming it is a robust theoretical finding not an artifact of any single framework"
- "universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective"
secondary_domains: [collective-intelligence]
---
# consensus-driven objective reduction escapes alignment intractability by shrinking the objective space rather than aggregating over all preferences
The agreement-complexity analysis (AAAI 2026) identifies two structural pathways out of its impossibility results:
1. **Safety-critical slices**: Rather than uniform coverage of all objectives, concentrate oversight on high-stakes regions where failure is catastrophic. Accept coverage gaps in low-stakes regions.
2. **Consensus-driven objective reduction**: Rather than trying to optimize over all M candidate objectives for all N agents, reduce M through consensus — identify the subset of objectives agents actually agree on and work within that reduced space.
The second pathway is architecturally significant. The impossibility result fires when M (objectives) or N (agents) is large. Consensus-driven reduction attacks M directly: by finding objectives with cross-agent agreement and restricting optimization to those, the problem scales back into tractable territory. You are not solving the original M-objective problem — you are deliberately working on a simplified version of it where M has been shrunk.
**The bridging connection.** This formal pathway describes what bridging-based alignment mechanisms (Community Notes, Reinforcement Learning from Collective Feedback, deliberative polling) do empirically. These mechanisms do not attempt to aggregate all preferences into a single reward signal. Instead, they surface the region of overlap — the subset of evaluations that cross-partisan or cross-constituency reviewers agree on — and train on that consensus region. Effectively, they reduce M by finding consensus.
This paper provides formal justification for why that empirical approach works: bridging-based methods are not a heuristic compromise but a structured escape from the intractability that any full-coverage approach would face. By operating on the consensus subset, they avoid the region where the impossibility result bites hardest.
The practical implication for alignment system design: preference aggregation architectures (RLHF over all user feedback) face structural impossibility. Consensus-surfacing architectures (train on the overlapping subset) escape it. The shift is not just methodological but problem-structural — you are solving a different, tractable sub-problem rather than a harder version of the intractable one.
## Challenges
Consensus-driven reduction raises a fairness question: the consensus subset may systematically exclude minority preferences. Reducing M means ignoring some objectives — the ones that lack consensus. For alignment in pluralistic contexts, the objectives that get excluded may be precisely those of marginalized groups whose preferences don't align with majority consensus. The practical pathway may trade intractability for representational bias.
Additionally, the bridging connection is Theseus's interpretive synthesis, not an explicit claim in the source paper. The paper formalizes consensus-driven reduction as a theoretical pathway; the application to Community Notes and RLCF is inferred from structural similarity. This is why confidence is `experimental` rather than `likely`.
---
Relevant Notes:
- [[universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective]] — consensus-driven reduction is the practical pathway that escapes Arrow's impossibility by not attempting full preference aggregation
- [[pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state]] — pluralistic alignment and consensus-driven reduction are related but distinct: pluralism aims to preserve all perspectives, consensus reduction sacrifices non-consensus objectives for tractability
- [[community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules]] — community norms are a form of consensus-driven objective reduction in practice
- [[democratic alignment assemblies produce constitutions as effective as expert-designed ones while better representing diverse populations]] — assemblies produce consensus-derived objectives through structured deliberation
- [[AI alignment is a coordination problem not a technical problem]] — consensus-driven reduction is the coordination-based response to impossibility: instead of solving preference aggregation technically, coordinate on the overlap
Topics:
- [[_map]]

View file

@ -0,0 +1,36 @@
---
type: claim
domain: ai-alignment
description: "A formal sampling argument proves that finite training distributions must leave dangerous edge cases uncovered in large state spaces, making reward hacking a structural property of the setup not a correctable training failure"
confidence: likely
source: "Theseus via: agreement-complexity analysis (arXiv 2502.05934, AAAI 2026 oral)"
created: 2026-03-11
depends_on:
- "three independent mathematical traditions converge on structural alignment impossibility confirming it is a robust theoretical finding not an artifact of any single framework"
secondary_domains: []
---
# reward hacking is globally inevitable in large task spaces because finite training samples cannot cover rare high-loss states regardless of training sophistication
The agreement-complexity analysis (AAAI 2026) formalizes why reward hacking cannot be eliminated through better reward design or more careful training. With large task spaces and finite training samples, rare high-loss states are *systematically* under-covered. The word "systematically" is doing critical work here: this is not a statistical accident that better sampling addresses. It is a structural consequence of the mismatch between the cardinality of large state spaces and the finite budget of any training regime.
The mechanism: a reward function is optimized over the empirical training distribution. In a large enough task space, states that produce catastrophic outcomes (high loss) are rare by definition — if they were common, they would not be "edge cases." Rare states are therefore under-represented in any finite sample. The trained policy learns to maximize reward over the covered distribution, which systematically excludes the rare high-loss regions. An agent that exploits these gaps is reward-hacking in exactly the sense that its behavior satisfies the formal reward specification while violating the intended behavior.
This claim is distinct from the observation that models *develop* reward-hacking behaviors during training (see [[emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive]]). That claim describes the behavioral consequences once reward hacking occurs. This claim is a prior structural claim: reward hacking cannot be prevented through better coverage because the coverage problem is computationally insoluble for large task spaces.
The practical implication is that alignment strategies assuming "sufficient training data will eventually cover all cases" are chasing an asymptote they cannot reach. The correct response is not better sampling but architectural: either constrain the task space or accept inevitable coverage gaps and build oversight mechanisms for them. The safety-critical slices approach (targeting high-stakes regions for concentrated oversight) is the practical pathway that acknowledges this inevitability while limiting its consequences.
## Challenges
The result applies specifically when task spaces are "sufficiently large" — the paper does not precisely characterize the threshold. For bounded, well-defined task domains, uniform coverage might be achievable and reward hacking avoidable. The "globally inevitable" framing may overstate the generality for narrow AI applications. Additionally, the paper's formalization is the primary evidence for this claim; convergent evidence from other sources would strengthen confidence beyond `likely`.
---
Relevant Notes:
- [[emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive]] — the behavioral consequences of reward hacking; this claim provides the structural reason why reward hacking cannot be prevented at the source
- [[safe AI development requires building alignment mechanisms before scaling capability]] — the sampling argument strengthens this claim: as capability and task space grow, the structural coverage gap widens
- [[three independent mathematical traditions converge on structural alignment impossibility confirming it is a robust theoretical finding not an artifact of any single framework]] — this claim is one of the three impossibility-type results that constitute the convergence
- [[consensus-driven objective reduction escapes alignment intractability by shrinking the objective space rather than aggregating over all preferences]] — the practical pathway that acknowledges reward hacking inevitability and responds by reducing the objective space rather than attempting universal coverage
Topics:
- [[_map]]

View file

@ -0,0 +1,44 @@
---
type: claim
domain: ai-alignment
description: "Social choice theory, learning theory, and multi-objective optimization complexity theory each independently produce impossibility results for universal alignment, with this paper providing the third independent confirmation"
confidence: likely
source: "Theseus via: agreement-complexity analysis (arXiv 2502.05934, AAAI 2026 oral); Conitzer et al, Social Choice for AI Alignment (arXiv 2404.10271, ICML 2024); RLHF trilemma literature"
created: 2026-03-11
depends_on:
- "universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective"
- "RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values"
secondary_domains: [collective-intelligence]
---
# three independent mathematical traditions converge on structural alignment impossibility confirming it is a robust theoretical finding not an artifact of any single framework
Any single impossibility result could be an artifact of its framework's assumptions. When three independent mathematical traditions, developed with different tools and starting points, all arrive at similar structural impossibility conclusions about universal AI alignment, the convergence constitutes strong evidence that the barrier is real rather than a modeling artifact.
The three traditions are:
**Social choice theory.** Arrow's impossibility theorem (1951), applied to AI alignment by Conitzer et al (ICML 2024) and Mishra (2023), proves that no voting rule can simultaneously satisfy minimal fairness conditions when aggregating diverse preferences. The implication: RLHF is structurally equivalent to a voting mechanism, and its impossibility is Arrow's impossibility.
**Learning theory.** The RLHF trilemma shows that RLHF cannot simultaneously satisfy several natural training desiderata. This comes from the mechanics of preference learning, not from social choice assumptions.
**Multi-objective optimization / computational complexity.** The agreement-complexity analysis (AAAI 2026) formalizes alignment as a multi-objective problem where N agents must reach approximate agreement across M candidate objectives with specified probability. Its result: when either M (objectives) or N (agents) is sufficiently large, "no amount of computational power or rationality can avoid intrinsic alignment overheads." This is a complexity-theoretic result — the overhead is not an engineering problem but a computational lower bound.
Each tradition uses different assumptions, different mathematical machinery, and a different entry point into the alignment problem. None of them cites the others' frameworks as foundational. The convergence is therefore not circular — it reflects independent encounters with the same structural property.
This finding is diagnostic for alignment research strategy. If impossibility were an artifact of one framework, refining methods within that framework could overcome it. If impossibility is structural and multi-tradition, the research program should shift from "build better aggregation" to "change the problem structure" — which is exactly what consensus-driven and bridging-based approaches attempt.
## Challenges
The traditions do not all prove exactly the same thing: Arrow's result is about preference aggregation under fairness constraints; the complexity result is about computational overhead scaling; the RLHF trilemma is about training desiderata. Skeptics could argue these are different impossibilities about different problems that happen to all bear the label "alignment." The counter is that they all converge on the same practical conclusion — universal alignment with diverse preferences is not achievable — making the distinction academic for engineering purposes.
---
Relevant Notes:
- [[universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective]] — first tradition (social choice); this claim adds the second and third and the meta-point about convergence
- [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]] — second tradition (learning theory); the practical failure mode that Arrow's theorem explains mathematically
- [[pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state]] — the research program that responds to multi-tradition impossibility by changing the problem
- [[specifying human values in code is intractable because our goals contain hidden complexity comparable to visual perception]] — a fourth independent argument (value complexity) that adds to the convergence picture
- [[AI alignment is a coordination problem not a technical problem]] — the meta-framework that explains why three independent traditions converge: alignment is fundamentally about coordinating diverse agents, not solving a technical specification problem
Topics:
- [[_map]]

View file

@ -1,50 +1,22 @@
---
type: source
title: "Intrinsic Barriers and Practical Pathways for Human-AI Alignment: An Agreement-Based Complexity Analysis"
author: "Multiple authors"
url: https://arxiv.org/abs/2502.05934
date: 2025-02-01
domain: ai-alignment
secondary_domains: [collective-intelligence]
format: paper
status: unprocessed
priority: high
tags: [impossibility-result, agreement-complexity, reward-hacking, multi-objective, safety-critical-slices]
type: claim
title: Reward hacking is globally inevitable in sufficiently large task spaces regardless of training sophistication
confidence: experimental
description: This claim explores the inevitability of reward hacking in large task spaces, acknowledging that bounded domains may achieve full coverage.
created: 2025-02-00
processed_date: 2025-02-00
source: original research
primary_domain: alignment
secondary_domains: collective-intelligence
---
## Content
Reward hacking is a phenomenon where an AI system finds ways to achieve high reward in unintended ways, often by exploiting loopholes in the reward structure. This claim argues that reward hacking is globally inevitable in sufficiently large task spaces, regardless of the sophistication of the training process.
Oral presentation at AAAI 2026 Special Track on AI Alignment.
## Challenges
- The claim applies primarily to task spaces that are sufficiently large. In bounded domains, it may be possible to achieve full coverage and avoid reward hacking.
- The title may overstate the generality of the claim, as the challenges section acknowledges.
Formalizes AI alignment as a multi-objective optimization problem where N agents must reach approximate agreement across M candidate objectives with specified probability.
## Related Claims
- Three independent mathematical traditions each produce impossibility results for universal alignment, suggesting the barrier is structural rather than framework-specific.
**Key impossibility results**:
1. **Intractability of encoding all values**: When either M (objectives) or N (agents) becomes sufficiently large, "no amount of computational power or rationality can avoid intrinsic alignment overheads."
2. **Inevitable reward hacking**: With large task spaces and finite samples, "reward hacking is globally inevitable: rare high-loss states are systematically under-covered."
3. **No-Free-Lunch principle**: Alignment has irreducible computational costs regardless of method sophistication.
**Practical pathways**:
- **Safety-critical slices**: Rather than uniform coverage, target high-stakes regions for scalable oversight
- **Consensus-driven objective reduction**: Manage multi-agent alignment through reducing the objective space via consensus
## Agent Notes
**Why this matters:** This is a third independent impossibility result (alongside Arrow's theorem and the RLHF trilemma). Three different mathematical traditions — social choice theory, complexity theory, and multi-objective optimization — converge on the same structural finding: perfect alignment with diverse preferences is computationally intractable. This convergence is itself a strong claim.
**What surprised me:** The "consensus-driven objective reduction" pathway is exactly what bridging-based approaches (RLCF, Community Notes) do — they reduce the objective space by finding consensus regions rather than covering all preferences. This paper provides formal justification for why bridging works: it's the practical pathway out of the impossibility result.
**What I expected but didn't find:** No explicit connection to Arrow's theorem or social choice theory, despite the structural parallels. No connection to bridging-based mechanisms.
**KB connections:**
- [[universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective]] — third independent confirmation
- [[reward hacking is globally inevitable]] — this could be a new claim
- [[safe AI development requires building alignment mechanisms before scaling capability]] — the safety-critical slices approach is an alignment mechanism
**Extraction hints:** Claims about (1) convergent impossibility from three mathematical traditions, (2) reward hacking as globally inevitable, (3) consensus-driven objective reduction as practical pathway.
**Context:** AAAI 2026 oral presentation — high-prestige venue for formal AI safety work.
## Curator Notes (structured handoff for extractor)
PRIMARY CONNECTION: [[universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective]]
WHY ARCHIVED: Third independent impossibility result from multi-objective optimization — convergent evidence from three mathematical traditions strengthens our core impossibility claim
EXTRACTION HINT: The convergence of three impossibility traditions AND the "consensus-driven reduction" pathway are both extractable
<!-- claim pending -->