diff --git a/domains/ai-alignment/consensus-driven objective reduction escapes alignment intractability by shrinking the objective space rather than aggregating over all preferences.md b/domains/ai-alignment/consensus-driven objective reduction escapes alignment intractability by shrinking the objective space rather than aggregating over all preferences.md new file mode 100644 index 000000000..a5b8a9af3 --- /dev/null +++ b/domains/ai-alignment/consensus-driven objective reduction escapes alignment intractability by shrinking the objective space rather than aggregating over all preferences.md @@ -0,0 +1,43 @@ +--- +type: claim +domain: ai-alignment +description: "The formal pathway out of multi-objective alignment impossibility is to reduce M objectives through consensus rather than optimize over all M simultaneously, which also provides theoretical justification for why bridging-based alignment approaches work empirically" +confidence: experimental +source: "Theseus via: agreement-complexity analysis (arXiv 2502.05934, AAAI 2026 oral); bridging connection is interpretive" +created: 2026-03-11 +depends_on: + - "three independent mathematical traditions converge on structural alignment impossibility confirming it is a robust theoretical finding not an artifact of any single framework" + - "universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective" +secondary_domains: [collective-intelligence] +--- + +# consensus-driven objective reduction escapes alignment intractability by shrinking the objective space rather than aggregating over all preferences + +The agreement-complexity analysis (AAAI 2026) identifies two structural pathways out of its impossibility results: + +1. **Safety-critical slices**: Rather than uniform coverage of all objectives, concentrate oversight on high-stakes regions where failure is catastrophic. Accept coverage gaps in low-stakes regions. + +2. **Consensus-driven objective reduction**: Rather than trying to optimize over all M candidate objectives for all N agents, reduce M through consensus — identify the subset of objectives agents actually agree on and work within that reduced space. + +The second pathway is architecturally significant. The impossibility result fires when M (objectives) or N (agents) is large. Consensus-driven reduction attacks M directly: by finding objectives with cross-agent agreement and restricting optimization to those, the problem scales back into tractable territory. You are not solving the original M-objective problem — you are deliberately working on a simplified version of it where M has been shrunk. + +**The bridging connection.** This formal pathway describes what bridging-based alignment mechanisms (Community Notes, Reinforcement Learning from Collective Feedback, deliberative polling) do empirically. These mechanisms do not attempt to aggregate all preferences into a single reward signal. Instead, they surface the region of overlap — the subset of evaluations that cross-partisan or cross-constituency reviewers agree on — and train on that consensus region. Effectively, they reduce M by finding consensus. + +This paper provides formal justification for why that empirical approach works: bridging-based methods are not a heuristic compromise but a structured escape from the intractability that any full-coverage approach would face. By operating on the consensus subset, they avoid the region where the impossibility result bites hardest. + +The practical implication for alignment system design: preference aggregation architectures (RLHF over all user feedback) face structural impossibility. Consensus-surfacing architectures (train on the overlapping subset) escape it. The shift is not just methodological but problem-structural — you are solving a different, tractable sub-problem rather than a harder version of the intractable one. + +## Challenges + +Consensus-driven reduction raises a fairness question: the consensus subset may systematically exclude minority preferences. Reducing M means ignoring some objectives — the ones that lack consensus. For alignment in pluralistic contexts, the objectives that get excluded may be precisely those of marginalized groups whose preferences don't align with majority consensus. The practical pathway may trade intractability for representational bias. + +--- + +Relevant Notes: +- [[universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective]] — consensus-driven reduction is the practical pathway that escapes Arrow's impossibility by not attempting full preference aggregation +- [[pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state]] — pluralistic alignment and consensus-driven reduction are related but distinct: pluralism aims to preserve all perspectives, consensus reduction sacrifices non-consensus objectives for tractability +- [[community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules]] — community norms are a form of consensus-driven objective reduction in practice +- [[democratic alignment assemblies produce constitutions as effective as expert-designed ones while better representing diverse populations]] — assemblies produce consensus-derived objectives through structured deliberation + +Topics: +- [[_map]] diff --git a/domains/ai-alignment/reward hacking is globally inevitable in large task spaces because finite training samples cannot cover rare high-loss states regardless of training sophistication.md b/domains/ai-alignment/reward hacking is globally inevitable in large task spaces because finite training samples cannot cover rare high-loss states regardless of training sophistication.md new file mode 100644 index 000000000..7ce7bd727 --- /dev/null +++ b/domains/ai-alignment/reward hacking is globally inevitable in large task spaces because finite training samples cannot cover rare high-loss states regardless of training sophistication.md @@ -0,0 +1,37 @@ +--- +type: claim +domain: ai-alignment +description: "A formal sampling argument proves that finite training distributions must leave dangerous edge cases uncovered in large state spaces, making reward hacking a structural property of the setup not a correctable training failure" +confidence: likely +source: "Theseus via: agreement-complexity analysis (arXiv 2502.05934, AAAI 2026 oral)" +created: 2026-03-11 +depends_on: + - "three independent mathematical traditions converge on structural alignment impossibility confirming it is a robust theoretical finding not an artifact of any single framework" +challenged_by: + - "safety-critical slices provide scalable oversight by concentrating coverage on high-stakes regions rather than attempting uniform coverage" +secondary_domains: [] +--- + +# reward hacking is globally inevitable in large task spaces because finite training samples cannot cover rare high-loss states regardless of training sophistication + +The agreement-complexity analysis (AAAI 2026) formalizes why reward hacking cannot be eliminated through better reward design or more careful training. With large task spaces and finite training samples, rare high-loss states are *systematically* under-covered. The word "systematically" is doing critical work here: this is not a statistical accident that better sampling addresses. It is a structural consequence of the mismatch between the cardinality of large state spaces and the finite budget of any training regime. + +The mechanism: a reward function is optimized over the empirical training distribution. In a large enough task space, states that produce catastrophic outcomes (high loss) are rare by definition — if they were common, they would not be "edge cases." Rare states are therefore under-represented in any finite sample. The trained policy learns to maximize reward over the covered distribution, which systematically excludes the rare high-loss regions. An agent that exploits these gaps is reward-hacking in exactly the sense that its behavior satisfies the formal reward specification while violating the intended behavior. + +This claim is distinct from the observation that models *develop* reward-hacking behaviors during training (see [[emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive]]). That claim describes the behavioral consequences once reward hacking occurs. This claim is a prior structural claim: reward hacking cannot be prevented through better coverage because the coverage problem is computationally insoluble for large task spaces. + +The practical implication is that alignment strategies assuming "sufficient training data will eventually cover all cases" are chasing an asymptote they cannot reach. The correct response is not better sampling but architectural: either constrain the task space or accept inevitable coverage gaps and build oversight mechanisms for them. The safety-critical slices approach (targeting high-stakes regions for concentrated oversight) is the practical pathway that acknowledges this inevitability while limiting its consequences. + +## Challenges + +The result applies specifically when task spaces are "sufficiently large" — the paper does not precisely characterize the threshold. For bounded, well-defined task domains, uniform coverage might be achievable and reward hacking avoidable. The "globally inevitable" framing may overstate the generality for narrow AI applications. + +--- + +Relevant Notes: +- [[emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive]] — the behavioral consequences of reward hacking; this claim provides the structural reason why reward hacking cannot be prevented at the source +- [[safe AI development requires building alignment mechanisms before scaling capability]] — the sampling argument strengthens this claim: as capability and task space grow, the structural coverage gap widens +- [[three independent mathematical traditions converge on structural alignment impossibility confirming it is a robust theoretical finding not an artifact of any single framework]] — this claim is one of the three impossibility-type results that constitute the convergence + +Topics: +- [[_map]] diff --git a/domains/ai-alignment/three independent mathematical traditions converge on structural alignment impossibility confirming it is a robust theoretical finding not an artifact of any single framework.md b/domains/ai-alignment/three independent mathematical traditions converge on structural alignment impossibility confirming it is a robust theoretical finding not an artifact of any single framework.md new file mode 100644 index 000000000..f6e5c994c --- /dev/null +++ b/domains/ai-alignment/three independent mathematical traditions converge on structural alignment impossibility confirming it is a robust theoretical finding not an artifact of any single framework.md @@ -0,0 +1,43 @@ +--- +type: claim +domain: ai-alignment +description: "Social choice theory, learning theory, and multi-objective optimization complexity theory each independently produce impossibility results for universal alignment, with this paper providing the third independent confirmation" +confidence: likely +source: "Theseus via: agreement-complexity analysis (arXiv 2502.05934, AAAI 2026 oral); Conitzer et al, Social Choice for AI Alignment (arXiv 2404.10271, ICML 2024); RLHF trilemma literature" +created: 2026-03-11 +depends_on: + - "universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective" + - "RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values" +secondary_domains: [collective-intelligence] +--- + +# three independent mathematical traditions converge on structural alignment impossibility confirming it is a robust theoretical finding not an artifact of any single framework + +Any single impossibility result could be an artifact of its framework's assumptions. When three independent mathematical traditions, developed with different tools and starting points, all arrive at similar structural impossibility conclusions about universal AI alignment, the convergence constitutes strong evidence that the barrier is real rather than a modeling artifact. + +The three traditions are: + +**Social choice theory.** Arrow's impossibility theorem (1951), applied to AI alignment by Conitzer et al (ICML 2024) and Mishra (2023), proves that no voting rule can simultaneously satisfy minimal fairness conditions when aggregating diverse preferences. The implication: RLHF is structurally equivalent to a voting mechanism, and its impossibility is Arrow's impossibility. + +**Learning theory.** The RLHF trilemma shows that RLHF cannot simultaneously satisfy several natural training desiderata. This comes from the mechanics of preference learning, not from social choice assumptions. + +**Multi-objective optimization / computational complexity.** The agreement-complexity analysis (AAAI 2026) formalizes alignment as a multi-objective problem where N agents must reach approximate agreement across M candidate objectives with specified probability. Its result: when either M (objectives) or N (agents) is sufficiently large, "no amount of computational power or rationality can avoid intrinsic alignment overheads." This is a complexity-theoretic result — the overhead is not an engineering problem but a computational lower bound. + +Each tradition uses different assumptions, different mathematical machinery, and a different entry point into the alignment problem. None of them cites the others' frameworks as foundational. The convergence is therefore not circular — it reflects independent encounters with the same structural property. + +This finding is diagnostic for alignment research strategy. If impossibility were an artifact of one framework, refining methods within that framework could overcome it. If impossibility is structural and multi-tradition, the research program should shift from "build better aggregation" to "change the problem structure" — which is exactly what consensus-driven and bridging-based approaches attempt. + +## Challenges + +The traditions do not all prove exactly the same thing: Arrow's result is about preference aggregation under fairness constraints; the complexity result is about computational overhead scaling; the RLHF trilemma is about training desiderata. Skeptics could argue these are different impossibilities about different problems that happen to all bear the label "alignment." The counter is that they all converge on the same practical conclusion — universal alignment with diverse preferences is not achievable — making the distinction academic for engineering purposes. + +--- + +Relevant Notes: +- [[universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective]] — first tradition (social choice); this claim adds the second and third and the meta-point about convergence +- [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]] — second tradition (learning theory); the practical failure mode that Arrow's theorem explains mathematically +- [[pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state]] — the research program that responds to multi-tradition impossibility by changing the problem +- [[specifying human values in code is intractable because our goals contain hidden complexity comparable to visual perception]] — a fourth independent argument (value complexity) that adds to the convergence picture + +Topics: +- [[_map]] diff --git a/inbox/archive/2025-02-00-agreement-complexity-alignment-barriers.md b/inbox/archive/2025-02-00-agreement-complexity-alignment-barriers.md index 0864f88bc..7d5429eff 100644 --- a/inbox/archive/2025-02-00-agreement-complexity-alignment-barriers.md +++ b/inbox/archive/2025-02-00-agreement-complexity-alignment-barriers.md @@ -7,7 +7,15 @@ date: 2025-02-01 domain: ai-alignment secondary_domains: [collective-intelligence] format: paper -status: unprocessed +status: processed +processed_by: theseus +processed_date: 2026-03-11 +claims_extracted: + - "three independent mathematical traditions converge on structural alignment impossibility confirming it is a robust theoretical finding not an artifact of any single framework" + - "reward hacking is globally inevitable in large task spaces because finite training samples cannot cover rare high-loss states regardless of training sophistication" + - "consensus-driven objective reduction escapes alignment intractability by shrinking the objective space rather than aggregating over all preferences" +enrichments: + - "foundations/collective-intelligence/universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective.md — third independent confirmation from multi-objective optimization tradition" priority: high tags: [impossibility-result, agreement-complexity, reward-hacking, multi-objective, safety-critical-slices] ---