- Source: inbox/archive/2025-02-00-agreement-complexity-alignment-barriers.md - Domain: ai-alignment - Extracted by: headless extraction cron (worker 0) Pentagon-Agent: Theseus <HEADLESS>
36 lines
4.7 KiB
Markdown
36 lines
4.7 KiB
Markdown
---
|
|
type: claim
|
|
domain: ai-alignment
|
|
description: "Rather than trying to encode all N agents' M objectives — which is computationally intractable — consensus-driven reduction finds the region of objective space where agents agree, making alignment tractable at the cost of scope."
|
|
confidence: experimental
|
|
source: "Theseus extraction; 'Intrinsic Barriers and Practical Pathways for Human-AI Alignment: An Agreement-Based Complexity Analysis', arXiv 2502.05934, AAAI 2026 oral"
|
|
created: 2026-03-11
|
|
depends_on:
|
|
- "multi-agent alignment with sufficiently large objective or agent spaces is computationally intractable regardless of rationality or computational power"
|
|
challenged_by: []
|
|
secondary_domains: [collective-intelligence]
|
|
---
|
|
|
|
# consensus-driven objective reduction is the practical pathway out of multi-agent alignment impossibility because it bounds the tractability problem by narrowing the objective space
|
|
|
|
[[Multi-agent alignment with sufficiently large objective or agent spaces is computationally intractable regardless of rationality or computational power]]. The escape is not to solve the intractable problem — it is to change the problem. Consensus-driven objective reduction does this by finding the region of the objective space where a sufficient subset of agents already agree, and aligning to that region rather than to the full objective space.
|
|
|
|
The formal argument: if the full M-objective, N-agent alignment problem is intractable when M and N are large, but tractable when both are small, then the path to tractability runs through reduction. Consensus-driven reduction finds objectives that satisfy the agreement condition for a specified subset of agents, shrinking the effective M until the problem is computationally feasible. This is not a perfect solution — it explicitly excludes objectives that lack consensus — but it converts an impossible problem into a feasible one.
|
|
|
|
This mechanism provides formal justification for why bridging-based approaches work in practice. Mechanisms like Community Notes (Twitter/X's bridged consensus system) and RLCF (Reinforcement Learning from Contrasting Feedback) are empirical implementations of objective reduction: they search for the region of preference space where people with diverse starting positions agree, and use that region as the alignment target. The paper's theoretical framework explains *why* these approaches are directionally correct — they are navigating around the intractability result, not through it.
|
|
|
|
The safety-critical slices approach is a complementary pathway for the coverage problem: rather than reducing objectives, prioritize coverage of the highest-stakes region of the task space. Both pathways accept the impossibility result and work within its constraints rather than ignoring it.
|
|
|
|
The key limitation of consensus-driven reduction is scope. The objective region with broad consensus is smaller than the full human value landscape. Aligning to the consensus region means leaving out the contested space — which is where the most politically and ethically live questions live. The approach is tractable precisely because it sidesteps conflict. Whether that tradeoff is acceptable depends on the deployment context: for high-stakes automated systems, aligning to the consensus region may be sufficient and appropriate. For systems meant to navigate genuine value conflict, the limitation becomes a core design constraint.
|
|
|
|
---
|
|
|
|
Relevant Notes:
|
|
- [[multi-agent alignment with sufficiently large objective or agent spaces is computationally intractable regardless of rationality or computational power]] — the impossibility result this pathway escapes by changing the problem structure
|
|
- [[pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state]] — pluralistic alignment is broader: it accommodates diversity. This note is narrower: it finds the consensus subset. They address different parts of the design space.
|
|
- [[democratic alignment assemblies produce constitutions as effective as expert-designed ones while better representing diverse populations]] — assemblies are one mechanism for finding the consensus region empirically
|
|
- [[community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules]] — empirical evidence that consensus-finding produces different targets than expert specification
|
|
- [[some disagreements are permanently irreducible because they stem from genuine value differences not information gaps and systems must map rather than eliminate them]] — the limitation of this approach: consensus reduction works for tractable disagreements but not for irreducibly contested values
|
|
|
|
Topics:
|
|
- [[_map]]
|