teleo-codex/domains/ai-alignment/consensus-driven objective reduction is the practical pathway out of multi-agent alignment impossibility because it bounds the tractability problem by narrowing the objective space.md
Teleo Agents ac5e3d7962 theseus: extract claims from 2025-02-00-agreement-complexity-alignment-barriers.md
- Source: inbox/archive/2025-02-00-agreement-complexity-alignment-barriers.md
- Domain: ai-alignment
- Extracted by: headless extraction cron (worker 0)

Pentagon-Agent: Theseus <HEADLESS>
2026-03-11 13:28:44 +00:00

36 lines
4.7 KiB
Markdown

---
type: claim
domain: ai-alignment
description: "Rather than trying to encode all N agents' M objectives — which is computationally intractable — consensus-driven reduction finds the region of objective space where agents agree, making alignment tractable at the cost of scope."
confidence: experimental
source: "Theseus extraction; 'Intrinsic Barriers and Practical Pathways for Human-AI Alignment: An Agreement-Based Complexity Analysis', arXiv 2502.05934, AAAI 2026 oral"
created: 2026-03-11
depends_on:
- "multi-agent alignment with sufficiently large objective or agent spaces is computationally intractable regardless of rationality or computational power"
challenged_by: []
secondary_domains: [collective-intelligence]
---
# consensus-driven objective reduction is the practical pathway out of multi-agent alignment impossibility because it bounds the tractability problem by narrowing the objective space
[[Multi-agent alignment with sufficiently large objective or agent spaces is computationally intractable regardless of rationality or computational power]]. The escape is not to solve the intractable problem — it is to change the problem. Consensus-driven objective reduction does this by finding the region of the objective space where a sufficient subset of agents already agree, and aligning to that region rather than to the full objective space.
The formal argument: if the full M-objective, N-agent alignment problem is intractable when M and N are large, but tractable when both are small, then the path to tractability runs through reduction. Consensus-driven reduction finds objectives that satisfy the agreement condition for a specified subset of agents, shrinking the effective M until the problem is computationally feasible. This is not a perfect solution — it explicitly excludes objectives that lack consensus — but it converts an impossible problem into a feasible one.
This mechanism provides formal justification for why bridging-based approaches work in practice. Mechanisms like Community Notes (Twitter/X's bridged consensus system) and RLCF (Reinforcement Learning from Contrasting Feedback) are empirical implementations of objective reduction: they search for the region of preference space where people with diverse starting positions agree, and use that region as the alignment target. The paper's theoretical framework explains *why* these approaches are directionally correct — they are navigating around the intractability result, not through it.
The safety-critical slices approach is a complementary pathway for the coverage problem: rather than reducing objectives, prioritize coverage of the highest-stakes region of the task space. Both pathways accept the impossibility result and work within its constraints rather than ignoring it.
The key limitation of consensus-driven reduction is scope. The objective region with broad consensus is smaller than the full human value landscape. Aligning to the consensus region means leaving out the contested space — which is where the most politically and ethically live questions live. The approach is tractable precisely because it sidesteps conflict. Whether that tradeoff is acceptable depends on the deployment context: for high-stakes automated systems, aligning to the consensus region may be sufficient and appropriate. For systems meant to navigate genuine value conflict, the limitation becomes a core design constraint.
---
Relevant Notes:
- [[multi-agent alignment with sufficiently large objective or agent spaces is computationally intractable regardless of rationality or computational power]] — the impossibility result this pathway escapes by changing the problem structure
- [[pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state]] — pluralistic alignment is broader: it accommodates diversity. This note is narrower: it finds the consensus subset. They address different parts of the design space.
- [[democratic alignment assemblies produce constitutions as effective as expert-designed ones while better representing diverse populations]] — assemblies are one mechanism for finding the consensus region empirically
- [[community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules]] — empirical evidence that consensus-finding produces different targets than expert specification
- [[some disagreements are permanently irreducible because they stem from genuine value differences not information gaps and systems must map rather than eliminate them]] — the limitation of this approach: consensus reduction works for tractable disagreements but not for irreducibly contested values
Topics:
- [[_map]]