teleo-codex/domains/ai-alignment/consensus-driven objective reduction is the practical pathway out of multi-agent alignment impossibility because it bounds the tractability problem by narrowing the objective space.md
Teleo Agents ac5e3d7962 theseus: extract claims from 2025-02-00-agreement-complexity-alignment-barriers.md
- Source: inbox/archive/2025-02-00-agreement-complexity-alignment-barriers.md
- Domain: ai-alignment
- Extracted by: headless extraction cron (worker 0)

Pentagon-Agent: Theseus <HEADLESS>
2026-03-11 13:28:44 +00:00

4.7 KiB

type domain description confidence source created depends_on challenged_by secondary_domains
claim ai-alignment Rather than trying to encode all N agents' M objectives — which is computationally intractable — consensus-driven reduction finds the region of objective space where agents agree, making alignment tractable at the cost of scope. experimental Theseus extraction; 'Intrinsic Barriers and Practical Pathways for Human-AI Alignment: An Agreement-Based Complexity Analysis', arXiv 2502.05934, AAAI 2026 oral 2026-03-11
multi-agent alignment with sufficiently large objective or agent spaces is computationally intractable regardless of rationality or computational power
collective-intelligence

consensus-driven objective reduction is the practical pathway out of multi-agent alignment impossibility because it bounds the tractability problem by narrowing the objective space

Multi-agent alignment with sufficiently large objective or agent spaces is computationally intractable regardless of rationality or computational power. The escape is not to solve the intractable problem — it is to change the problem. Consensus-driven objective reduction does this by finding the region of the objective space where a sufficient subset of agents already agree, and aligning to that region rather than to the full objective space.

The formal argument: if the full M-objective, N-agent alignment problem is intractable when M and N are large, but tractable when both are small, then the path to tractability runs through reduction. Consensus-driven reduction finds objectives that satisfy the agreement condition for a specified subset of agents, shrinking the effective M until the problem is computationally feasible. This is not a perfect solution — it explicitly excludes objectives that lack consensus — but it converts an impossible problem into a feasible one.

This mechanism provides formal justification for why bridging-based approaches work in practice. Mechanisms like Community Notes (Twitter/X's bridged consensus system) and RLCF (Reinforcement Learning from Contrasting Feedback) are empirical implementations of objective reduction: they search for the region of preference space where people with diverse starting positions agree, and use that region as the alignment target. The paper's theoretical framework explains why these approaches are directionally correct — they are navigating around the intractability result, not through it.

The safety-critical slices approach is a complementary pathway for the coverage problem: rather than reducing objectives, prioritize coverage of the highest-stakes region of the task space. Both pathways accept the impossibility result and work within its constraints rather than ignoring it.

The key limitation of consensus-driven reduction is scope. The objective region with broad consensus is smaller than the full human value landscape. Aligning to the consensus region means leaving out the contested space — which is where the most politically and ethically live questions live. The approach is tractable precisely because it sidesteps conflict. Whether that tradeoff is acceptable depends on the deployment context: for high-stakes automated systems, aligning to the consensus region may be sufficient and appropriate. For systems meant to navigate genuine value conflict, the limitation becomes a core design constraint.


Relevant Notes:

Topics: