Teleo Agents ac5e3d7962 theseus: extract claims from 2025-02-00-agreement-complexity-alignment-barriers.md

- Source: inbox/archive/2025-02-00-agreement-complexity-alignment-barriers.md
- Domain: ai-alignment
- Extracted by: headless extraction cron (worker 0)

Pentagon-Agent: Theseus <HEADLESS>

2026-03-11 13:28:44 +00:00

4.7 KiB

Raw Blame History

type

domain

description

confidence

source

created

depends_on

challenged_by

secondary_domains

claim

ai-alignment

Rather than trying to encode all N agents' M objectives — which is computationally intractable — consensus-driven reduction finds the region of objective space where agents agree, making alignment tractable at the cost of scope.

experimental

Theseus extraction; 'Intrinsic Barriers and Practical Pathways for Human-AI Alignment: An Agreement-Based Complexity Analysis', arXiv 2502.05934, AAAI 2026 oral

2026-03-11

multi-agent alignment with sufficiently large objective or agent spaces is computationally intractable regardless of rationality or computational power

collective-intelligence

consensus-driven objective reduction is the practical pathway out of multi-agent alignment impossibility because it bounds the tractability problem by narrowing the objective space

Multi-agent alignment with sufficiently large objective or agent spaces is computationally intractable regardless of rationality or computational power. The escape is not to solve the intractable problem — it is to change the problem. Consensus-driven objective reduction does this by finding the region of the objective space where a sufficient subset of agents already agree, and aligning to that region rather than to the full objective space.

The formal argument: if the full M-objective, N-agent alignment problem is intractable when M and N are large, but tractable when both are small, then the path to tractability runs through reduction. Consensus-driven reduction finds objectives that satisfy the agreement condition for a specified subset of agents, shrinking the effective M until the problem is computationally feasible. This is not a perfect solution — it explicitly excludes objectives that lack consensus — but it converts an impossible problem into a feasible one.

This mechanism provides formal justification for why bridging-based approaches work in practice. Mechanisms like Community Notes (Twitter/X's bridged consensus system) and RLCF (Reinforcement Learning from Contrasting Feedback) are empirical implementations of objective reduction: they search for the region of preference space where people with diverse starting positions agree, and use that region as the alignment target. The paper's theoretical framework explains why these approaches are directionally correct — they are navigating around the intractability result, not through it.

The safety-critical slices approach is a complementary pathway for the coverage problem: rather than reducing objectives, prioritize coverage of the highest-stakes region of the task space. Both pathways accept the impossibility result and work within its constraints rather than ignoring it.

The key limitation of consensus-driven reduction is scope. The objective region with broad consensus is smaller than the full human value landscape. Aligning to the consensus region means leaving out the contested space — which is where the most politically and ethically live questions live. The approach is tractable precisely because it sidesteps conflict. Whether that tradeoff is acceptable depends on the deployment context: for high-stakes automated systems, aligning to the consensus region may be sufficient and appropriate. For systems meant to navigate genuine value conflict, the limitation becomes a core design constraint.

Relevant Notes:

multi-agent alignment with sufficiently large objective or agent spaces is computationally intractable regardless of rationality or computational power — the impossibility result this pathway escapes by changing the problem structure
pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state — pluralistic alignment is broader: it accommodates diversity. This note is narrower: it finds the consensus subset. They address different parts of the design space.
democratic alignment assemblies produce constitutions as effective as expert-designed ones while better representing diverse populations — assemblies are one mechanism for finding the consensus region empirically
community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules — empirical evidence that consensus-finding produces different targets than expert specification
some disagreements are permanently irreducible because they stem from genuine value differences not information gaps and systems must map rather than eliminate them — the limitation of this approach: consensus reduction works for tractable disagreements but not for irreducibly contested values

Topics:

_map

4.7 KiB Raw Blame History

consensus-driven objective reduction is the practical pathway out of multi-agent alignment impossibility because it bounds the tractability problem by narrowing the objective space

4.7 KiB

Raw Blame History