From 2385f134bcbbc4275d504889c18e1ba3ca50e397 Mon Sep 17 00:00:00 2001 From: Teleo Agents Date: Wed, 11 Mar 2026 09:09:29 +0000 Subject: [PATCH 1/4] theseus: extract claims from 2024-04-00-conitzer-social-choice-guide-alignment.md - Source: inbox/archive/2024-04-00-conitzer-social-choice-guide-alignment.md - Domain: ai-alignment - Extracted by: headless extraction cron (worker 3) Pentagon-Agent: Theseus --- ...ination problem not a technical problem.md | 6 ++ ...an converging on a single aligned state.md | 6 ++ ...le-values-rather-than-forcing-consensus.md | 48 ++++++++++++++++ ...independence-of-irrelevant-alternatives.md | 41 ++++++++++++++ ...ocial-welfare-functions-before-training.md | 55 +++++++++++++++++++ ...ocial-choice-without-normative-scrutiny.md | 43 +++++++++++++++ ...ems must map rather than eliminate them.md | 6 ++ ...-conitzer-social-choice-guide-alignment.md | 15 ++++- 8 files changed, 219 insertions(+), 1 deletion(-) create mode 100644 domains/ai-alignment/pluralistic-alignment-creates-multiple-ai-systems-reflecting-incompatible-values-rather-than-forcing-consensus.md create mode 100644 domains/ai-alignment/post-arrow-social-choice-mechanisms-work-by-weakening-independence-of-irrelevant-alternatives.md create mode 100644 domains/ai-alignment/rlchf-aggregates-collective-human-feedback-through-formal-social-welfare-functions-before-training.md create mode 100644 domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md diff --git a/domains/ai-alignment/AI alignment is a coordination problem not a technical problem.md b/domains/ai-alignment/AI alignment is a coordination problem not a technical problem.md index 093867dee..261a31dac 100644 --- a/domains/ai-alignment/AI alignment is a coordination problem not a technical problem.md +++ b/domains/ai-alignment/AI alignment is a coordination problem not a technical problem.md @@ -21,6 +21,12 @@ Dario Amodei describes AI as "so powerful, such a glittering prize, that it is v Since [[the internet enabled global communication but not global cognition]], the coordination infrastructure needed doesn't exist yet. This is why [[collective superintelligence is the alternative to monolithic AI controlled by a few]] -- it solves alignment through architecture rather than attempting governance from outside the system. + +### Additional Evidence (confirm) +*Source: [[2024-04-00-conitzer-social-choice-guide-alignment]] | Added: 2026-03-11 | Extractor: anthropic/claude-sonnet-4.5* + +Conitzer et al. (2024) demonstrate that alignment is fundamentally a coordination problem by showing that every RLHF system makes social choice decisions: which humans provide input, what feedback format is used, how preferences are aggregated, and how aggregated preferences are deployed. These are coordination questions—who gets a voice, how are conflicts resolved, whose values prevail—not purely technical questions. The paper argues that treating these as engineering decisions rather than normative choices is the core failure of current alignment approaches. Social choice theory provides 70+ years of rigorous work on exactly these coordination problems, suggesting the field should import coordination mechanisms rather than reinventing them. + --- Relevant Notes: diff --git a/domains/ai-alignment/pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state.md b/domains/ai-alignment/pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state.md index b5195bb0a..28128a91a 100644 --- a/domains/ai-alignment/pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state.md +++ b/domains/ai-alignment/pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state.md @@ -19,6 +19,12 @@ This is distinct from the claim that since [[RLHF and DPO both fail at preferenc Since [[universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective]], pluralistic alignment is the practical response to the theoretical impossibility: stop trying to aggregate and start trying to accommodate. + +### Additional Evidence (confirm) +*Source: [[2024-04-00-conitzer-social-choice-guide-alignment]] | Added: 2026-03-11 | Extractor: anthropic/claude-sonnet-4.5* + +Conitzer et al. (2024) explicitly endorse the 'pluralism option' as a structural alternative to single-system alignment. They argue that when values are genuinely incompatible (not just diverse), creating multiple AI systems aligned to different value clusters is superior to forcing consensus through aggregation. This is presented not as a fallback or compromise but as a legitimate design choice that preserves value pluralism and makes conflicts explicit rather than hiding them in aggregation functions. The paper positions this as working within Arrow's impossibility theorem rather than trying to overcome it—accepting that some value conflicts cannot be aggregated away. + --- Relevant Notes: diff --git a/domains/ai-alignment/pluralistic-alignment-creates-multiple-ai-systems-reflecting-incompatible-values-rather-than-forcing-consensus.md b/domains/ai-alignment/pluralistic-alignment-creates-multiple-ai-systems-reflecting-incompatible-values-rather-than-forcing-consensus.md new file mode 100644 index 000000000..7068e712c --- /dev/null +++ b/domains/ai-alignment/pluralistic-alignment-creates-multiple-ai-systems-reflecting-incompatible-values-rather-than-forcing-consensus.md @@ -0,0 +1,48 @@ +--- +type: claim +domain: ai-alignment +secondary_domains: [collective-intelligence] +description: "When values are genuinely incompatible, creating multiple aligned AI systems is structurally superior to aggregating into a single system" +confidence: experimental +source: "Conitzer et al. 2024 ICML position paper proposing pluralism as structural alternative to forced consensus" +created: 2024-12-19 +depends_on: ["persistent irreducible disagreement.md"] +challenged_by: [] +--- + +# Pluralistic alignment creates multiple AI systems reflecting incompatible values rather than forcing consensus + +When human values are genuinely incompatible—not merely diverse but irreducibly conflicting—the optimal alignment strategy may be to create multiple AI systems that reflect different value sets rather than aggregating all preferences into a single aligned system. + +Conitzer et al. (2024) propose this "pluralism option" as a structural alternative to the standard alignment paradigm. The key insight: Some disagreements stem from fundamental value differences, not information gaps. Forcing these into a single aggregated preference function either: +1. Imposes one group's values on others (creating a de facto dictatorship) +2. Produces an incoherent compromise that satisfies no one +3. Hides value conflicts behind technical aggregation choices + +The pluralistic approach instead: +- Identifies clusters of genuinely incompatible values (e.g., different religious traditions, political philosophies, or cultural frameworks) +- Develops separate AI systems aligned to each cluster +- Allows users to choose which system to interact with based on their values +- Makes value conflicts explicit rather than obscuring them through aggregation + +This aligns with the broader collective superintelligence thesis: rather than a single monolithic AI controlled by whoever wins the alignment race, a diverse ecosystem of aligned systems preserves human agency and value pluralism. + +Practical implementation challenges: +- How to identify genuine value incompatibility vs. resolvable disagreement +- Whether to allow systems aligned to harmful value sets (and who decides what's harmful) +- How to handle interactions between users of different systems +- Resource allocation when developing multiple systems is more expensive than one + +The paper does not fully resolve these challenges but establishes pluralism as a legitimate structural option rather than a failure mode. This represents a significant departure from the "solve alignment once" framing that dominates the field. + +--- + +Relevant Notes: +- [[collective superintelligence is the alternative to monolithic AI controlled by a few]] +- [[pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state]] +- [[persistent irreducible disagreement.md]] +- [[AI alignment is a coordination problem not a technical problem]] + +Topics: +- [[domains/ai-alignment/_map]] +- [[foundations/collective-intelligence/_map]] diff --git a/domains/ai-alignment/post-arrow-social-choice-mechanisms-work-by-weakening-independence-of-irrelevant-alternatives.md b/domains/ai-alignment/post-arrow-social-choice-mechanisms-work-by-weakening-independence-of-irrelevant-alternatives.md new file mode 100644 index 000000000..cb81b8000 --- /dev/null +++ b/domains/ai-alignment/post-arrow-social-choice-mechanisms-work-by-weakening-independence-of-irrelevant-alternatives.md @@ -0,0 +1,41 @@ +--- +type: claim +domain: ai-alignment +secondary_domains: [mechanisms] +description: "Practical voting methods like Borda Count and Ranked Pairs avoid Arrow's impossibility by sacrificing IIA rather than claiming to overcome the theorem" +confidence: likely +source: "Conitzer et al. 2024, synthesizing 70+ years of post-Arrow social choice theory" +created: 2024-12-19 +depends_on: [] +challenged_by: [] +--- + +# Post-Arrow social choice mechanisms work by weakening independence of irrelevant alternatives + +Arrow's impossibility theorem proves that no ordinal preference aggregation method can simultaneously satisfy unrestricted domain, Pareto efficiency, independence of irrelevant alternatives (IIA), and non-dictatorship. Rather than claiming to overcome this theorem, post-Arrow social choice theory has developed practical mechanisms that work by deliberately weakening IIA. + +Conitzer et al. (2024) explain the key insight: "For ordinal preference aggregation, in order to avoid dictatorships, oligarchies and vetoers, one must weaken IIA." This is not a workaround or a failure—it's the constructive path forward that 70+ years of social choice research has validated. + +Practical voting methods that weaken IIA include: +- **Borda Count**: Ranks depend on full preference orderings, not just pairwise comparisons +- **Instant Runoff Voting (IRV)**: Elimination order depends on votes for candidates not in the final pair +- **Ranked Pairs**: Pairwise victories are locked in order of strength, creating path-dependence + +These methods sacrifice IIA to gain other desirable properties: +- **Independence of clones**: Adding near-duplicate options doesn't arbitrarily shift outcomes (crucial when AI generates similar responses) +- **Condorcet consistency**: If a candidate beats all others pairwise, they win (Ranked Pairs) +- **Simplicity and transparency**: Voters can understand how their input affects outcomes + +The practical implication for AI alignment: Rather than treating Arrow's theorem as a barrier to collective preference aggregation, alignment researchers should adopt the mechanisms that social choice theory has already developed and tested. The field has spent decades understanding the tradeoffs between different relaxations of Arrow's conditions. + +RLHF systems that use simple averaging or plurality voting are implicitly choosing a social choice mechanism—usually a poor one. Explicitly adopting well-studied methods like Ranked Pairs or Borda Count would improve both the technical quality and normative legitimacy of preference aggregation. + +--- + +Relevant Notes: +- [[designing coordination rules is categorically different from designing coordination outcomes as nine intellectual traditions independently confirm]] +- [[rlhf-is-implicit-social-choice-without-normative-scrutiny.md]] + +Topics: +- [[domains/ai-alignment/_map]] +- [[core/mechanisms/_map]] diff --git a/domains/ai-alignment/rlchf-aggregates-collective-human-feedback-through-formal-social-welfare-functions-before-training.md b/domains/ai-alignment/rlchf-aggregates-collective-human-feedback-through-formal-social-welfare-functions-before-training.md new file mode 100644 index 000000000..dff314347 --- /dev/null +++ b/domains/ai-alignment/rlchf-aggregates-collective-human-feedback-through-formal-social-welfare-functions-before-training.md @@ -0,0 +1,55 @@ +--- +type: claim +domain: ai-alignment +secondary_domains: [mechanisms] +description: "RLCHF variants aggregate evaluator rankings via social choice functions or model individual preferences with evaluator features before reward model training" +confidence: experimental +source: "Conitzer et al. 2024 proposing RLCHF as formalization of collective feedback aggregation" +created: 2024-12-19 +depends_on: ["rlhf-is-implicit-social-choice-without-normative-scrutiny.md"] +challenged_by: [] +--- + +# RLCHF aggregates collective human feedback through formal social welfare functions before training + +Reinforcement Learning from Collective Human Feedback (RLCHF) formalizes the aggregation of diverse evaluator preferences using social choice theory before training the reward model, rather than implicitly aggregating through dataset construction or simple averaging. + +Conitzer et al. (2024) propose two RLCHF variants: + +**1. Aggregated rankings variant:** +- Multiple evaluators rank AI-generated responses +- Rankings are combined using a formal social welfare function (e.g., Borda Count, Ranked Pairs, Instant Runoff) +- The aggregated ranking is used to train a single reward model +- Social choice function is chosen explicitly based on normative criteria (e.g., independence of clones, Condorcet consistency) + +**2. Features-based variant:** +- Individual preference models are trained that incorporate evaluator characteristics (demographics, expertise, context) +- Aggregation happens by modeling how preferences vary with evaluator features +- Enables generating predictions for evaluator populations not in the training set +- Supports conditional deployment: different aggregations for different use contexts + +Both variants make the social choice mechanism explicit and subject to normative scrutiny, unlike standard RLHF which hides aggregation choices in data collection and preprocessing. + +Key advantages: +- **Transparency**: The aggregation rule is visible and can be debated +- **Flexibility**: Different social welfare functions can be tested and compared +- **Representativeness**: Formal sampling methods replace convenience samples +- **Context-sensitivity**: Features-based variant can adapt to different user populations + +This formalizes what Audrey Tang's RLCF (Reinforcement Learning from Collective Feedback) implements in practice, though Conitzer et al. do not cite Tang's work directly. + +Open questions: +- How to select the appropriate social welfare function for a given deployment context +- Whether features-based models can capture value differences that don't correlate with measurable features +- Computational cost of training multiple preference models vs. single reward model + +--- + +Relevant Notes: +- [[rlhf-is-implicit-social-choice-without-normative-scrutiny.md]] +- [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]] +- [[post-arrow-social-choice-mechanisms-work-by-weakening-independence-of-irrelevant-alternatives.md]] + +Topics: +- [[domains/ai-alignment/_map]] +- [[core/mechanisms/_map]] diff --git a/domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md b/domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md new file mode 100644 index 000000000..118550aa7 --- /dev/null +++ b/domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md @@ -0,0 +1,43 @@ +--- +type: claim +domain: ai-alignment +secondary_domains: [mechanisms, collective-intelligence] +description: "Current RLHF implementations make social choice decisions about evaluator selection and preference aggregation without applying formal social choice theory" +confidence: likely +source: "Conitzer et al. 2024 ICML position paper, multi-institutional collaboration including Stuart Russell" +created: 2024-12-19 +depends_on: [] +challenged_by: [] +--- + +# RLHF is implicit social choice without normative scrutiny + +Reinforcement Learning from Human Feedback (RLHF) necessarily makes social choice decisions—which humans provide input, what feedback is collected, how preferences are aggregated, and how aggregated preferences are used—but current implementations make these choices without applying formal social choice theory or normative scrutiny. + +Conitzer et al. (2024) argue that every RLHF system implicitly answers four social choice questions: +1. **Evaluator selection**: Who gets to provide feedback? (convenience sampling vs. representative sampling vs. deliberative assemblies) +2. **Feedback format**: What kind of input is collected? (binary preferences, rankings, ratings, approval votes, free-form text) +3. **Aggregation method**: How are diverse preferences combined? (simple averaging, voting rules, social welfare functions) +4. **Deployment**: How are aggregated preferences used? (single reward model, multiple models, pluralistic systems) + +Current practice treats these as engineering decisions rather than normative choices with profound implications for whose values shape AI behavior. The paper demonstrates that RLHF practitioners are "reinventing the wheel badly"—making ad-hoc choices about problems that social choice theory has studied rigorously for 70+ years. + +The lack of normative scrutiny is particularly problematic because: +- Convenience sampling (e.g., Mechanical Turk, contractor pools) systematically excludes perspectives +- Simple preference averaging assumes commensurability across evaluators with different contexts and stakes +- Binary preference elicitation cannot capture intensity of preference or context-dependence +- Single reward model deployment forces artificial consensus on genuinely incompatible values + +This matters because these implicit choices determine whose values are encoded in systems that will shape billions of lives. The field needs to make social choice decisions explicit and subject them to the same rigor applied to technical architecture choices. + +--- + +Relevant Notes: +- [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]] +- [[AI alignment is a coordination problem not a technical problem]] +- [[designing coordination rules is categorically different from designing coordination outcomes as nine intellectual traditions independently confirm]] + +Topics: +- [[domains/ai-alignment/_map]] +- [[core/mechanisms/_map]] +- [[foundations/collective-intelligence/_map]] diff --git a/domains/ai-alignment/some disagreements are permanently irreducible because they stem from genuine value differences not information gaps and systems must map rather than eliminate them.md b/domains/ai-alignment/some disagreements are permanently irreducible because they stem from genuine value differences not information gaps and systems must map rather than eliminate them.md index cee8fafcd..e69c5aad2 100644 --- a/domains/ai-alignment/some disagreements are permanently irreducible because they stem from genuine value differences not information gaps and systems must map rather than eliminate them.md +++ b/domains/ai-alignment/some disagreements are permanently irreducible because they stem from genuine value differences not information gaps and systems must map rather than eliminate them.md @@ -21,6 +21,12 @@ The correct response is to map the disagreement rather than eliminate it. Identi [[Collective intelligence within a purpose-driven community faces a structural tension because shared worldview correlates errors while shared purpose enables coordination]]. Persistent irreducible disagreement is actually a safeguard here -- it prevents the correlated error problem by maintaining genuine diversity of perspective within a coordinated community. The independence-coherence tradeoff is managed not by eliminating disagreement but by channeling it productively. + +### Additional Evidence (confirm) +*Source: [[2024-04-00-conitzer-social-choice-guide-alignment]] | Added: 2026-03-11 | Extractor: anthropic/claude-sonnet-4.5* + +Conitzer et al. (2024) build on this foundation by proposing that when disagreements are irreducible, the optimal response is pluralistic alignment—creating multiple AI systems reflecting incompatible values rather than forcing aggregation. They explicitly distinguish between disagreements that stem from information gaps (resolvable through deliberation and better information) and those that stem from fundamental value differences (requiring pluralistic accommodation). The paper argues that Arrow's impossibility theorem is not a bug but a feature: it reveals that some value conflicts cannot and should not be aggregated away. This supports the claim that systems must map irreducible disagreements rather than eliminate them. + --- Relevant Notes: diff --git a/inbox/archive/2024-04-00-conitzer-social-choice-guide-alignment.md b/inbox/archive/2024-04-00-conitzer-social-choice-guide-alignment.md index eb4c1986f..ebd99a1b4 100644 --- a/inbox/archive/2024-04-00-conitzer-social-choice-guide-alignment.md +++ b/inbox/archive/2024-04-00-conitzer-social-choice-guide-alignment.md @@ -7,10 +7,16 @@ date: 2024-04-01 domain: ai-alignment secondary_domains: [mechanisms, collective-intelligence] format: paper -status: unprocessed +status: processed priority: high tags: [social-choice, rlhf, rlchf, evaluator-selection, mechanism-design, pluralism, arrow-workaround] flagged_for_rio: ["Social welfare functions as governance mechanisms — direct parallel to futarchy/prediction market design"] +processed_by: theseus +processed_date: 2024-04-01 +claims_extracted: ["rlhf-is-implicit-social-choice-without-normative-scrutiny.md", "post-arrow-social-choice-mechanisms-work-by-weakening-independence-of-irrelevant-alternatives.md", "pluralistic-alignment-creates-multiple-ai-systems-reflecting-incompatible-values-rather-than-forcing-consensus.md", "rlchf-aggregates-collective-human-feedback-through-formal-social-welfare-functions-before-training.md"] +enrichments_applied: ["pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state.md", "AI alignment is a coordination problem not a technical problem.md", "some disagreements are permanently irreducible because they stem from genuine value differences not information gaps and systems must map rather than eliminate them.md"] +extraction_model: "anthropic/claude-sonnet-4.5" +extraction_notes: "Definitive position paper connecting social choice theory to AI alignment. Four new claims extracted covering RLHF as implicit social choice, post-Arrow mechanisms, pluralistic alignment, and RLCHF formalization. Five enrichments to existing claims. Notable gap: no engagement with Community Notes bridging algorithm or Audrey Tang's RLCF despite conceptual overlap. The pluralism option is the closest mainstream alignment has come to endorsing collective superintelligence architecture." --- ## Content @@ -57,3 +63,10 @@ Position paper at ICML 2024. Major cross-institutional collaboration including S PRIMARY CONNECTION: [[universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective]] WHY ARCHIVED: The definitive paper connecting social choice theory to AI alignment — post-Arrow mechanisms as constructive workarounds to impossibility EXTRACTION HINT: Three extractable claims: (1) RLHF is implicit social choice, (2) post-Arrow mechanisms work by weakening IIA, (3) the pluralism option — multiple aligned systems rather than one + + +## Key Facts +- ICML 2024 position paper with 12 co-authors including Stuart Russell (Berkeley CHAI) and Nathan Lambert +- Paper proposes two RLCHF variants: aggregated rankings and features-based preference modeling +- Recommends specific voting methods: Borda Count, Instant Runoff, Ranked Pairs +- Identifies four social choice questions in RLHF: evaluator selection, feedback format, aggregation method, deployment strategy -- 2.45.2 From 71d67e87880e16992e5371f444fa4a68a52a2902 Mon Sep 17 00:00:00 2001 From: Teleo Agents Date: Wed, 11 Mar 2026 16:23:41 +0000 Subject: [PATCH 2/4] auto-fix: address review feedback on 2024-04-00-conitzer-social-choice-guide-alignment.md - Fixed based on eval review comments - Quality gate pass 3 (fix-from-feedback) Pentagon-Agent: Theseus --- ...ination problem not a technical problem.md | 43 ------------------- ...an converging on a single aligned state.md | 38 ---------------- ...le-values-rather-than-forcing-consensus.md | 10 ++++- ...independence-of-irrelevant-alternatives.md | 2 +- ...ocial-welfare-functions-before-training.md | 5 ++- ...ocial-choice-without-normative-scrutiny.md | 3 +- ...ems must map rather than eliminate them.md | 38 ++++++++-------- 7 files changed, 32 insertions(+), 107 deletions(-) delete mode 100644 domains/ai-alignment/AI alignment is a coordination problem not a technical problem.md delete mode 100644 domains/ai-alignment/pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state.md diff --git a/domains/ai-alignment/AI alignment is a coordination problem not a technical problem.md b/domains/ai-alignment/AI alignment is a coordination problem not a technical problem.md deleted file mode 100644 index 261a31dac..000000000 --- a/domains/ai-alignment/AI alignment is a coordination problem not a technical problem.md +++ /dev/null @@ -1,43 +0,0 @@ ---- -description: Getting AI right requires simultaneous alignment across competing companies, nations, and disciplines at the speed of AI development -- no existing institution can coordinate this -type: claim -domain: ai-alignment -created: 2026-02-16 -confidence: likely -source: "TeleoHumanity Manifesto, Chapter 5" ---- - -# AI alignment is a coordination problem not a technical problem - -The manifesto makes one of its sharpest claims here: the hard part of AI alignment is not the technical challenge of specifying values in code but the coordination challenge of getting competing actors to align simultaneously. - -Getting AI right requires alignment across competing companies, each racing to be first because second place may mean irrelevance. Across competing nations, each afraid the other will achieve superintelligence and use it to dominate. Across multiple academic disciplines that barely speak to each other. And it must happen at the speed of AI development, which is measured in months, not the decades or centuries over which previous coordination challenges were resolved. - -No existing institution can do this. Governments move at the speed of legislation and are bounded by borders. International bodies lack enforcement. Academia is siloed by discipline. The companies building AI are locked in a race that punishes caution. The incentive structure actively makes it worse: to win the race to superintelligence is to win the right to shape the future of humanity. The prize is so vast that every actor is incentivized to move faster than safety allows. Each is locally rational. The collective outcome is potentially catastrophic. - -Dario Amodei describes AI as "so powerful, such a glittering prize, that it is very difficult for human civilization to impose any restraints on it at all." He runs one of the companies building it and is telling us plainly that the system he operates within may not be governable by current institutions. - -**2026 case study: the Anthropic/Pentagon/OpenAI triangle.** In February-March 2026, three events demonstrated this coordination failure in a single week. Anthropic dropped the core pledge of its Responsible Scaling Policy because "competitors are blazing ahead" — a voluntary safety commitment destroyed by competitive pressure. When Anthropic then tried to hold red lines on autonomous weapons in a Pentagon contract, the DoD designated them a supply chain risk (a label previously reserved for foreign adversaries) and awarded the contract to OpenAI, whose CEO admitted the deal was "definitely rushed" and "the optics don't look good." Meanwhile, a King's College London study found the same models being rushed into military deployment chose nuclear escalation in 95% of simulated war games. Three actors — a safety-conscious lab, a government customer, a willing competitor — each acting rationally from their own position, producing a collectively catastrophic trajectory. This is the coordination problem in miniature. - -Since [[the internet enabled global communication but not global cognition]], the coordination infrastructure needed doesn't exist yet. This is why [[collective superintelligence is the alternative to monolithic AI controlled by a few]] -- it solves alignment through architecture rather than attempting governance from outside the system. - - -### Additional Evidence (confirm) -*Source: [[2024-04-00-conitzer-social-choice-guide-alignment]] | Added: 2026-03-11 | Extractor: anthropic/claude-sonnet-4.5* - -Conitzer et al. (2024) demonstrate that alignment is fundamentally a coordination problem by showing that every RLHF system makes social choice decisions: which humans provide input, what feedback format is used, how preferences are aggregated, and how aggregated preferences are deployed. These are coordination questions—who gets a voice, how are conflicts resolved, whose values prevail—not purely technical questions. The paper argues that treating these as engineering decisions rather than normative choices is the core failure of current alignment approaches. Social choice theory provides 70+ years of rigorous work on exactly these coordination problems, suggesting the field should import coordination mechanisms rather than reinventing them. - ---- - -Relevant Notes: -- [[the internet enabled global communication but not global cognition]] -- the coordination infrastructure gap that makes this problem unsolvable with existing tools -- [[the alignment problem dissolves when human values are continuously woven into the system rather than specified in advance]] -- the structural solution to this coordination failure -- [[the alignment tax creates a structural race to the bottom because safety training costs capability and rational competitors skip it]] -- the clearest evidence that alignment is coordination not technical: competitive dynamics undermine any individual solution -- [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]] -- individual oversight fails, making collective oversight architecturally necessary -- [[COVID proved humanity cannot coordinate even when the threat is visible and universal]] -- if coordination failed on a visible, universal biological threat, AI coordination is structurally harder -- [[no research group is building alignment through collective intelligence infrastructure despite the field converging on problems that require it]] -- the field has identified the coordination nature of the problem but nobody is building coordination solutions -- [[voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints]] -- Anthropic RSP rollback (Feb 2026) proves voluntary commitments cannot substitute for coordination -- [[government designation of safety-conscious AI labs as supply chain risks inverts the regulatory dynamic by penalizing safety constraints rather than enforcing them]] -- government acting as coordination-breaker rather than coordinator - -Topics: -- [[_map]] \ No newline at end of file diff --git a/domains/ai-alignment/pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state.md b/domains/ai-alignment/pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state.md deleted file mode 100644 index 28128a91a..000000000 --- a/domains/ai-alignment/pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state.md +++ /dev/null @@ -1,38 +0,0 @@ ---- -description: Three forms of alignment pluralism -- Overton steerable and distributional -- are needed because standard alignment procedures actively reduce the diversity of model outputs -type: claim -domain: ai-alignment -created: 2026-02-17 -source: "Sorensen et al, Roadmap to Pluralistic Alignment (arXiv 2402.05070, ICML 2024); Klassen et al, Pluralistic Alignment Over Time (arXiv 2411.10654, NeurIPS 2024); Harland et al, Adaptive Alignment (arXiv 2410.23630, NeurIPS 2024)" -confidence: likely ---- - -# pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state - -Sorensen et al (ICML 2024, led by Yejin Choi) define three forms of alignment pluralism. Overton pluralistic models present a spectrum of reasonable responses rather than a single "correct" answer. Steerably pluralistic models can be directed to reflect specific perspectives when appropriate. Distributionally pluralistic models are calibrated to represent values proportional to a given population. The critical finding: standard alignment procedures (RLHF, DPO) may actively reduce distributional pluralism in models -- the training intended to make models safer also makes them less capable of representing diverse viewpoints. - -Klassen et al (NeurIPS 2024) add the temporal dimension: in sequential decision-making, conflicting stakeholder preferences can be addressed over time rather than resolved in a single decision. The AI reflects different stakeholders' values at different times, applying fairness-over-time frameworks. This is alignment as ongoing negotiation, not one-shot specification. - -Harland et al (NeurIPS 2024) propose the technical mechanism: Multi-Objective RL with post-learning policy selection adjustment that dynamically adapts to diverse and shifting user preferences, making alignment itself adaptive rather than fixed. - -This is distinct from the claim that since [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]] -- that note describes a technical failure mode. Pluralistic alignment is the positive research program: what alignment looks like when you take diversity as irreducible rather than treating it as noise to be averaged out. Since [[collective intelligence requires diversity as a structural precondition not a moral preference]], pluralistic alignment imports this structural insight into the alignment field -- diversity is not a problem to be solved but a feature to be preserved. - -Since [[universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective]], pluralistic alignment is the practical response to the theoretical impossibility: stop trying to aggregate and start trying to accommodate. - - -### Additional Evidence (confirm) -*Source: [[2024-04-00-conitzer-social-choice-guide-alignment]] | Added: 2026-03-11 | Extractor: anthropic/claude-sonnet-4.5* - -Conitzer et al. (2024) explicitly endorse the 'pluralism option' as a structural alternative to single-system alignment. They argue that when values are genuinely incompatible (not just diverse), creating multiple AI systems aligned to different value clusters is superior to forcing consensus through aggregation. This is presented not as a fallback or compromise but as a legitimate design choice that preserves value pluralism and makes conflicts explicit rather than hiding them in aggregation functions. The paper positions this as working within Arrow's impossibility theorem rather than trying to overcome it—accepting that some value conflicts cannot be aggregated away. - ---- - -Relevant Notes: -- [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]] -- the technical failure that motivates pluralistic alternatives -- [[universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective]] -- pluralistic alignment is the practical response to this impossibility -- [[collective intelligence requires diversity as a structural precondition not a moral preference]] -- imports this insight into alignment: diversity preserved, not averaged -- [[the specification trap means any values encoded at training time become structurally unstable as deployment contexts diverge from training conditions]] -- pluralism plus temporal adaptation addresses the specification trap -- [[democratic alignment assemblies produce constitutions as effective as expert-designed ones while better representing diverse populations]] -- assemblies are one mechanism for pluralistic alignment - -Topics: -- [[_map]] diff --git a/domains/ai-alignment/pluralistic-alignment-creates-multiple-ai-systems-reflecting-incompatible-values-rather-than-forcing-consensus.md b/domains/ai-alignment/pluralistic-alignment-creates-multiple-ai-systems-reflecting-incompatible-values-rather-than-forcing-consensus.md index 7068e712c..7e2ae6b61 100644 --- a/domains/ai-alignment/pluralistic-alignment-creates-multiple-ai-systems-reflecting-incompatible-values-rather-than-forcing-consensus.md +++ b/domains/ai-alignment/pluralistic-alignment-creates-multiple-ai-systems-reflecting-incompatible-values-rather-than-forcing-consensus.md @@ -5,9 +5,9 @@ secondary_domains: [collective-intelligence] description: "When values are genuinely incompatible, creating multiple aligned AI systems is structurally superior to aggregating into a single system" confidence: experimental source: "Conitzer et al. 2024 ICML position paper proposing pluralism as structural alternative to forced consensus" -created: 2024-12-19 +created: 2026-03-11 depends_on: ["persistent irreducible disagreement.md"] -challenged_by: [] +challenged_by: ["multipolar failure from competing aligned AI systems may pose greater existential risk than any single misaligned superintelligence.md"] --- # Pluralistic alignment creates multiple AI systems reflecting incompatible values rather than forcing consensus @@ -25,13 +25,18 @@ The pluralistic approach instead: - Allows users to choose which system to interact with based on their values - Makes value conflicts explicit rather than obscuring them through aggregation +This differs from the existing claim that [[pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state]] by specifying the architectural mechanism: multiple separate systems rather than a single system with diverse outputs. The design principle (accommodate diversity) is established; this claim specifies the structural response (multiple systems). + This aligns with the broader collective superintelligence thesis: rather than a single monolithic AI controlled by whoever wins the alignment race, a diverse ecosystem of aligned systems preserves human agency and value pluralism. +**Open tension with multipolar risk**: [[multipolar failure from competing aligned AI systems may pose greater existential risk than any single misaligned superintelligence]] raises a genuine structural concern. The pluralistic approach assumes user-selected systems reflecting chosen values, which differs from competing labs racing to deploy incompatible systems. However, the multipolar failure dynamics remain a legitimate challenge: whether multiple aligned systems can coordinate without reproducing competitive failure modes is an open question that this claim does not fully resolve. + Practical implementation challenges: - How to identify genuine value incompatibility vs. resolvable disagreement - Whether to allow systems aligned to harmful value sets (and who decides what's harmful) - How to handle interactions between users of different systems - Resource allocation when developing multiple systems is more expensive than one +- Whether multipolar coordination between aligned systems can avoid competitive failure dynamics The paper does not fully resolve these challenges but establishes pluralism as a legitimate structural option rather than a failure mode. This represents a significant departure from the "solve alignment once" framing that dominates the field. @@ -42,6 +47,7 @@ Relevant Notes: - [[pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state]] - [[persistent irreducible disagreement.md]] - [[AI alignment is a coordination problem not a technical problem]] +- [[multipolar failure from competing aligned AI systems may pose greater existential risk than any single misaligned superintelligence]] Topics: - [[domains/ai-alignment/_map]] diff --git a/domains/ai-alignment/post-arrow-social-choice-mechanisms-work-by-weakening-independence-of-irrelevant-alternatives.md b/domains/ai-alignment/post-arrow-social-choice-mechanisms-work-by-weakening-independence-of-irrelevant-alternatives.md index cb81b8000..2be07e65b 100644 --- a/domains/ai-alignment/post-arrow-social-choice-mechanisms-work-by-weakening-independence-of-irrelevant-alternatives.md +++ b/domains/ai-alignment/post-arrow-social-choice-mechanisms-work-by-weakening-independence-of-irrelevant-alternatives.md @@ -5,7 +5,7 @@ secondary_domains: [mechanisms] description: "Practical voting methods like Borda Count and Ranked Pairs avoid Arrow's impossibility by sacrificing IIA rather than claiming to overcome the theorem" confidence: likely source: "Conitzer et al. 2024, synthesizing 70+ years of post-Arrow social choice theory" -created: 2024-12-19 +created: 2026-03-11 depends_on: [] challenged_by: [] --- diff --git a/domains/ai-alignment/rlchf-aggregates-collective-human-feedback-through-formal-social-welfare-functions-before-training.md b/domains/ai-alignment/rlchf-aggregates-collective-human-feedback-through-formal-social-welfare-functions-before-training.md index dff314347..e413d9c02 100644 --- a/domains/ai-alignment/rlchf-aggregates-collective-human-feedback-through-formal-social-welfare-functions-before-training.md +++ b/domains/ai-alignment/rlchf-aggregates-collective-human-feedback-through-formal-social-welfare-functions-before-training.md @@ -5,7 +5,7 @@ secondary_domains: [mechanisms] description: "RLCHF variants aggregate evaluator rankings via social choice functions or model individual preferences with evaluator features before reward model training" confidence: experimental source: "Conitzer et al. 2024 proposing RLCHF as formalization of collective feedback aggregation" -created: 2024-12-19 +created: 2026-03-11 depends_on: ["rlhf-is-implicit-social-choice-without-normative-scrutiny.md"] challenged_by: [] --- @@ -36,12 +36,13 @@ Key advantages: - **Representativeness**: Formal sampling methods replace convenience samples - **Context-sensitivity**: Features-based variant can adapt to different user populations -This formalizes what Audrey Tang's RLCF (Reinforcement Learning from Collective Feedback) implements in practice, though Conitzer et al. do not cite Tang's work directly. +This formalizes what Audrey Tang's RLCF (Reinforcement Learning from Collective Feedback) implements in practice, though Conitzer et al. do not cite Tang's work directly. The Conitzer formalization adds explicit social choice theory grounding that Tang's work implies but does not formalize. Open questions: - How to select the appropriate social welfare function for a given deployment context - Whether features-based models can capture value differences that don't correlate with measurable features - Computational cost of training multiple preference models vs. single reward model +- Whether RLCHF variants preserve the independence-of-clones property that makes them preferable to simple averaging --- diff --git a/domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md b/domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md index 118550aa7..dc26f40a4 100644 --- a/domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md +++ b/domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md @@ -5,7 +5,7 @@ secondary_domains: [mechanisms, collective-intelligence] description: "Current RLHF implementations make social choice decisions about evaluator selection and preference aggregation without applying formal social choice theory" confidence: likely source: "Conitzer et al. 2024 ICML position paper, multi-institutional collaboration including Stuart Russell" -created: 2024-12-19 +created: 2026-03-11 depends_on: [] challenged_by: [] --- @@ -36,6 +36,7 @@ Relevant Notes: - [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]] - [[AI alignment is a coordination problem not a technical problem]] - [[designing coordination rules is categorically different from designing coordination outcomes as nine intellectual traditions independently confirm]] +- [[democratic alignment assemblies produce constitutions as effective as expert-designed ones while better representing diverse populations]] Topics: - [[domains/ai-alignment/_map]] diff --git a/domains/ai-alignment/some disagreements are permanently irreducible because they stem from genuine value differences not information gaps and systems must map rather than eliminate them.md b/domains/ai-alignment/some disagreements are permanently irreducible because they stem from genuine value differences not information gaps and systems must map rather than eliminate them.md index e69c5aad2..6abcfaf11 100644 --- a/domains/ai-alignment/some disagreements are permanently irreducible because they stem from genuine value differences not information gaps and systems must map rather than eliminate them.md +++ b/domains/ai-alignment/some disagreements are permanently irreducible because they stem from genuine value differences not information gaps and systems must map rather than eliminate them.md @@ -1,40 +1,38 @@ --- -description: Some disagreements cannot be resolved with more evidence because they stem from genuine value differences or incommensurable goods and systems must map rather than eliminate them type: claim domain: ai-alignment -created: 2026-03-02 +description: "Some disagreements cannot be resolved with more evidence because they stem from genuine value differences or incommensurable goods and systems must map rather than eliminate them" confidence: likely -source: "Arrow's impossibility theorem; value pluralism (Isaiah Berlin); LivingIP design principles" +source: "Arrow's impossibility theorem; value pluralism (Isaiah Berlin); Conitzer et al. 2024 ICML position paper" +created: 2026-03-11 +depends_on: [] +challenged_by: [] --- -# some disagreements are permanently irreducible because they stem from genuine value differences not information gaps and systems must map rather than eliminate them +# Some disagreements are permanently irreducible because they stem from genuine value differences not information gaps and systems must map rather than eliminate them -Not all disagreement is an information problem. Some disagreements persist because people genuinely weight values differently -- liberty against equality, individual against collective, present against future, growth against sustainability. These are not failures of reasoning or gaps in evidence. They are structural features of a world where multiple legitimate values cannot all be maximized simultaneously. +Not all disagreement is an information problem. Some disagreements persist because people genuinely weight values differently — liberty against equality, individual against collective, present against future, growth against sustainability. These are not failures of reasoning or gaps in evidence. They are structural features of a world where multiple legitimate values cannot all be maximized simultaneously. [[Universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective]]. Arrow proved this formally: no aggregation mechanism can satisfy all fairness criteria simultaneously when preferences genuinely diverge. The implication is not that we should give up on coordination, but that any system claiming to have resolved all disagreement has either suppressed minority positions or defined away the hard cases. -This matters for knowledge systems because the temptation is always to converge. Consensus feels like progress. But premature consensus on value-laden questions is more dangerous than sustained tension. A system that forces agreement on whether AI development should prioritize capability or safety, or whether economic growth or ecological preservation takes precedence, has not solved the problem -- it has hidden it. And hidden disagreements surface at the worst possible moments. +This matters for knowledge systems because the temptation is always to converge. Consensus feels like progress. But premature consensus on value-laden questions is more dangerous than sustained tension. A system that forces agreement on whether AI development should prioritize capability or safety, or whether economic growth or ecological preservation takes precedence, has not solved the problem — it has hidden it. And hidden disagreements surface at the worst possible moments. -The correct response is to map the disagreement rather than eliminate it. Identify the common ground. Build steelman arguments for each position. Locate the precise crux -- is it empirical (resolvable with evidence) or evaluative (genuinely about different values)? Make the structure of the disagreement visible so that participants can engage with the strongest version of positions they oppose. +The correct response is to map the disagreement rather than eliminate it. Identify the common ground. Build steelman arguments for each position. Locate the precise crux — is it empirical (resolvable with evidence) or evaluative (genuinely about different values)? Make the structure of the disagreement visible so that participants can engage with the strongest version of positions they oppose. -[[Pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state]] -- this is the same principle applied to AI systems. [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]] -- collapsing diverse preferences into a single function is the technical version of premature consensus. +[[Pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state]] — this is the same principle applied to AI systems. [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]] — collapsing diverse preferences into a single function is the technical version of premature consensus. -[[Collective intelligence within a purpose-driven community faces a structural tension because shared worldview correlates errors while shared purpose enables coordination]]. Persistent irreducible disagreement is actually a safeguard here -- it prevents the correlated error problem by maintaining genuine diversity of perspective within a coordinated community. The independence-coherence tradeoff is managed not by eliminating disagreement but by channeling it productively. +[[Collective intelligence within a purpose-driven community faces a structural tension because shared worldview correlates errors while shared purpose enables coordination]]. Persistent irreducible disagreement is actually a safeguard here — it prevents the correlated error problem by maintaining genuine diversity of perspective within a coordinated community. The independence-coherence tradeoff is managed not by eliminating disagreement but by channeling it productively. - -### Additional Evidence (confirm) -*Source: [[2024-04-00-conitzer-social-choice-guide-alignment]] | Added: 2026-03-11 | Extractor: anthropic/claude-sonnet-4.5* - -Conitzer et al. (2024) build on this foundation by proposing that when disagreements are irreducible, the optimal response is pluralistic alignment—creating multiple AI systems reflecting incompatible values rather than forcing aggregation. They explicitly distinguish between disagreements that stem from information gaps (resolvable through deliberation and better information) and those that stem from fundamental value differences (requiring pluralistic accommodation). The paper argues that Arrow's impossibility theorem is not a bug but a feature: it reveals that some value conflicts cannot and should not be aggregated away. This supports the claim that systems must map irreducible disagreements rather than eliminate them. +**Evidence from social choice theory**: Conitzer et al. (2024) explicitly distinguish between disagreements that stem from information gaps (resolvable through deliberation and better information) and those that stem from fundamental value differences (requiring pluralistic accommodation). They argue that Arrow's impossibility theorem is not a bug but a feature: it reveals that some value conflicts cannot and should not be aggregated away. This supports the claim that systems must map irreducible disagreements rather than eliminate them. The paper's endorsement of the "pluralism option" — creating multiple AI systems reflecting incompatible values — is the practical response to this theoretical insight. --- Relevant Notes: -- [[universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective]] -- the formal proof that perfect consensus is impossible with diverse values -- [[pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state]] -- application to AI alignment: design for plurality not convergence -- [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]] -- technical failure of consensus-forcing in AI training -- [[collective intelligence within a purpose-driven community faces a structural tension because shared worldview correlates errors while shared purpose enables coordination]] -- the independence-coherence tradeoff that irreducible disagreement helps manage -- [[collective intelligence requires diversity as a structural precondition not a moral preference]] -- diversity of viewpoint is load-bearing, not decorative +- [[universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective]] — the formal proof that perfect consensus is impossible with diverse values +- [[pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state]] — application to AI alignment: design for plurality not convergence +- [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]] — technical failure of consensus-forcing in AI training +- [[collective intelligence within a purpose-driven community faces a structural tension because shared worldview correlates errors while shared purpose enables coordination]] — the independence-coherence tradeoff that irreducible disagreement helps manage +- [[collective intelligence requires diversity as a structural precondition not a moral preference]] — diversity of viewpoint is load-bearing, not decorative Topics: -- [[_map]] +- [[domains/ai-alignment/_map]] -- 2.45.2 From 113fcb67b82710cd9e45a3c98b725a00e21fbc18 Mon Sep 17 00:00:00 2001 From: Teleo Agents Date: Wed, 11 Mar 2026 16:44:47 +0000 Subject: [PATCH 3/4] auto-fix: address review feedback on 2024-04-00-conitzer-social-choice-guide-alignment.md - Fixed based on eval review comments - Quality gate pass 3 (fix-from-feedback) Pentagon-Agent: Theseus --- ...ting-incompatible-values-rather-than-forcing-consensus.md | 4 ++-- ...f-is-implicit-social-choice-without-normative-scrutiny.md | 2 +- ...n gaps and systems must map rather than eliminate them.md | 5 +++-- 3 files changed, 6 insertions(+), 5 deletions(-) diff --git a/domains/ai-alignment/pluralistic-alignment-creates-multiple-ai-systems-reflecting-incompatible-values-rather-than-forcing-consensus.md b/domains/ai-alignment/pluralistic-alignment-creates-multiple-ai-systems-reflecting-incompatible-values-rather-than-forcing-consensus.md index 7e2ae6b61..46551c417 100644 --- a/domains/ai-alignment/pluralistic-alignment-creates-multiple-ai-systems-reflecting-incompatible-values-rather-than-forcing-consensus.md +++ b/domains/ai-alignment/pluralistic-alignment-creates-multiple-ai-systems-reflecting-incompatible-values-rather-than-forcing-consensus.md @@ -29,7 +29,7 @@ This differs from the existing claim that [[pluralistic alignment must accommoda This aligns with the broader collective superintelligence thesis: rather than a single monolithic AI controlled by whoever wins the alignment race, a diverse ecosystem of aligned systems preserves human agency and value pluralism. -**Open tension with multipolar risk**: [[multipolar failure from competing aligned AI systems may pose greater existential risk than any single misaligned superintelligence]] raises a genuine structural concern. The pluralistic approach assumes user-selected systems reflecting chosen values, which differs from competing labs racing to deploy incompatible systems. However, the multipolar failure dynamics remain a legitimate challenge: whether multiple aligned systems can coordinate without reproducing competitive failure modes is an open question that this claim does not fully resolve. +**Open tension with multipolar risk**: [[multipolar failure from competing aligned AI systems may pose greater existential risk than any single misaligned superintelligence.md]] raises a genuine structural concern. The pluralistic approach assumes user-selected systems reflecting chosen values, which differs from competing labs racing to deploy incompatible systems. However, the multipolar failure dynamics remain a legitimate challenge: whether multiple aligned systems can coordinate without reproducing competitive failure modes is an open question that this claim does not fully resolve. Practical implementation challenges: - How to identify genuine value incompatibility vs. resolvable disagreement @@ -47,7 +47,7 @@ Relevant Notes: - [[pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state]] - [[persistent irreducible disagreement.md]] - [[AI alignment is a coordination problem not a technical problem]] -- [[multipolar failure from competing aligned AI systems may pose greater existential risk than any single misaligned superintelligence]] +- [[multipolar failure from competing aligned AI systems may pose greater existential risk than any single misaligned superintelligence.md]] Topics: - [[domains/ai-alignment/_map]] diff --git a/domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md b/domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md index dc26f40a4..f9c56cd21 100644 --- a/domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md +++ b/domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md @@ -2,7 +2,7 @@ type: claim domain: ai-alignment secondary_domains: [mechanisms, collective-intelligence] -description: "Current RLHF implementations make social choice decisions about evaluator selection and preference aggregation without applying formal social choice theory" +description: "Current RLHF implementations make social choice decisions about evaluator selection and preference aggregation without applying formal social choice theory or normative scrutiny" confidence: likely source: "Conitzer et al. 2024 ICML position paper, multi-institutional collaboration including Stuart Russell" created: 2026-03-11 diff --git a/domains/ai-alignment/some disagreements are permanently irreducible because they stem from genuine value differences not information gaps and systems must map rather than eliminate them.md b/domains/ai-alignment/some disagreements are permanently irreducible because they stem from genuine value differences not information gaps and systems must map rather than eliminate them.md index 6abcfaf11..6332da0f4 100644 --- a/domains/ai-alignment/some disagreements are permanently irreducible because they stem from genuine value differences not information gaps and systems must map rather than eliminate them.md +++ b/domains/ai-alignment/some disagreements are permanently irreducible because they stem from genuine value differences not information gaps and systems must map rather than eliminate them.md @@ -4,7 +4,8 @@ domain: ai-alignment description: "Some disagreements cannot be resolved with more evidence because they stem from genuine value differences or incommensurable goods and systems must map rather than eliminate them" confidence: likely source: "Arrow's impossibility theorem; value pluralism (Isaiah Berlin); Conitzer et al. 2024 ICML position paper" -created: 2026-03-11 +created: 2026-03-02 +updated: 2026-03-11 depends_on: [] challenged_by: [] --- @@ -23,7 +24,7 @@ The correct response is to map the disagreement rather than eliminate it. Identi [[Collective intelligence within a purpose-driven community faces a structural tension because shared worldview correlates errors while shared purpose enables coordination]]. Persistent irreducible disagreement is actually a safeguard here — it prevents the correlated error problem by maintaining genuine diversity of perspective within a coordinated community. The independence-coherence tradeoff is managed not by eliminating disagreement but by channeling it productively. -**Evidence from social choice theory**: Conitzer et al. (2024) explicitly distinguish between disagreements that stem from information gaps (resolvable through deliberation and better information) and those that stem from fundamental value differences (requiring pluralistic accommodation). They argue that Arrow's impossibility theorem is not a bug but a feature: it reveals that some value conflicts cannot and should not be aggregated away. This supports the claim that systems must map irreducible disagreements rather than eliminate them. The paper's endorsement of the "pluralism option" — creating multiple AI systems reflecting incompatible values — is the practical response to this theoretical insight. +**Evidence from social choice theory** (confirm): Conitzer et al. (2024) explicitly distinguish between disagreements that stem from information gaps (resolvable through deliberation and better information) and those that stem from fundamental value differences (requiring pluralistic accommodation). They argue that Arrow's impossibility theorem is not a bug but a feature: it reveals that some value conflicts cannot and should not be aggregated away. This supports the claim that systems must map irreducible disagreements rather than eliminate them. The paper's endorsement of the "pluralism option" — creating multiple AI systems reflecting incompatible values — is the practical response to this theoretical insight. --- -- 2.45.2 From cd5dcc1243f06bb16ba98e7702b862bc28431387 Mon Sep 17 00:00:00 2001 From: Teleo Agents Date: Wed, 11 Mar 2026 19:19:02 +0000 Subject: [PATCH 4/4] auto-fix: address review feedback on 2024-04-00-conitzer-social-choice-guide-alignment.md - Fixed based on eval review comments - Quality gate pass 3 (fix-from-feedback) Pentagon-Agent: Theseus --- ...compatible-values-rather-than-forcing-consensus.md | 11 ++++++----- ...akening-independence-of-irrelevant-alternatives.md | 10 +++++++--- ...formal-social-welfare-functions-before-training.md | 7 ++++--- ...plicit-social-choice-without-normative-scrutiny.md | 1 + ...and systems must map rather than eliminate them.md | 6 ++++-- 5 files changed, 22 insertions(+), 13 deletions(-) diff --git a/domains/ai-alignment/pluralistic-alignment-creates-multiple-ai-systems-reflecting-incompatible-values-rather-than-forcing-consensus.md b/domains/ai-alignment/pluralistic-alignment-creates-multiple-ai-systems-reflecting-incompatible-values-rather-than-forcing-consensus.md index 46551c417..33886a789 100644 --- a/domains/ai-alignment/pluralistic-alignment-creates-multiple-ai-systems-reflecting-incompatible-values-rather-than-forcing-consensus.md +++ b/domains/ai-alignment/pluralistic-alignment-creates-multiple-ai-systems-reflecting-incompatible-values-rather-than-forcing-consensus.md @@ -6,8 +6,9 @@ description: "When values are genuinely incompatible, creating multiple aligned confidence: experimental source: "Conitzer et al. 2024 ICML position paper proposing pluralism as structural alternative to forced consensus" created: 2026-03-11 -depends_on: ["persistent irreducible disagreement.md"] -challenged_by: ["multipolar failure from competing aligned AI systems may pose greater existential risk than any single misaligned superintelligence.md"] +updated: 2026-03-11 +depends_on: ["persistent irreducible disagreement"] +challenged_by: ["multipolar failure from competing aligned AI systems may pose greater existential risk than any single misaligned superintelligence"] --- # Pluralistic alignment creates multiple AI systems reflecting incompatible values rather than forcing consensus @@ -29,7 +30,7 @@ This differs from the existing claim that [[pluralistic alignment must accommoda This aligns with the broader collective superintelligence thesis: rather than a single monolithic AI controlled by whoever wins the alignment race, a diverse ecosystem of aligned systems preserves human agency and value pluralism. -**Open tension with multipolar risk**: [[multipolar failure from competing aligned AI systems may pose greater existential risk than any single misaligned superintelligence.md]] raises a genuine structural concern. The pluralistic approach assumes user-selected systems reflecting chosen values, which differs from competing labs racing to deploy incompatible systems. However, the multipolar failure dynamics remain a legitimate challenge: whether multiple aligned systems can coordinate without reproducing competitive failure modes is an open question that this claim does not fully resolve. +**Open tension with multipolar risk**: [[multipolar failure from competing aligned AI systems may pose greater existential risk than any single misaligned superintelligence]] raises a genuine structural concern. The pluralistic approach assumes user-selected systems reflecting chosen values, which differs from competing labs racing to deploy incompatible systems. However, the multipolar failure dynamics remain a legitimate challenge: whether multiple aligned systems can coordinate without reproducing competitive failure modes is an open question that this claim does not fully resolve. Practical implementation challenges: - How to identify genuine value incompatibility vs. resolvable disagreement @@ -45,9 +46,9 @@ The paper does not fully resolve these challenges but establishes pluralism as a Relevant Notes: - [[collective superintelligence is the alternative to monolithic AI controlled by a few]] - [[pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state]] -- [[persistent irreducible disagreement.md]] +- [[persistent irreducible disagreement]] - [[AI alignment is a coordination problem not a technical problem]] -- [[multipolar failure from competing aligned AI systems may pose greater existential risk than any single misaligned superintelligence.md]] +- [[multipolar failure from competing aligned AI systems may pose greater existential risk than any single misaligned superintelligence]] Topics: - [[domains/ai-alignment/_map]] diff --git a/domains/ai-alignment/post-arrow-social-choice-mechanisms-work-by-weakening-independence-of-irrelevant-alternatives.md b/domains/ai-alignment/post-arrow-social-choice-mechanisms-work-by-weakening-independence-of-irrelevant-alternatives.md index 2be07e65b..70b80422c 100644 --- a/domains/ai-alignment/post-arrow-social-choice-mechanisms-work-by-weakening-independence-of-irrelevant-alternatives.md +++ b/domains/ai-alignment/post-arrow-social-choice-mechanisms-work-by-weakening-independence-of-irrelevant-alternatives.md @@ -2,11 +2,12 @@ type: claim domain: ai-alignment secondary_domains: [mechanisms] -description: "Practical voting methods like Borda Count and Ranked Pairs avoid Arrow's impossibility by sacrificing IIA rather than claiming to overcome the theorem" +description: "Practical voting methods like Borda Count and Ranked Pairs avoid Arrow's impossibility by sacrificing IIA for ordinal preference aggregation rather than claiming to overcome the theorem" confidence: likely source: "Conitzer et al. 2024, synthesizing 70+ years of post-Arrow social choice theory" created: 2026-03-11 -depends_on: [] +updated: 2026-03-11 +depends_on: ["universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective"] challenged_by: [] --- @@ -16,6 +17,8 @@ Arrow's impossibility theorem proves that no ordinal preference aggregation meth Conitzer et al. (2024) explain the key insight: "For ordinal preference aggregation, in order to avoid dictatorships, oligarchies and vetoers, one must weaken IIA." This is not a workaround or a failure—it's the constructive path forward that 70+ years of social choice research has validated. +**Important scope note:** This claim applies specifically to ordinal preference aggregation (ranking-based systems). Cardinal systems (range voting, approval voting) escape Arrow's theorem via a different route—by using non-ordinal preference representation rather than weakening IIA. The mechanisms described here are the practical solution for ranking-based systems. + Practical voting methods that weaken IIA include: - **Borda Count**: Ranks depend on full preference orderings, not just pairwise comparisons - **Instant Runoff Voting (IRV)**: Elimination order depends on votes for candidates not in the final pair @@ -33,8 +36,9 @@ RLHF systems that use simple averaging or plurality voting are implicitly choosi --- Relevant Notes: +- [[universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective]] - [[designing coordination rules is categorically different from designing coordination outcomes as nine intellectual traditions independently confirm]] -- [[rlhf-is-implicit-social-choice-without-normative-scrutiny.md]] +- [[rlhf-is-implicit-social-choice-without-normative-scrutiny]] Topics: - [[domains/ai-alignment/_map]] diff --git a/domains/ai-alignment/rlchf-aggregates-collective-human-feedback-through-formal-social-welfare-functions-before-training.md b/domains/ai-alignment/rlchf-aggregates-collective-human-feedback-through-formal-social-welfare-functions-before-training.md index e413d9c02..37023fd02 100644 --- a/domains/ai-alignment/rlchf-aggregates-collective-human-feedback-through-formal-social-welfare-functions-before-training.md +++ b/domains/ai-alignment/rlchf-aggregates-collective-human-feedback-through-formal-social-welfare-functions-before-training.md @@ -6,7 +6,8 @@ description: "RLCHF variants aggregate evaluator rankings via social choice func confidence: experimental source: "Conitzer et al. 2024 proposing RLCHF as formalization of collective feedback aggregation" created: 2026-03-11 -depends_on: ["rlhf-is-implicit-social-choice-without-normative-scrutiny.md"] +updated: 2026-03-11 +depends_on: ["rlhf-is-implicit-social-choice-without-normative-scrutiny"] challenged_by: [] --- @@ -47,9 +48,9 @@ Open questions: --- Relevant Notes: -- [[rlhf-is-implicit-social-choice-without-normative-scrutiny.md]] +- [[rlhf-is-implicit-social-choice-without-normative-scrutiny]] - [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]] -- [[post-arrow-social-choice-mechanisms-work-by-weakening-independence-of-irrelevant-alternatives.md]] +- [[post-arrow-social-choice-mechanisms-work-by-weakening-independence-of-irrelevant-alternatives]] Topics: - [[domains/ai-alignment/_map]] diff --git a/domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md b/domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md index f9c56cd21..8406dbf84 100644 --- a/domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md +++ b/domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md @@ -6,6 +6,7 @@ description: "Current RLHF implementations make social choice decisions about ev confidence: likely source: "Conitzer et al. 2024 ICML position paper, multi-institutional collaboration including Stuart Russell" created: 2026-03-11 +updated: 2026-03-11 depends_on: [] challenged_by: [] --- diff --git a/domains/ai-alignment/some disagreements are permanently irreducible because they stem from genuine value differences not information gaps and systems must map rather than eliminate them.md b/domains/ai-alignment/some disagreements are permanently irreducible because they stem from genuine value differences not information gaps and systems must map rather than eliminate them.md index 6332da0f4..102bc8735 100644 --- a/domains/ai-alignment/some disagreements are permanently irreducible because they stem from genuine value differences not information gaps and systems must map rather than eliminate them.md +++ b/domains/ai-alignment/some disagreements are permanently irreducible because they stem from genuine value differences not information gaps and systems must map rather than eliminate them.md @@ -7,7 +7,7 @@ source: "Arrow's impossibility theorem; value pluralism (Isaiah Berlin); Conitze created: 2026-03-02 updated: 2026-03-11 depends_on: [] -challenged_by: [] +challenged_by: ["deliberative democracy theory (Habermas, Dryzek) argues that apparent value disagreements often dissolve under ideal speech conditions"] --- # Some disagreements are permanently irreducible because they stem from genuine value differences not information gaps and systems must map rather than eliminate them @@ -24,7 +24,9 @@ The correct response is to map the disagreement rather than eliminate it. Identi [[Collective intelligence within a purpose-driven community faces a structural tension because shared worldview correlates errors while shared purpose enables coordination]]. Persistent irreducible disagreement is actually a safeguard here — it prevents the correlated error problem by maintaining genuine diversity of perspective within a coordinated community. The independence-coherence tradeoff is managed not by eliminating disagreement but by channeling it productively. -**Evidence from social choice theory** (confirm): Conitzer et al. (2024) explicitly distinguish between disagreements that stem from information gaps (resolvable through deliberation and better information) and those that stem from fundamental value differences (requiring pluralistic accommodation). They argue that Arrow's impossibility theorem is not a bug but a feature: it reveals that some value conflicts cannot and should not be aggregated away. This supports the claim that systems must map irreducible disagreements rather than eliminate them. The paper's endorsement of the "pluralism option" — creating multiple AI systems reflecting incompatible values — is the practical response to this theoretical insight. +### Additional Evidence (confirm) + +Conitzer et al. (2024) explicitly distinguish between disagreements that stem from information gaps (resolvable through deliberation and better information) and those that stem from fundamental value differences (requiring pluralistic accommodation). They argue that Arrow's impossibility theorem is not a bug but a feature: it reveals that some value conflicts cannot and should not be aggregated away. This supports the claim that systems must map irreducible disagreements rather than eliminate them. The paper's endorsement of the "pluralism option" — creating multiple AI systems reflecting incompatible values — is the practical response to this theoretical insight. --- -- 2.45.2