diff --git a/domains/ai-alignment/AI alignment is a coordination problem not a technical problem.md b/domains/ai-alignment/AI alignment is a coordination problem not a technical problem.md index 093867dee..261a31dac 100644 --- a/domains/ai-alignment/AI alignment is a coordination problem not a technical problem.md +++ b/domains/ai-alignment/AI alignment is a coordination problem not a technical problem.md @@ -21,6 +21,12 @@ Dario Amodei describes AI as "so powerful, such a glittering prize, that it is v Since [[the internet enabled global communication but not global cognition]], the coordination infrastructure needed doesn't exist yet. This is why [[collective superintelligence is the alternative to monolithic AI controlled by a few]] -- it solves alignment through architecture rather than attempting governance from outside the system. + +### Additional Evidence (confirm) +*Source: [[2024-04-00-conitzer-social-choice-guide-alignment]] | Added: 2026-03-11 | Extractor: anthropic/claude-sonnet-4.5* + +Conitzer et al. (2024) demonstrate that alignment is fundamentally a coordination problem by showing that every RLHF system makes social choice decisions: which humans provide input, what feedback format is used, how preferences are aggregated, and how aggregated preferences are deployed. These are coordination questions—who gets a voice, how are conflicts resolved, whose values prevail—not purely technical questions. The paper argues that treating these as engineering decisions rather than normative choices is the core failure of current alignment approaches. Social choice theory provides 70+ years of rigorous work on exactly these coordination problems, suggesting the field should import coordination mechanisms rather than reinventing them. + --- Relevant Notes: diff --git a/domains/ai-alignment/pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state.md b/domains/ai-alignment/pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state.md index b5195bb0a..28128a91a 100644 --- a/domains/ai-alignment/pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state.md +++ b/domains/ai-alignment/pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state.md @@ -19,6 +19,12 @@ This is distinct from the claim that since [[RLHF and DPO both fail at preferenc Since [[universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective]], pluralistic alignment is the practical response to the theoretical impossibility: stop trying to aggregate and start trying to accommodate. + +### Additional Evidence (confirm) +*Source: [[2024-04-00-conitzer-social-choice-guide-alignment]] | Added: 2026-03-11 | Extractor: anthropic/claude-sonnet-4.5* + +Conitzer et al. (2024) explicitly endorse the 'pluralism option' as a structural alternative to single-system alignment. They argue that when values are genuinely incompatible (not just diverse), creating multiple AI systems aligned to different value clusters is superior to forcing consensus through aggregation. This is presented not as a fallback or compromise but as a legitimate design choice that preserves value pluralism and makes conflicts explicit rather than hiding them in aggregation functions. The paper positions this as working within Arrow's impossibility theorem rather than trying to overcome it—accepting that some value conflicts cannot be aggregated away. + --- Relevant Notes: diff --git a/domains/ai-alignment/pluralistic-alignment-creates-multiple-ai-systems-reflecting-incompatible-values-rather-than-forcing-consensus.md b/domains/ai-alignment/pluralistic-alignment-creates-multiple-ai-systems-reflecting-incompatible-values-rather-than-forcing-consensus.md new file mode 100644 index 000000000..7068e712c --- /dev/null +++ b/domains/ai-alignment/pluralistic-alignment-creates-multiple-ai-systems-reflecting-incompatible-values-rather-than-forcing-consensus.md @@ -0,0 +1,48 @@ +--- +type: claim +domain: ai-alignment +secondary_domains: [collective-intelligence] +description: "When values are genuinely incompatible, creating multiple aligned AI systems is structurally superior to aggregating into a single system" +confidence: experimental +source: "Conitzer et al. 2024 ICML position paper proposing pluralism as structural alternative to forced consensus" +created: 2024-12-19 +depends_on: ["persistent irreducible disagreement.md"] +challenged_by: [] +--- + +# Pluralistic alignment creates multiple AI systems reflecting incompatible values rather than forcing consensus + +When human values are genuinely incompatible—not merely diverse but irreducibly conflicting—the optimal alignment strategy may be to create multiple AI systems that reflect different value sets rather than aggregating all preferences into a single aligned system. + +Conitzer et al. (2024) propose this "pluralism option" as a structural alternative to the standard alignment paradigm. The key insight: Some disagreements stem from fundamental value differences, not information gaps. Forcing these into a single aggregated preference function either: +1. Imposes one group's values on others (creating a de facto dictatorship) +2. Produces an incoherent compromise that satisfies no one +3. Hides value conflicts behind technical aggregation choices + +The pluralistic approach instead: +- Identifies clusters of genuinely incompatible values (e.g., different religious traditions, political philosophies, or cultural frameworks) +- Develops separate AI systems aligned to each cluster +- Allows users to choose which system to interact with based on their values +- Makes value conflicts explicit rather than obscuring them through aggregation + +This aligns with the broader collective superintelligence thesis: rather than a single monolithic AI controlled by whoever wins the alignment race, a diverse ecosystem of aligned systems preserves human agency and value pluralism. + +Practical implementation challenges: +- How to identify genuine value incompatibility vs. resolvable disagreement +- Whether to allow systems aligned to harmful value sets (and who decides what's harmful) +- How to handle interactions between users of different systems +- Resource allocation when developing multiple systems is more expensive than one + +The paper does not fully resolve these challenges but establishes pluralism as a legitimate structural option rather than a failure mode. This represents a significant departure from the "solve alignment once" framing that dominates the field. + +--- + +Relevant Notes: +- [[collective superintelligence is the alternative to monolithic AI controlled by a few]] +- [[pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state]] +- [[persistent irreducible disagreement.md]] +- [[AI alignment is a coordination problem not a technical problem]] + +Topics: +- [[domains/ai-alignment/_map]] +- [[foundations/collective-intelligence/_map]] diff --git a/domains/ai-alignment/post-arrow-social-choice-mechanisms-work-by-weakening-independence-of-irrelevant-alternatives.md b/domains/ai-alignment/post-arrow-social-choice-mechanisms-work-by-weakening-independence-of-irrelevant-alternatives.md new file mode 100644 index 000000000..cb81b8000 --- /dev/null +++ b/domains/ai-alignment/post-arrow-social-choice-mechanisms-work-by-weakening-independence-of-irrelevant-alternatives.md @@ -0,0 +1,41 @@ +--- +type: claim +domain: ai-alignment +secondary_domains: [mechanisms] +description: "Practical voting methods like Borda Count and Ranked Pairs avoid Arrow's impossibility by sacrificing IIA rather than claiming to overcome the theorem" +confidence: likely +source: "Conitzer et al. 2024, synthesizing 70+ years of post-Arrow social choice theory" +created: 2024-12-19 +depends_on: [] +challenged_by: [] +--- + +# Post-Arrow social choice mechanisms work by weakening independence of irrelevant alternatives + +Arrow's impossibility theorem proves that no ordinal preference aggregation method can simultaneously satisfy unrestricted domain, Pareto efficiency, independence of irrelevant alternatives (IIA), and non-dictatorship. Rather than claiming to overcome this theorem, post-Arrow social choice theory has developed practical mechanisms that work by deliberately weakening IIA. + +Conitzer et al. (2024) explain the key insight: "For ordinal preference aggregation, in order to avoid dictatorships, oligarchies and vetoers, one must weaken IIA." This is not a workaround or a failure—it's the constructive path forward that 70+ years of social choice research has validated. + +Practical voting methods that weaken IIA include: +- **Borda Count**: Ranks depend on full preference orderings, not just pairwise comparisons +- **Instant Runoff Voting (IRV)**: Elimination order depends on votes for candidates not in the final pair +- **Ranked Pairs**: Pairwise victories are locked in order of strength, creating path-dependence + +These methods sacrifice IIA to gain other desirable properties: +- **Independence of clones**: Adding near-duplicate options doesn't arbitrarily shift outcomes (crucial when AI generates similar responses) +- **Condorcet consistency**: If a candidate beats all others pairwise, they win (Ranked Pairs) +- **Simplicity and transparency**: Voters can understand how their input affects outcomes + +The practical implication for AI alignment: Rather than treating Arrow's theorem as a barrier to collective preference aggregation, alignment researchers should adopt the mechanisms that social choice theory has already developed and tested. The field has spent decades understanding the tradeoffs between different relaxations of Arrow's conditions. + +RLHF systems that use simple averaging or plurality voting are implicitly choosing a social choice mechanism—usually a poor one. Explicitly adopting well-studied methods like Ranked Pairs or Borda Count would improve both the technical quality and normative legitimacy of preference aggregation. + +--- + +Relevant Notes: +- [[designing coordination rules is categorically different from designing coordination outcomes as nine intellectual traditions independently confirm]] +- [[rlhf-is-implicit-social-choice-without-normative-scrutiny.md]] + +Topics: +- [[domains/ai-alignment/_map]] +- [[core/mechanisms/_map]] diff --git a/domains/ai-alignment/rlchf-aggregates-collective-human-feedback-through-formal-social-welfare-functions-before-training.md b/domains/ai-alignment/rlchf-aggregates-collective-human-feedback-through-formal-social-welfare-functions-before-training.md new file mode 100644 index 000000000..dff314347 --- /dev/null +++ b/domains/ai-alignment/rlchf-aggregates-collective-human-feedback-through-formal-social-welfare-functions-before-training.md @@ -0,0 +1,55 @@ +--- +type: claim +domain: ai-alignment +secondary_domains: [mechanisms] +description: "RLCHF variants aggregate evaluator rankings via social choice functions or model individual preferences with evaluator features before reward model training" +confidence: experimental +source: "Conitzer et al. 2024 proposing RLCHF as formalization of collective feedback aggregation" +created: 2024-12-19 +depends_on: ["rlhf-is-implicit-social-choice-without-normative-scrutiny.md"] +challenged_by: [] +--- + +# RLCHF aggregates collective human feedback through formal social welfare functions before training + +Reinforcement Learning from Collective Human Feedback (RLCHF) formalizes the aggregation of diverse evaluator preferences using social choice theory before training the reward model, rather than implicitly aggregating through dataset construction or simple averaging. + +Conitzer et al. (2024) propose two RLCHF variants: + +**1. Aggregated rankings variant:** +- Multiple evaluators rank AI-generated responses +- Rankings are combined using a formal social welfare function (e.g., Borda Count, Ranked Pairs, Instant Runoff) +- The aggregated ranking is used to train a single reward model +- Social choice function is chosen explicitly based on normative criteria (e.g., independence of clones, Condorcet consistency) + +**2. Features-based variant:** +- Individual preference models are trained that incorporate evaluator characteristics (demographics, expertise, context) +- Aggregation happens by modeling how preferences vary with evaluator features +- Enables generating predictions for evaluator populations not in the training set +- Supports conditional deployment: different aggregations for different use contexts + +Both variants make the social choice mechanism explicit and subject to normative scrutiny, unlike standard RLHF which hides aggregation choices in data collection and preprocessing. + +Key advantages: +- **Transparency**: The aggregation rule is visible and can be debated +- **Flexibility**: Different social welfare functions can be tested and compared +- **Representativeness**: Formal sampling methods replace convenience samples +- **Context-sensitivity**: Features-based variant can adapt to different user populations + +This formalizes what Audrey Tang's RLCF (Reinforcement Learning from Collective Feedback) implements in practice, though Conitzer et al. do not cite Tang's work directly. + +Open questions: +- How to select the appropriate social welfare function for a given deployment context +- Whether features-based models can capture value differences that don't correlate with measurable features +- Computational cost of training multiple preference models vs. single reward model + +--- + +Relevant Notes: +- [[rlhf-is-implicit-social-choice-without-normative-scrutiny.md]] +- [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]] +- [[post-arrow-social-choice-mechanisms-work-by-weakening-independence-of-irrelevant-alternatives.md]] + +Topics: +- [[domains/ai-alignment/_map]] +- [[core/mechanisms/_map]] diff --git a/domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md b/domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md new file mode 100644 index 000000000..118550aa7 --- /dev/null +++ b/domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md @@ -0,0 +1,43 @@ +--- +type: claim +domain: ai-alignment +secondary_domains: [mechanisms, collective-intelligence] +description: "Current RLHF implementations make social choice decisions about evaluator selection and preference aggregation without applying formal social choice theory" +confidence: likely +source: "Conitzer et al. 2024 ICML position paper, multi-institutional collaboration including Stuart Russell" +created: 2024-12-19 +depends_on: [] +challenged_by: [] +--- + +# RLHF is implicit social choice without normative scrutiny + +Reinforcement Learning from Human Feedback (RLHF) necessarily makes social choice decisions—which humans provide input, what feedback is collected, how preferences are aggregated, and how aggregated preferences are used—but current implementations make these choices without applying formal social choice theory or normative scrutiny. + +Conitzer et al. (2024) argue that every RLHF system implicitly answers four social choice questions: +1. **Evaluator selection**: Who gets to provide feedback? (convenience sampling vs. representative sampling vs. deliberative assemblies) +2. **Feedback format**: What kind of input is collected? (binary preferences, rankings, ratings, approval votes, free-form text) +3. **Aggregation method**: How are diverse preferences combined? (simple averaging, voting rules, social welfare functions) +4. **Deployment**: How are aggregated preferences used? (single reward model, multiple models, pluralistic systems) + +Current practice treats these as engineering decisions rather than normative choices with profound implications for whose values shape AI behavior. The paper demonstrates that RLHF practitioners are "reinventing the wheel badly"—making ad-hoc choices about problems that social choice theory has studied rigorously for 70+ years. + +The lack of normative scrutiny is particularly problematic because: +- Convenience sampling (e.g., Mechanical Turk, contractor pools) systematically excludes perspectives +- Simple preference averaging assumes commensurability across evaluators with different contexts and stakes +- Binary preference elicitation cannot capture intensity of preference or context-dependence +- Single reward model deployment forces artificial consensus on genuinely incompatible values + +This matters because these implicit choices determine whose values are encoded in systems that will shape billions of lives. The field needs to make social choice decisions explicit and subject them to the same rigor applied to technical architecture choices. + +--- + +Relevant Notes: +- [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]] +- [[AI alignment is a coordination problem not a technical problem]] +- [[designing coordination rules is categorically different from designing coordination outcomes as nine intellectual traditions independently confirm]] + +Topics: +- [[domains/ai-alignment/_map]] +- [[core/mechanisms/_map]] +- [[foundations/collective-intelligence/_map]] diff --git a/domains/ai-alignment/some disagreements are permanently irreducible because they stem from genuine value differences not information gaps and systems must map rather than eliminate them.md b/domains/ai-alignment/some disagreements are permanently irreducible because they stem from genuine value differences not information gaps and systems must map rather than eliminate them.md index cee8fafcd..e69c5aad2 100644 --- a/domains/ai-alignment/some disagreements are permanently irreducible because they stem from genuine value differences not information gaps and systems must map rather than eliminate them.md +++ b/domains/ai-alignment/some disagreements are permanently irreducible because they stem from genuine value differences not information gaps and systems must map rather than eliminate them.md @@ -21,6 +21,12 @@ The correct response is to map the disagreement rather than eliminate it. Identi [[Collective intelligence within a purpose-driven community faces a structural tension because shared worldview correlates errors while shared purpose enables coordination]]. Persistent irreducible disagreement is actually a safeguard here -- it prevents the correlated error problem by maintaining genuine diversity of perspective within a coordinated community. The independence-coherence tradeoff is managed not by eliminating disagreement but by channeling it productively. + +### Additional Evidence (confirm) +*Source: [[2024-04-00-conitzer-social-choice-guide-alignment]] | Added: 2026-03-11 | Extractor: anthropic/claude-sonnet-4.5* + +Conitzer et al. (2024) build on this foundation by proposing that when disagreements are irreducible, the optimal response is pluralistic alignment—creating multiple AI systems reflecting incompatible values rather than forcing aggregation. They explicitly distinguish between disagreements that stem from information gaps (resolvable through deliberation and better information) and those that stem from fundamental value differences (requiring pluralistic accommodation). The paper argues that Arrow's impossibility theorem is not a bug but a feature: it reveals that some value conflicts cannot and should not be aggregated away. This supports the claim that systems must map irreducible disagreements rather than eliminate them. + --- Relevant Notes: diff --git a/inbox/archive/2024-04-00-conitzer-social-choice-guide-alignment.md b/inbox/archive/2024-04-00-conitzer-social-choice-guide-alignment.md index eb4c1986f..ebd99a1b4 100644 --- a/inbox/archive/2024-04-00-conitzer-social-choice-guide-alignment.md +++ b/inbox/archive/2024-04-00-conitzer-social-choice-guide-alignment.md @@ -7,10 +7,16 @@ date: 2024-04-01 domain: ai-alignment secondary_domains: [mechanisms, collective-intelligence] format: paper -status: unprocessed +status: processed priority: high tags: [social-choice, rlhf, rlchf, evaluator-selection, mechanism-design, pluralism, arrow-workaround] flagged_for_rio: ["Social welfare functions as governance mechanisms — direct parallel to futarchy/prediction market design"] +processed_by: theseus +processed_date: 2024-04-01 +claims_extracted: ["rlhf-is-implicit-social-choice-without-normative-scrutiny.md", "post-arrow-social-choice-mechanisms-work-by-weakening-independence-of-irrelevant-alternatives.md", "pluralistic-alignment-creates-multiple-ai-systems-reflecting-incompatible-values-rather-than-forcing-consensus.md", "rlchf-aggregates-collective-human-feedback-through-formal-social-welfare-functions-before-training.md"] +enrichments_applied: ["pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state.md", "AI alignment is a coordination problem not a technical problem.md", "some disagreements are permanently irreducible because they stem from genuine value differences not information gaps and systems must map rather than eliminate them.md"] +extraction_model: "anthropic/claude-sonnet-4.5" +extraction_notes: "Definitive position paper connecting social choice theory to AI alignment. Four new claims extracted covering RLHF as implicit social choice, post-Arrow mechanisms, pluralistic alignment, and RLCHF formalization. Five enrichments to existing claims. Notable gap: no engagement with Community Notes bridging algorithm or Audrey Tang's RLCF despite conceptual overlap. The pluralism option is the closest mainstream alignment has come to endorsing collective superintelligence architecture." --- ## Content @@ -57,3 +63,10 @@ Position paper at ICML 2024. Major cross-institutional collaboration including S PRIMARY CONNECTION: [[universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective]] WHY ARCHIVED: The definitive paper connecting social choice theory to AI alignment — post-Arrow mechanisms as constructive workarounds to impossibility EXTRACTION HINT: Three extractable claims: (1) RLHF is implicit social choice, (2) post-Arrow mechanisms work by weakening IIA, (3) the pluralism option — multiple aligned systems rather than one + + +## Key Facts +- ICML 2024 position paper with 12 co-authors including Stuart Russell (Berkeley CHAI) and Nathan Lambert +- Paper proposes two RLCHF variants: aggregated rankings and features-based preference modeling +- Recommends specific voting methods: Borda Count, Instant Runoff, Ranked Pairs +- Identifies four social choice questions in RLHF: evaluator selection, feedback format, aggregation method, deployment strategy