From e4506bd6ce8434c42895f278c82534446d79b56d Mon Sep 17 00:00:00 2001 From: Teleo Pipeline Date: Sun, 15 Mar 2026 16:08:32 +0000 Subject: [PATCH] extract: 2024-04-00-conitzer-social-choice-guide-alignment Pentagon-Agent: Ganymede --- ...-diversity-better-than-forced-consensus.md | 48 ++++++++++++++++++ ...independence-of-irrelevant-alternatives.md | 42 ++++++++++++++++ ...nce-platforms-for-ai-alignment-feedback.md | 47 +++++++++++++++++ ...e-function-before-reward-model-training.md | 49 ++++++++++++++++++ ...bling-aggregation-across-diverse-groups.md | 50 +++++++++++++++++++ ...ocial-choice-without-normative-scrutiny.md | 40 +++++++++++++++ ...-conitzer-social-choice-guide-alignment.md | 8 ++- 7 files changed, 283 insertions(+), 1 deletion(-) create mode 100644 domains/ai-alignment/pluralistic-ai-alignment-through-multiple-systems-preserves-value-diversity-better-than-forced-consensus.md create mode 100644 domains/ai-alignment/post-arrow-social-choice-mechanisms-work-by-weakening-independence-of-irrelevant-alternatives.md create mode 100644 domains/ai-alignment/representative-sampling-and-deliberative-mechanisms-should-replace-convenience-platforms-for-ai-alignment-feedback.md create mode 100644 domains/ai-alignment/rlchf-aggregated-rankings-variant-combines-evaluator-rankings-via-social-welfare-function-before-reward-model-training.md create mode 100644 domains/ai-alignment/rlchf-features-based-variant-models-individual-preferences-with-evaluator-characteristics-enabling-aggregation-across-diverse-groups.md create mode 100644 domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md diff --git a/domains/ai-alignment/pluralistic-ai-alignment-through-multiple-systems-preserves-value-diversity-better-than-forced-consensus.md b/domains/ai-alignment/pluralistic-ai-alignment-through-multiple-systems-preserves-value-diversity-better-than-forced-consensus.md new file mode 100644 index 00000000..a40f5572 --- /dev/null +++ b/domains/ai-alignment/pluralistic-ai-alignment-through-multiple-systems-preserves-value-diversity-better-than-forced-consensus.md @@ -0,0 +1,48 @@ +--- +type: claim +domain: ai-alignment +secondary_domains: [collective-intelligence, mechanisms] +description: "Creating multiple AI systems reflecting genuinely incompatible values may be structurally superior to aggregating all preferences into one aligned system" +confidence: experimental +source: "Conitzer et al. (2024), 'Social Choice Should Guide AI Alignment' (ICML 2024)" +created: 2026-03-11 +--- + +# Pluralistic AI alignment through multiple systems preserves value diversity better than forced consensus + +Conitzer et al. (2024) propose a "pluralism option": rather than forcing all human values into a single aligned AI system through preference aggregation, create multiple AI systems that reflect genuinely incompatible value sets. This structural approach to pluralism may better preserve value diversity than any aggregation mechanism. + +The paper positions this as an alternative to the standard alignment framing, which assumes a single AI system must be aligned with aggregated human preferences. When values are irreducibly diverse—not just different but fundamentally incompatible—attempting to merge them into one system necessarily distorts or suppresses some values. Multiple systems allow each value set to be faithfully represented. + +This connects directly to the collective superintelligence thesis: rather than one monolithic aligned AI, a ecosystem of specialized systems with different value orientations, coordinating through explicit mechanisms. The paper doesn't fully develop this direction but identifies it as a viable path. + +## Evidence + +- Conitzer et al. (2024) explicitly propose "creating multiple AI systems reflecting genuinely incompatible values rather than forcing artificial consensus" +- The paper cites [[persistent irreducible disagreement]] as a structural feature that aggregation cannot resolve +- Stuart Russell's co-authorship signals this is a serious position within mainstream AI safety, not a fringe view + +## Relationship to Collective Superintelligence + +This is the closest mainstream AI alignment has come to the collective superintelligence thesis articulated in [[collective superintelligence is the alternative to monolithic AI controlled by a few]]. The paper doesn't use the term "collective superintelligence" but the structural logic is identical: value diversity is preserved through system plurality rather than aggregation. + +The key difference: Conitzer et al. frame this as an option among several approaches, while the collective superintelligence thesis argues this is the only path that preserves human agency at scale. The paper's pluralism option is permissive ("we could do this"), not prescriptive ("we must do this"). + +## Open Questions + +- How do multiple value-aligned systems coordinate when their values conflict in practice? +- What governance mechanisms determine which value sets get their own system? +- Does this approach scale to thousands of value clusters or only to a handful? + +--- + +Relevant Notes: +- [[collective superintelligence is the alternative to monolithic AI controlled by a few]] +- [[pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state]] +- [[persistent irreducible disagreement]] +- [[some disagreements are permanently irreducible because they stem from genuine value differences not information gaps and systems must map rather than eliminate them]] + +Topics: +- domains/ai-alignment/_map +- foundations/collective-intelligence/_map +- core/mechanisms/_map \ No newline at end of file diff --git a/domains/ai-alignment/post-arrow-social-choice-mechanisms-work-by-weakening-independence-of-irrelevant-alternatives.md b/domains/ai-alignment/post-arrow-social-choice-mechanisms-work-by-weakening-independence-of-irrelevant-alternatives.md new file mode 100644 index 00000000..9aa9040d --- /dev/null +++ b/domains/ai-alignment/post-arrow-social-choice-mechanisms-work-by-weakening-independence-of-irrelevant-alternatives.md @@ -0,0 +1,42 @@ +--- +type: claim +domain: ai-alignment +secondary_domains: [mechanisms, collective-intelligence] +description: "Practical voting methods like Borda Count and Ranked Pairs avoid Arrow's impossibility by sacrificing IIA rather than claiming to overcome the theorem" +confidence: proven +source: "Conitzer et al. (2024), 'Social Choice Should Guide AI Alignment' (ICML 2024)" +created: 2026-03-11 +--- + +# Post-Arrow social choice mechanisms work by weakening independence of irrelevant alternatives + +Arrow's impossibility theorem proves that no ordinal preference aggregation method can simultaneously satisfy unrestricted domain, Pareto efficiency, independence of irrelevant alternatives (IIA), and non-dictatorship. Rather than claiming to overcome this theorem, post-Arrow social choice theory has spent 70 years developing practical mechanisms that work by deliberately weakening IIA. + +Conitzer et al. (2024) emphasize this key insight: "for ordinal preference aggregation, in order to avoid dictatorships, oligarchies and vetoers, one must weaken IIA." Practical voting methods like Borda Count, Instant Runoff Voting, and Ranked Pairs all sacrifice IIA to achieve other desirable properties. This is not a failure—it's a principled tradeoff that enables functional collective decision-making. + +The paper recommends examining specific voting methods that have been formally analyzed for their properties rather than searching for a mythical "perfect" aggregation method that Arrow proved cannot exist. Different methods make different tradeoffs, and the choice should depend on the specific alignment context. + +## Evidence + +- Arrow's impossibility theorem (1951) establishes the fundamental constraint +- Conitzer et al. (2024) explicitly state: "Rather than claiming to overcome Arrow's theorem, the paper leverages post-Arrow social choice theory" +- Specific mechanisms recommended: Borda Count, Instant Runoff, Ranked Pairs—all formally analyzed for their properties +- The paper proposes RLCHF variants that use these established social welfare functions rather than inventing new aggregation methods + +## Practical Implications + +This resolves a common confusion in AI alignment discussions: people often cite Arrow's theorem as proof that preference aggregation is impossible, when the actual lesson is that perfect aggregation is impossible and we must choose which properties to prioritize. The 70-year history of social choice theory provides a menu of well-understood options. + +For AI alignment, this means: (1) stop searching for a universal aggregation method, (2) explicitly choose which Arrow conditions to relax based on the deployment context, (3) use established voting methods with known properties rather than ad-hoc aggregation. + +--- + +Relevant Notes: +- [[designing coordination rules is categorically different from designing coordination outcomes as nine intellectual traditions independently confirm]] +- [[collective intelligence requires diversity as a structural precondition not a moral preference]] +- [[persistent irreducible disagreement]] + +Topics: +- domains/ai-alignment/_map +- core/mechanisms/_map +- foundations/collective-intelligence/_map \ No newline at end of file diff --git a/domains/ai-alignment/representative-sampling-and-deliberative-mechanisms-should-replace-convenience-platforms-for-ai-alignment-feedback.md b/domains/ai-alignment/representative-sampling-and-deliberative-mechanisms-should-replace-convenience-platforms-for-ai-alignment-feedback.md new file mode 100644 index 00000000..79742e5d --- /dev/null +++ b/domains/ai-alignment/representative-sampling-and-deliberative-mechanisms-should-replace-convenience-platforms-for-ai-alignment-feedback.md @@ -0,0 +1,47 @@ +--- +type: claim +domain: ai-alignment +secondary_domains: [mechanisms, collective-intelligence] +description: "AI alignment feedback should use citizens assemblies or representative sampling rather than crowdworker platforms to ensure evaluator diversity reflects actual populations" +confidence: likely +source: "Conitzer et al. (2024), 'Social Choice Should Guide AI Alignment' (ICML 2024)" +created: 2026-03-11 +--- + +# Representative sampling and deliberative mechanisms should replace convenience platforms for AI alignment feedback + +Conitzer et al. (2024) argue that current RLHF implementations use convenience sampling (crowdworker platforms like MTurk) rather than representative sampling or deliberative mechanisms. This creates systematic bias in whose values shape AI behavior. The paper recommends citizens' assemblies or stratified representative sampling as alternatives. + +The core issue: crowdworker platforms systematically over-represent certain demographics (younger, more educated, Western, tech-comfortable) and under-represent others. If AI alignment depends on human feedback, the composition of the feedback pool determines whose values are encoded. Convenience sampling makes this choice implicitly based on who signs up for crowdwork platforms. + +Deliberative mechanisms like citizens' assemblies add a second benefit: evaluators engage with each other's perspectives and reasoning, not just their initial preferences. This can surface shared values that aren't apparent from aggregating isolated individual judgments. + +## Evidence + +- Conitzer et al. (2024) explicitly recommend "representative sampling or deliberative mechanisms (citizens' assemblies) rather than convenience platforms" +- The paper cites [[democratic alignment assemblies produce constitutions as effective as expert-designed ones while better representing diverse populations]] as evidence that deliberative approaches work +- Current RLHF implementations predominantly use MTurk, Upwork, or similar platforms + +## Practical Challenges + +Representative sampling and deliberative mechanisms are more expensive and slower than crowdworker platforms. This creates competitive pressure: companies that use convenience sampling can iterate faster and cheaper than those using representative sampling. The paper doesn't address how to resolve this tension. + +Additionally: representative of what population? Global? National? Users of the specific AI system? Different choices lead to different value distributions. + +## Relationship to Existing Work + +This recommendation directly supports [[collective intelligence requires diversity as a structural precondition not a moral preference]]—diversity isn't just normatively desirable, it's necessary for the aggregation mechanism to work correctly. + +The deliberative component connects to [[democratic alignment assemblies produce constitutions as effective as expert-designed ones while better representing diverse populations]], which provides empirical evidence that deliberation improves alignment outcomes. + +--- + +Relevant Notes: +- [[collective intelligence requires diversity as a structural precondition not a moral preference]] +- [[democratic alignment assemblies produce constitutions as effective as expert-designed ones while better representing diverse populations]] +- [[community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules]] + +Topics: +- domains/ai-alignment/_map +- core/mechanisms/_map +- foundations/collective-intelligence/_map \ No newline at end of file diff --git a/domains/ai-alignment/rlchf-aggregated-rankings-variant-combines-evaluator-rankings-via-social-welfare-function-before-reward-model-training.md b/domains/ai-alignment/rlchf-aggregated-rankings-variant-combines-evaluator-rankings-via-social-welfare-function-before-reward-model-training.md new file mode 100644 index 00000000..7f13ac1d --- /dev/null +++ b/domains/ai-alignment/rlchf-aggregated-rankings-variant-combines-evaluator-rankings-via-social-welfare-function-before-reward-model-training.md @@ -0,0 +1,49 @@ +--- +type: claim +domain: ai-alignment +secondary_domains: [mechanisms] +description: "The aggregated rankings variant of RLCHF applies formal social choice functions to combine multiple evaluator rankings before training the reward model" +confidence: experimental +source: "Conitzer et al. (2024), 'Social Choice Should Guide AI Alignment' (ICML 2024)" +created: 2026-03-11 +--- + +# RLCHF aggregated rankings variant combines evaluator rankings via social welfare function before reward model training + +Conitzer et al. (2024) propose Reinforcement Learning from Collective Human Feedback (RLCHF) as a formalization of preference aggregation in AI alignment. The aggregated rankings variant works by: (1) collecting rankings of AI responses from multiple evaluators, (2) combining these rankings using a formal social welfare function (e.g., Borda Count, Ranked Pairs), (3) training the reward model on the aggregated ranking rather than individual preferences. + +This approach makes the social choice decision explicit and auditable. Instead of implicitly aggregating through dataset composition or reward model averaging, the aggregation happens at the ranking level using well-studied voting methods with known properties. + +The key architectural choice: aggregation happens before reward model training, not during or after. This means the reward model learns from a collective preference signal rather than trying to learn individual preferences and aggregate them internally. + +## Evidence + +- Conitzer et al. (2024) describe two RLCHF variants; this is the first +- The paper recommends specific social welfare functions: Borda Count, Instant Runoff, Ranked Pairs +- This approach connects to 70+ years of social choice theory on voting methods + +## Comparison to Standard RLHF + +Standard RLHF typically aggregates preferences implicitly through: +- Dataset composition (which evaluators are included) +- Majority voting on pairwise comparisons +- Averaging reward model predictions + +RLCHF makes this aggregation explicit and allows practitioners to choose aggregation methods based on their normative properties rather than computational convenience. + +## Relationship to Existing Work + +This mechanism directly addresses the failure mode identified in [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]]. By aggregating at the ranking level with formal social choice functions, RLCHF preserves more information about preference diversity than collapsing to a single reward function. + +The approach also connects to [[modeling preference sensitivity as a learned distribution rather than a fixed scalar resolves DPO diversity failures without demographic labels or explicit user modeling]]—both are attempts to handle preference heterogeneity more formally. + +--- + +Relevant Notes: +- [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]] +- [[modeling preference sensitivity as a learned distribution rather than a fixed scalar resolves DPO diversity failures without demographic labels or explicit user modeling]] +- [[post-arrow-social-choice-mechanisms-work-by-weakening-independence-of-irrelevant-alternatives]] + +Topics: +- domains/ai-alignment/_map +- core/mechanisms/_map \ No newline at end of file diff --git a/domains/ai-alignment/rlchf-features-based-variant-models-individual-preferences-with-evaluator-characteristics-enabling-aggregation-across-diverse-groups.md b/domains/ai-alignment/rlchf-features-based-variant-models-individual-preferences-with-evaluator-characteristics-enabling-aggregation-across-diverse-groups.md new file mode 100644 index 00000000..c6b1ad63 --- /dev/null +++ b/domains/ai-alignment/rlchf-features-based-variant-models-individual-preferences-with-evaluator-characteristics-enabling-aggregation-across-diverse-groups.md @@ -0,0 +1,50 @@ +--- +type: claim +domain: ai-alignment +secondary_domains: [mechanisms] +description: "The features-based RLCHF variant learns individual preference models that incorporate evaluator characteristics allowing aggregation across demographic or value-based groups" +confidence: experimental +source: "Conitzer et al. (2024), 'Social Choice Should Guide AI Alignment' (ICML 2024)" +created: 2026-03-11 +--- + +# RLCHF features-based variant models individual preferences with evaluator characteristics enabling aggregation across diverse groups + +The second RLCHF variant proposed by Conitzer et al. (2024) takes a different approach: instead of aggregating rankings directly, it builds individual preference models that incorporate evaluator characteristics (demographics, values, context). These models can then be aggregated across groups, enabling context-sensitive preference aggregation. + +This approach allows the system to learn: "People with characteristic X tend to prefer response type Y in context Z." Aggregation then happens by weighting or combining these learned preference functions according to a social choice rule, rather than aggregating raw rankings. + +The key advantage: this variant can handle preference heterogeneity more flexibly than the aggregated rankings variant. It can adapt aggregation based on context, represent minority preferences explicitly, and enable "what would group X prefer?" queries. + +## Evidence + +- Conitzer et al. (2024) describe this as the second RLCHF variant +- The paper notes this approach "incorporates evaluator characteristics" and enables "aggregation across diverse groups" +- This connects to the broader literature on personalized and pluralistic AI systems + +## Comparison to Aggregated Rankings Variant + +Where the aggregated rankings variant collapses preferences into a single collective ranking before training, the features-based variant preserves preference structure throughout. This allows: +- Context-dependent aggregation (different social choice rules for different situations) +- Explicit representation of minority preferences +- Transparency about which groups prefer which responses + +The tradeoff: higher complexity and potential for misuse (e.g., demographic profiling, value discrimination). + +## Relationship to Existing Work + +This approach is conceptually similar to [[modeling preference sensitivity as a learned distribution rather than a fixed scalar resolves DPO diversity failures without demographic labels or explicit user modeling]], but more explicit about incorporating evaluator features. Both recognize that preference heterogeneity is structural, not noise. + +The features-based variant also connects to [[community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules]]—both emphasize that different communities have different legitimate preferences that should be represented rather than averaged away. + +--- + +Relevant Notes: +- [[modeling preference sensitivity as a learned distribution rather than a fixed scalar resolves DPO diversity failures without demographic labels or explicit user modeling]] +- [[community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules]] +- [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]] + +Topics: +- domains/ai-alignment/_map +- core/mechanisms/_map +- foundations/collective-intelligence/_map \ No newline at end of file diff --git a/domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md b/domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md new file mode 100644 index 00000000..d8d679b8 --- /dev/null +++ b/domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md @@ -0,0 +1,40 @@ +--- +type: claim +domain: ai-alignment +description: "Current RLHF implementations make social choice decisions about evaluator selection and preference aggregation without examining their normative properties" +confidence: likely +source: "Conitzer et al. (2024), 'Social Choice Should Guide AI Alignment' (ICML 2024)" +created: 2026-03-11 +--- + +# RLHF is implicit social choice without normative scrutiny + +Reinforcement Learning from Human Feedback (RLHF) necessarily makes social choice decisions—which humans provide input, what feedback is collected, how it's aggregated, and how it's used—but current implementations make these choices without examining their normative properties or drawing on 70+ years of social choice theory. + +Conitzer et al. (2024) argue that RLHF practitioners implicitly answer fundamental social choice questions: Who gets to evaluate? How are conflicting preferences weighted? What aggregation method combines diverse judgments? These decisions have profound implications for whose values shape AI behavior, yet they're typically made based on convenience (e.g., using readily available crowdworker platforms) rather than principled normative reasoning. + +The paper demonstrates that post-Arrow social choice theory has developed practical mechanisms that work within Arrow's impossibility constraints. RLHF essentially reinvented preference aggregation badly, ignoring decades of formal work on voting methods, welfare functions, and pluralistic decision-making. + +## Evidence + +- Conitzer et al. (2024) position paper at ICML 2024, co-authored by Stuart Russell (Berkeley CHAI) and leading social choice theorists +- Current RLHF uses convenience sampling (crowdworker platforms) rather than representative sampling or deliberative mechanisms +- The paper proposes RLCHF (Reinforcement Learning from Collective Human Feedback) as the formal alternative that makes social choice decisions explicit + +## Relationship to Existing Work + +This claim directly addresses the mechanism gap identified in [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]]. Where that claim focuses on the technical failure mode (single reward function), this claim identifies the root cause: RLHF makes social choice decisions without social choice theory. + +The paper's proposed solution—RLCHF with explicit social welfare functions—connects to [[collective intelligence requires diversity as a structural precondition not a moral preference]] by formalizing how diverse evaluator input should be preserved rather than collapsed. + +--- + +Relevant Notes: +- [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]] +- [[collective intelligence requires diversity as a structural precondition not a moral preference]] +- [[AI alignment is a coordination problem not a technical problem]] + +Topics: +- domains/ai-alignment/_map +- core/mechanisms/_map +- foundations/collective-intelligence/_map \ No newline at end of file diff --git a/inbox/archive/2024-04-00-conitzer-social-choice-guide-alignment.md b/inbox/archive/2024-04-00-conitzer-social-choice-guide-alignment.md index eb4c1986..de076d53 100644 --- a/inbox/archive/2024-04-00-conitzer-social-choice-guide-alignment.md +++ b/inbox/archive/2024-04-00-conitzer-social-choice-guide-alignment.md @@ -7,10 +7,16 @@ date: 2024-04-01 domain: ai-alignment secondary_domains: [mechanisms, collective-intelligence] format: paper -status: unprocessed +status: processed priority: high tags: [social-choice, rlhf, rlchf, evaluator-selection, mechanism-design, pluralism, arrow-workaround] flagged_for_rio: ["Social welfare functions as governance mechanisms — direct parallel to futarchy/prediction market design"] +processed_by: theseus +processed_date: 2026-03-11 +claims_extracted: ["rlhf-is-implicit-social-choice-without-normative-scrutiny.md", "post-arrow-social-choice-mechanisms-work-by-weakening-independence-of-irrelevant-alternatives.md", "pluralistic-ai-alignment-through-multiple-systems-preserves-value-diversity-better-than-forced-consensus.md", "rlchf-aggregated-rankings-variant-combines-evaluator-rankings-via-social-welfare-function-before-reward-model-training.md", "rlchf-features-based-variant-models-individual-preferences-with-evaluator-characteristics-enabling-aggregation-across-diverse-groups.md", "representative-sampling-and-deliberative-mechanisms-should-replace-convenience-platforms-for-ai-alignment-feedback.md"] +enrichments_applied: ["pluralistic-alignment-must-accommodate-irreducibly-diverse-values-simultaneously-rather-than-converging-on-a-single-aligned-state.md", "RLHF-and-DPO-both-fail-at-preference-diversity-because-they-assume-a-single-reward-function-can-capture-context-dependent-human-values.md", "collective-intelligence-requires-diversity-as-a-structural-precondition-not-a-moral-preference.md", "AI-alignment-is-a-coordination-problem-not-a-technical-problem.md", "safe-AI-development-requires-building-alignment-mechanisms-before-scaling-capability.md"] +extraction_model: "anthropic/claude-sonnet-4.5" +extraction_notes: "Definitive position paper connecting social choice theory to AI alignment. Six new claims extracted covering RLHF as implicit social choice, post-Arrow mechanisms, pluralism option, and RLCHF variants. Five enrichments to existing claims on preference diversity, collective intelligence, and coordination. No entity data. Key insight: mainstream AI alignment is converging toward collective superintelligence thesis through the 'pluralism option' without using that terminology. Stuart Russell co-authorship signals this is now a serious position within AI safety establishment." --- ## Content