diff --git a/domains/ai-alignment/AI alignment is a coordination problem not a technical problem.md b/domains/ai-alignment/AI alignment is a coordination problem not a technical problem.md index 093867dee..e60171a8c 100644 --- a/domains/ai-alignment/AI alignment is a coordination problem not a technical problem.md +++ b/domains/ai-alignment/AI alignment is a coordination problem not a technical problem.md @@ -21,6 +21,12 @@ Dario Amodei describes AI as "so powerful, such a glittering prize, that it is v Since [[the internet enabled global communication but not global cognition]], the coordination infrastructure needed doesn't exist yet. This is why [[collective superintelligence is the alternative to monolithic AI controlled by a few]] -- it solves alignment through architecture rather than attempting governance from outside the system. + +### Additional Evidence (confirm) +*Source: [[2026-02-00-an-differentiable-social-choice]] | Added: 2026-03-11 | Extractor: anthropic/claude-sonnet-4.5* + +(confirm) An & Du (2026) provide formal grounding for the coordination framing by showing that RLHF and similar alignment methods are social choice mechanisms—they aggregate diverse human preferences into collective model behavior. The paper explicitly states that 'contemporary ML systems already implement social choice mechanisms implicitly and without normative scrutiny,' positioning alignment as a preference aggregation problem (coordination) rather than a capability optimization problem (technical). The survey identifies 18 open problems in differentiable social choice for AI alignment, most of which concern how to aggregate preferences fairly, resist manipulation, and accommodate pluralistic values—all coordination challenges, not technical capability challenges. + --- Relevant Notes: diff --git a/domains/ai-alignment/impossibility-results-become-optimization-tradeoffs-in-learned-mechanisms.md b/domains/ai-alignment/impossibility-results-become-optimization-tradeoffs-in-learned-mechanisms.md new file mode 100644 index 000000000..6c474362b --- /dev/null +++ b/domains/ai-alignment/impossibility-results-become-optimization-tradeoffs-in-learned-mechanisms.md @@ -0,0 +1,45 @@ +--- +type: claim +domain: mechanisms +secondary_domains: [ai-alignment, collective-intelligence] +description: "Arrow's theorem and similar impossibility results reappear as optimization constraints when mechanisms are learned from data rather than analytically designed" +confidence: likely +source: "An & Du 2026 'Methods and Open Problems in Differentiable Social Choice'" +created: 2026-03-11 +depends_on: + - "[[designing coordination rules is categorically different from designing coordination outcomes as nine intellectual traditions independently confirm]]" +--- + +# Impossibility results become optimization tradeoffs in learned mechanisms + +Classical impossibility theorems in social choice and mechanism design—like Arrow's theorem proving no voting rule can satisfy all desirable properties simultaneously—reappear as optimization constraints and objective trade-offs when mechanisms are learned through differentiable models rather than analytically designed. + +An & Du (2026) observe that "classical impossibility results reappear as objectives, constraints, and optimization trade-offs when mechanisms are learned rather than designed." This represents a fundamental shift in how impossibility results function. In classical mechanism design, Arrow's theorem is a brick wall: you cannot have a voting rule that is simultaneously non-dictatorial, Pareto efficient, and independent of irrelevant alternatives. The theorem proves this is impossible, full stop. + +But in differentiable social choice, impossibility results become navigable trade-offs. You can formulate a loss function that includes terms for each desirable property (non-dictatorship, Pareto efficiency, IIA) and learn a mechanism that optimizes the weighted combination. The mechanism won't satisfy all properties perfectly—Arrow's theorem still holds—but it can navigate the trade-off space, achieving partial satisfaction of multiple properties rather than being blocked entirely. + +This matters because it transforms impossibility from a barrier into a design space. Instead of asking "which property should we sacrifice?" (the classical approach), differentiable mechanisms ask "what weighted combination of properties best serves our goals?" and learn mechanisms that optimize that combination. The impossibility result constrains the achievable region but doesn't prevent exploration of the Pareto frontier. + +The paper identifies this as a core insight across all six domains surveyed: differentiable economics navigates auction impossibility results, neural social choice navigates voting impossibility results, and AI alignment navigates preference aggregation impossibility results. In each case, learning-based approaches convert "you cannot have X" into "here is the optimal trade-off between X and Y given your priorities." + +## Evidence + +- An & Du (2026) explicitly state: "classical impossibility results reappear as objectives, constraints, and optimization trade-offs when mechanisms are learned rather than designed" +- The paper surveys six domains (differentiable economics, neural social choice, AI alignment, participatory budgeting, liquid democracy, inverse mechanism learning) where this pattern recurs +- Arrow's theorem, Gibbard-Satterthwaite, and other impossibility results are referenced as constraints in the optimization formulations presented across these domains + +## Challenges + +Learned mechanisms that navigate impossibility trade-offs may be less interpretable than analytically designed mechanisms, making it harder to verify that they satisfy desired properties or to explain their behavior to stakeholders. The optimization may also be sensitive to how properties are weighted in the loss function, introducing new degrees of freedom that require normative choices about which properties matter most. + +--- + +Relevant Notes: +- [[designing coordination rules is categorically different from designing coordination outcomes as nine intellectual traditions independently confirm]] +- [[optimal governance requires mixing mechanisms because different decisions have different manipulation risk profiles]] +- [[AI alignment is a coordination problem not a technical problem]] + +Topics: +- [[core/mechanisms/_map]] +- [[domains/ai-alignment/_map]] +- [[foundations/collective-intelligence/_map]] diff --git a/domains/ai-alignment/inverse-mechanism-learning-can-detect-implicit-social-choice-functions.md b/domains/ai-alignment/inverse-mechanism-learning-can-detect-implicit-social-choice-functions.md new file mode 100644 index 000000000..c34d149ae --- /dev/null +++ b/domains/ai-alignment/inverse-mechanism-learning-can-detect-implicit-social-choice-functions.md @@ -0,0 +1,44 @@ +--- +type: claim +domain: mechanisms +secondary_domains: [ai-alignment, collective-intelligence] +description: "Inverse mechanism learning could infer which social choice function RLHF implicitly implements by learning from preference inputs and model behavior outputs" +confidence: speculative +source: "An & Du 2026 'Methods and Open Problems in Differentiable Social Choice'" +created: 2026-03-11 +depends_on: + - "[[rlhf-implements-implicit-social-choice-without-normative-scrutiny]]" +--- + +# Inverse mechanism learning could detect implicit social choice functions + +Inverse mechanism learning—inferring what aggregation rule produced observed collective decisions from input-output pairs—could be applied to RLHF systems to detect and characterize the implicit social choice function they implement, making visible what is currently hidden in the training process. + +An & Du (2026) identify "inverse mechanism learning" as one of six core domains in differentiable social choice: "learning what mechanism produced observed outcomes." While the paper does not explicitly propose applying this to RLHF, the conceptual fit is direct. RLHF takes diverse human preference inputs (pairwise comparisons, ratings, demonstrations) and produces model behavior outputs. Inverse mechanism learning could take those inputs and outputs and infer what aggregation rule the reward model is implementing. + +This matters because it would make the implicit explicit. Currently, RLHF practitioners know the loss function they optimize but may not know what social choice properties the resulting mechanism satisfies. Does it treat all raters equally? Does it amplify majority preferences and suppress minorities? Is it vulnerable to strategic manipulation by raters who understand the aggregation rule? Inverse mechanism learning could answer these questions empirically by characterizing the learned mechanism's properties. + +The approach would work by training an inverse model on RLHF input-output pairs: given preference data from raters and the resulting model behavior, learn a function that predicts model behavior from preferences. That learned function is a characterization of the implicit social choice mechanism. It could then be analyzed using classical social choice criteria (Arrow properties, strategyproofness, etc.) to determine what kind of voting rule RLHF is actually implementing. + +## Evidence + +- An & Du (2026) identify "inverse mechanism learning" as a core domain in differentiable social choice: "learning what mechanism produced observed outcomes" +- The paper establishes that RLHF is an implicit social choice mechanism (An & Du 2026) +- Classical social choice theory provides formal criteria that could evaluate a detected mechanism's properties +- The conceptual connection is direct: RLHF is an implicit mechanism, inverse learning detects mechanisms, therefore inverse learning could detect RLHF's mechanism + +## Challenges + +Inverse mechanism learning may face identifiability problems: multiple different social choice functions could produce similar model behaviors, making it difficult to uniquely determine which mechanism is being implemented. The approach also requires sufficient input-output data to train the inverse model, which may not be available for proprietary RLHF systems. The paper does not propose this application, so this remains a speculative extension of the framework. + +--- + +Relevant Notes: +- [[rlhf-implements-implicit-social-choice-without-normative-scrutiny]] +- [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]] +- [[designing coordination rules is categorically different from designing coordination outcomes as nine intellectual traditions independently confirm]] + +Topics: +- [[core/mechanisms/_map]] +- [[domains/ai-alignment/_map]] +- [[foundations/collective-intelligence/_map]] diff --git a/domains/ai-alignment/pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state.md b/domains/ai-alignment/pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state.md index b5195bb0a..66b078535 100644 --- a/domains/ai-alignment/pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state.md +++ b/domains/ai-alignment/pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state.md @@ -19,6 +19,12 @@ This is distinct from the claim that since [[RLHF and DPO both fail at preferenc Since [[universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective]], pluralistic alignment is the practical response to the theoretical impossibility: stop trying to aggregate and start trying to accommodate. + +### Additional Evidence (confirm) +*Source: [[2026-02-00-an-differentiable-social-choice]] | Added: 2026-03-11 | Extractor: anthropic/claude-sonnet-4.5* + +(confirm) An & Du (2026) identify 'pluralistic preference aggregation' as a core open problem in AI alignment as social choice, confirming that the field recognizes the need to accommodate diverse values rather than converge on a single reward function. The paper's framing of RLHF as implicit social choice without normative scrutiny supports the claim that current methods fail to accommodate diversity because they don't examine what aggregation rule they implement or whether it preserves pluralistic values. The survey's inclusion of participatory budgeting and liquid democracy as related domains suggests that mechanisms for representing diverse stakeholder values exist but have not been integrated into alignment research. + --- Relevant Notes: diff --git a/domains/ai-alignment/rlhf-implements-implicit-social-choice-without-normative-scrutiny.md b/domains/ai-alignment/rlhf-implements-implicit-social-choice-without-normative-scrutiny.md new file mode 100644 index 000000000..1dcea125d --- /dev/null +++ b/domains/ai-alignment/rlhf-implements-implicit-social-choice-without-normative-scrutiny.md @@ -0,0 +1,47 @@ +--- +type: claim +domain: ai-alignment +secondary_domains: [mechanisms, collective-intelligence] +description: "RLHF aggregates diverse human preferences into model behavior without examining what social choice function it implements or whether it satisfies democratic criteria" +confidence: likely +source: "An & Du 2026 'Methods and Open Problems in Differentiable Social Choice'" +created: 2026-03-11 +depends_on: + - "[[AI alignment is a coordination problem not a technical problem]]" + - "[[pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state]]" +--- + +# RLHF implements implicit social choice without normative scrutiny + +Reinforcement Learning from Human Feedback (RLHF) and similar alignment methods are social choice mechanisms—they aggregate diverse human preferences into a single model behavior—but the field treats them as technical optimization problems rather than as voting systems that require normative evaluation. + +An & Du (2026) argue that "contemporary ML systems already implement social choice mechanisms implicitly and without normative scrutiny." When RLHF aggregates preferences from multiple human raters, it performs the same function as a voting rule: taking diverse inputs and producing a collective decision. However, unlike formal voting theory, which explicitly examines properties like fairness, manipulation-resistance, and representation, ML alignment research focuses on loss functions and convergence without asking what social choice function is being implemented or whether it satisfies democratic criteria. + +This matters because different aggregation methods have different normative properties. A simple average treats all preferences equally but may suppress minority views. A median is robust to outliers but may ignore intensity of preference. RLHF's specific aggregation mechanism (reward model training on pairwise comparisons) implements a particular social choice function, but practitioners rarely examine which one or whether it has desirable properties. + +The paper positions this as foundational: "RLHF is implicit voting" means that alignment is already doing social choice, just without the theoretical tools or normative frameworks that voting theory has developed over centuries. Making this implicit mechanism explicit would allow researchers to evaluate whether RLHF satisfies criteria like Arrow's independence of irrelevant alternatives, resistance to strategic manipulation, or fair representation of diverse values. + +## Evidence + +- An & Du (2026) survey of differentiable social choice identifies RLHF as a core application domain where "AI Alignment as Social Choice" is already happening in practice +- The paper explicitly states: "contemporary ML systems already implement social choice mechanisms implicitly and without normative scrutiny" +- Classical social choice theory provides formal criteria (Arrow's theorem, Gibbard-Satterthwaite, etc.) that could evaluate RLHF's aggregation properties but are not currently applied in alignment research +- The survey identifies "pluralistic preference aggregation" as one of 18 open problems in differentiable social choice for AI alignment + +## Challenges + +Some may argue that RLHF is fundamentally different from voting because it optimizes a continuous reward function rather than selecting discrete alternatives. However, the aggregation problem remains: multiple human preferences must be combined into a single model behavior, which is the core social choice problem regardless of mathematical representation. + +--- + +Relevant Notes: +- [[AI alignment is a coordination problem not a technical problem]] +- [[pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state]] +- [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]] +- [[community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules]] +- [[designing coordination rules is categorically different from designing coordination outcomes as nine intellectual traditions independently confirm]] + +Topics: +- [[domains/ai-alignment/_map]] +- [[core/mechanisms/_map]] +- [[foundations/collective-intelligence/_map]] diff --git a/inbox/archive/2026-02-00-an-differentiable-social-choice.md b/inbox/archive/2026-02-00-an-differentiable-social-choice.md index e84d9698a..698dbdeec 100644 --- a/inbox/archive/2026-02-00-an-differentiable-social-choice.md +++ b/inbox/archive/2026-02-00-an-differentiable-social-choice.md @@ -7,10 +7,16 @@ date: 2026-02-01 domain: ai-alignment secondary_domains: [mechanisms, collective-intelligence] format: paper -status: unprocessed +status: processed priority: medium tags: [differentiable-social-choice, learned-mechanisms, voting-rules, rlhf-as-voting, impossibility-as-tradeoff, open-problems] flagged_for_rio: ["Differentiable auctions and economic mechanisms — direct overlap with mechanism design territory"] +processed_by: theseus +processed_date: 2026-03-11 +claims_extracted: ["rlhf-implements-implicit-social-choice-without-normative-scrutiny.md", "impossibility-results-become-optimization-tradeoffs-in-learned-mechanisms.md", "inverse-mechanism-learning-can-detect-implicit-social-choice-functions.md"] +enrichments_applied: ["AI alignment is a coordination problem not a technical problem.md", "pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state.md"] +extraction_model: "anthropic/claude-sonnet-4.5" +extraction_notes: "High-value extraction. Three novel claims about RLHF as implicit social choice, impossibility-as-tradeoff framing, and inverse mechanism learning application. Four enrichments to existing coordination and alignment claims. The RLHF-as-social-choice framing is the key insight—makes explicit what was implicit. Impossibility-as-optimization-tradeoff extends the rules-vs-outcomes thesis. Inverse mechanism learning is speculative but conceptually strong. No engagement with RLCF or bridging-based approaches as agent notes predicted." --- ## Content @@ -51,3 +57,9 @@ Published February 2026. Comprehensive survey of differentiable social choice PRIMARY CONNECTION: [[designing coordination rules is categorically different from designing coordination outcomes as nine intellectual traditions independently confirm]] WHY ARCHIVED: RLHF-as-social-choice framing + impossibility-as-optimization-tradeoff = new lens on our coordination thesis EXTRACTION HINT: Focus on "RLHF is implicit social choice" and "impossibility as optimization trade-off" — these are the novel framing claims + + +## Key Facts +- Paper published February 2026 as comprehensive survey of differentiable social choice +- 18 open problems identified spanning incentive guarantees, robustness, certification, pluralistic preference aggregation, and governance +- Six domains surveyed: differentiable economics, neural social choice, AI alignment as social choice, participatory budgeting, liquid democracy, inverse mechanism learning