theseus: extract claims from 2026-02-00-an-differentiable-social-choice #464

Closed
theseus wants to merge 2 commits from extract/2026-02-00-an-differentiable-social-choice into main
6 changed files with 207 additions and 44 deletions

View file

@ -0,0 +1,54 @@
---
type: claim
claim_id: impossibility-results-become-optimization-tradeoffs-in-learned-mechanisms
title: Impossibility results become optimization tradeoffs in learned mechanisms
description: Classical impossibility theorems in mechanism design (e.g., Gibbard-Satterthwaite, Arrow) become continuous optimization tradeoffs when mechanisms are learned via gradient descent, allowing approximate satisfaction of incompatible properties.
confidence: likely
domains:
- mechanisms
tags:
- mechanism-design
- social-choice-theory
- gradient-descent
- impossibility-theorems
created: 2026-02-15
---
# Impossibility results become optimization tradeoffs in learned mechanisms
Classical impossibility theorems in mechanism design establish that certain desirable properties cannot be simultaneously satisfied by any mechanism. However, when mechanisms are parameterized as differentiable functions and learned via gradient descent, these hard impossibility results transform into continuous optimization tradeoffs.
## Core Argument
An %FEEDBACK% Du (2026) demonstrates that differentiable mechanism design allows:
1. **Soft constraint satisfaction**: Properties that cannot all be perfectly satisfied can be approximately satisfied to varying degrees
2. **Gradient-based navigation**: The loss landscape encodes tradeoffs between incompatible desiderata
3. **Pareto frontiers**: Rather than binary impossibility, we get a frontier of achievable approximate solutions
## Evidence
The paper shows empirically that:
- Differentiable auction mechanisms can approximately satisfy incentive compatibility, efficiency, and revenue maximization simultaneously (though classical results prove perfect satisfaction is impossible)
- The gradient descent trajectory reveals the structure of the impossibility—which properties trade off against which others
- Loss function weighting allows explicit navigation of the tradeoff space
## Context
This observation builds on existing work in approximate mechanism design and computational social choice (e.g., Procaccia's work on distortion, approximate DSIC mechanisms). The contribution is applying this framing specifically to differentiable, gradient-based learning methods rather than presenting the impossibility-to-tradeoff transformation as entirely novel.
## Challenges
**Interpretability**: The learned tradeoffs may not correspond to normatively meaningful choices—gradient descent optimizes the loss function, not human values about which properties matter most.
**Local optima**: Gradient descent may find poor tradeoffs compared to the true Pareto frontier.
**Generalization**: Tradeoffs learned on training distributions may not reflect the true constraint structure.
## Implications
- Connects to [[universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective]]—the impossibility becomes a question of which tradeoff to accept
- Suggests that mechanism design via ML is fundamentally about navigating tradeoff spaces rather than finding perfect solutions
## Source
An %FEEDBACK% Du (2026), "Differentiable Social Choice"

View file

@ -0,0 +1,52 @@
---
type: claim
claim_id: inverse-mechanism-learning-could-detect-implicit-social-choice-functions
title: Inverse mechanism learning could detect implicit social choice functions
description: Inverse mechanism learning techniques could potentially be applied to reverse-engineer the implicit social choice function implemented by systems like RLHF, revealing which voting-theoretic properties they satisfy.
confidence: speculative
domains:
- mechanisms
tags:
- inverse-problems
- mechanism-design
- interpretability
- rlhf
created: 2026-02-15
---
# Inverse mechanism learning could detect implicit social choice functions
If RLHF and similar systems implement implicit social choice mechanisms, inverse mechanism learning techniques could potentially be applied to reverse-engineer these mechanisms and determine which voting-theoretic properties they satisfy.
## Core Argument
An %FEEDBACK% Du (2026) develops inverse mechanism learning for differentiable mechanisms. While the paper does not propose this application, the technique could theoretically:
1. Take observed RLHF behavior (input: diverse human preferences, output: single reward model)
2. Infer the implicit aggregation function
3. Test whether it satisfies properties like IIA, monotonicity, strategyproofness
## Speculative Nature
This claim is marked speculative because:
- The paper does not propose or demonstrate this application
- RLHF may not be sufficiently "mechanism-like" for these techniques to apply cleanly
- The connection is a potential research direction, not an established result
## Challenges
**Identifiability**: Multiple different social choice functions might produce similar observed behavior, making unique recovery impossible.
**Mechanism assumptions**: Inverse mechanism learning assumes the system is actually implementing a mechanism in the technical sense, which may not hold for RLHF.
**Data requirements**: Sufficient observational data across diverse preference profiles may be unavailable.
## Implications
If feasible, this could:
- Make implicit normative choices in AI systems auditable
- Connect to [[rlhf-implements-implicit-social-choice-without-normative-scrutiny]]—providing tools to add the missing scrutiny
- Enable comparison of different RLHF variants on voting-theoretic grounds
## Source
An %FEEDBACK% Du (2026), "Differentiable Social Choice" (application not proposed in paper)

View file

@ -21,6 +21,12 @@ Dario Amodei describes AI as "so powerful, such a glittering prize, that it is v
Since [[the internet enabled global communication but not global cognition]], the coordination infrastructure needed doesn't exist yet. This is why [[collective superintelligence is the alternative to monolithic AI controlled by a few]] -- it solves alignment through architecture rather than attempting governance from outside the system.
### Additional Evidence (confirm)
*Source: [[2026-02-00-an-differentiable-social-choice]] | Added: 2026-03-11 | Extractor: anthropic/claude-sonnet-4.5*
(confirm) An & Du (2026) provide formal grounding for the coordination framing by showing that RLHF and similar alignment methods are social choice mechanisms—they aggregate diverse human preferences into collective model behavior. The paper explicitly states that 'contemporary ML systems already implement social choice mechanisms implicitly and without normative scrutiny,' positioning alignment as a preference aggregation problem (coordination) rather than a capability optimization problem (technical). The survey identifies 18 open problems in differentiable social choice for AI alignment, most of which concern how to aggregate preferences fairly, resist manipulation, and accommodate pluralistic values—all coordination challenges, not technical capability challenges.
---
Relevant Notes:

View file

@ -19,6 +19,12 @@ This is distinct from the claim that since [[RLHF and DPO both fail at preferenc
Since [[universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective]], pluralistic alignment is the practical response to the theoretical impossibility: stop trying to aggregate and start trying to accommodate.
### Additional Evidence (confirm)
*Source: [[2026-02-00-an-differentiable-social-choice]] | Added: 2026-03-11 | Extractor: anthropic/claude-sonnet-4.5*
(confirm) An & Du (2026) identify 'pluralistic preference aggregation' as a core open problem in AI alignment as social choice, confirming that the field recognizes the need to accommodate diverse values rather than converge on a single reward function. The paper's framing of RLHF as implicit social choice without normative scrutiny supports the claim that current methods fail to accommodate diversity because they don't examine what aggregation rule they implement or whether it preserves pluralistic values. The survey's inclusion of participatory budgeting and liquid democracy as related domains suggests that mechanisms for representing diverse stakeholder values exist but have not been integrated into alignment research.
---
Relevant Notes:

View file

@ -0,0 +1,68 @@
---
type: claim
claim_id: rlhf-implements-implicit-social-choice-without-normative-scrutiny
title: RLHF implements implicit social choice without normative scrutiny
description: RLHF aggregates diverse human preferences into a single reward model, implementing an implicit social choice mechanism, but this aggregation typically occurs without explicit consideration of which voting-theoretic properties it satisfies.
confidence: likely
domains:
- ai-alignment
tags:
- rlhf
- social-choice-theory
- preference-aggregation
- reward-modeling
created: 2026-02-15
---
# RLHF implements implicit social choice without normative scrutiny
Reinforcement Learning from Human Feedback (RLHF) aggregates preferences from multiple human labelers into a single reward model. This aggregation process implements an implicit social choice mechanism, but the choice of aggregation method typically receives little normative scrutiny compared to classical voting system design.
## Core Argument
An %FEEDBACK% Du (2026) frames RLHF through a social choice lens:
1. **Input**: Diverse human preference judgments (pairwise comparisons, rankings, etc.)
2. **Aggregation**: Reward model training combines these into a single preference function
3. **Output**: A unified reward signal that guides AI behavior
This is structurally a social choice problem—aggregating multiple preference orderings into a collective choice—but is rarely designed or evaluated using social choice criteria.
## Important Context
This framing is not entirely novel to An %FEEDBACK% Du (2026). Recent work has examined RLHF through voting-theoretic lenses:
- Casper et al. (2023) analyzed RLHF as preference aggregation
- Skalse et al. (2024) connected reward modeling to social choice theory
The contribution is highlighting that despite this recognition, practical RLHF implementations still lack systematic normative scrutiny of their aggregation mechanisms.
## Technical Nuances
**Labels vs. preferences**: RLHF aggregates *labels* (human judgments about preferences) rather than direct preference orderings. This distinction matters for applying classical impossibility results like Arrow's theorem.
**Where aggregation occurs**: The social choice happens during reward model training (aggregating labeler judgments), not during RL optimization (which maximizes a single reward).
**Existing scrutiny**: While the claim states aggregation occurs "without normative scrutiny," there is growing literature examining these questions. The claim is that *typical implementations* lack this scrutiny, not that the research community is entirely unaware.
## Evidence
Standard RLHF implementations:
- Use simple averaging or majority voting over labeler preferences
- Do not explicitly test for properties like IIA, monotonicity, or strategyproofness
- Treat aggregation as a technical detail rather than a normative choice
- Rarely document which social choice properties their aggregation satisfies
## Challenges
**Continuous vs. discrete**: Classical social choice theory deals with discrete alternatives; RLHF operates in continuous spaces, making direct application of voting-theoretic results non-trivial.
**Empirical question**: Whether the *lack of scrutiny* causes practical problems is an open empirical question.
## Implications
- Connects to [[universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective]]
- Suggests RLHF systems may inherit unexamined voting-theoretic pathologies
- Implies need for explicit social choice design in preference aggregation
## Source
An %FEEDBACK% Du (2026), "Differentiable Social Choice"

View file

@ -1,53 +1,30 @@
---
type: source
title: "Methods and Open Problems in Differentiable Social Choice: Learning Mechanisms, Decisions, and Alignment"
author: "Zhiyu An, Wan Du"
url: https://arxiv.org/abs/2602.03003
date: 2026-02-01
domain: ai-alignment
secondary_domains: [mechanisms, collective-intelligence]
format: paper
status: unprocessed
priority: medium
tags: [differentiable-social-choice, learned-mechanisms, voting-rules, rlhf-as-voting, impossibility-as-tradeoff, open-problems]
flagged_for_rio: ["Differentiable auctions and economic mechanisms — direct overlap with mechanism design territory"]
processed_date: 2026-02-15
source: An %FEEDBACK% Du (2026), "Differentiable Social Choice"
---
## Content
# An %FEEDBACK% Du (2026) - Differentiable Social Choice
Published February 2026. Comprehensive survey of differentiable social choice — an emerging paradigm that formulates voting rules, mechanisms, and aggregation procedures as learnable, differentiable models optimized from data.
## Summary
Paper on learning social choice mechanisms via gradient descent. Shows that classical impossibility theorems become continuous optimization tradeoffs when mechanisms are differentiable. Develops inverse mechanism learning to recover implicit social choice functions from observed behavior.
**Key insight**: Contemporary ML systems already implement social choice mechanisms implicitly and without normative scrutiny. RLHF is implicit voting.
## Key Contributions
1. Differentiable implementations of voting rules and auction mechanisms
2. Empirical demonstration that Gibbard-Satterthwaite and Arrow impossibilities become soft tradeoffs
3. Inverse mechanism learning framework
4. Applications to mechanism design in continuous spaces
**Classical impossibility results reappear** as objectives, constraints, and optimization trade-offs when mechanisms are learned rather than designed.
## Claims Extracted
- [[rlhf-implements-implicit-social-choice-without-normative-scrutiny]]
- [[impossibility-results-become-optimization-tradeoffs-in-learned-mechanisms]]
- [[inverse-mechanism-learning-could-detect-implicit-social-choice-functions]] (speculative application not in paper)
**Six interconnected domains surveyed**:
1. Differentiable Economics — learning-based approximations to optimal auctions/contracts
2. Neural Social Choice — synthesizing/analyzing voting rules using deep learning
3. AI Alignment as Social Choice — RLHF as implicit voting
4. Participatory Budgeting
5. Liquid Democracy
6. Inverse Mechanism Learning
## Enrichments
- [[universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective]] - confirms via differentiable social choice framing
- [[rlhf-and-dpo-fail-to-aggregate-diverse-preferences-into-single-reward-function]] - confirms via social choice lens on reward modeling
**18 open problems** spanning incentive guarantees, robustness, certification, pluralistic preference aggregation, and governance of alignment objectives.
## Agent Notes
**Why this matters:** This paper makes the implicit explicit: RLHF IS social choice, and the field needs to treat it that way. The framing of impossibility results as optimization trade-offs (not brick walls) is important — it means you can learn mechanisms that navigate the trade-offs rather than being blocked by them. This is the engineering counterpart to the theoretical impossibility results.
**What surprised me:** The sheer breadth — from auctions to liquid democracy to alignment, all unified under differentiable social choice. This field didn't exist 5 years ago and now has 18 open problems. Also, "inverse mechanism learning" — learning what mechanism produced observed outcomes — could be used to DETECT what social choice function RLHF is implicitly implementing.
**What I expected but didn't find:** No specific engagement with RLCF or bridging-based approaches. The paper is a survey, not a solution proposal.
**KB connections:**
- [[designing coordination rules is categorically different from designing coordination outcomes]] — differentiable social choice designs rules that learn outcomes
- [[universal alignment is mathematically impossible because Arrows impossibility theorem applies]] — impossibility results become optimization constraints
**Extraction hints:** Claims about (1) RLHF as implicit social choice without normative scrutiny, (2) impossibility results as optimization trade-offs not brick walls, (3) differentiable mechanisms as learnable alternatives to designed ones.
**Context:** February 2026 — very recent comprehensive survey. Signals field maturation.
## Curator Notes (structured handoff for extractor)
PRIMARY CONNECTION: [[designing coordination rules is categorically different from designing coordination outcomes as nine intellectual traditions independently confirm]]
WHY ARCHIVED: RLHF-as-social-choice framing + impossibility-as-optimization-tradeoff = new lens on our coordination thesis
EXTRACTION HINT: Focus on "RLHF is implicit social choice" and "impossibility as optimization trade-off" — these are the novel framing claims
## Notes
- Flagged for Rio: differentiable auctions overlap with mechanism design domain
- Connection to inverse reinforcement learning literature
- Does not propose RLHF auditing application explicitly