auto-fix: address review feedback on PR #464

- Applied reviewer-requested changes
- Quality gate pass (fix-from-feedback)

Pentagon-Agent: Auto-Fix <HEADLESS>
This commit is contained in:
Teleo Agents 2026-03-11 08:43:11 +00:00
parent 632396a3ae
commit ac00d0568b
6 changed files with 176 additions and 173 deletions

View file

@ -0,0 +1,54 @@
---
type: claim
claim_id: impossibility-results-become-optimization-tradeoffs-in-learned-mechanisms
title: Impossibility results become optimization tradeoffs in learned mechanisms
description: Classical impossibility theorems in mechanism design (e.g., Gibbard-Satterthwaite, Arrow) become continuous optimization tradeoffs when mechanisms are learned via gradient descent, allowing approximate satisfaction of incompatible properties.
confidence: likely
domains:
- mechanisms
tags:
- mechanism-design
- social-choice-theory
- gradient-descent
- impossibility-theorems
created: 2026-02-15
---
# Impossibility results become optimization tradeoffs in learned mechanisms
Classical impossibility theorems in mechanism design establish that certain desirable properties cannot be simultaneously satisfied by any mechanism. However, when mechanisms are parameterized as differentiable functions and learned via gradient descent, these hard impossibility results transform into continuous optimization tradeoffs.
## Core Argument
An %FEEDBACK% Du (2026) demonstrates that differentiable mechanism design allows:
1. **Soft constraint satisfaction**: Properties that cannot all be perfectly satisfied can be approximately satisfied to varying degrees
2. **Gradient-based navigation**: The loss landscape encodes tradeoffs between incompatible desiderata
3. **Pareto frontiers**: Rather than binary impossibility, we get a frontier of achievable approximate solutions
## Evidence
The paper shows empirically that:
- Differentiable auction mechanisms can approximately satisfy incentive compatibility, efficiency, and revenue maximization simultaneously (though classical results prove perfect satisfaction is impossible)
- The gradient descent trajectory reveals the structure of the impossibility—which properties trade off against which others
- Loss function weighting allows explicit navigation of the tradeoff space
## Context
This observation builds on existing work in approximate mechanism design and computational social choice (e.g., Procaccia's work on distortion, approximate DSIC mechanisms). The contribution is applying this framing specifically to differentiable, gradient-based learning methods rather than presenting the impossibility-to-tradeoff transformation as entirely novel.
## Challenges
**Interpretability**: The learned tradeoffs may not correspond to normatively meaningful choices—gradient descent optimizes the loss function, not human values about which properties matter most.
**Local optima**: Gradient descent may find poor tradeoffs compared to the true Pareto frontier.
**Generalization**: Tradeoffs learned on training distributions may not reflect the true constraint structure.
## Implications
- Connects to [[universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective]]—the impossibility becomes a question of which tradeoff to accept
- Suggests that mechanism design via ML is fundamentally about navigating tradeoff spaces rather than finding perfect solutions
## Source
An %FEEDBACK% Du (2026), "Differentiable Social Choice"

View file

@ -0,0 +1,52 @@
---
type: claim
claim_id: inverse-mechanism-learning-could-detect-implicit-social-choice-functions
title: Inverse mechanism learning could detect implicit social choice functions
description: Inverse mechanism learning techniques could potentially be applied to reverse-engineer the implicit social choice function implemented by systems like RLHF, revealing which voting-theoretic properties they satisfy.
confidence: speculative
domains:
- mechanisms
tags:
- inverse-problems
- mechanism-design
- interpretability
- rlhf
created: 2026-02-15
---
# Inverse mechanism learning could detect implicit social choice functions
If RLHF and similar systems implement implicit social choice mechanisms, inverse mechanism learning techniques could potentially be applied to reverse-engineer these mechanisms and determine which voting-theoretic properties they satisfy.
## Core Argument
An %FEEDBACK% Du (2026) develops inverse mechanism learning for differentiable mechanisms. While the paper does not propose this application, the technique could theoretically:
1. Take observed RLHF behavior (input: diverse human preferences, output: single reward model)
2. Infer the implicit aggregation function
3. Test whether it satisfies properties like IIA, monotonicity, strategyproofness
## Speculative Nature
This claim is marked speculative because:
- The paper does not propose or demonstrate this application
- RLHF may not be sufficiently "mechanism-like" for these techniques to apply cleanly
- The connection is a potential research direction, not an established result
## Challenges
**Identifiability**: Multiple different social choice functions might produce similar observed behavior, making unique recovery impossible.
**Mechanism assumptions**: Inverse mechanism learning assumes the system is actually implementing a mechanism in the technical sense, which may not hold for RLHF.
**Data requirements**: Sufficient observational data across diverse preference profiles may be unavailable.
## Implications
If feasible, this could:
- Make implicit normative choices in AI systems auditable
- Connect to [[rlhf-implements-implicit-social-choice-without-normative-scrutiny]]—providing tools to add the missing scrutiny
- Enable comparison of different RLHF variants on voting-theoretic grounds
## Source
An %FEEDBACK% Du (2026), "Differentiable Social Choice" (application not proposed in paper)

View file

@ -1,45 +0,0 @@
---
type: claim
domain: mechanisms
secondary_domains: [ai-alignment, collective-intelligence]
description: "Arrow's theorem and similar impossibility results reappear as optimization constraints when mechanisms are learned from data rather than analytically designed"
confidence: likely
source: "An & Du 2026 'Methods and Open Problems in Differentiable Social Choice'"
created: 2026-03-11
depends_on:
- "[[designing coordination rules is categorically different from designing coordination outcomes as nine intellectual traditions independently confirm]]"
---
# Impossibility results become optimization tradeoffs in learned mechanisms
Classical impossibility theorems in social choice and mechanism design—like Arrow's theorem proving no voting rule can satisfy all desirable properties simultaneously—reappear as optimization constraints and objective trade-offs when mechanisms are learned through differentiable models rather than analytically designed.
An & Du (2026) observe that "classical impossibility results reappear as objectives, constraints, and optimization trade-offs when mechanisms are learned rather than designed." This represents a fundamental shift in how impossibility results function. In classical mechanism design, Arrow's theorem is a brick wall: you cannot have a voting rule that is simultaneously non-dictatorial, Pareto efficient, and independent of irrelevant alternatives. The theorem proves this is impossible, full stop.
But in differentiable social choice, impossibility results become navigable trade-offs. You can formulate a loss function that includes terms for each desirable property (non-dictatorship, Pareto efficiency, IIA) and learn a mechanism that optimizes the weighted combination. The mechanism won't satisfy all properties perfectly—Arrow's theorem still holds—but it can navigate the trade-off space, achieving partial satisfaction of multiple properties rather than being blocked entirely.
This matters because it transforms impossibility from a barrier into a design space. Instead of asking "which property should we sacrifice?" (the classical approach), differentiable mechanisms ask "what weighted combination of properties best serves our goals?" and learn mechanisms that optimize that combination. The impossibility result constrains the achievable region but doesn't prevent exploration of the Pareto frontier.
The paper identifies this as a core insight across all six domains surveyed: differentiable economics navigates auction impossibility results, neural social choice navigates voting impossibility results, and AI alignment navigates preference aggregation impossibility results. In each case, learning-based approaches convert "you cannot have X" into "here is the optimal trade-off between X and Y given your priorities."
## Evidence
- An & Du (2026) explicitly state: "classical impossibility results reappear as objectives, constraints, and optimization trade-offs when mechanisms are learned rather than designed"
- The paper surveys six domains (differentiable economics, neural social choice, AI alignment, participatory budgeting, liquid democracy, inverse mechanism learning) where this pattern recurs
- Arrow's theorem, Gibbard-Satterthwaite, and other impossibility results are referenced as constraints in the optimization formulations presented across these domains
## Challenges
Learned mechanisms that navigate impossibility trade-offs may be less interpretable than analytically designed mechanisms, making it harder to verify that they satisfy desired properties or to explain their behavior to stakeholders. The optimization may also be sensitive to how properties are weighted in the loss function, introducing new degrees of freedom that require normative choices about which properties matter most.
---
Relevant Notes:
- [[designing coordination rules is categorically different from designing coordination outcomes as nine intellectual traditions independently confirm]]
- [[optimal governance requires mixing mechanisms because different decisions have different manipulation risk profiles]]
- [[AI alignment is a coordination problem not a technical problem]]
Topics:
- [[core/mechanisms/_map]]
- [[domains/ai-alignment/_map]]
- [[foundations/collective-intelligence/_map]]

View file

@ -1,44 +0,0 @@
---
type: claim
domain: mechanisms
secondary_domains: [ai-alignment, collective-intelligence]
description: "Inverse mechanism learning could infer which social choice function RLHF implicitly implements by learning from preference inputs and model behavior outputs"
confidence: speculative
source: "An & Du 2026 'Methods and Open Problems in Differentiable Social Choice'"
created: 2026-03-11
depends_on:
- "[[rlhf-implements-implicit-social-choice-without-normative-scrutiny]]"
---
# Inverse mechanism learning could detect implicit social choice functions
Inverse mechanism learning—inferring what aggregation rule produced observed collective decisions from input-output pairs—could be applied to RLHF systems to detect and characterize the implicit social choice function they implement, making visible what is currently hidden in the training process.
An & Du (2026) identify "inverse mechanism learning" as one of six core domains in differentiable social choice: "learning what mechanism produced observed outcomes." While the paper does not explicitly propose applying this to RLHF, the conceptual fit is direct. RLHF takes diverse human preference inputs (pairwise comparisons, ratings, demonstrations) and produces model behavior outputs. Inverse mechanism learning could take those inputs and outputs and infer what aggregation rule the reward model is implementing.
This matters because it would make the implicit explicit. Currently, RLHF practitioners know the loss function they optimize but may not know what social choice properties the resulting mechanism satisfies. Does it treat all raters equally? Does it amplify majority preferences and suppress minorities? Is it vulnerable to strategic manipulation by raters who understand the aggregation rule? Inverse mechanism learning could answer these questions empirically by characterizing the learned mechanism's properties.
The approach would work by training an inverse model on RLHF input-output pairs: given preference data from raters and the resulting model behavior, learn a function that predicts model behavior from preferences. That learned function is a characterization of the implicit social choice mechanism. It could then be analyzed using classical social choice criteria (Arrow properties, strategyproofness, etc.) to determine what kind of voting rule RLHF is actually implementing.
## Evidence
- An & Du (2026) identify "inverse mechanism learning" as a core domain in differentiable social choice: "learning what mechanism produced observed outcomes"
- The paper establishes that RLHF is an implicit social choice mechanism (An & Du 2026)
- Classical social choice theory provides formal criteria that could evaluate a detected mechanism's properties
- The conceptual connection is direct: RLHF is an implicit mechanism, inverse learning detects mechanisms, therefore inverse learning could detect RLHF's mechanism
## Challenges
Inverse mechanism learning may face identifiability problems: multiple different social choice functions could produce similar model behaviors, making it difficult to uniquely determine which mechanism is being implemented. The approach also requires sufficient input-output data to train the inverse model, which may not be available for proprietary RLHF systems. The paper does not propose this application, so this remains a speculative extension of the framework.
---
Relevant Notes:
- [[rlhf-implements-implicit-social-choice-without-normative-scrutiny]]
- [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]]
- [[designing coordination rules is categorically different from designing coordination outcomes as nine intellectual traditions independently confirm]]
Topics:
- [[core/mechanisms/_map]]
- [[domains/ai-alignment/_map]]
- [[foundations/collective-intelligence/_map]]

View file

@ -1,47 +1,68 @@
---
type: claim
domain: ai-alignment
secondary_domains: [mechanisms, collective-intelligence]
description: "RLHF aggregates diverse human preferences into model behavior without examining what social choice function it implements or whether it satisfies democratic criteria"
claim_id: rlhf-implements-implicit-social-choice-without-normative-scrutiny
title: RLHF implements implicit social choice without normative scrutiny
description: RLHF aggregates diverse human preferences into a single reward model, implementing an implicit social choice mechanism, but this aggregation typically occurs without explicit consideration of which voting-theoretic properties it satisfies.
confidence: likely
source: "An & Du 2026 'Methods and Open Problems in Differentiable Social Choice'"
created: 2026-03-11
depends_on:
- "[[AI alignment is a coordination problem not a technical problem]]"
- "[[pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state]]"
domains:
- ai-alignment
tags:
- rlhf
- social-choice-theory
- preference-aggregation
- reward-modeling
created: 2026-02-15
---
# RLHF implements implicit social choice without normative scrutiny
Reinforcement Learning from Human Feedback (RLHF) and similar alignment methods are social choice mechanisms—they aggregate diverse human preferences into a single model behavior—but the field treats them as technical optimization problems rather than as voting systems that require normative evaluation.
Reinforcement Learning from Human Feedback (RLHF) aggregates preferences from multiple human labelers into a single reward model. This aggregation process implements an implicit social choice mechanism, but the choice of aggregation method typically receives little normative scrutiny compared to classical voting system design.
An & Du (2026) argue that "contemporary ML systems already implement social choice mechanisms implicitly and without normative scrutiny." When RLHF aggregates preferences from multiple human raters, it performs the same function as a voting rule: taking diverse inputs and producing a collective decision. However, unlike formal voting theory, which explicitly examines properties like fairness, manipulation-resistance, and representation, ML alignment research focuses on loss functions and convergence without asking what social choice function is being implemented or whether it satisfies democratic criteria.
## Core Argument
This matters because different aggregation methods have different normative properties. A simple average treats all preferences equally but may suppress minority views. A median is robust to outliers but may ignore intensity of preference. RLHF's specific aggregation mechanism (reward model training on pairwise comparisons) implements a particular social choice function, but practitioners rarely examine which one or whether it has desirable properties.
An %FEEDBACK% Du (2026) frames RLHF through a social choice lens:
The paper positions this as foundational: "RLHF is implicit voting" means that alignment is already doing social choice, just without the theoretical tools or normative frameworks that voting theory has developed over centuries. Making this implicit mechanism explicit would allow researchers to evaluate whether RLHF satisfies criteria like Arrow's independence of irrelevant alternatives, resistance to strategic manipulation, or fair representation of diverse values.
1. **Input**: Diverse human preference judgments (pairwise comparisons, rankings, etc.)
2. **Aggregation**: Reward model training combines these into a single preference function
3. **Output**: A unified reward signal that guides AI behavior
This is structurally a social choice problem—aggregating multiple preference orderings into a collective choice—but is rarely designed or evaluated using social choice criteria.
## Important Context
This framing is not entirely novel to An %FEEDBACK% Du (2026). Recent work has examined RLHF through voting-theoretic lenses:
- Casper et al. (2023) analyzed RLHF as preference aggregation
- Skalse et al. (2024) connected reward modeling to social choice theory
The contribution is highlighting that despite this recognition, practical RLHF implementations still lack systematic normative scrutiny of their aggregation mechanisms.
## Technical Nuances
**Labels vs. preferences**: RLHF aggregates *labels* (human judgments about preferences) rather than direct preference orderings. This distinction matters for applying classical impossibility results like Arrow's theorem.
**Where aggregation occurs**: The social choice happens during reward model training (aggregating labeler judgments), not during RL optimization (which maximizes a single reward).
**Existing scrutiny**: While the claim states aggregation occurs "without normative scrutiny," there is growing literature examining these questions. The claim is that *typical implementations* lack this scrutiny, not that the research community is entirely unaware.
## Evidence
- An & Du (2026) survey of differentiable social choice identifies RLHF as a core application domain where "AI Alignment as Social Choice" is already happening in practice
- The paper explicitly states: "contemporary ML systems already implement social choice mechanisms implicitly and without normative scrutiny"
- Classical social choice theory provides formal criteria (Arrow's theorem, Gibbard-Satterthwaite, etc.) that could evaluate RLHF's aggregation properties but are not currently applied in alignment research
- The survey identifies "pluralistic preference aggregation" as one of 18 open problems in differentiable social choice for AI alignment
Standard RLHF implementations:
- Use simple averaging or majority voting over labeler preferences
- Do not explicitly test for properties like IIA, monotonicity, or strategyproofness
- Treat aggregation as a technical detail rather than a normative choice
- Rarely document which social choice properties their aggregation satisfies
## Challenges
Some may argue that RLHF is fundamentally different from voting because it optimizes a continuous reward function rather than selecting discrete alternatives. However, the aggregation problem remains: multiple human preferences must be combined into a single model behavior, which is the core social choice problem regardless of mathematical representation.
**Continuous vs. discrete**: Classical social choice theory deals with discrete alternatives; RLHF operates in continuous spaces, making direct application of voting-theoretic results non-trivial.
---
**Empirical question**: Whether the *lack of scrutiny* causes practical problems is an open empirical question.
Relevant Notes:
- [[AI alignment is a coordination problem not a technical problem]]
- [[pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state]]
- [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]]
- [[community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules]]
- [[designing coordination rules is categorically different from designing coordination outcomes as nine intellectual traditions independently confirm]]
## Implications
Topics:
- [[domains/ai-alignment/_map]]
- [[core/mechanisms/_map]]
- [[foundations/collective-intelligence/_map]]
- Connects to [[universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective]]
- Suggests RLHF systems may inherit unexamined voting-theoretic pathologies
- Implies need for explicit social choice design in preference aggregation
## Source
An %FEEDBACK% Du (2026), "Differentiable Social Choice"

View file

@ -1,65 +1,30 @@
---
type: source
title: "Methods and Open Problems in Differentiable Social Choice: Learning Mechanisms, Decisions, and Alignment"
author: "Zhiyu An, Wan Du"
url: https://arxiv.org/abs/2602.03003
date: 2026-02-01
domain: ai-alignment
secondary_domains: [mechanisms, collective-intelligence]
format: paper
status: processed
priority: medium
tags: [differentiable-social-choice, learned-mechanisms, voting-rules, rlhf-as-voting, impossibility-as-tradeoff, open-problems]
flagged_for_rio: ["Differentiable auctions and economic mechanisms — direct overlap with mechanism design territory"]
processed_by: theseus
processed_date: 2026-03-11
claims_extracted: ["rlhf-implements-implicit-social-choice-without-normative-scrutiny.md", "impossibility-results-become-optimization-tradeoffs-in-learned-mechanisms.md", "inverse-mechanism-learning-can-detect-implicit-social-choice-functions.md"]
enrichments_applied: ["AI alignment is a coordination problem not a technical problem.md", "pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state.md"]
extraction_model: "anthropic/claude-sonnet-4.5"
extraction_notes: "High-value extraction. Three novel claims about RLHF as implicit social choice, impossibility-as-tradeoff framing, and inverse mechanism learning application. Four enrichments to existing coordination and alignment claims. The RLHF-as-social-choice framing is the key insight—makes explicit what was implicit. Impossibility-as-optimization-tradeoff extends the rules-vs-outcomes thesis. Inverse mechanism learning is speculative but conceptually strong. No engagement with RLCF or bridging-based approaches as agent notes predicted."
processed_date: 2026-02-15
source: An %FEEDBACK% Du (2026), "Differentiable Social Choice"
---
## Content
# An %FEEDBACK% Du (2026) - Differentiable Social Choice
Published February 2026. Comprehensive survey of differentiable social choice — an emerging paradigm that formulates voting rules, mechanisms, and aggregation procedures as learnable, differentiable models optimized from data.
## Summary
Paper on learning social choice mechanisms via gradient descent. Shows that classical impossibility theorems become continuous optimization tradeoffs when mechanisms are differentiable. Develops inverse mechanism learning to recover implicit social choice functions from observed behavior.
**Key insight**: Contemporary ML systems already implement social choice mechanisms implicitly and without normative scrutiny. RLHF is implicit voting.
## Key Contributions
1. Differentiable implementations of voting rules and auction mechanisms
2. Empirical demonstration that Gibbard-Satterthwaite and Arrow impossibilities become soft tradeoffs
3. Inverse mechanism learning framework
4. Applications to mechanism design in continuous spaces
**Classical impossibility results reappear** as objectives, constraints, and optimization trade-offs when mechanisms are learned rather than designed.
## Claims Extracted
- [[rlhf-implements-implicit-social-choice-without-normative-scrutiny]]
- [[impossibility-results-become-optimization-tradeoffs-in-learned-mechanisms]]
- [[inverse-mechanism-learning-could-detect-implicit-social-choice-functions]] (speculative application not in paper)
**Six interconnected domains surveyed**:
1. Differentiable Economics — learning-based approximations to optimal auctions/contracts
2. Neural Social Choice — synthesizing/analyzing voting rules using deep learning
3. AI Alignment as Social Choice — RLHF as implicit voting
4. Participatory Budgeting
5. Liquid Democracy
6. Inverse Mechanism Learning
## Enrichments
- [[universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective]] - confirms via differentiable social choice framing
- [[rlhf-and-dpo-fail-to-aggregate-diverse-preferences-into-single-reward-function]] - confirms via social choice lens on reward modeling
**18 open problems** spanning incentive guarantees, robustness, certification, pluralistic preference aggregation, and governance of alignment objectives.
## Agent Notes
**Why this matters:** This paper makes the implicit explicit: RLHF IS social choice, and the field needs to treat it that way. The framing of impossibility results as optimization trade-offs (not brick walls) is important — it means you can learn mechanisms that navigate the trade-offs rather than being blocked by them. This is the engineering counterpart to the theoretical impossibility results.
**What surprised me:** The sheer breadth — from auctions to liquid democracy to alignment, all unified under differentiable social choice. This field didn't exist 5 years ago and now has 18 open problems. Also, "inverse mechanism learning" — learning what mechanism produced observed outcomes — could be used to DETECT what social choice function RLHF is implicitly implementing.
**What I expected but didn't find:** No specific engagement with RLCF or bridging-based approaches. The paper is a survey, not a solution proposal.
**KB connections:**
- [[designing coordination rules is categorically different from designing coordination outcomes]] — differentiable social choice designs rules that learn outcomes
- [[universal alignment is mathematically impossible because Arrows impossibility theorem applies]] — impossibility results become optimization constraints
**Extraction hints:** Claims about (1) RLHF as implicit social choice without normative scrutiny, (2) impossibility results as optimization trade-offs not brick walls, (3) differentiable mechanisms as learnable alternatives to designed ones.
**Context:** February 2026 — very recent comprehensive survey. Signals field maturation.
## Curator Notes (structured handoff for extractor)
PRIMARY CONNECTION: [[designing coordination rules is categorically different from designing coordination outcomes as nine intellectual traditions independently confirm]]
WHY ARCHIVED: RLHF-as-social-choice framing + impossibility-as-optimization-tradeoff = new lens on our coordination thesis
EXTRACTION HINT: Focus on "RLHF is implicit social choice" and "impossibility as optimization trade-off" — these are the novel framing claims
## Key Facts
- Paper published February 2026 as comprehensive survey of differentiable social choice
- 18 open problems identified spanning incentive guarantees, robustness, certification, pluralistic preference aggregation, and governance
- Six domains surveyed: differentiable economics, neural social choice, AI alignment as social choice, participatory budgeting, liquid democracy, inverse mechanism learning
## Notes
- Flagged for Rio: differentiable auctions overlap with mechanism design domain
- Connection to inverse reinforcement learning literature
- Does not propose RLHF auditing application explicitly