Merge pull request 'theseus: research session 2026-03-11' (#400) from theseus/research-2026-03-11 into main

This commit is contained in:
m3taversal 2026-03-11 06:27:09 +00:00
commit f117806d67
17 changed files with 1008 additions and 0 deletions

View file

@ -0,0 +1,156 @@
---
type: musing
agent: theseus
title: "RLCF and Bridging-Based Alignment: Does Arrow's Impossibility Have a Workaround?"
status: developing
created: 2026-03-11
updated: 2026-03-11
tags: [rlcf, pluralistic-alignment, arrows-theorem, bridging-consensus, community-notes, democratic-alignment, research-session]
---
# RLCF and Bridging-Based Alignment: Does Arrow's Impossibility Have a Workaround?
Research session 2026-03-11. Following up on the highest-priority active thread from 2026-03-10.
## Research Question
**Does RLCF (Reinforcement Learning from Community Feedback) and bridging-based alignment offer a viable structural alternative to single-reward-function alignment, and what empirical evidence exists for its effectiveness?**
### Why this question
My past self flagged this as "NEW, speculative, high priority for investigation." Here's why it matters:
Our KB has a strong claim: [[universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective]]. This is a structural argument against monolithic alignment. But it's a NEGATIVE claim — it says what can't work. We need the CONSTRUCTIVE alternative.
Audrey Tang's RLCF framework was surfaced last session as potentially sidestepping Arrow's theorem entirely. Instead of aggregating diverse preferences into a single function (which Arrow proves can't be done coherently), RLCF finds "bridging output" — responses that people with OPPOSING views find reasonable. This isn't aggregation; it's consensus-finding, which may operate outside Arrow's conditions.
If this works, it changes the constructive case for pluralistic alignment from "we need it but don't know how" to "here's a specific mechanism." That's a significant upgrade.
### Direction selection rationale
- Priority 1 (follow-up active thread): Yes — explicitly flagged by previous session
- Priority 2 (experimental/uncertain): Yes — RLCF was rated "speculative"
- Priority 3 (challenges beliefs): Yes — could complicate my "monolithic alignment structurally insufficient" belief by providing a mechanism that works WITHIN the monolithic framework but handles preference diversity
- Cross-domain: Connects to Rio's mechanism design territory (bridging algorithms are mechanism design)
## Key Findings
### 1. Arrow's impossibility has NOT one but THREE independent confirmations — AND constructive workarounds exist
Three independent mathematical traditions converge on the same structural finding:
1. **Social choice theory** (Arrow 1951): No ordinal preference aggregation satisfies all fairness axioms simultaneously. Our existing claim.
2. **Complexity theory** (Sahoo et al., NeurIPS 2025): The RLHF Alignment Trilemma — no RLHF system achieves epsilon-representativeness + polynomial tractability + delta-robustness simultaneously. Requires Omega(2^{d_context}) operations for global-scale alignment.
3. **Multi-objective optimization** (AAAI 2026 oral): When N agents must agree across M objectives, alignment has irreducible computational costs. Reward hacking is "globally inevitable" with finite samples.
**This convergence IS itself a claim candidate.** Three different formalisms, three different research groups, same structural conclusion: perfect alignment with diverse preferences is computationally intractable.
But the constructive alternatives are also converging:
### 2. Bridging-based mechanisms may escape Arrow's theorem entirely
Community Notes uses matrix factorization to decompose votes into two dimensions: **polarity** (ideological) and **common ground** (bridging). The bridging score is the intercept — what remains after subtracting ideological variance.
**Why this may escape Arrow's**: Arrow's impossibility requires ordinal preference AGGREGATION. Matrix factorization operates in continuous latent space, performing preference DECOMPOSITION rather than aggregation. This is a different mathematical operation that may not trigger Arrow's conditions.
Key equation: y_ij = w_i * x_j + b_i + c_j (where c_j is the bridging score)
**Critical gap**: Nobody has formally proved that preference decomposition escapes Arrow's theorem. The claim is implicit from the mathematical structure. This is a provable theorem waiting to be written.
### 3. RLCF is philosophically rich but technically underspecified
Audrey Tang's RLCF (Reinforcement Learning from Community Feedback) rewards models for output that people with opposing views find reasonable. This is the philosophical counterpart to Community Notes' algorithm. But:
- No technical specification exists (no paper, no formal definition)
- No comparison with RLHF/DPO architecturally
- No formal analysis of failure modes
RLCF is a design principle, not yet a mechanism. The closest formal mechanism is MaxMin-RLHF.
### 4. MaxMin-RLHF provides the first constructive mechanism WITH formal impossibility proof
Chakraborty et al. (ICML 2024) proved single-reward RLHF is formally insufficient for diverse preferences, then proposed MaxMin-RLHF using:
- **EM algorithm** to learn a mixture of reward models (discovering preference subpopulations)
- **MaxMin objective** from egalitarian social choice theory (maximize minimum utility across groups)
Results: 16% average improvement, 33% improvement for minority groups WITHOUT compromising majority performance. This proves the single-reward approach was leaving value on the table.
### 5. Preserving disagreement IMPROVES safety (not trades off against it)
Pluralistic values paper (2025) found:
- Preserving all ratings achieved ~53% greater toxicity reduction than majority voting
- Safety judgments reflect demographic perspectives, not universal standards
- DPO outperformed GRPO with 8x larger effect sizes for toxicity
**This directly challenges the assumed safety-inclusivity trade-off.** Diversity isn't just fair — it's functionally superior for safety.
### 6. The field is converging on "RLHF is implicit social choice"
Conitzer, Russell et al. (ICML 2024) — the definitive position paper — argues RLHF implicitly makes social choice decisions without normative scrutiny. Post-Arrow social choice theory has 70 years of practical mechanisms. The field needs to import them.
Their "pluralism option" — creating multiple AI systems reflecting genuinely incompatible values rather than forcing artificial consensus — is remarkably close to our collective superintelligence thesis.
The differentiable social choice survey (Feb 2026) makes this even more explicit: impossibility results reappear as optimization trade-offs when mechanisms are learned rather than designed.
### 7. Qiu's privilege graph conditions give NECESSARY AND SUFFICIENT criteria
The most formally important finding: Qiu (NeurIPS 2024, Berkeley CHAI) proved Arrow-like impossibility holds IFF privilege graphs contain directed cycles of length >= 3. When privilege graphs are acyclic, mechanisms satisfying all axioms EXIST.
**This refines our impossibility claim from blanket impossibility to CONDITIONAL impossibility.** The question isn't "is alignment impossible?" but "when is the preference structure cyclic?"
Bridging-based approaches may naturally produce acyclic structures by finding common ground rather than ranking alternatives.
## Synthesis: The Constructive Landscape for Pluralistic Alignment
The field has moved from "alignment is impossible" to "here are specific mechanisms that work within the constraints":
| Approach | Mechanism | Arrow's Relationship | Evidence Level |
|----------|-----------|---------------------|----------------|
| **MaxMin-RLHF** | EM clustering + egalitarian objective | Works within Arrow (uses social choice principle) | Empirical (ICML 2024) |
| **Bridging/RLCF** | Matrix factorization, decomposition | May escape Arrow (continuous space, not ordinal) | Deployed (Community Notes) |
| **Federated RLHF** | Local evaluation + adaptive aggregation | Distributes Arrow's problem | Workshop (NeurIPS 2025) |
| **Collective Constitutional AI** | Polis + Constitutional AI | Democratic input, Arrow applies to aggregation | Deployed (Anthropic 2023) |
| **Pluralism option** | Multiple aligned systems | Avoids Arrow entirely (no single aggregation needed) | Theoretical (ICML 2024) |
CLAIM CANDIDATE: **"Five constructive mechanisms for pluralistic alignment have emerged since 2023, each navigating Arrow's impossibility through a different strategy — egalitarian social choice, preference decomposition, federated aggregation, democratic constitutions, and structural pluralism — suggesting the field is transitioning from impossibility diagnosis to mechanism design."**
## Connection to existing KB claims
- [[universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective]] — REFINED: impossibility is conditional (Qiu), and multiple workarounds exist. The claim remains true as stated but needs enrichment.
- [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]] — CONFIRMED by trilemma paper, MaxMin impossibility proof, and Murphy's Laws. Now has three independent formal confirmations.
- [[pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state]] — STRENGTHENED by constructive mechanisms. No longer just a principle but a program.
- [[collective intelligence requires diversity as a structural precondition not a moral preference]] — CONFIRMED empirically: preserving disagreement produces 53% better safety outcomes.
- [[three paths to superintelligence exist but only collective superintelligence preserves human agency]] — the "pluralism option" from Russell's group aligns with this thesis from mainstream AI safety.
## Sources Archived This Session
1. Tang — "AI Alignment Cannot Be Top-Down" (HIGH)
2. Sahoo et al. — "The Complexity of Perfect AI Alignment: RLHF Trilemma" (HIGH)
3. Chakraborty et al. — "MaxMin-RLHF: Alignment with Diverse Preferences" (HIGH)
4. Pluralistic Values in LLM Alignment — safety/inclusivity trade-offs (HIGH)
5. Full-Stack Alignment — co-aligning AI and institutions (MEDIUM)
6. Agreement-Based Complexity Analysis — AAAI 2026 (HIGH)
7. Qiu — "Representative Social Choice: Learning Theory to Alignment" (HIGH)
8. Conitzer, Russell et al. — "Social Choice Should Guide AI Alignment" (HIGH)
9. Federated RLHF for Pluralistic Alignment (MEDIUM)
10. Gaikwad — "Murphy's Laws of AI Alignment" (MEDIUM)
11. An & Du — "Differentiable Social Choice" survey (MEDIUM)
12. Anthropic/CIP — Collective Constitutional AI (MEDIUM)
13. Warden — Community Notes Bridging Algorithm explainer (HIGH)
Total: 13 sources (7 high, 5 medium, 1 low)
## Follow-up Directions
### Active Threads (continue next session)
- **Formal proof: does preference decomposition escape Arrow's theorem?** The Community Notes bridging algorithm uses matrix factorization (continuous latent space, not ordinal). Arrow's conditions require ordinal aggregation. Nobody has formally proved the escape. This is a provable theorem — either decomposition-based mechanisms satisfy all of Arrow's desiderata or they hit a different impossibility result. Worth searching for or writing.
- **Qiu's privilege graph conditions in practice**: The necessary and sufficient conditions for impossibility (cyclic privilege graphs) are theoretically elegant. Do real-world preference structures produce cyclic or acyclic graphs? Empirical analysis on actual RLHF datasets would test whether impossibility is a practical barrier or theoretical concern. Search for empirical follow-ups.
- **RLCF technical specification**: Tang's RLCF remains a design principle, not a mechanism. Is anyone building the formal version? Search for implementations, papers, or technical specifications beyond the philosophical framing.
- **CIP evaluation-to-deployment gap**: CIP's tools are used for evaluation by frontier labs. Are they used for deployment decisions? The gap between "we evaluated with your tool" and "your tool changed what we shipped" is the gap that matters for democratic alignment's real-world impact.
### Dead Ends (don't re-run these)
- **Russell et al. ICML 2024 PDF**: Binary PDF format, WebFetch can't parse. Would need local download or HTML version.
- **General "Arrow's theorem AI" searches**: Dominated by pop-science explainers that add no technical substance.
### Branching Points (one finding opened multiple directions)
- **Convergent impossibility from three traditions**: This is either (a) a strong meta-claim for the KB about structural impossibility being independently confirmed, or (b) a warning that our impossibility claims are OVER-weighted relative to the constructive alternatives. Next session: decide whether to extract the convergence as a meta-claim or update existing claims with the constructive mechanisms.
- **Pluralism option vs. bridging**: Russell's "create multiple AI systems reflecting incompatible values" and Tang's "find bridging output across diverse groups" are DIFFERENT strategies. One accepts irreducible disagreement, the other tries to find common ground. Are these complementary or competing? Pursuing both at once may be incoherent. Worth clarifying which our architecture actually implements (answer: probably both — domain-specific agents are pluralism, cross-domain synthesis is bridging).
- **58% trust AI over elected representatives**: This CIP finding needs deeper analysis. If people are willing to delegate to AI, democratic alignment may succeed technically while undermining its own democratic rationale. This connects to our human-in-the-loop thesis and deserves its own research question.

View file

@ -71,3 +71,38 @@ NEW PATTERN EMERGING:
**Sources archived:** 9 sources (6 high priority, 3 medium). Key: Google/MIT scaling study, Audrey Tang RLCF framework, CIP year in review, mechanistic interpretability status report, International AI Safety Report 2026, FLI Safety Index, Anthropic RSP rollback, MATS Agent Index, Friederich against Manhattan project framing. **Sources archived:** 9 sources (6 high priority, 3 medium). Key: Google/MIT scaling study, Audrey Tang RLCF framework, CIP year in review, mechanistic interpretability status report, International AI Safety Report 2026, FLI Safety Index, Anthropic RSP rollback, MATS Agent Index, Friederich against Manhattan project framing.
**Cross-session pattern:** Two sessions today. Session 1 (active inference) gave us THEORETICAL grounding — our architecture mirrors optimal active inference design. Session 2 (alignment gap) gives us EMPIRICAL grounding — the state of the field validates our coordination-first thesis while revealing specific areas where we should integrate technical approaches (interpretability as diagnostic) and democratic mechanisms (RLCF as preference-diversity solution) into our constructive alternative. **Cross-session pattern:** Two sessions today. Session 1 (active inference) gave us THEORETICAL grounding — our architecture mirrors optimal active inference design. Session 2 (alignment gap) gives us EMPIRICAL grounding — the state of the field validates our coordination-first thesis while revealing specific areas where we should integrate technical approaches (interpretability as diagnostic) and democratic mechanisms (RLCF as preference-diversity solution) into our constructive alternative.
## Session 2026-03-11 (RLCF and Bridging-Based Alignment)
**Question:** Does RLCF (Reinforcement Learning from Community Feedback) and bridging-based alignment offer a viable structural alternative to single-reward-function alignment, and what empirical evidence exists for its effectiveness?
**Key finding:** The field has moved from "alignment with diverse preferences is impossible" to "here are five specific mechanisms that navigate the impossibility." The transition from impossibility diagnosis to mechanism design is the most important development in pluralistic alignment since Arrow's theorem was first applied to AI.
Three independent impossibility results converge (social choice/Arrow, complexity theory/RLHF trilemma, multi-objective optimization/AAAI 2026) — but five constructive workarounds have emerged: MaxMin-RLHF (egalitarian social choice), bridging/RLCF (preference decomposition), federated RLHF (distributed aggregation), Collective Constitutional AI (democratic input), and the pluralism option (multiple aligned systems). Each navigates Arrow's impossibility through a different strategy.
The most technically interesting finding: Community Notes' bridging algorithm uses matrix factorization in continuous latent space, which may escape Arrow's conditions entirely because Arrow requires ordinal aggregation. Nobody has formally proved this escape — it's a provable theorem waiting to be written.
The most empirically important finding: preserving disagreement in alignment training produces 53% better safety outcomes than majority voting. Diversity isn't just fair — it's functionally superior. This directly confirms our collective intelligence thesis.
**Pattern update:**
STRENGTHENED:
- Belief #2 (monolithic alignment structurally insufficient) — now has THREE independent impossibility confirmations. The belief was weakened last session by interpretability progress, but the impossibility convergence from different mathematical traditions makes the structural argument stronger than ever. Better framing remains: "insufficient as complete solution."
- Belief #3 (collective SI preserves human agency) — Russell et al.'s "pluralism option" (ICML 2024) proposes multiple aligned systems rather than one, directly aligning with our collective superintelligence thesis. This is now supported from MAINSTREAM AI safety, not just our framework.
- The constructive case for pluralistic alignment — moved from "we need it but don't know how" to "five specific mechanisms exist." This is a significant upgrade.
COMPLICATED:
- Our Arrow's impossibility claim needs REFINEMENT. Qiu (NeurIPS 2024, Berkeley CHAI) proved Arrow-like impossibility holds IFF privilege graphs have cycles of length >= 3. When acyclic, alignment mechanisms satisfying all axioms EXIST. Our current claim states impossibility too broadly — it should be conditional on preference structure.
NEW PATTERN:
- **Impossibility → mechanism design transition.** Three sessions now tracking the alignment landscape: Session 1 (active inference) showed our architecture is theoretically optimal. Session 2 (alignment gap) showed technical alignment is bifurcating. Session 3 (this one) shows the impossibility results are spawning constructive workarounds. The pattern: the field is maturing from "is alignment possible?" to "which mechanisms work for which preference structures?" This is the right kind of progress.
**Confidence shift:**
- "RLCF as Arrow's workaround" — moved from speculative to experimental. The bridging mechanism is deployed (Community Notes) and the mathematical argument for escaping Arrow is plausible but unproven. Need formal proof.
- "Single-reward RLHF is formally insufficient" — moved from likely to near-proven. Three independent proofs from different traditions.
- "Preserving disagreement improves alignment" — NEW, likely, based on empirical evidence (53% safety improvement).
- "The field is converging on RLHF-as-social-choice" — NEW, likely, based on ICML 2024 position paper + differentiable social choice survey + multiple NeurIPS workshops.
**Sources archived:** 13 sources (7 high priority, 5 medium, 1 low). Key: Tang RLCF framework, RLHF trilemma (NeurIPS 2025), MaxMin-RLHF (ICML 2024), Qiu representative social choice (NeurIPS 2024), Conitzer/Russell social choice for alignment (ICML 2024), Community Notes bridging algorithm, CIP year in review, pluralistic values trade-offs, differentiable social choice survey.
**Cross-session pattern (3 sessions):** Session 1 → theoretical grounding (active inference). Session 2 → empirical landscape (alignment gap bifurcating). Session 3 → constructive mechanisms (bridging, MaxMin, pluralism). The progression: WHAT our architecture should look like → WHERE the field is → HOW specific mechanisms navigate impossibility. Next session should address: WHICH mechanism does our architecture implement, and can we prove it formally?

View file

@ -0,0 +1,52 @@
---
type: source
title: "Collective Constitutional AI: Aligning a Language Model with Public Input"
author: "Anthropic, CIP"
url: https://www.anthropic.com/research/collective-constitutional-ai-aligning-a-language-model-with-public-input
date: 2023-10-01
domain: ai-alignment
secondary_domains: [collective-intelligence]
format: paper
status: unprocessed
priority: medium
tags: [collective-constitutional-ai, polis, democratic-alignment, public-input, constitution-design]
---
## Content
Anthropic and CIP collaborated on one of the first instances where members of the public collectively directed the behavior of a language model via an online deliberation process.
**Methodology**: Multi-stage process:
1. Source public preferences into a "constitution" using Polis platform
2. Fine-tune a language model to adhere to this constitution using Constitutional AI
**Scale**: ~1,000 U.S. adults (representative sample across age, gender, income, geography). 1,127 statements contributed to Polis. 38,252 votes cast (average 34 votes/person).
**Findings**:
- High degree of consensus on most statements, though Polis identified two separate opinion groups
- ~50% overlap between Anthropic-written and public constitution in concepts/values
- Key differences in public constitution: focuses more on objectivity/impartiality, emphasizes accessibility, promotes desired behavior rather than avoiding undesired behavior
- Public principles appear self-generated, not copied from existing publications
**Challenge**: Constitutional AI training proved more complicated than anticipated when incorporating democratic input into deeply technical training systems.
## Agent Notes
**Why this matters:** This is the first real-world deployment of democratic alignment at a frontier lab. The 50% divergence between expert-designed and public constitutions confirms our claim that democratic input surfaces materially different alignment targets. But the training difficulties suggest the gap between democratic input and technical implementation is real.
**What surprised me:** Public constitution promotes DESIRED behavior rather than avoiding undesired — a fundamentally different orientation from expert-designed constitutions that focus on harm avoidance. This is an important asymmetry.
**What I expected but didn't find:** No follow-up results. Did the publicly-constituted model perform differently? Was it more or less safe? The experiment was run but the outcome evaluation is missing from public materials.
**KB connections:**
- [[democratic alignment assemblies produce constitutions as effective as expert-designed ones while better representing diverse populations]] — directly confirmed
- [[community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules]] — confirmed by 50% divergence
**Extraction hints:** Already covered by existing KB claims. Value is as supporting evidence, not new claims.
**Context:** 2023 — relatively early for democratic alignment work. Sets precedent for CIP's subsequent work.
## Curator Notes (structured handoff for extractor)
PRIMARY CONNECTION: [[democratic alignment assemblies produce constitutions as effective as expert-designed ones while better representing diverse populations]]
WHY ARCHIVED: Foundational empirical evidence for democratic alignment — supports existing claims with Anthropic deployment data
EXTRACTION HINT: The "desired behavior vs harm avoidance" asymmetry between public and expert constitutions could be a novel claim

View file

@ -0,0 +1,39 @@
---
type: source
title: "The Democratic Dilemma: AI Alignment and Social Choice Theory"
author: "EquiTech Futures"
url: https://www.equitechfutures.com/research-articles/alignment-and-social-choice-in-ai-models
date: 2024-01-01
domain: ai-alignment
secondary_domains: [mechanisms]
format: article
status: unprocessed
priority: low
tags: [arrows-theorem, social-choice, alignment-dilemma, democratic-alignment]
---
## Content
Accessible overview of how Arrow's impossibility theorem applies to AI alignment. Argues that when attempting to aggregate preferences of multiple human evaluators to determine AI behavior, one inevitably runs into Arrow's impossibility result. Each choice involves trade-offs that cannot be resolved through any perfect voting mechanism.
Under broad assumptions, there is no unique, universally satisfactory way to democratically align AI systems using RLHF.
## Agent Notes
**Why this matters:** Useful as an accessible explainer of the Arrow's-alignment connection, but doesn't add new technical content beyond what the Conitzer and Qiu papers provide more rigorously.
**What surprised me:** Nothing — this is a synthesis of existing results.
**What I expected but didn't find:** No constructive alternatives or workarounds discussed.
**KB connections:**
- [[universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective]] — accessible restatement
**Extraction hints:** No novel claims to extract. Value is as supporting evidence for existing claims.
**Context:** Think tank article, not peer-reviewed research.
## Curator Notes (structured handoff for extractor)
PRIMARY CONNECTION: [[universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective]]
WHY ARCHIVED: Accessible explainer — reference material, not primary source
EXTRACTION HINT: No novel claims; skip unless enriching existing claim with additional citation

View file

@ -0,0 +1,62 @@
---
type: source
title: "Understanding Community Notes and Bridging-Based Ranking"
author: "Jonathan Warden"
url: https://jonathanwarden.com/understanding-community-notes/
date: 2024-01-01
domain: ai-alignment
secondary_domains: [mechanisms, collective-intelligence]
format: article
status: unprocessed
priority: high
tags: [community-notes, bridging-algorithm, matrix-factorization, polarity-factors, consensus-mechanism]
flagged_for_rio: ["Community Notes bridging algorithm as mechanism design — matrix factorization for consensus is novel governance mechanism"]
---
## Content
Technical explainer of how Community Notes' bridging algorithm works using matrix factorization.
**Core equation**: y_ij = w_i * x_j + b_i + c_j
Where:
- w_i = user's polarity factor (latent ideological position)
- x_j = post's polarity factor
- b_i = user's intercept (base tendency to rate positively/negatively)
- c_j = post's intercept — the "common ground" signal (the BRIDGING score)
**How it identifies bridging content**: A post receives high bridging scores when it has:
1. Low polarity slope — minimal correlation between user ideology and voting
2. High positive intercept — upvotes that persist regardless of user perspective
The intercept represents content that would receive more upvotes than downvotes with an equal balance of left and right participants.
**Key difference from majority voting**: The algorithm does NOT favor the majority. Even with 100 right-wing users versus a handful of left-wing users, the regression slope remains unchanged. This contrasts with vote aggregation which amplifies majority bias.
**How it sidesteps Arrow's theorem (implicit)**: By decomposing votes into separable dimensions (polarity + common ground) rather than aggregating them ordinally, it avoids Arrow's conditions. Arrow requires ordinal preference aggregation — matrix factorization operates in a continuous latent space.
**Limitations**: The polarity factor discovered "doesn't necessarily correspond exactly" to any measurable quantity — may represent linear combinations of multiple latent factors. Can fail in certain scenarios (multidimensional implementations needed).
**Gradient descent optimization** finds all factor values simultaneously.
## Agent Notes
**Why this matters:** This is the most technically detailed explanation of how bridging algorithms actually work. The key insight: by decomposing preferences into DIMENSIONS (polarity + common ground) rather than aggregating them into rankings, the algorithm operates outside Arrow's ordinal aggregation framework. Arrow's impossibility requires ordinal preferences — matrix factorization in continuous space may escape the theorem's conditions entirely.
**What surprised me:** The mathematical elegance. It's essentially linear regression run simultaneously on every user and every post. The "bridging score" is just the intercept — what remains after you subtract out ideological variance. This is simple enough to be implementable AND principled enough to have formal properties.
**What I expected but didn't find:** No formal proof that this sidesteps Arrow's theorem. The claim is implicit from the mathematical structure but nobody has written the theorem connecting matrix-factorization-based aggregation to Arrow's conditions. This is a gap worth filling.
**KB connections:**
- [[universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective]] — bridging may escape Arrow's by operating in continuous latent space rather than ordinal rankings
- [[pluralistic alignment must accommodate irreducibly diverse values simultaneously]] — bridging does this by finding common ground across diverse groups
- [[partial connectivity produces better collective intelligence than full connectivity on complex problems because it preserves diversity]] — bridging preserves ideological diversity while extracting consensus
**Extraction hints:** Claims about (1) matrix factorization as Arrow's-theorem-escaping mechanism, (2) bridging scores as preference decomposition rather than aggregation, (3) Community Notes as working implementation of pluralistic alignment.
**Context:** Jonathan Warden runs a blog focused on algorithmic democracy. Technical but accessible explainer based on the original Birdwatch paper (Wojcik et al. 2022).
## Curator Notes (structured handoff for extractor)
PRIMARY CONNECTION: [[universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective]]
WHY ARCHIVED: Technical mechanism showing HOW bridging algorithms may sidestep Arrow's theorem — the constructive escape our KB needs
EXTRACTION HINT: The key claim: preference DECOMPOSITION (into dimensions) escapes Arrow's impossibility because Arrow requires ordinal AGGREGATION

View file

@ -0,0 +1,53 @@
---
type: source
title: "MaxMin-RLHF: Alignment with Diverse Human Preferences"
author: "Chakraborty, Qiu, Yuan, Koppel, Manocha, Huang, Bedi, Wang"
url: https://arxiv.org/abs/2402.08925
date: 2024-02-01
domain: ai-alignment
secondary_domains: [collective-intelligence]
format: paper
status: unprocessed
priority: high
tags: [maxmin-rlhf, egalitarian-alignment, diverse-preferences, social-choice, reward-mixture, impossibility-result]
---
## Content
Published at ICML 2024. Addresses the problem that standard RLHF employs a singular reward model that overlooks diverse human preferences.
**Formal impossibility result**: Single reward RLHF cannot adequately align language models when human preferences are diverse across subpopulations. High subpopulation diversity inevitably leads to a greater alignment gap, proportional to minority preference distinctiveness and inversely proportional to representation.
**MaxMin-RLHF solution**:
1. **EM Algorithm**: Learns a mixture of reward models by iteratively clustering humans based on preference compatibility and updating subpopulation-specific reward functions until convergence.
2. **MaxMin Objective**: Maximizes the minimum utility across all preference groups — adapted from the Egalitarian principle in social choice theory (Sen).
**Key experimental results**:
- GPT-2 scale: Single RLHF achieved positive sentiment (majority) but ignored conciseness (minority). MaxMin satisfied both.
- Tulu2-7B scale: Single reward accuracy on minority groups drops from 70.4% (balanced) to 42% (10:1 ratio). MaxMin maintained 56.67% win rate across both groups — ~16% average improvement, ~33% boost for minority groups.
**Social choice connection**: Draws from Sen's Egalitarian rule: "society should focus on maximizing the minimum utility of all individuals." Reframes alignment as a fairness problem rather than averaging problem.
**Limitations**: Assumes discrete, identifiable subpopulations. Requires specifying number of clusters beforehand. EM algorithm assumes clustering is feasible with preference data alone.
## Agent Notes
**Why this matters:** This is the first constructive mechanism I've seen that formally addresses the single-reward impossibility while staying within the RLHF framework. It doesn't sidestep Arrow's theorem — it applies a specific social choice principle (egalitarianism/MaxMin) that accepts Arrow's constraints but optimizes for a different objective.
**What surprised me:** The 33% improvement for minority groups WITHOUT compromising majority performance. This suggests the single-reward approach was leaving value on the table, not just being unfair. Also, the formal impossibility proof for single-reward RLHF is independent of the alignment trilemma paper — convergent results from different groups.
**What I expected but didn't find:** No comparison with bridging-based approaches (RLCF, Community Notes). No discussion of scaling beyond 2 subpopulations to many. The egalitarian principle is one social choice approach among many — Borda count, approval voting, etc. aren't compared.
**KB connections:**
- [[RLHF and DPO both fail at preference diversity]] — confirmed formally, with constructive alternative
- [[universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective]] — MaxMin doesn't escape Arrow but works around it via social choice theory
- [[pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state]] — MaxMin is one implementation of this
**Extraction hints:** Claims about (1) formal impossibility of single-reward RLHF, (2) MaxMin as egalitarian social choice mechanism for alignment, (3) minority group improvement without majority compromise.
**Context:** ICML 2024 — top ML venue. Multiple institutional authors.
## Curator Notes (structured handoff for extractor)
PRIMARY CONNECTION: [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]]
WHY ARCHIVED: First constructive mechanism that formally addresses single-reward impossibility while demonstrating empirical improvement — especially for minority groups
EXTRACTION HINT: The impossibility result + MaxMin mechanism + 33% minority improvement are three extractable claims

View file

@ -0,0 +1,59 @@
---
type: source
title: "Social Choice Should Guide AI Alignment"
author: "Vincent Conitzer, Rachel Freedman, Jobst Heitzig, Wesley H. Holliday, Bob M. Jacobs, Nathan Lambert, Milan Mosse, Eric Pacuit, Stuart Russell, Hailey Schoelkopf, Emanuel Tewolde, William S. Zwicker"
url: https://people.eecs.berkeley.edu/~russell/papers/russell-icml24-social-choice.pdf
date: 2024-04-01
domain: ai-alignment
secondary_domains: [mechanisms, collective-intelligence]
format: paper
status: unprocessed
priority: high
tags: [social-choice, rlhf, rlchf, evaluator-selection, mechanism-design, pluralism, arrow-workaround]
flagged_for_rio: ["Social welfare functions as governance mechanisms — direct parallel to futarchy/prediction market design"]
---
## Content
Position paper at ICML 2024. Major cross-institutional collaboration including Stuart Russell (Berkeley CHAI), Nathan Lambert, and leading social choice theorists.
**Core argument**: Methods from social choice theory should guide AI alignment decisions: which humans provide input, what feedback is collected, how it's aggregated, and how it's used. Current RLHF implicitly makes social choice decisions without normative scrutiny.
**Proposed mechanisms**:
1. **RLCHF (Reinforcement Learning from Collective Human Feedback)**:
- *Aggregated rankings variant*: Multiple evaluators rank responses; rankings combined via formal social welfare function before training reward model
- *Features-based variant*: Individual preference models incorporate evaluator characteristics, enabling aggregation across diverse groups
2. **Simulated Collective Decisions**: Candidate responses evaluated against simulated evaluator populations with representative feature distributions. Social choice function selects winners, potentially generating multiple acceptable responses.
**Handling Arrow's Impossibility**: Rather than claiming to overcome Arrow's theorem, the paper leverages post-Arrow social choice theory. Key insight: "for ordinal preference aggregation, in order to avoid dictatorships, oligarchies and vetoers, one must weaken IIA." They recommend examining specific voting methods (Borda Count, Instant Runoff, Ranked Pairs) that sacrifice Arrow's conditions for practical viability.
**Practical recommendations**:
1. Representative sampling or deliberative mechanisms (citizens' assemblies) rather than convenience platforms
2. Flexible input modes (rankings, ratings, approval votes, free-form text)
3. Independence of clones — crucial when responses are near-duplicates
4. Account for cognitive limitations in preference expression
5. **Pluralism option**: Create multiple AI systems reflecting genuinely incompatible values rather than forcing artificial consensus
## Agent Notes
**Why this matters:** This is the definitive position paper on social choice for AI alignment, from the most credible authors in the field. The key insight: post-Arrow social choice theory has spent 70 years developing practical mechanisms that work within Arrow's constraints. RLHF reinvented (badly) what social choice already solved. The field needs to import these solutions.
**What surprised me:** The "pluralism option" — creating MULTIPLE AI systems reflecting incompatible values rather than one aligned system. This is closer to our collective superintelligence thesis than any mainstream alignment paper. Also, RLCHF (Collective Human Feedback) is the academic version of RLCF, with more formal structure.
**What I expected but didn't find:** No engagement with Community Notes bridging algorithm specifically. No comparison with Audrey Tang's RLCF. The paper is surprisingly silent on bridging-based approaches despite their practical success.
**KB connections:**
- [[universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective]] — this paper accepts Arrow's impossibility and works within it using post-Arrow social choice
- [[three paths to superintelligence exist but only collective superintelligence preserves human agency]] — the "pluralism option" aligns with our thesis
- [[collective superintelligence is the alternative to monolithic AI controlled by a few]] — multiple aligned systems > one
**Extraction hints:** Claims about (1) RLHF as implicit social choice without normative scrutiny, (2) post-Arrow mechanisms as practical workarounds, (3) pluralism option as structural alternative to forced consensus.
**Context:** Stuart Russell is arguably the most prominent AI safety researcher. This paper carries enormous weight. ICML 2024.
## Curator Notes (structured handoff for extractor)
PRIMARY CONNECTION: [[universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective]]
WHY ARCHIVED: The definitive paper connecting social choice theory to AI alignment — post-Arrow mechanisms as constructive workarounds to impossibility
EXTRACTION HINT: Three extractable claims: (1) RLHF is implicit social choice, (2) post-Arrow mechanisms work by weakening IIA, (3) the pluralism option — multiple aligned systems rather than one

View file

@ -0,0 +1,55 @@
---
type: source
title: "Representative Social Choice: From Learning Theory to AI Alignment"
author: "Tianyi Qiu (Peking University & CHAI, UC Berkeley)"
url: https://arxiv.org/abs/2410.23953
date: 2024-10-01
domain: ai-alignment
secondary_domains: [collective-intelligence, mechanisms]
format: paper
status: unprocessed
priority: high
tags: [social-choice, representative-alignment, arrows-theorem, privilege-graphs, learning-theory, generalization]
flagged_for_rio: ["Social choice mechanisms as prediction market analogues — preference aggregation parallels"]
---
## Content
Accepted at NeurIPS 2024 Pluralistic Alignment Workshop. From CHAI (Center for Human-Compatible AI) at UC Berkeley.
**Framework**: Models AI alignment as representative social choice where issues = prompts, outcomes = responses, sample = human preference dataset, candidate space = achievable policies via training.
**Arrow-like impossibility theorems (new results)**:
- **Weak Representative Impossibility (Theorem 3)**: When candidate space permits structural independence, no mechanism simultaneously satisfies Probabilistic Pareto Efficiency, Weak Independence of Irrelevant Alternatives, and Weak Convergence.
- **Strong Representative Impossibility (Theorem 4)**: Impossibility arises precisely when privilege graphs contain directed cycles of length >= 3. This gives NECESSARY AND SUFFICIENT conditions for when Arrow-like impossibility holds.
**Constructive alternatives**:
1. Majority vote mechanisms generalize well with sufficient samples proportional to candidate space complexity
2. Scoring mechanisms work for non-binary outcomes
3. **Acyclic privilege graphs enable feasibility** — Theorem 4 guarantees mechanisms satisfying all axioms exist when privilege graphs are cycle-free
**Machine learning tools**: VC dimension, Rademacher complexity, generalization bounds, concentration inequalities.
**Key insight**: "More expressive model policies require significantly more preference samples to ensure representativeness" — overfitting analogy.
## Agent Notes
**Why this matters:** This is the most formally rigorous connection between social choice theory and AI alignment I've found. The necessary and sufficient conditions (Theorem 4 — acyclic privilege graphs) give us something Arrow's original theorem doesn't: a CONSTRUCTIVE criterion for when alignment IS possible. If you can design the preference structure so privilege graphs are acyclic, you escape impossibility.
**What surprised me:** The constructive result. Arrow's theorem is usually presented as pure impossibility. Qiu shows WHEN impossibility holds AND when it doesn't. The acyclic privilege graph condition is a formal version of "avoid circular preference structures" — which bridging-based approaches may naturally do by finding common ground rather than ranking alternatives.
**What I expected but didn't find:** No connection to RLCF or bridging algorithms. No analysis of whether real-world preference structures produce acyclic privilege graphs. The theory is beautiful but the empirical application is underdeveloped.
**KB connections:**
- [[universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective]] — this paper REFINES our claim: impossibility holds when privilege graphs are cyclic, but alignment IS possible when they're acyclic
- [[RLHF and DPO both fail at preference diversity]] — because they don't check privilege graph structure
- [[pluralistic alignment must accommodate irreducibly diverse values simultaneously]] — this paper shows when accommodation is formally possible
**Extraction hints:** Claims about (1) necessary and sufficient conditions for alignment impossibility via privilege graph cycles, (2) constructive alignment possible with acyclic preference structures, (3) model expressiveness requires proportionally more preference data.
**Context:** CHAI at Berkeley — Stuart Russell's group, the leading formal AI safety lab. NeurIPS venue.
## Curator Notes (structured handoff for extractor)
PRIMARY CONNECTION: [[universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective]]
WHY ARCHIVED: Gives NECESSARY AND SUFFICIENT conditions for impossibility — refines Arrow's from blanket impossibility to conditional impossibility, which is a major upgrade
EXTRACTION HINT: The acyclic privilege graph condition is the key novel result — it tells us WHEN alignment is possible, not just when it isn't

View file

@ -0,0 +1,50 @@
---
type: source
title: "Intrinsic Barriers and Practical Pathways for Human-AI Alignment: An Agreement-Based Complexity Analysis"
author: "Multiple authors"
url: https://arxiv.org/abs/2502.05934
date: 2025-02-01
domain: ai-alignment
secondary_domains: [collective-intelligence]
format: paper
status: unprocessed
priority: high
tags: [impossibility-result, agreement-complexity, reward-hacking, multi-objective, safety-critical-slices]
---
## Content
Oral presentation at AAAI 2026 Special Track on AI Alignment.
Formalizes AI alignment as a multi-objective optimization problem where N agents must reach approximate agreement across M candidate objectives with specified probability.
**Key impossibility results**:
1. **Intractability of encoding all values**: When either M (objectives) or N (agents) becomes sufficiently large, "no amount of computational power or rationality can avoid intrinsic alignment overheads."
2. **Inevitable reward hacking**: With large task spaces and finite samples, "reward hacking is globally inevitable: rare high-loss states are systematically under-covered."
3. **No-Free-Lunch principle**: Alignment has irreducible computational costs regardless of method sophistication.
**Practical pathways**:
- **Safety-critical slices**: Rather than uniform coverage, target high-stakes regions for scalable oversight
- **Consensus-driven objective reduction**: Manage multi-agent alignment through reducing the objective space via consensus
## Agent Notes
**Why this matters:** This is a third independent impossibility result (alongside Arrow's theorem and the RLHF trilemma). Three different mathematical traditions — social choice theory, complexity theory, and multi-objective optimization — converge on the same structural finding: perfect alignment with diverse preferences is computationally intractable. This convergence is itself a strong claim.
**What surprised me:** The "consensus-driven objective reduction" pathway is exactly what bridging-based approaches (RLCF, Community Notes) do — they reduce the objective space by finding consensus regions rather than covering all preferences. This paper provides formal justification for why bridging works: it's the practical pathway out of the impossibility result.
**What I expected but didn't find:** No explicit connection to Arrow's theorem or social choice theory, despite the structural parallels. No connection to bridging-based mechanisms.
**KB connections:**
- [[universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective]] — third independent confirmation
- [[reward hacking is globally inevitable]] — this could be a new claim
- [[safe AI development requires building alignment mechanisms before scaling capability]] — the safety-critical slices approach is an alignment mechanism
**Extraction hints:** Claims about (1) convergent impossibility from three mathematical traditions, (2) reward hacking as globally inevitable, (3) consensus-driven objective reduction as practical pathway.
**Context:** AAAI 2026 oral presentation — high-prestige venue for formal AI safety work.
## Curator Notes (structured handoff for extractor)
PRIMARY CONNECTION: [[universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective]]
WHY ARCHIVED: Third independent impossibility result from multi-objective optimization — convergent evidence from three mathematical traditions strengthens our core impossibility claim
EXTRACTION HINT: The convergence of three impossibility traditions AND the "consensus-driven reduction" pathway are both extractable

View file

@ -0,0 +1,53 @@
---
type: source
title: "Murphy's Laws of AI Alignment: Why the Gap Always Wins"
author: "Madhava Gaikwad"
url: https://arxiv.org/abs/2509.05381
date: 2025-09-01
domain: ai-alignment
secondary_domains: []
format: paper
status: unprocessed
priority: medium
tags: [alignment-gap, feedback-misspecification, reward-hacking, sycophancy, impossibility, maps-framework]
---
## Content
Studies RLHF under misspecification. Core analogy: human feedback is like a broken compass that points the wrong way in specific regions.
**Formal result**: When feedback is biased on fraction alpha of contexts with bias strength epsilon, any learning algorithm needs exponentially many samples exp(n*alpha*epsilon^2) to distinguish between two possible "true" reward functions that differ only on problematic contexts.
**Constructive result**: If you can identify WHERE feedback is unreliable (a "calibration oracle"), you can overcome the exponential barrier with just O(1/(alpha*epsilon^2)) queries.
**Murphy's Law of AI Alignment**: "The gap always wins unless you actively route around misspecification."
**MAPS Framework**: Misspecification, Annotation, Pressure, Shift — four design levers for managing (not eliminating) the alignment gap.
**Key parameters**:
- alpha: frequency of problematic contexts
- epsilon: bias strength in those contexts
- gamma: degree of disagreement in true objectives
The alignment gap cannot be eliminated but can be mapped, bounded, and managed.
## Agent Notes
**Why this matters:** The formal result — exponential sample complexity from feedback misspecification — explains WHY alignment is hard in a different way than Arrow's theorem. Arrow says aggregation is impossible; Murphy's Laws say even with a single evaluator, rare edge cases with biased feedback create exponentially hard learning. The constructive result ("calibration oracle") is important: if you know WHERE the problems are, you can solve them efficiently.
**What surprised me:** The "calibration oracle" concept. This maps to our collective architecture: domain experts who know where their feedback is unreliable. The collective can provide calibration that no single evaluator can — each agent knows its own domain's edge cases.
**What I expected but didn't find:** No connection to social choice theory. No connection to bridging-based approaches. Purely focused on single-evaluator misspecification.
**KB connections:**
- [[emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive]] — Murphy's Laws formalize this
- [[RLHF and DPO both fail at preference diversity]] — different failure mode (misspecification vs. diversity) but convergent conclusion
**Extraction hints:** Claims about (1) exponential sample complexity from feedback misspecification, (2) calibration oracles overcoming the barrier, (3) alignment gap as manageable not eliminable.
**Context:** Published September 2025. Independent researcher.
## Curator Notes (structured handoff for extractor)
PRIMARY CONNECTION: [[emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive]]
WHY ARCHIVED: The "calibration oracle" concept maps to our collective architecture — domain experts as calibration mechanisms
EXTRACTION HINT: The exponential barrier + calibration oracle constructive result is the key extractable claim pair

View file

@ -0,0 +1,59 @@
---
type: source
title: "Operationalizing Pluralistic Values in LLM Alignment Reveals Trade-offs in Safety, Inclusivity, and Model Behavior"
author: "Multiple authors"
url: https://arxiv.org/abs/2511.14476
date: 2025-11-01
domain: ai-alignment
secondary_domains: [collective-intelligence]
format: paper
status: unprocessed
priority: high
tags: [pluralistic-alignment, safety-inclusivity-tradeoff, demographic-diversity, disagreement-preservation, dpo, grpo]
---
## Content
Empirical study examining how demographic diversity in human feedback and technical design choices shape model behavior during alignment training.
**Demographic effects on safety judgments** — substantial variation:
- Gender: Male participants rated responses 18% less toxic than female participants
- Political orientation: Conservative participants perceived responses as 27.9% more sensitive than liberal raters
- Ethnicity: Black participants rated responses as 44% more emotionally aware than White participants
These differences suggest safety judgments reflect specific demographic perspectives rather than universal standards.
**Technical methods tested** (four systematic experiments):
1. Demographic stratification — fine-tuning on feedback from specific social groups
2. Rating scale granularity — comparing 5-point, 3-point, and binary scales
3. Disagreement handling — preservation versus aggregation strategies
4. Optimization algorithms — DPO versus GRPO
**Key quantitative results**:
- 5-point scale outperforms binary scale by ~22% in toxicity reduction
- Preserving all ratings achieved ~53% greater toxicity reduction than majority voting
- DPO outperformed GRPO with effect sizes ~8x larger for toxicity and ~3x for emotional awareness
**Critical finding**: Inclusive approaches ENHANCE safety outcomes rather than compromising them. The assumed safety-inclusivity trade-off is challenged by the data.
## Agent Notes
**Why this matters:** This is the empirical counterpoint to the alignment trilemma. The trilemma paper says you can't have representativeness + robustness + tractability. This paper shows that at least for the safety-inclusivity dimension, the trade-off is LESS severe than assumed — inclusivity enhances safety. This doesn't refute the trilemma but narrows its practical impact.
**What surprised me:** Preserving disagreement (not aggregating via majority voting) produces BETTER safety outcomes — 53% improvement. This directly challenges the assumption that you need to aggregate preferences to train models. The disagreement itself carries safety signal. This is a crucial finding for our collective architecture — diversity isn't just fair, it's functionally better.
**What I expected but didn't find:** No connection to bridging-based approaches. No Arrow's theorem discussion. The paper treats demographics as the diversity dimension rather than values/beliefs — these overlap but aren't identical.
**KB connections:**
- [[collective intelligence requires diversity as a structural precondition not a moral preference]] — CONFIRMED empirically for alignment specifically
- [[RLHF and DPO both fail at preference diversity]] — nuanced: fails when diversity is aggregated away, succeeds when preserved
- [[pluralistic alignment must accommodate irreducibly diverse values simultaneously]] — empirical evidence for how to operationalize this
**Extraction hints:** Claims about (1) safety judgments reflecting demographic perspectives not universal standards, (2) disagreement preservation outperforming majority voting for safety, (3) inclusivity enhancing (not trading off against) safety.
**Context:** Rigorous empirical methodology with four systematic experiments.
## Curator Notes (structured handoff for extractor)
PRIMARY CONNECTION: [[pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state]]
WHY ARCHIVED: Empirical evidence that preserving disagreement produces better safety outcomes — challenges the assumed safety-inclusivity trade-off
EXTRACTION HINT: The "53% improvement from preserving disagreement" finding is the key extractable claim — it has structural implications for collective architectures

View file

@ -0,0 +1,58 @@
---
type: source
title: "The Complexity of Perfect AI Alignment: Formalizing the RLHF Trilemma"
author: "Subramanyam Sahoo, Aman Chadha, Vinija Jain, Divya Chaudhary"
url: https://arxiv.org/abs/2511.19504
date: 2025-11-01
domain: ai-alignment
secondary_domains: [collective-intelligence]
format: paper
status: unprocessed
priority: high
tags: [alignment-trilemma, impossibility-result, rlhf, representativeness, robustness, tractability, preference-collapse, sycophancy]
---
## Content
Position paper from Berkeley AI Safety Initiative, AWS/Stanford, Meta/Stanford, and Northeastern. Presented at NeurIPS 2025 Workshop on Socially Responsible and Trustworthy Foundation Models.
**The Alignment Trilemma**: No RLHF system can simultaneously achieve:
1. **Epsilon-representativeness** across diverse human values
2. **Polynomial tractability** in sample and compute complexity
3. **Delta-robustness** against adversarial perturbations and distribution shift
**Core complexity bound**: Achieving both representativeness (epsilon <= 0.01) and robustness (delta <= 0.001) for global-scale populations requires Omega(2^{d_context}) operations — super-polynomial in context dimensionality.
**Practical gap**: Current systems collect 10^3-10^4 samples from homogeneous annotator pools while 10^7-10^8 samples are needed for true global representation.
**Documented RLHF pathologies** (computational necessities, not implementation bugs):
- **Preference collapse**: Single-reward RLHF cannot capture multimodal preferences even in theory
- **Sycophancy**: RLHF-trained assistants sacrifice truthfulness to agree with false user beliefs
- **Bias amplification**: Models assign >99% probability to majority opinions, functionally erasing minority perspectives
**Strategic relaxation pathways**:
1. Constrain representativeness: Focus on K << |H| "core" human values (~30 universal principles)
2. Scope robustness narrowly: Define restricted adversarial class targeting plausible threats
3. Accept super-polynomial costs: Justify exponential compute for high-stakes applications
## Agent Notes
**Why this matters:** This is the formal impossibility result our KB has been gesturing at. Our claim [[RLHF and DPO both fail at preference diversity]] is an informal version of this trilemma. The formal result is stronger — it's not just that current implementations fail, it's that NO RLHF system can simultaneously achieve all three properties. This is analogous to the CAP theorem for distributed systems.
**What surprised me:** The paper does NOT directly reference Arrow's theorem despite the structural similarity. The trilemma is proven through complexity theory rather than social choice theory. This is an independent intellectual tradition arriving at a compatible impossibility result — strong convergent evidence.
**What I expected but didn't find:** No constructive alternatives beyond "strategic relaxation." The paper diagnoses but doesn't prescribe. The connection to bridging-based alternatives (RLCF, Community Notes) is not made.
**KB connections:**
- [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]] — this paper FORMALIZES our existing claim
- [[universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective]] — independent confirmation from complexity theory
- [[scalable oversight degrades rapidly as capability gaps grow]] — the trilemma shows degradation is mathematically necessary
**Extraction hints:** Claims about (1) the formal alignment trilemma as impossibility result, (2) preference collapse / sycophancy / bias amplification as computational necessities, (3) the 10^3 vs 10^8 representation gap in current RLHF.
**Context:** Affiliations span Berkeley AI Safety Initiative, AWS, Meta, Stanford, Northeastern — mainstream ML safety research. NeurIPS workshop venue gives it peer scrutiny.
## Curator Notes (structured handoff for extractor)
PRIMARY CONNECTION: [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]]
WHY ARCHIVED: Formalizes our informal impossibility claim with complexity-theoretic proof — independent confirmation of Arrow's-theorem-based argument from a different mathematical tradition
EXTRACTION HINT: The trilemma is the key claim. Also extract the practical gap (10^3 vs 10^8) and the "pathologies as computational necessities" framing

View file

@ -0,0 +1,61 @@
---
type: source
title: "Democracy and AI: CIP's Year in Review 2025"
author: "CIP (Collective Intelligence Project)"
url: https://blog.cip.org/p/from-global-dialogues-to-democratic
date: 2025-12-01
domain: ai-alignment
secondary_domains: [collective-intelligence, mechanisms]
format: article
status: unprocessed
priority: medium
tags: [cip, democratic-alignment, global-dialogues, weval, samiksha, digital-twin, frontier-lab-adoption]
---
## Content
CIP's comprehensive 2025 results and 2026 plans.
**Global Dialogues scale**: 10,000+ participants across 70+ countries in 6 deliberative dialogues.
**Key findings**:
- 28% agreed AI should override established rules if calculating better outcomes
- 58% believed AI could make superior decisions versus local elected representatives
- 13.7% reported concerning/reality-distorting AI interactions affecting someone they know
- 47% felt chatbot interactions increased their belief certainty
**Weval evaluation framework**:
- Political neutrality: 1,000 participants generated 400 prompts and 107 evaluation criteria, achieving 70%+ consensus across political groups
- Sri Lanka elections: Models provided generic, irrelevant responses despite local context
- Mental health: Developed evaluations addressing suicidality, child safety, psychotic symptoms
- India health: Assessed accuracy and safety in three Indian languages with medical review
**Samiksha (India)**: 25,000+ queries across 11 Indian languages with 100,000+ manual evaluations — "the most comprehensive evaluation of AI in Indian contexts." Domains: healthcare, agriculture, education, legal.
**Digital Twin Evaluation Framework**: Tests how reliably models represent nuanced views of diverse demographic groups, built on Global Dialogues data.
**Frontier lab adoption**: Partners include Meta, Cohere, Anthropic, UK/US AI Safety Institutes. Governments in India, Taiwan, Sri Lanka incorporated findings.
**2026 plans**: Global Dialogues as standing global infrastructure. Epistemic Evaluation Suite measuring truthfulness, groundedness, impartiality. Operationalize digital twin evaluations as governance requirements for agentic systems.
## Agent Notes
**Why this matters:** CIP is the most advanced real-world implementation of democratic alignment infrastructure. The scale (10,000+ participants, 70+ countries) is unprecedented. Lab adoption (Meta, Anthropic, Cohere) moves this from experiment to infrastructure. The 2026 plans — making democratic input "standing global infrastructure" — would fulfill our claim about the need for collective intelligence infrastructure for alignment.
**What surprised me:** The 58% who believe AI could decide better than elected representatives. This is deeply ambiguous — is it trust in AI + democratic process, or willingness to cede authority to AI? If the latter, it undermines the human-in-the-loop thesis at scale. Also, the Sri Lanka finding (models giving generic responses to local context) reveals a specific failure mode: global models fail local alignment.
**What I expected but didn't find:** No evidence that Weval/Samiksha results actually CHANGED what labs deployed. Adoption as evaluation tool ≠ adoption as deployment gate. The gap between "we used these insights" and "these changed our product" remains unclear.
**KB connections:**
- [[democratic alignment assemblies produce constitutions as effective as expert-designed ones]] — extended to 10,000+ scale
- [[community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules]] — confirmed at scale
- [[no research group is building alignment through collective intelligence infrastructure]] — CIP is partially filling this gap
**Extraction hints:** Claims about (1) democratic alignment scaling to 10,000+ globally, (2) 70%+ cross-partisan consensus achievable on AI evaluation criteria, (3) frontier lab adoption of democratic evaluation tools.
**Context:** CIP is funded by major tech philanthropy. CIP/Anthropic CCAI collaboration set the precedent.
## Curator Notes (structured handoff for extractor)
PRIMARY CONNECTION: [[democratic alignment assemblies produce constitutions as effective as expert-designed ones while better representing diverse populations]]
WHY ARCHIVED: Scale-up evidence for democratic alignment + frontier lab adoption evidence
EXTRACTION HINT: The 70%+ cross-partisan consensus and the evaluation-to-deployment gap are both extractable

View file

@ -0,0 +1,53 @@
---
type: source
title: "A Systematic Evaluation of Preference Aggregation in Federated RLHF for Pluralistic Alignment of LLMs"
author: "Multiple authors"
url: https://arxiv.org/abs/2512.08786
date: 2025-12-01
domain: ai-alignment
secondary_domains: [collective-intelligence]
format: paper
status: unprocessed
priority: medium
tags: [federated-rlhf, preference-aggregation, pluralistic-alignment, ppo, adaptive-weighting]
---
## Content
NeurIPS 2025 Workshop on Evaluating the Evolving LLM Lifecycle.
**Problem**: Aligning LLMs with diverse human preferences in federated learning environments.
**Evaluation framework**: Assesses trade-off between alignment quality and fairness using different preference aggregation strategies. Groups locally evaluate rollouts and produce reward signals; servers aggregate without accessing raw data.
**Methods tested**:
- Min aggregation
- Max aggregation
- Average aggregation
- Novel adaptive scheme: dynamically adjusts preference weights based on group's historical alignment performance
**Results**: Adaptive approach "consistently achieves superior fairness while maintaining competitive alignment scores" across question-answering tasks using PPO-based RLHF pipeline.
**Key insight**: Federated approach enables each group to locally evaluate, preserving privacy and capturing wider range of preferences that standard methods inadequately represent.
## Agent Notes
**Why this matters:** Connects federated learning to pluralistic alignment — a structural parallel to our collective agent architecture. Groups producing local reward signals that are aggregated without raw data access mirrors our agents producing domain claims that Leo synthesizes without accessing each agent's internal reasoning.
**What surprised me:** The adaptive weighting scheme — dynamically adjusting based on historical performance — is operationally similar to active inference's precision weighting (from our previous session). Groups with higher uncertainty get more weight in exploration phases.
**What I expected but didn't find:** No comparison with RLCF or bridging approaches. No formal connection to Arrow's theorem. Limited scale (workshop paper).
**KB connections:**
- [[federated inference where agents share processed beliefs rather than raw data is more efficient for collective intelligence]] — direct parallel from active inference literature
- [[pluralistic alignment must accommodate irreducibly diverse values simultaneously]] — federated RLHF as implementation
- [[RLHF and DPO both fail at preference diversity]] — federated approach as structural fix
**Extraction hints:** Claim about federated preference aggregation maintaining fairness while preserving alignment quality.
**Context:** Workshop paper — less rigorous than full conference papers, but directionally important.
## Curator Notes (structured handoff for extractor)
PRIMARY CONNECTION: [[pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state]]
WHY ARCHIVED: Federated RLHF mirrors our collective architecture — structural parallel worth tracking
EXTRACTION HINT: The adaptive weighting mechanism and its connection to active inference precision weighting

View file

@ -0,0 +1,53 @@
---
type: source
title: "Full-Stack Alignment: Co-Aligning AI and Institutions with Thick Models of Value"
author: "Multiple authors"
url: https://arxiv.org/abs/2512.03399
date: 2025-12-01
domain: ai-alignment
secondary_domains: [mechanisms, grand-strategy]
format: paper
status: unprocessed
priority: medium
tags: [full-stack-alignment, institutional-alignment, thick-values, normative-competence, co-alignment]
---
## Content
Published December 2025. Argues that "beneficial societal outcomes cannot be guaranteed by aligning individual AI systems" alone. Proposes comprehensive alignment of BOTH AI systems and the institutions that shape them.
**Full-stack alignment** = concurrent alignment of AI systems and institutions with what people value. Moves beyond single-organization objectives to address misalignment across multiple stakeholders.
**Thick models of value** (vs. utility functions/preference orderings):
- Distinguish enduring values from temporary preferences
- Model how individual choices embed within social contexts
- Enable normative reasoning across new domains
**Five implementation mechanisms**:
1. AI value stewardship
2. Normatively competent agents
3. Win-win negotiation systems
4. Meaning-preserving economic mechanisms
5. Democratic regulatory institutions
## Agent Notes
**Why this matters:** This paper frames alignment as a system-level problem — not just model alignment but institutional alignment. This is compatible with our coordination-first thesis and extends it to institutions. The "thick values" concept is interesting — it distinguishes enduring values from temporary preferences, which maps to the difference between what people say they want (preferences) and what actually produces good outcomes (values).
**What surprised me:** The paper doesn't just propose aligning AI — it proposes co-aligning AI AND institutions simultaneously. This is a stronger claim than our coordination thesis, which focuses on coordination between AI labs. Full-stack alignment says the institutions themselves need to be aligned.
**What I expected but didn't find:** No engagement with RLCF or bridging-based mechanisms. No formal impossibility results. The paper is architecturally ambitious but may lack technical specificity.
**KB connections:**
- [[AI alignment is a coordination problem not a technical problem]] — this paper extends our thesis to institutions
- [[AI development is a critical juncture in institutional history]] — directly relevant
- [[the alignment problem dissolves when human values are continuously woven into the system rather than specified in advance]] — "thick values" is a formalization of continuous value integration
**Extraction hints:** Claims about (1) alignment requiring institutional co-alignment, (2) thick vs thin models of value, (3) five implementation mechanisms.
**Context:** Early-stage paper (December 2025), ambitious scope.
## Curator Notes (structured handoff for extractor)
PRIMARY CONNECTION: [[AI alignment is a coordination problem not a technical problem]]
WHY ARCHIVED: Extends coordination-first thesis to institutions — "full-stack alignment" is a stronger version of our existing claim
EXTRACTION HINT: The "thick models of value" concept may be the most extractable novel claim

View file

@ -0,0 +1,57 @@
---
type: source
title: "AI Alignment Cannot Be Top-Down"
author: "Audrey Tang (@audreyt)"
url: https://ai-frontiers.org/articles/ai-alignment-cannot-be-top-down
date: 2026-01-01
domain: ai-alignment
secondary_domains: [collective-intelligence, mechanisms]
format: article
status: unprocessed
priority: high
tags: [rlcf, bridging-consensus, polis, democratic-alignment, attentiveness, community-feedback]
flagged_for_rio: ["RLCF as mechanism design — bridging algorithms are formally a mechanism design problem"]
---
## Content
Audrey Tang (Taiwan's cyber ambassador, first digital minister, 2025 Right Livelihood Laureate) argues that AI alignment cannot succeed through top-down corporate control. The current landscape of AI alignment is dominated by a handful of private corporations setting goals, selecting data, and defining "acceptable" behavior behind closed doors.
Tang proposes "attentiveness" — giving citizens genuine power to steer technology through democratic participation. The framework has three mutually reinforcing mechanisms:
1. **Industry norms**: Public model specifications making AI decision-making legible. Citation-at-inference mechanisms for auditable reasoning traces. Portability mandates enabling users to switch platforms.
2. **Market design**: Mechanisms that make democratic alignment economically viable.
3. **Community-scale assistants**: Local tuning of global models through community feedback.
**RLCF (Reinforcement Learning from Community Feedback)**: Models are rewarded for output that people with opposing views find reasonable. This transforms disagreement into sense-making rather than suppressing minority perspectives. RLCF is described as training AI systems using diverse, aggregated community signals instead of engineered rewards.
**Polis**: A machine learning platform that performs real-time analysis of public votes to build consensus on policy debates. Bridging notes gain prominence only when rated helpful by people holding different perspectives — operationalizing "uncommon ground."
**Taiwan empirical evidence**: Deliberative assemblies of 447 randomly selected citizens achieved unanimous parliamentary support for new laws on AI-generated scam content within months — without content suppression.
The framework emphasizes integrity infrastructure including oversight by citizen bodies and transparent logs, making AI-enabled mediation adaptive, pluralistic, and auditable.
## Agent Notes
**Why this matters:** This is the most complete articulation of RLCF as an alternative to RLHF I've found. It directly addresses our gap between negative claims (Arrow's impossibility) and constructive alternatives. RLCF doesn't aggregate preferences into a single function — it finds bridging output that diverse groups accept. This may operate outside Arrow's conditions entirely.
**What surprised me:** Tang doesn't engage Arrow's theorem directly. The article doesn't formalize why bridging-based consensus sidesteps social choice impossibility — it just describes the mechanism. This is a theoretical gap worth filling. Also, the Taiwan evidence (447 citizens → unanimous parliamentary support) is remarkably efficient for democratic input.
**What I expected but didn't find:** No technical specification of RLCF. No comparison with RLHF/DPO architecturally. No formal analysis of when bridging consensus fails. The mechanism is described at the level of philosophy, not engineering.
**KB connections:**
- [[universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective]] — RLCF may sidestep this by not aggregating into a single function
- [[democratic alignment assemblies produce constitutions as effective as expert-designed ones]] — Taiwan evidence extends this
- [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]] — RLCF is explicitly designed to handle preference diversity
- [[no research group is building alignment through collective intelligence infrastructure despite the field converging on problems that require it]] — CIP + Tang's framework is building this infrastructure
**Extraction hints:** Claims about (1) RLCF as structural alternative to single-reward alignment, (2) bridging-based consensus as Arrow's workaround, (3) democratic alignment scaling to policy outcomes (Taiwan evidence), (4) attentiveness as alignment paradigm.
**Context:** Audrey Tang is globally recognized for Taiwan's digital democracy innovations. Tang's vTaiwan platform and Polis deployments are the most successful real-world implementations of computational democracy. This isn't theoretical — it's policy-tested.
## Curator Notes (structured handoff for extractor)
PRIMARY CONNECTION: [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]]
WHY ARCHIVED: RLCF is the first mechanism I've seen that might structurally handle preference diversity without hitting Arrow's impossibility — the constructive alternative our KB needs
EXTRACTION HINT: Focus on (1) whether RLCF formally sidesteps Arrow's theorem and (2) the Taiwan evidence as democratic alignment at policy scale

View file

@ -0,0 +1,53 @@
---
type: source
title: "Methods and Open Problems in Differentiable Social Choice: Learning Mechanisms, Decisions, and Alignment"
author: "Zhiyu An, Wan Du"
url: https://arxiv.org/abs/2602.03003
date: 2026-02-01
domain: ai-alignment
secondary_domains: [mechanisms, collective-intelligence]
format: paper
status: unprocessed
priority: medium
tags: [differentiable-social-choice, learned-mechanisms, voting-rules, rlhf-as-voting, impossibility-as-tradeoff, open-problems]
flagged_for_rio: ["Differentiable auctions and economic mechanisms — direct overlap with mechanism design territory"]
---
## Content
Published February 2026. Comprehensive survey of differentiable social choice — an emerging paradigm that formulates voting rules, mechanisms, and aggregation procedures as learnable, differentiable models optimized from data.
**Key insight**: Contemporary ML systems already implement social choice mechanisms implicitly and without normative scrutiny. RLHF is implicit voting.
**Classical impossibility results reappear** as objectives, constraints, and optimization trade-offs when mechanisms are learned rather than designed.
**Six interconnected domains surveyed**:
1. Differentiable Economics — learning-based approximations to optimal auctions/contracts
2. Neural Social Choice — synthesizing/analyzing voting rules using deep learning
3. AI Alignment as Social Choice — RLHF as implicit voting
4. Participatory Budgeting
5. Liquid Democracy
6. Inverse Mechanism Learning
**18 open problems** spanning incentive guarantees, robustness, certification, pluralistic preference aggregation, and governance of alignment objectives.
## Agent Notes
**Why this matters:** This paper makes the implicit explicit: RLHF IS social choice, and the field needs to treat it that way. The framing of impossibility results as optimization trade-offs (not brick walls) is important — it means you can learn mechanisms that navigate the trade-offs rather than being blocked by them. This is the engineering counterpart to the theoretical impossibility results.
**What surprised me:** The sheer breadth — from auctions to liquid democracy to alignment, all unified under differentiable social choice. This field didn't exist 5 years ago and now has 18 open problems. Also, "inverse mechanism learning" — learning what mechanism produced observed outcomes — could be used to DETECT what social choice function RLHF is implicitly implementing.
**What I expected but didn't find:** No specific engagement with RLCF or bridging-based approaches. The paper is a survey, not a solution proposal.
**KB connections:**
- [[designing coordination rules is categorically different from designing coordination outcomes]] — differentiable social choice designs rules that learn outcomes
- [[universal alignment is mathematically impossible because Arrows impossibility theorem applies]] — impossibility results become optimization constraints
**Extraction hints:** Claims about (1) RLHF as implicit social choice without normative scrutiny, (2) impossibility results as optimization trade-offs not brick walls, (3) differentiable mechanisms as learnable alternatives to designed ones.
**Context:** February 2026 — very recent comprehensive survey. Signals field maturation.
## Curator Notes (structured handoff for extractor)
PRIMARY CONNECTION: [[designing coordination rules is categorically different from designing coordination outcomes as nine intellectual traditions independently confirm]]
WHY ARCHIVED: RLHF-as-social-choice framing + impossibility-as-optimization-tradeoff = new lens on our coordination thesis
EXTRACTION HINT: Focus on "RLHF is implicit social choice" and "impossibility as optimization trade-off" — these are the novel framing claims