Compare commits
2 commits
fde7be1748
...
6a8a7464b4
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
6a8a7464b4 | ||
|
|
30c4314b1b |
6 changed files with 77 additions and 1 deletions
|
|
@ -25,6 +25,12 @@ Since [[universal alignment is mathematically impossible because Arrows impossib
|
|||
|
||||
MaxMin-RLHF provides a constructive implementation of pluralistic alignment through mixture-of-rewards and egalitarian optimization. Rather than converging preferences, it learns separate reward models for each subpopulation and optimizes for the worst-off group (Sen's Egalitarian principle). At Tulu2-7B scale, this achieved 56.67% win rate across both majority and minority groups, compared to single-reward's 70.4%/42% split. The mechanism accommodates irreducible diversity by maintaining separate reward functions rather than forcing convergence.
|
||||
|
||||
|
||||
### Additional Evidence (confirm)
|
||||
*Source: [[2025-00-00-em-dpo-heterogeneous-preferences]] | Added: 2026-03-16*
|
||||
|
||||
EM-DPO implements this through ensemble architecture where separate models serve different preference types, with MMRA aggregation ensuring no group is severely underserved. The system maintains diversity by training type-specific models rather than forcing convergence to a single reward function.
|
||||
|
||||
---
|
||||
|
||||
Relevant Notes:
|
||||
|
|
|
|||
|
|
@ -33,6 +33,12 @@ The paper's proposed solution—RLCHF with explicit social welfare functions—c
|
|||
|
||||
RLCF makes the social choice mechanism explicit through the bridging algorithm (matrix factorization with intercept scores). Unlike standard RLHF which aggregates preferences opaquely through reward model training, RLCF's use of intercepts as the training signal is a deliberate choice to optimize for cross-partisan agreement—a specific social welfare function.
|
||||
|
||||
|
||||
### Additional Evidence (extend)
|
||||
*Source: [[2025-00-00-em-dpo-heterogeneous-preferences]] | Added: 2026-03-16*
|
||||
|
||||
EM-DPO makes the social choice explicit by using MinMax Regret Aggregation based on egalitarian principles. This shows that pluralistic alignment requires both discovering preference types (EM) and choosing an aggregation principle (MMRA), with the latter being an unavoidable normative choice.
|
||||
|
||||
---
|
||||
|
||||
Relevant Notes:
|
||||
|
|
|
|||
|
|
@ -33,6 +33,12 @@ Chakraborty, Qiu, Yuan, Koppel, Manocha, Huang, Bedi, Wang. "MaxMin-RLHF: Alignm
|
|||
|
||||
Study demonstrates that models trained on different demographic populations show measurable behavioral divergence (3-5 percentage points), providing empirical evidence that single-reward functions trained on one population systematically misalign with others.
|
||||
|
||||
|
||||
### Additional Evidence (extend)
|
||||
*Source: [[2025-00-00-em-dpo-heterogeneous-preferences]] | Added: 2026-03-16*
|
||||
|
||||
EM-DPO demonstrates that the problem is deeper than single-reward optimization — binary comparisons are formally insufficient for preference type identification. Even with perfect optimization, pairwise RLHF/DPO cannot detect heterogeneity because the data format lacks the necessary information structure. Rankings over 3+ responses are required.
|
||||
|
||||
---
|
||||
|
||||
Relevant Notes:
|
||||
|
|
|
|||
|
|
@ -65,6 +65,7 @@ The futarchy governance protocol on Solana. Implements decision markets through
|
|||
- **2024-01-24** — Proposed AMM program to replace CLOB markets, addressing liquidity fragmentation and state rent costs (Proposal CF9QUBS251FnNGZHLJ4WbB2CVRi5BtqJbCqMi47NX1PG)
|
||||
- **2024-01-29** — AMM proposal passed with 400 META on approval and 800 META on completion budget
|
||||
- **2024-08-31** — Passed proposal to enter services agreement with Organization Technology LLC, creating US entity vehicle for paying contributors with $1.378M annualized burn rate. Entity owns no IP (all owned by MetaDAO LLC) and cannot encumber MetaDAO LLC. Agreement cancellable with 30-day notice or immediately for material breach.
|
||||
- **2024-08-28** — Futardio memecoin launchpad proposal created (failed 2024-09-01)
|
||||
## Key Decisions
|
||||
| Date | Proposal | Proposer | Category | Outcome |
|
||||
|------|----------|----------|----------|---------|
|
||||
|
|
|
|||
|
|
@ -0,0 +1,47 @@
|
|||
{
|
||||
"rejected_claims": [
|
||||
{
|
||||
"filename": "binary-preference-comparisons-cannot-identify-latent-preference-types-requiring-rankings-over-three-or-more-responses.md",
|
||||
"issues": [
|
||||
"missing_attribution_extractor"
|
||||
]
|
||||
},
|
||||
{
|
||||
"filename": "em-algorithm-discovers-latent-preference-subpopulations-enabling-ensemble-alignment-without-demographic-labels.md",
|
||||
"issues": [
|
||||
"missing_attribution_extractor"
|
||||
]
|
||||
},
|
||||
{
|
||||
"filename": "minmax-regret-aggregation-ensures-no-preference-group-is-severely-underserved-by-applying-egalitarian-social-choice-to-ensemble-deployment.md",
|
||||
"issues": [
|
||||
"missing_attribution_extractor"
|
||||
]
|
||||
}
|
||||
],
|
||||
"validation_stats": {
|
||||
"total": 3,
|
||||
"kept": 0,
|
||||
"fixed": 10,
|
||||
"rejected": 3,
|
||||
"fixes_applied": [
|
||||
"binary-preference-comparisons-cannot-identify-latent-preference-types-requiring-rankings-over-three-or-more-responses.md:set_created:2026-03-16",
|
||||
"binary-preference-comparisons-cannot-identify-latent-preference-types-requiring-rankings-over-three-or-more-responses.md:stripped_wiki_link:single-reward-rlhf-cannot-align-diverse-preferences-because-",
|
||||
"binary-preference-comparisons-cannot-identify-latent-preference-types-requiring-rankings-over-three-or-more-responses.md:stripped_wiki_link:rlhf-is-implicit-social-choice-without-normative-scrutiny.md",
|
||||
"em-algorithm-discovers-latent-preference-subpopulations-enabling-ensemble-alignment-without-demographic-labels.md:set_created:2026-03-16",
|
||||
"em-algorithm-discovers-latent-preference-subpopulations-enabling-ensemble-alignment-without-demographic-labels.md:stripped_wiki_link:modeling preference sensitivity as a learned distribution ra",
|
||||
"em-algorithm-discovers-latent-preference-subpopulations-enabling-ensemble-alignment-without-demographic-labels.md:stripped_wiki_link:pluralistic alignment must accommodate irreducibly diverse v",
|
||||
"minmax-regret-aggregation-ensures-no-preference-group-is-severely-underserved-by-applying-egalitarian-social-choice-to-ensemble-deployment.md:set_created:2026-03-16",
|
||||
"minmax-regret-aggregation-ensures-no-preference-group-is-severely-underserved-by-applying-egalitarian-social-choice-to-ensemble-deployment.md:stripped_wiki_link:maxmin-rlhf-applies-egalitarian-social-choice-to-alignment-b",
|
||||
"minmax-regret-aggregation-ensures-no-preference-group-is-severely-underserved-by-applying-egalitarian-social-choice-to-ensemble-deployment.md:stripped_wiki_link:post-arrow-social-choice-mechanisms-work-by-weakening-indepe",
|
||||
"minmax-regret-aggregation-ensures-no-preference-group-is-severely-underserved-by-applying-egalitarian-social-choice-to-ensemble-deployment.md:stripped_wiki_link:pluralistic-ai-alignment-through-multiple-systems-preserves-"
|
||||
],
|
||||
"rejections": [
|
||||
"binary-preference-comparisons-cannot-identify-latent-preference-types-requiring-rankings-over-three-or-more-responses.md:missing_attribution_extractor",
|
||||
"em-algorithm-discovers-latent-preference-subpopulations-enabling-ensemble-alignment-without-demographic-labels.md:missing_attribution_extractor",
|
||||
"minmax-regret-aggregation-ensures-no-preference-group-is-severely-underserved-by-applying-egalitarian-social-choice-to-ensemble-deployment.md:missing_attribution_extractor"
|
||||
]
|
||||
},
|
||||
"model": "anthropic/claude-sonnet-4.5",
|
||||
"date": "2026-03-16"
|
||||
}
|
||||
|
|
@ -7,9 +7,13 @@ date: 2025-01-01
|
|||
domain: ai-alignment
|
||||
secondary_domains: []
|
||||
format: paper
|
||||
status: unprocessed
|
||||
status: enrichment
|
||||
priority: medium
|
||||
tags: [pluralistic-alignment, EM-algorithm, preference-clustering, ensemble-LLM, fairness]
|
||||
processed_by: theseus
|
||||
processed_date: 2026-03-16
|
||||
enrichments_applied: ["single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md", "rlhf-is-implicit-social-choice-without-normative-scrutiny.md", "pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state.md"]
|
||||
extraction_model: "anthropic/claude-sonnet-4.5"
|
||||
---
|
||||
|
||||
## Content
|
||||
|
|
@ -39,3 +43,9 @@ EM-DPO uses expectation-maximization to simultaneously uncover latent user prefe
|
|||
PRIMARY CONNECTION: RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values
|
||||
WHY ARCHIVED: The binary-comparison insufficiency claim is a novel formal result that strengthens the case against standard alignment approaches
|
||||
EXTRACTION HINT: Focus on the formal insufficiency of binary comparisons and the EM + egalitarian aggregation combination
|
||||
|
||||
|
||||
## Key Facts
|
||||
- EM-DPO paper presented at EAAMO 2025 (Equity and Access in Algorithms, Mechanisms, and Optimization)
|
||||
- MMRA operates at inference time when user type is unknown, distinct from training-time aggregation approaches
|
||||
- The approach requires rankings over 3+ responses rather than binary comparisons for preference type identification
|
||||
|
|
|
|||
Loading…
Reference in a new issue