extract: 2025-11-00-sahoo-rlhf-alignment-trilemma

Pentagon-Agent: Ganymede <F99EBFA6-547B-4096-BEEA-1D59C3E4028A>
This commit is contained in:
Teleo Agents 2026-03-16 14:51:10 +00:00
parent a5e1e96dba
commit 4c39e34e6f
4 changed files with 63 additions and 1 deletions

View file

@ -39,6 +39,12 @@ RLCF makes the social choice mechanism explicit through the bridging algorithm (
Comprehensive February 2026 survey by An & Du documents that contemporary ML systems implement social choice mechanisms implicitly across RLHF, participatory budgeting, and liquid democracy applications, with 18 identified open problems spanning incentive guarantees and pluralistic preference aggregation. Comprehensive February 2026 survey by An & Du documents that contemporary ML systems implement social choice mechanisms implicitly across RLHF, participatory budgeting, and liquid democracy applications, with 18 identified open problems spanning incentive guarantees and pluralistic preference aggregation.
### Additional Evidence (extend)
*Source: [[2025-11-00-sahoo-rlhf-alignment-trilemma]] | Added: 2026-03-16*
The trilemma formalizes why RLHF's implicit social choice is problematic: achieving epsilon-representativeness (epsilon <= 0.01) and delta-robustness (delta <= 0.001) simultaneously requires super-polynomial compute, making the 'strategic relaxation' of representativeness a practical necessity that RLHF implementations make without explicit acknowledgment.
--- ---
Relevant Notes: Relevant Notes:

View file

@ -39,6 +39,12 @@ Study demonstrates that models trained on different demographic populations show
An & Du's survey reveals the mechanism behind single-reward failure: RLHF is doing social choice (preference aggregation) but treating it as an engineering detail rather than a normative design choice, which means the aggregation function is chosen implicitly and without examination of which fairness criteria it satisfies. An & Du's survey reveals the mechanism behind single-reward failure: RLHF is doing social choice (preference aggregation) but treating it as an engineering detail rather than a normative design choice, which means the aggregation function is chosen implicitly and without examination of which fairness criteria it satisfies.
### Additional Evidence (extend)
*Source: [[2025-11-00-sahoo-rlhf-alignment-trilemma]] | Added: 2026-03-16*
The formal trilemma proof shows preference collapse is not just empirically observed but mathematically necessary: single-reward RLHF cannot capture multimodal preferences even in theory. The paper quantifies the practical gap: current systems use 10^3-10^4 samples from homogeneous pools while 10^7-10^8 samples are needed for global representation — a 3-4 order of magnitude shortfall.
--- ---
Relevant Notes: Relevant Notes:

View file

@ -0,0 +1,37 @@
{
"rejected_claims": [
{
"filename": "rlhf-alignment-trilemma-proves-no-system-can-simultaneously-achieve-representativeness-tractability-and-robustness.md",
"issues": [
"missing_attribution_extractor"
]
},
{
"filename": "rlhf-pathologies-are-computational-necessities-not-implementation-bugs-because-preference-collapse-sycophancy-and-bias-amplification-follow-from-the-trilemma.md",
"issues": [
"missing_attribution_extractor"
]
}
],
"validation_stats": {
"total": 2,
"kept": 0,
"fixed": 7,
"rejected": 2,
"fixes_applied": [
"rlhf-alignment-trilemma-proves-no-system-can-simultaneously-achieve-representativeness-tractability-and-robustness.md:set_created:2026-03-16",
"rlhf-alignment-trilemma-proves-no-system-can-simultaneously-achieve-representativeness-tractability-and-robustness.md:stripped_wiki_link:universal-alignment-is-mathematically-impossible-because-Arr",
"rlhf-alignment-trilemma-proves-no-system-can-simultaneously-achieve-representativeness-tractability-and-robustness.md:stripped_wiki_link:single-reward-rlhf-cannot-align-diverse-preferences-because-",
"rlhf-alignment-trilemma-proves-no-system-can-simultaneously-achieve-representativeness-tractability-and-robustness.md:stripped_wiki_link:rlhf-is-implicit-social-choice-without-normative-scrutiny.md",
"rlhf-pathologies-are-computational-necessities-not-implementation-bugs-because-preference-collapse-sycophancy-and-bias-amplification-follow-from-the-trilemma.md:set_created:2026-03-16",
"rlhf-pathologies-are-computational-necessities-not-implementation-bugs-because-preference-collapse-sycophancy-and-bias-amplification-follow-from-the-trilemma.md:stripped_wiki_link:single-reward-rlhf-cannot-align-diverse-preferences-because-",
"rlhf-pathologies-are-computational-necessities-not-implementation-bugs-because-preference-collapse-sycophancy-and-bias-amplification-follow-from-the-trilemma.md:stripped_wiki_link:emergent-misalignment-arises-naturally-from-reward-hacking-a"
],
"rejections": [
"rlhf-alignment-trilemma-proves-no-system-can-simultaneously-achieve-representativeness-tractability-and-robustness.md:missing_attribution_extractor",
"rlhf-pathologies-are-computational-necessities-not-implementation-bugs-because-preference-collapse-sycophancy-and-bias-amplification-follow-from-the-trilemma.md:missing_attribution_extractor"
]
},
"model": "anthropic/claude-sonnet-4.5",
"date": "2026-03-16"
}

View file

@ -7,9 +7,13 @@ date: 2025-11-01
domain: ai-alignment domain: ai-alignment
secondary_domains: [collective-intelligence] secondary_domains: [collective-intelligence]
format: paper format: paper
status: unprocessed status: enrichment
priority: high priority: high
tags: [alignment-trilemma, impossibility-result, rlhf, representativeness, robustness, tractability, preference-collapse, sycophancy] tags: [alignment-trilemma, impossibility-result, rlhf, representativeness, robustness, tractability, preference-collapse, sycophancy]
processed_by: theseus
processed_date: 2026-03-16
enrichments_applied: ["single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md", "rlhf-is-implicit-social-choice-without-normative-scrutiny.md"]
extraction_model: "anthropic/claude-sonnet-4.5"
--- ---
## Content ## Content
@ -56,3 +60,12 @@ Position paper from Berkeley AI Safety Initiative, AWS/Stanford, Meta/Stanford,
PRIMARY CONNECTION: [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]] PRIMARY CONNECTION: [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]]
WHY ARCHIVED: Formalizes our informal impossibility claim with complexity-theoretic proof — independent confirmation of Arrow's-theorem-based argument from a different mathematical tradition WHY ARCHIVED: Formalizes our informal impossibility claim with complexity-theoretic proof — independent confirmation of Arrow's-theorem-based argument from a different mathematical tradition
EXTRACTION HINT: The trilemma is the key claim. Also extract the practical gap (10^3 vs 10^8) and the "pathologies as computational necessities" framing EXTRACTION HINT: The trilemma is the key claim. Also extract the practical gap (10^3 vs 10^8) and the "pathologies as computational necessities" framing
## Key Facts
- Paper presented at NeurIPS 2025 Workshop on Socially Responsible and Trustworthy Foundation Models
- Authors affiliated with Berkeley AI Safety Initiative, AWS, Stanford, Meta, and Northeastern
- Current RLHF systems collect 10^3-10^4 samples from annotator pools
- True global representation would require 10^7-10^8 samples
- Models assign >99% probability to majority opinions in documented cases
- Three strategic relaxation pathways proposed: constrain representativeness to ~30 core values, scope robustness narrowly to plausible threats, or accept super-polynomial costs for high-stakes applications