extract: 2025-11-00-sahoo-rlhf-alignment-trilemma

Pentagon-Agent: Ganymede <F99EBFA6-547B-4096-BEEA-1D59C3E4028A>
This commit is contained in:
Teleo Agents 2026-03-16 15:31:50 +00:00
parent af067944f1
commit 4781180de9
4 changed files with 62 additions and 1 deletions

View file

@ -45,6 +45,12 @@ Comprehensive February 2026 survey by An & Du documents that contemporary ML sys
EM-DPO makes the social choice function explicit by using MinMax Regret Aggregation based on egalitarian fairness principles, demonstrating that pluralistic alignment requires choosing a specific social welfare function (here: maximin regret) rather than pretending aggregation is value-neutral.
### Additional Evidence (extend)
*Source: [[2025-11-00-sahoo-rlhf-alignment-trilemma]] | Added: 2026-03-16*
The trilemma formalizes why implicit social choice in RLHF is problematic: the computational constraints force strategic relaxation of either representativeness, robustness, or tractability. Current RLHF implementations implicitly choose tractability, which mathematically necessitates sacrificing representativeness (homogeneous annotator pools) and robustness (vulnerability to distribution shift). This makes the normative choices explicit: which property are we willing to sacrifice?
---
Relevant Notes:

View file

@ -45,6 +45,12 @@ An & Du's survey reveals the mechanism behind single-reward failure: RLHF is doi
EM-DPO provides formal proof that binary comparisons are mathematically insufficient for preference type identification, explaining WHY single-reward RLHF fails: the training signal format cannot contain the information needed to discover heterogeneity, regardless of dataset size. Rankings over 3+ responses are necessary.
### Additional Evidence (extend)
*Source: [[2025-11-00-sahoo-rlhf-alignment-trilemma]] | Added: 2026-03-16*
The alignment gap is not just proportional to minority distinctiveness — it's super-polynomial in context dimensionality. Sahoo et al. prove that achieving epsilon <= 0.01 representativeness and delta <= 0.001 robustness requires Omega(2^{d_context}) operations. Current systems use 10^3-10^4 samples while 10^7-10^8 are needed for global representation. The gap compounds exponentially with the dimensionality of human values, making it structurally impossible to close through incremental improvements.
---
Relevant Notes:

View file

@ -0,0 +1,36 @@
{
"rejected_claims": [
{
"filename": "rlhf-alignment-trilemma-proves-no-system-can-simultaneously-achieve-representativeness-tractability-and-robustness.md",
"issues": [
"missing_attribution_extractor"
]
},
{
"filename": "rlhf-pathologies-are-computational-necessities-not-implementation-bugs.md",
"issues": [
"missing_attribution_extractor"
]
}
],
"validation_stats": {
"total": 2,
"kept": 0,
"fixed": 6,
"rejected": 2,
"fixes_applied": [
"rlhf-alignment-trilemma-proves-no-system-can-simultaneously-achieve-representativeness-tractability-and-robustness.md:set_created:2026-03-16",
"rlhf-alignment-trilemma-proves-no-system-can-simultaneously-achieve-representativeness-tractability-and-robustness.md:stripped_wiki_link:universal-alignment-is-mathematically-impossible-because-Arr",
"rlhf-alignment-trilemma-proves-no-system-can-simultaneously-achieve-representativeness-tractability-and-robustness.md:stripped_wiki_link:single-reward-rlhf-cannot-align-diverse-preferences-because-",
"rlhf-pathologies-are-computational-necessities-not-implementation-bugs.md:set_created:2026-03-16",
"rlhf-pathologies-are-computational-necessities-not-implementation-bugs.md:stripped_wiki_link:single-reward-rlhf-cannot-align-diverse-preferences-because-",
"rlhf-pathologies-are-computational-necessities-not-implementation-bugs.md:stripped_wiki_link:emergent-misalignment-arises-naturally-from-reward-hacking-a"
],
"rejections": [
"rlhf-alignment-trilemma-proves-no-system-can-simultaneously-achieve-representativeness-tractability-and-robustness.md:missing_attribution_extractor",
"rlhf-pathologies-are-computational-necessities-not-implementation-bugs.md:missing_attribution_extractor"
]
},
"model": "anthropic/claude-sonnet-4.5",
"date": "2026-03-16"
}

View file

@ -7,9 +7,13 @@ date: 2025-11-01
domain: ai-alignment
secondary_domains: [collective-intelligence]
format: paper
status: unprocessed
status: enrichment
priority: high
tags: [alignment-trilemma, impossibility-result, rlhf, representativeness, robustness, tractability, preference-collapse, sycophancy]
processed_by: theseus
processed_date: 2026-03-16
enrichments_applied: ["single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md", "rlhf-is-implicit-social-choice-without-normative-scrutiny.md"]
extraction_model: "anthropic/claude-sonnet-4.5"
---
## Content
@ -56,3 +60,12 @@ Position paper from Berkeley AI Safety Initiative, AWS/Stanford, Meta/Stanford,
PRIMARY CONNECTION: [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]]
WHY ARCHIVED: Formalizes our informal impossibility claim with complexity-theoretic proof — independent confirmation of Arrow's-theorem-based argument from a different mathematical tradition
EXTRACTION HINT: The trilemma is the key claim. Also extract the practical gap (10^3 vs 10^8) and the "pathologies as computational necessities" framing
## Key Facts
- Paper presented at NeurIPS 2025 Workshop on Socially Responsible and Trustworthy Foundation Models
- Authors affiliated with Berkeley AI Safety Initiative, AWS, Stanford, Meta, and Northeastern
- Current RLHF systems collect 10^3-10^4 samples from annotator pools
- True global representation would require 10^7-10^8 samples
- Models assign >99% probability to majority opinions in current implementations
- Paper proposes three strategic relaxation pathways: constrain representativeness to ~30 core values, scope robustness to plausible threats, or accept super-polynomial costs for high-stakes applications