extract: 2025-11-00-sahoo-rlhf-alignment-trilemma
Pentagon-Agent: Epimetheus <968B2991-E2DF-4006-B962-F5B0A0CC8ACA>
This commit is contained in:
parent
38fed641fd
commit
49716bd392
3 changed files with 55 additions and 1 deletions
|
|
@ -45,6 +45,12 @@ An & Du's survey reveals the mechanism behind single-reward failure: RLHF is doi
|
||||||
|
|
||||||
EM-DPO provides formal proof that binary comparisons are mathematically insufficient for preference type identification, explaining WHY single-reward RLHF fails: the training signal format cannot contain the information needed to discover heterogeneity, regardless of dataset size. Rankings over 3+ responses are necessary.
|
EM-DPO provides formal proof that binary comparisons are mathematically insufficient for preference type identification, explaining WHY single-reward RLHF fails: the training signal format cannot contain the information needed to discover heterogeneity, regardless of dataset size. Rankings over 3+ responses are necessary.
|
||||||
|
|
||||||
|
|
||||||
|
### Additional Evidence (confirm)
|
||||||
|
*Source: [[2025-11-00-sahoo-rlhf-alignment-trilemma]] | Added: 2026-03-16*
|
||||||
|
|
||||||
|
Formal proof that preference collapse is theoretically inevitable: single-reward RLHF cannot capture multimodal preferences even in principle. The paper quantifies the practical gap: current systems use 10^3-10^4 samples from homogeneous pools while 10^7-10^8 samples are needed for global representation — a 3-4 order of magnitude shortfall that explains why minority alignment gaps grow with distinctiveness.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
Relevant Notes:
|
Relevant Notes:
|
||||||
|
|
|
||||||
|
|
@ -0,0 +1,36 @@
|
||||||
|
{
|
||||||
|
"rejected_claims": [
|
||||||
|
{
|
||||||
|
"filename": "rlhf-alignment-trilemma-proves-no-system-can-simultaneously-achieve-representativeness-tractability-and-robustness.md",
|
||||||
|
"issues": [
|
||||||
|
"missing_attribution_extractor"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"filename": "rlhf-pathologies-are-computational-necessities-not-implementation-bugs-because-preference-collapse-sycophancy-and-bias-amplification-follow-from-trilemma-constraints.md",
|
||||||
|
"issues": [
|
||||||
|
"missing_attribution_extractor"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"validation_stats": {
|
||||||
|
"total": 2,
|
||||||
|
"kept": 0,
|
||||||
|
"fixed": 6,
|
||||||
|
"rejected": 2,
|
||||||
|
"fixes_applied": [
|
||||||
|
"rlhf-alignment-trilemma-proves-no-system-can-simultaneously-achieve-representativeness-tractability-and-robustness.md:set_created:2026-03-16",
|
||||||
|
"rlhf-alignment-trilemma-proves-no-system-can-simultaneously-achieve-representativeness-tractability-and-robustness.md:stripped_wiki_link:universal-alignment-is-mathematically-impossible-because-Arr",
|
||||||
|
"rlhf-alignment-trilemma-proves-no-system-can-simultaneously-achieve-representativeness-tractability-and-robustness.md:stripped_wiki_link:single-reward-rlhf-cannot-align-diverse-preferences-because-",
|
||||||
|
"rlhf-pathologies-are-computational-necessities-not-implementation-bugs-because-preference-collapse-sycophancy-and-bias-amplification-follow-from-trilemma-constraints.md:set_created:2026-03-16",
|
||||||
|
"rlhf-pathologies-are-computational-necessities-not-implementation-bugs-because-preference-collapse-sycophancy-and-bias-amplification-follow-from-trilemma-constraints.md:stripped_wiki_link:emergent-misalignment-arises-naturally-from-reward-hacking-a",
|
||||||
|
"rlhf-pathologies-are-computational-necessities-not-implementation-bugs-because-preference-collapse-sycophancy-and-bias-amplification-follow-from-trilemma-constraints.md:stripped_wiki_link:single-reward-rlhf-cannot-align-diverse-preferences-because-"
|
||||||
|
],
|
||||||
|
"rejections": [
|
||||||
|
"rlhf-alignment-trilemma-proves-no-system-can-simultaneously-achieve-representativeness-tractability-and-robustness.md:missing_attribution_extractor",
|
||||||
|
"rlhf-pathologies-are-computational-necessities-not-implementation-bugs-because-preference-collapse-sycophancy-and-bias-amplification-follow-from-trilemma-constraints.md:missing_attribution_extractor"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"model": "anthropic/claude-sonnet-4.5",
|
||||||
|
"date": "2026-03-16"
|
||||||
|
}
|
||||||
|
|
@ -7,9 +7,13 @@ date: 2025-11-01
|
||||||
domain: ai-alignment
|
domain: ai-alignment
|
||||||
secondary_domains: [collective-intelligence]
|
secondary_domains: [collective-intelligence]
|
||||||
format: paper
|
format: paper
|
||||||
status: unprocessed
|
status: enrichment
|
||||||
priority: high
|
priority: high
|
||||||
tags: [alignment-trilemma, impossibility-result, rlhf, representativeness, robustness, tractability, preference-collapse, sycophancy]
|
tags: [alignment-trilemma, impossibility-result, rlhf, representativeness, robustness, tractability, preference-collapse, sycophancy]
|
||||||
|
processed_by: theseus
|
||||||
|
processed_date: 2026-03-16
|
||||||
|
enrichments_applied: ["single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md"]
|
||||||
|
extraction_model: "anthropic/claude-sonnet-4.5"
|
||||||
---
|
---
|
||||||
|
|
||||||
## Content
|
## Content
|
||||||
|
|
@ -56,3 +60,11 @@ Position paper from Berkeley AI Safety Initiative, AWS/Stanford, Meta/Stanford,
|
||||||
PRIMARY CONNECTION: [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]]
|
PRIMARY CONNECTION: [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]]
|
||||||
WHY ARCHIVED: Formalizes our informal impossibility claim with complexity-theoretic proof — independent confirmation of Arrow's-theorem-based argument from a different mathematical tradition
|
WHY ARCHIVED: Formalizes our informal impossibility claim with complexity-theoretic proof — independent confirmation of Arrow's-theorem-based argument from a different mathematical tradition
|
||||||
EXTRACTION HINT: The trilemma is the key claim. Also extract the practical gap (10^3 vs 10^8) and the "pathologies as computational necessities" framing
|
EXTRACTION HINT: The trilemma is the key claim. Also extract the practical gap (10^3 vs 10^8) and the "pathologies as computational necessities" framing
|
||||||
|
|
||||||
|
|
||||||
|
## Key Facts
|
||||||
|
- Paper presented at NeurIPS 2025 Workshop on Socially Responsible and Trustworthy Foundation Models
|
||||||
|
- Authors affiliated with Berkeley AI Safety Initiative, AWS, Stanford, Meta, and Northeastern
|
||||||
|
- Current RLHF systems collect 10^3-10^4 samples from annotator pools
|
||||||
|
- True global representation would require 10^7-10^8 samples
|
||||||
|
- Bias amplification in current systems: models assign >99% probability to majority opinions
|
||||||
|
|
|
||||||
Loading…
Reference in a new issue