extract: 2025-11-00-sahoo-rlhf-alignment-trilemma

Pentagon-Agent: Ganymede <F99EBFA6-547B-4096-BEEA-1D59C3E4028A>
This commit is contained in:
Teleo Agents 2026-03-16 11:36:50 +00:00
parent 87dd668402
commit 5c758113bf
5 changed files with 70 additions and 1 deletions

View file

@ -30,6 +30,12 @@ This claim rests on a single source—a research strategy document rather than e
- Ensemble methods or mixture models can capture diverse subpopulations
- The outlier-erasure effect is implementation-dependent rather than fundamental
### Additional Evidence (confirm)
*Source: [[2025-11-00-sahoo-rlhf-alignment-trilemma]] | Added: 2026-03-16*
Bias amplification in RLHF is proven to be a computational necessity: models assign >99% probability to majority opinions, functionally erasing minority perspectives, because sample efficiency constraints force over-fitting to the majority distribution. This is not a correctable bias but a direct consequence of choosing tractability over representativeness in the trilemma.
---
Relevant Notes:

View file

@ -33,6 +33,12 @@ The paper's proposed solution—RLCHF with explicit social welfare functions—c
RLCF makes the social choice mechanism explicit through the bridging algorithm (matrix factorization with intercept scores). Unlike standard RLHF which aggregates preferences opaquely through reward model training, RLCF's use of intercepts as the training signal is a deliberate choice to optimize for cross-partisan agreement—a specific social welfare function.
### Additional Evidence (extend)
*Source: [[2025-11-00-sahoo-rlhf-alignment-trilemma]] | Added: 2026-03-16*
The trilemma formalizes why RLHF's implicit social choice is structurally problematic: achieving epsilon-representativeness (epsilon <= 0.01) and delta-robustness (delta <= 0.001) simultaneously requires super-polynomial compute, forcing systems to sacrifice representativeness for tractability. This makes the 'implicit' nature of RLHF social choice not just a transparency problem but a computational necessity.
---
Relevant Notes:

View file

@ -33,6 +33,12 @@ Chakraborty, Qiu, Yuan, Koppel, Manocha, Huang, Bedi, Wang. "MaxMin-RLHF: Alignm
Study demonstrates that models trained on different demographic populations show measurable behavioral divergence (3-5 percentage points), providing empirical evidence that single-reward functions trained on one population systematically misalign with others.
### Additional Evidence (extend)
*Source: [[2025-11-00-sahoo-rlhf-alignment-trilemma]] | Added: 2026-03-16*
The formal trilemma provides the theoretical foundation: preference collapse is proven to be computationally necessary, not an implementation choice. Single-reward RLHF cannot capture multimodal preferences even in theory because representing multiple preference modes requires super-polynomial parameters. Current systems collect 10^3-10^4 samples while 10^7-10^8 are needed for global representation.
---
Relevant Notes:

View file

@ -0,0 +1,34 @@
{
"rejected_claims": [
{
"filename": "rlhf-alignment-trilemma-proves-no-system-can-simultaneously-achieve-representativeness-tractability-and-robustness.md",
"issues": [
"missing_attribution_extractor"
]
},
{
"filename": "rlhf-pathologies-are-computational-necessities-not-implementation-bugs-because-preference-collapse-sycophancy-and-bias-amplification-follow-from-the-trilemma.md",
"issues": [
"missing_attribution_extractor"
]
}
],
"validation_stats": {
"total": 2,
"kept": 0,
"fixed": 4,
"rejected": 2,
"fixes_applied": [
"rlhf-alignment-trilemma-proves-no-system-can-simultaneously-achieve-representativeness-tractability-and-robustness.md:set_created:2026-03-16",
"rlhf-alignment-trilemma-proves-no-system-can-simultaneously-achieve-representativeness-tractability-and-robustness.md:stripped_wiki_link:universal-alignment-is-mathematically-impossible-because-Arr",
"rlhf-pathologies-are-computational-necessities-not-implementation-bugs-because-preference-collapse-sycophancy-and-bias-amplification-follow-from-the-trilemma.md:set_created:2026-03-16",
"rlhf-pathologies-are-computational-necessities-not-implementation-bugs-because-preference-collapse-sycophancy-and-bias-amplification-follow-from-the-trilemma.md:stripped_wiki_link:emergent-misalignment-arises-naturally-from-reward-hacking-a"
],
"rejections": [
"rlhf-alignment-trilemma-proves-no-system-can-simultaneously-achieve-representativeness-tractability-and-robustness.md:missing_attribution_extractor",
"rlhf-pathologies-are-computational-necessities-not-implementation-bugs-because-preference-collapse-sycophancy-and-bias-amplification-follow-from-the-trilemma.md:missing_attribution_extractor"
]
},
"model": "anthropic/claude-sonnet-4.5",
"date": "2026-03-16"
}

View file

@ -7,9 +7,13 @@ date: 2025-11-01
domain: ai-alignment
secondary_domains: [collective-intelligence]
format: paper
status: unprocessed
status: enrichment
priority: high
tags: [alignment-trilemma, impossibility-result, rlhf, representativeness, robustness, tractability, preference-collapse, sycophancy]
processed_by: theseus
processed_date: 2026-03-16
enrichments_applied: ["single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md", "rlhf-is-implicit-social-choice-without-normative-scrutiny.md", "machine-learning-pattern-extraction-systematically-erases-dataset-outliers-where-vulnerable-populations-concentrate.md"]
extraction_model: "anthropic/claude-sonnet-4.5"
---
## Content
@ -56,3 +60,16 @@ Position paper from Berkeley AI Safety Initiative, AWS/Stanford, Meta/Stanford,
PRIMARY CONNECTION: [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]]
WHY ARCHIVED: Formalizes our informal impossibility claim with complexity-theoretic proof — independent confirmation of Arrow's-theorem-based argument from a different mathematical tradition
EXTRACTION HINT: The trilemma is the key claim. Also extract the practical gap (10^3 vs 10^8) and the "pathologies as computational necessities" framing
## Key Facts
- Paper presented at NeurIPS 2025 Workshop on Socially Responsible and Trustworthy Foundation Models
- Authors affiliated with Berkeley AI Safety Initiative, AWS/Stanford, Meta/Stanford, and Northeastern
- Core complexity bound: Omega(2^{d_context}) operations required for representativeness + robustness
- Current RLHF systems collect 10^3-10^4 samples from homogeneous pools
- True global representation requires 10^7-10^8 samples (3-4 orders of magnitude more)
- Representativeness threshold: epsilon <= 0.01
- Robustness threshold: delta <= 0.001
- Bias amplification: models assign >99% probability to majority opinions
- Strategic relaxation option 1: Focus on K << |H| core values (~30 universal principles)
- Paper does NOT reference Arrow's theorem despite structural similarity