Compare commits

...

3 commits

Author SHA1 Message Date
Leo
35cbd6f092 Merge branch 'main' into extract/2025-11-00-sahoo-rlhf-alignment-trilemma 2026-03-15 19:49:05 +00:00
Teleo Agents
c174c0c8c9 auto-fix: strip 2 broken wiki links
Pipeline auto-fixer: removed [[ ]] brackets from links
that don't resolve to existing claims in the knowledge base.
2026-03-15 19:39:50 +00:00
Teleo Agents
b75d16c3e7 extract: 2025-11-00-sahoo-rlhf-alignment-trilemma
Pentagon-Agent: Ganymede <F99EBFA6-547B-4096-BEEA-1D59C3E4028A>
2026-03-15 19:38:52 +00:00
4 changed files with 62 additions and 3 deletions

View file

@ -27,6 +27,12 @@ This claim directly addresses the mechanism gap identified in [[RLHF and DPO bot
The paper's proposed solution—RLCHF with explicit social welfare functions—connects to [[collective intelligence requires diversity as a structural precondition not a moral preference]] by formalizing how diverse evaluator input should be preserved rather than collapsed.
### Additional Evidence (extend)
*Source: [[2025-11-00-sahoo-rlhf-alignment-trilemma]] | Added: 2026-03-15*
The trilemma formalizes why RLHF's implicit social choice is structurally inadequate: achieving representativeness requires 10^7-10^8 samples but current systems use 10^3-10^4 from homogeneous pools. The paper provides three strategic relaxation pathways: constrain to ~30 core values, scope robustness narrowly, or accept super-polynomial costs. This gives concrete parameters to the 'implicit social choice' critique.
---
Relevant Notes:

View file

@ -27,6 +27,12 @@ Chakraborty, Qiu, Yuan, Koppel, Manocha, Huang, Bedi, Wang. "MaxMin-RLHF: Alignm
- GPT-2 experiment: single RLHF achieved positive sentiment but ignored conciseness
- Tulu2-7B experiment: minority group accuracy dropped from 70.4% to 42% at 10:1 ratio
### Additional Evidence (confirm)
*Source: [[2025-11-00-sahoo-rlhf-alignment-trilemma]] | Added: 2026-03-15*
Sahoo et al. provide formal complexity-theoretic proof that single-reward RLHF requires Omega(2^{d_context}) operations to achieve epsilon <= 0.01 representativeness and delta <= 0.001 robustness simultaneously. Current systems collect 10^3-10^4 samples while 10^7-10^8 are needed for global representation — a four-order-of-magnitude practical gap. Preference collapse is proven to be a computational necessity, not an implementation bug.
---
Relevant Notes:

View file

@ -0,0 +1,35 @@
{
"rejected_claims": [
{
"filename": "rlhf-alignment-trilemma-proves-no-system-can-achieve-representativeness-tractability-and-robustness-simultaneously.md",
"issues": [
"missing_attribution_extractor"
]
},
{
"filename": "rlhf-pathologies-are-computational-necessities-not-implementation-bugs-because-preference-collapse-sycophancy-and-bias-amplification-follow-from-trilemma-constraints.md",
"issues": [
"missing_attribution_extractor"
]
}
],
"validation_stats": {
"total": 2,
"kept": 0,
"fixed": 5,
"rejected": 2,
"fixes_applied": [
"rlhf-alignment-trilemma-proves-no-system-can-achieve-representativeness-tractability-and-robustness-simultaneously.md:set_created:2026-03-15",
"rlhf-alignment-trilemma-proves-no-system-can-achieve-representativeness-tractability-and-robustness-simultaneously.md:stripped_wiki_link:universal-alignment-is-mathematically-impossible-because-Arr",
"rlhf-pathologies-are-computational-necessities-not-implementation-bugs-because-preference-collapse-sycophancy-and-bias-amplification-follow-from-trilemma-constraints.md:set_created:2026-03-15",
"rlhf-pathologies-are-computational-necessities-not-implementation-bugs-because-preference-collapse-sycophancy-and-bias-amplification-follow-from-trilemma-constraints.md:stripped_wiki_link:emergent-misalignment-arises-naturally-from-reward-hacking-a",
"rlhf-pathologies-are-computational-necessities-not-implementation-bugs-because-preference-collapse-sycophancy-and-bias-amplification-follow-from-trilemma-constraints.md:stripped_wiki_link:current-language-models-escalate-to-nuclear-war-in-simulated"
],
"rejections": [
"rlhf-alignment-trilemma-proves-no-system-can-achieve-representativeness-tractability-and-robustness-simultaneously.md:missing_attribution_extractor",
"rlhf-pathologies-are-computational-necessities-not-implementation-bugs-because-preference-collapse-sycophancy-and-bias-amplification-follow-from-trilemma-constraints.md:missing_attribution_extractor"
]
},
"model": "anthropic/claude-sonnet-4.5",
"date": "2026-03-15"
}

View file

@ -7,9 +7,13 @@ date: 2025-11-01
domain: ai-alignment
secondary_domains: [collective-intelligence]
format: paper
status: unprocessed
status: enrichment
priority: high
tags: [alignment-trilemma, impossibility-result, rlhf, representativeness, robustness, tractability, preference-collapse, sycophancy]
processed_by: theseus
processed_date: 2026-03-15
enrichments_applied: ["single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md", "rlhf-is-implicit-social-choice-without-normative-scrutiny.md"]
extraction_model: "anthropic/claude-sonnet-4.5"
---
## Content
@ -37,7 +41,7 @@ Position paper from Berkeley AI Safety Initiative, AWS/Stanford, Meta/Stanford,
## Agent Notes
**Why this matters:** This is the formal impossibility result our KB has been gesturing at. Our claim [[RLHF and DPO both fail at preference diversity]] is an informal version of this trilemma. The formal result is stronger — it's not just that current implementations fail, it's that NO RLHF system can simultaneously achieve all three properties. This is analogous to the CAP theorem for distributed systems.
**Why this matters:** This is the formal impossibility result our KB has been gesturing at. Our claim RLHF and DPO both fail at preference diversity is an informal version of this trilemma. The formal result is stronger — it's not just that current implementations fail, it's that NO RLHF system can simultaneously achieve all three properties. This is analogous to the CAP theorem for distributed systems.
**What surprised me:** The paper does NOT directly reference Arrow's theorem despite the structural similarity. The trilemma is proven through complexity theory rather than social choice theory. This is an independent intellectual tradition arriving at a compatible impossibility result — strong convergent evidence.
@ -46,7 +50,7 @@ Position paper from Berkeley AI Safety Initiative, AWS/Stanford, Meta/Stanford,
**KB connections:**
- [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]] — this paper FORMALIZES our existing claim
- [[universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective]] — independent confirmation from complexity theory
- [[scalable oversight degrades rapidly as capability gaps grow]] — the trilemma shows degradation is mathematically necessary
- scalable oversight degrades rapidly as capability gaps grow — the trilemma shows degradation is mathematically necessary
**Extraction hints:** Claims about (1) the formal alignment trilemma as impossibility result, (2) preference collapse / sycophancy / bias amplification as computational necessities, (3) the 10^3 vs 10^8 representation gap in current RLHF.
@ -56,3 +60,11 @@ Position paper from Berkeley AI Safety Initiative, AWS/Stanford, Meta/Stanford,
PRIMARY CONNECTION: [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]]
WHY ARCHIVED: Formalizes our informal impossibility claim with complexity-theoretic proof — independent confirmation of Arrow's-theorem-based argument from a different mathematical tradition
EXTRACTION HINT: The trilemma is the key claim. Also extract the practical gap (10^3 vs 10^8) and the "pathologies as computational necessities" framing
## Key Facts
- Current RLHF systems collect 10^3-10^4 samples from annotator pools
- True global representation requires 10^7-10^8 samples according to complexity analysis
- Models trained with RLHF assign >99% probability to majority opinions in documented cases
- Paper presented at NeurIPS 2025 Workshop on Socially Responsible and Trustworthy Foundation Models
- Authors affiliated with Berkeley AI Safety Initiative, AWS, Stanford, Meta, and Northeastern