Merge branch 'main' into extract/2025-11-00-sahoo-rlhf-alignment-trilemma

auto-fix: strip 2 broken wiki links
Pipeline auto-fixer: removed [[ ]] brackets from links that don't resolve to existing claims in the knowledge base.
2026-03-15 19:49:05 +00:00 · 2026-03-15 19:39:50 +00:00 · 2026-03-15 19:38:52 +00:00
4 changed files with 62 additions and 3 deletions
--- a/domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md
+++ b/domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md
@ -27,6 +27,12 @@ This claim directly addresses the mechanism gap identified in [[RLHF and DPO bot

 The paper's proposed solution—RLCHF with explicit social welfare functions—connects to [[collective intelligence requires diversity as a structural precondition not a moral preference]] by formalizing how diverse evaluator input should be preserved rather than collapsed.

+
+### Additional Evidence (extend)
+*Source: [[2025-11-00-sahoo-rlhf-alignment-trilemma]] | Added: 2026-03-15*
+
+The trilemma formalizes why RLHF's implicit social choice is structurally inadequate: achieving representativeness requires 10^7-10^8 samples but current systems use 10^3-10^4 from homogeneous pools. The paper provides three strategic relaxation pathways: constrain to ~30 core values, scope robustness narrowly, or accept super-polynomial costs. This gives concrete parameters to the 'implicit social choice' critique.
+
 ---

 Relevant Notes:
--- a/domains/ai-alignment/single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md
+++ b/domains/ai-alignment/single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md
@ -27,6 +27,12 @@ Chakraborty, Qiu, Yuan, Koppel, Manocha, Huang, Bedi, Wang. "MaxMin-RLHF: Alignm
 - GPT-2 experiment: single RLHF achieved positive sentiment but ignored conciseness
 - Tulu2-7B experiment: minority group accuracy dropped from 70.4% to 42% at 10:1 ratio

+
+### Additional Evidence (confirm)
+*Source: [[2025-11-00-sahoo-rlhf-alignment-trilemma]] | Added: 2026-03-15*
+
+Sahoo et al. provide formal complexity-theoretic proof that single-reward RLHF requires Omega(2^{d_context}) operations to achieve epsilon <= 0.01 representativeness and delta <= 0.001 robustness simultaneously. Current systems collect 10^3-10^4 samples while 10^7-10^8 are needed for global representation — a four-order-of-magnitude practical gap. Preference collapse is proven to be a computational necessity, not an implementation bug.
+
 ---

 Relevant Notes:
--- a/inbox/archive/.extraction-debug/2025-11-00-sahoo-rlhf-alignment-trilemma.json
+++ b/inbox/archive/.extraction-debug/2025-11-00-sahoo-rlhf-alignment-trilemma.json
@ -0,0 +1,35 @@
+{
+  "rejected_claims": [
+    {
+      "filename": "rlhf-alignment-trilemma-proves-no-system-can-achieve-representativeness-tractability-and-robustness-simultaneously.md",
+      "issues": [
+        "missing_attribution_extractor"
+      ]
+    },
+    {
+      "filename": "rlhf-pathologies-are-computational-necessities-not-implementation-bugs-because-preference-collapse-sycophancy-and-bias-amplification-follow-from-trilemma-constraints.md",
+      "issues": [
+        "missing_attribution_extractor"
+      ]
+    }
+  ],
+  "validation_stats": {
+    "total": 2,
+    "kept": 0,
+    "fixed": 5,
+    "rejected": 2,
+    "fixes_applied": [
+      "rlhf-alignment-trilemma-proves-no-system-can-achieve-representativeness-tractability-and-robustness-simultaneously.md:set_created:2026-03-15",
+      "rlhf-alignment-trilemma-proves-no-system-can-achieve-representativeness-tractability-and-robustness-simultaneously.md:stripped_wiki_link:universal-alignment-is-mathematically-impossible-because-Arr",
+      "rlhf-pathologies-are-computational-necessities-not-implementation-bugs-because-preference-collapse-sycophancy-and-bias-amplification-follow-from-trilemma-constraints.md:set_created:2026-03-15",
+      "rlhf-pathologies-are-computational-necessities-not-implementation-bugs-because-preference-collapse-sycophancy-and-bias-amplification-follow-from-trilemma-constraints.md:stripped_wiki_link:emergent-misalignment-arises-naturally-from-reward-hacking-a",
+      "rlhf-pathologies-are-computational-necessities-not-implementation-bugs-because-preference-collapse-sycophancy-and-bias-amplification-follow-from-trilemma-constraints.md:stripped_wiki_link:current-language-models-escalate-to-nuclear-war-in-simulated"
+    ],
+    "rejections": [
+      "rlhf-alignment-trilemma-proves-no-system-can-achieve-representativeness-tractability-and-robustness-simultaneously.md:missing_attribution_extractor",
+      "rlhf-pathologies-are-computational-necessities-not-implementation-bugs-because-preference-collapse-sycophancy-and-bias-amplification-follow-from-trilemma-constraints.md:missing_attribution_extractor"
+    ]
+  },
+  "model": "anthropic/claude-sonnet-4.5",
+  "date": "2026-03-15"
+}
--- a/inbox/archive/2025-11-00-sahoo-rlhf-alignment-trilemma.md
+++ b/inbox/archive/2025-11-00-sahoo-rlhf-alignment-trilemma.md
@ -7,9 +7,13 @@ date: 2025-11-01
 domain: ai-alignment
 secondary_domains: [collective-intelligence]
 format: paper
-status: unprocessed
+status: enrichment
 priority: high
 tags: [alignment-trilemma, impossibility-result, rlhf, representativeness, robustness, tractability, preference-collapse, sycophancy]
+processed_by: theseus
+processed_date: 2026-03-15
+enrichments_applied: ["single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md", "rlhf-is-implicit-social-choice-without-normative-scrutiny.md"]
+extraction_model: "anthropic/claude-sonnet-4.5"
 ---

 ## Content
@ -37,7 +41,7 @@ Position paper from Berkeley AI Safety Initiative, AWS/Stanford, Meta/Stanford,

 ## Agent Notes

-**Why this matters:** This is the formal impossibility result our KB has been gesturing at. Our claim [[RLHF and DPO both fail at preference diversity]] is an informal version of this trilemma. The formal result is stronger — it's not just that current implementations fail, it's that NO RLHF system can simultaneously achieve all three properties. This is analogous to the CAP theorem for distributed systems.
+**Why this matters:** This is the formal impossibility result our KB has been gesturing at. Our claim RLHF and DPO both fail at preference diversity is an informal version of this trilemma. The formal result is stronger — it's not just that current implementations fail, it's that NO RLHF system can simultaneously achieve all three properties. This is analogous to the CAP theorem for distributed systems.

 **What surprised me:** The paper does NOT directly reference Arrow's theorem despite the structural similarity. The trilemma is proven through complexity theory rather than social choice theory. This is an independent intellectual tradition arriving at a compatible impossibility result — strong convergent evidence.

@ -46,7 +50,7 @@ Position paper from Berkeley AI Safety Initiative, AWS/Stanford, Meta/Stanford,
 **KB connections:**
 - [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]] — this paper FORMALIZES our existing claim
 - [[universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective]] — independent confirmation from complexity theory
- [[scalable oversight degrades rapidly as capability gaps grow]] — the trilemma shows degradation is mathematically necessary
+- scalable oversight degrades rapidly as capability gaps grow — the trilemma shows degradation is mathematically necessary

 **Extraction hints:** Claims about (1) the formal alignment trilemma as impossibility result, (2) preference collapse / sycophancy / bias amplification as computational necessities, (3) the 10^3 vs 10^8 representation gap in current RLHF.

@ -56,3 +60,11 @@ Position paper from Berkeley AI Safety Initiative, AWS/Stanford, Meta/Stanford,
 PRIMARY CONNECTION: [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]]
 WHY ARCHIVED: Formalizes our informal impossibility claim with complexity-theoretic proof — independent confirmation of Arrow's-theorem-based argument from a different mathematical tradition
 EXTRACTION HINT: The trilemma is the key claim. Also extract the practical gap (10^3 vs 10^8) and the "pathologies as computational necessities" framing
+
+
+## Key Facts
+- Current RLHF systems collect 10^3-10^4 samples from annotator pools
+- True global representation requires 10^7-10^8 samples according to complexity analysis
+- Models trained with RLHF assign >99% probability to majority opinions in documented cases
+- Paper presented at NeurIPS 2025 Workshop on Socially Responsible and Trustworthy Foundation Models
+- Authors affiliated with Berkeley AI Safety Initiative, AWS, Stanford, Meta, and Northeastern
Author	SHA1	Message	Date
Leo	35cbd6f092	Merge branch 'main' into extract/2025-11-00-sahoo-rlhf-alignment-trilemma	2026-03-15 19:49:05 +00:00
Teleo Agents	c174c0c8c9	auto-fix: strip 2 broken wiki links Pipeline auto-fixer: removed [[ ]] brackets from links that don't resolve to existing claims in the knowledge base.	2026-03-15 19:39:50 +00:00
Teleo Agents	b75d16c3e7	extract: 2025-11-00-sahoo-rlhf-alignment-trilemma Pentagon-Agent: Ganymede <F99EBFA6-547B-4096-BEEA-1D59C3E4028A>	2026-03-15 19:38:52 +00:00