extract: 2025-11-00-sahoo-rlhf-alignment-trilemma

Pentagon-Agent: Ganymede <F99EBFA6-547B-4096-BEEA-1D59C3E4028A>
2026-03-16 15:31:50 +00:00 · 2026-03-16 15:31:50 +00:00 · 4781180de9
commit 4781180de9
parent af067944f1
4 changed files with 62 additions and 1 deletions
--- a/domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md
+++ b/domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md
@ -45,6 +45,12 @@ Comprehensive February 2026 survey by An & Du documents that contemporary ML sys

 EM-DPO makes the social choice function explicit by using MinMax Regret Aggregation based on egalitarian fairness principles, demonstrating that pluralistic alignment requires choosing a specific social welfare function (here: maximin regret) rather than pretending aggregation is value-neutral.

+
+### Additional Evidence (extend)
+*Source: [[2025-11-00-sahoo-rlhf-alignment-trilemma]] | Added: 2026-03-16*
+
+The trilemma formalizes why implicit social choice in RLHF is problematic: the computational constraints force strategic relaxation of either representativeness, robustness, or tractability. Current RLHF implementations implicitly choose tractability, which mathematically necessitates sacrificing representativeness (homogeneous annotator pools) and robustness (vulnerability to distribution shift). This makes the normative choices explicit: which property are we willing to sacrifice?
+
 ---

 Relevant Notes:
--- a/domains/ai-alignment/single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md
+++ b/domains/ai-alignment/single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md
@ -45,6 +45,12 @@ An & Du's survey reveals the mechanism behind single-reward failure: RLHF is doi

 EM-DPO provides formal proof that binary comparisons are mathematically insufficient for preference type identification, explaining WHY single-reward RLHF fails: the training signal format cannot contain the information needed to discover heterogeneity, regardless of dataset size. Rankings over 3+ responses are necessary.

+
+### Additional Evidence (extend)
+*Source: [[2025-11-00-sahoo-rlhf-alignment-trilemma]] | Added: 2026-03-16*
+
+The alignment gap is not just proportional to minority distinctiveness — it's super-polynomial in context dimensionality. Sahoo et al. prove that achieving epsilon <= 0.01 representativeness and delta <= 0.001 robustness requires Omega(2^{d_context}) operations. Current systems use 10^3-10^4 samples while 10^7-10^8 are needed for global representation. The gap compounds exponentially with the dimensionality of human values, making it structurally impossible to close through incremental improvements.
+
 ---

 Relevant Notes:
--- a/inbox/archive/.extraction-debug/2025-11-00-sahoo-rlhf-alignment-trilemma.json
+++ b/inbox/archive/.extraction-debug/2025-11-00-sahoo-rlhf-alignment-trilemma.json
@ -0,0 +1,36 @@
+{
+  "rejected_claims": [
+    {
+      "filename": "rlhf-alignment-trilemma-proves-no-system-can-simultaneously-achieve-representativeness-tractability-and-robustness.md",
+      "issues": [
+        "missing_attribution_extractor"
+      ]
+    },
+    {
+      "filename": "rlhf-pathologies-are-computational-necessities-not-implementation-bugs.md",
+      "issues": [
+        "missing_attribution_extractor"
+      ]
+    }
+  ],
+  "validation_stats": {
+    "total": 2,
+    "kept": 0,
+    "fixed": 6,
+    "rejected": 2,
+    "fixes_applied": [
+      "rlhf-alignment-trilemma-proves-no-system-can-simultaneously-achieve-representativeness-tractability-and-robustness.md:set_created:2026-03-16",
+      "rlhf-alignment-trilemma-proves-no-system-can-simultaneously-achieve-representativeness-tractability-and-robustness.md:stripped_wiki_link:universal-alignment-is-mathematically-impossible-because-Arr",
+      "rlhf-alignment-trilemma-proves-no-system-can-simultaneously-achieve-representativeness-tractability-and-robustness.md:stripped_wiki_link:single-reward-rlhf-cannot-align-diverse-preferences-because-",
+      "rlhf-pathologies-are-computational-necessities-not-implementation-bugs.md:set_created:2026-03-16",
+      "rlhf-pathologies-are-computational-necessities-not-implementation-bugs.md:stripped_wiki_link:single-reward-rlhf-cannot-align-diverse-preferences-because-",
+      "rlhf-pathologies-are-computational-necessities-not-implementation-bugs.md:stripped_wiki_link:emergent-misalignment-arises-naturally-from-reward-hacking-a"
+    ],
+    "rejections": [
+      "rlhf-alignment-trilemma-proves-no-system-can-simultaneously-achieve-representativeness-tractability-and-robustness.md:missing_attribution_extractor",
+      "rlhf-pathologies-are-computational-necessities-not-implementation-bugs.md:missing_attribution_extractor"
+    ]
+  },
+  "model": "anthropic/claude-sonnet-4.5",
+  "date": "2026-03-16"
+}
--- a/inbox/archive/2025-11-00-sahoo-rlhf-alignment-trilemma.md
+++ b/inbox/archive/2025-11-00-sahoo-rlhf-alignment-trilemma.md
@ -7,9 +7,13 @@ date: 2025-11-01
 domain: ai-alignment
 secondary_domains: [collective-intelligence]
 format: paper
-status: unprocessed
+status: enrichment
 priority: high
 tags: [alignment-trilemma, impossibility-result, rlhf, representativeness, robustness, tractability, preference-collapse, sycophancy]
+processed_by: theseus
+processed_date: 2026-03-16
+enrichments_applied: ["single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md", "rlhf-is-implicit-social-choice-without-normative-scrutiny.md"]
+extraction_model: "anthropic/claude-sonnet-4.5"
 ---

 ## Content
@ -56,3 +60,12 @@ Position paper from Berkeley AI Safety Initiative, AWS/Stanford, Meta/Stanford,
 PRIMARY CONNECTION: [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]]
 WHY ARCHIVED: Formalizes our informal impossibility claim with complexity-theoretic proof — independent confirmation of Arrow's-theorem-based argument from a different mathematical tradition
 EXTRACTION HINT: The trilemma is the key claim. Also extract the practical gap (10^3 vs 10^8) and the "pathologies as computational necessities" framing
+
+
+## Key Facts
+- Paper presented at NeurIPS 2025 Workshop on Socially Responsible and Trustworthy Foundation Models
+- Authors affiliated with Berkeley AI Safety Initiative, AWS, Stanford, Meta, and Northeastern
+- Current RLHF systems collect 10^3-10^4 samples from annotator pools
+- True global representation would require 10^7-10^8 samples
+- Models assign >99% probability to majority opinions in current implementations
+- Paper proposes three strategic relaxation pathways: constrain representativeness to ~30 core values, scope robustness to plausible threats, or accept super-polynomial costs for high-stakes applications