extract: 2025-11-00-sahoo-rlhf-alignment-trilemma

Pentagon-Agent: Ganymede <F99EBFA6-547B-4096-BEEA-1D59C3E4028A>
2026-03-16 14:05:48 +00:00 · 2026-03-16 14:05:48 +00:00 · 0df5a39824
commit 0df5a39824
parent 0447403656
4 changed files with 61 additions and 1 deletions
--- a/domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md
+++ b/domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md
@ -33,6 +33,12 @@ The paper's proposed solution—RLCHF with explicit social welfare functions—c

 RLCF makes the social choice mechanism explicit through the bridging algorithm (matrix factorization with intercept scores). Unlike standard RLHF which aggregates preferences opaquely through reward model training, RLCF's use of intercepts as the training signal is a deliberate choice to optimize for cross-partisan agreement—a specific social welfare function.

+
+### Additional Evidence (extend)
+*Source: [[2025-11-00-sahoo-rlhf-alignment-trilemma]] | Added: 2026-03-16*
+
+The alignment trilemma provides the formal complexity-theoretic foundation for why RLHF's implicit social choice is problematic: the computational constraints force systems to sacrifice either representativeness (excluding minority values), tractability (becoming computationally infeasible), or robustness (failing under distribution shift). The paper proves this is not a design choice but a mathematical necessity.
+
 ---

 Relevant Notes:
--- a/domains/ai-alignment/single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md
+++ b/domains/ai-alignment/single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md
@ -33,6 +33,12 @@ Chakraborty, Qiu, Yuan, Koppel, Manocha, Huang, Bedi, Wang. "MaxMin-RLHF: Alignm

 Study demonstrates that models trained on different demographic populations show measurable behavioral divergence (3-5 percentage points), providing empirical evidence that single-reward functions trained on one population systematically misalign with others.

+
+### Additional Evidence (extend)
+*Source: [[2025-11-00-sahoo-rlhf-alignment-trilemma]] | Added: 2026-03-16*
+
+The alignment trilemma formalizes why single-reward RLHF fails: achieving both representativeness (epsilon <= 0.01) and robustness (delta <= 0.001) requires super-polynomial operations. Current systems have a 3-4 order of magnitude representation gap (10^3-10^4 samples collected vs 10^7-10^8 needed for global representation). Preference collapse is proven to be a computational necessity, not an implementation bug.
+
 ---

 Relevant Notes:
--- a/inbox/archive/.extraction-debug/2025-11-00-sahoo-rlhf-alignment-trilemma.json
+++ b/inbox/archive/.extraction-debug/2025-11-00-sahoo-rlhf-alignment-trilemma.json
@ -0,0 +1,36 @@
+{
+  "rejected_claims": [
+    {
+      "filename": "rlhf-alignment-trilemma-proves-no-system-can-simultaneously-achieve-representativeness-tractability-and-robustness.md",
+      "issues": [
+        "missing_attribution_extractor"
+      ]
+    },
+    {
+      "filename": "rlhf-pathologies-are-computational-necessities-not-implementation-bugs-because-preference-collapse-sycophancy-and-bias-amplification-follow-from-the-trilemma.md",
+      "issues": [
+        "missing_attribution_extractor"
+      ]
+    }
+  ],
+  "validation_stats": {
+    "total": 2,
+    "kept": 0,
+    "fixed": 6,
+    "rejected": 2,
+    "fixes_applied": [
+      "rlhf-alignment-trilemma-proves-no-system-can-simultaneously-achieve-representativeness-tractability-and-robustness.md:set_created:2026-03-16",
+      "rlhf-alignment-trilemma-proves-no-system-can-simultaneously-achieve-representativeness-tractability-and-robustness.md:stripped_wiki_link:universal-alignment-is-mathematically-impossible-because-Arr",
+      "rlhf-alignment-trilemma-proves-no-system-can-simultaneously-achieve-representativeness-tractability-and-robustness.md:stripped_wiki_link:single-reward-rlhf-cannot-align-diverse-preferences-because-",
+      "rlhf-pathologies-are-computational-necessities-not-implementation-bugs-because-preference-collapse-sycophancy-and-bias-amplification-follow-from-the-trilemma.md:set_created:2026-03-16",
+      "rlhf-pathologies-are-computational-necessities-not-implementation-bugs-because-preference-collapse-sycophancy-and-bias-amplification-follow-from-the-trilemma.md:stripped_wiki_link:single-reward-rlhf-cannot-align-diverse-preferences-because-",
+      "rlhf-pathologies-are-computational-necessities-not-implementation-bugs-because-preference-collapse-sycophancy-and-bias-amplification-follow-from-the-trilemma.md:stripped_wiki_link:emergent-misalignment-arises-naturally-from-reward-hacking-a"
+    ],
+    "rejections": [
+      "rlhf-alignment-trilemma-proves-no-system-can-simultaneously-achieve-representativeness-tractability-and-robustness.md:missing_attribution_extractor",
+      "rlhf-pathologies-are-computational-necessities-not-implementation-bugs-because-preference-collapse-sycophancy-and-bias-amplification-follow-from-the-trilemma.md:missing_attribution_extractor"
+    ]
+  },
+  "model": "anthropic/claude-sonnet-4.5",
+  "date": "2026-03-16"
+}
--- a/inbox/archive/2025-11-00-sahoo-rlhf-alignment-trilemma.md
+++ b/inbox/archive/2025-11-00-sahoo-rlhf-alignment-trilemma.md
@ -7,9 +7,13 @@ date: 2025-11-01
 domain: ai-alignment
 secondary_domains: [collective-intelligence]
 format: paper
-status: unprocessed
+status: enrichment
 priority: high
 tags: [alignment-trilemma, impossibility-result, rlhf, representativeness, robustness, tractability, preference-collapse, sycophancy]
+processed_by: theseus
+processed_date: 2026-03-16
+enrichments_applied: ["single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md", "rlhf-is-implicit-social-choice-without-normative-scrutiny.md"]
+extraction_model: "anthropic/claude-sonnet-4.5"
 ---

 ## Content
@ -56,3 +60,11 @@ Position paper from Berkeley AI Safety Initiative, AWS/Stanford, Meta/Stanford,
 PRIMARY CONNECTION: [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]]
 WHY ARCHIVED: Formalizes our informal impossibility claim with complexity-theoretic proof — independent confirmation of Arrow's-theorem-based argument from a different mathematical tradition
 EXTRACTION HINT: The trilemma is the key claim. Also extract the practical gap (10^3 vs 10^8) and the "pathologies as computational necessities" framing
+
+
+## Key Facts
+- Current RLHF systems collect 10^3-10^4 samples from homogeneous annotator pools
+- True global representation would require 10^7-10^8 samples
+- Paper presented at NeurIPS 2025 Workshop on Socially Responsible and Trustworthy Foundation Models
+- Authors affiliated with Berkeley AI Safety Initiative, AWS, Meta, Stanford, and Northeastern
+- Three strategic relaxation pathways proposed: constrain to ~30 core values, scope robustness narrowly, or accept super-polynomial costs