extract: 2025-11-00-sahoo-rlhf-alignment-trilemma

Pentagon-Agent: Ganymede <F99EBFA6-547B-4096-BEEA-1D59C3E4028A>
2026-03-16 14:51:10 +00:00 · 2026-03-16 14:51:10 +00:00 · 4c39e34e6f
commit 4c39e34e6f
parent a5e1e96dba
4 changed files with 63 additions and 1 deletions
--- a/domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md
+++ b/domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md
@ -39,6 +39,12 @@ RLCF makes the social choice mechanism explicit through the bridging algorithm (
 Comprehensive February 2026 survey by An & Du documents that contemporary ML systems implement social choice mechanisms implicitly across RLHF, participatory budgeting, and liquid democracy applications, with 18 identified open problems spanning incentive guarantees and pluralistic preference aggregation.
 ### Additional Evidence (extend)
 *Source: [[2025-11-00-sahoo-rlhf-alignment-trilemma]] | Added: 2026-03-16*
 The trilemma formalizes why RLHF's implicit social choice is problematic: achieving epsilon-representativeness (epsilon <= 0.01) and delta-robustness (delta <= 0.001) simultaneously requires super-polynomial compute, making the 'strategic relaxation' of representativeness a practical necessity that RLHF implementations make without explicit acknowledgment.
 ---
 Relevant Notes:
--- a/domains/ai-alignment/single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md
+++ b/domains/ai-alignment/single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md
@ -39,6 +39,12 @@ Study demonstrates that models trained on different demographic populations show
 An & Du's survey reveals the mechanism behind single-reward failure: RLHF is doing social choice (preference aggregation) but treating it as an engineering detail rather than a normative design choice, which means the aggregation function is chosen implicitly and without examination of which fairness criteria it satisfies.
 ### Additional Evidence (extend)
 *Source: [[2025-11-00-sahoo-rlhf-alignment-trilemma]] | Added: 2026-03-16*
 The formal trilemma proof shows preference collapse is not just empirically observed but mathematically necessary: single-reward RLHF cannot capture multimodal preferences even in theory. The paper quantifies the practical gap: current systems use 10^3-10^4 samples from homogeneous pools while 10^7-10^8 samples are needed for global representation — a 3-4 order of magnitude shortfall.
 ---
 Relevant Notes:
--- a/inbox/archive/.extraction-debug/2025-11-00-sahoo-rlhf-alignment-trilemma.json
+++ b/inbox/archive/.extraction-debug/2025-11-00-sahoo-rlhf-alignment-trilemma.json
@ -0,0 +1,37 @@
 {
  "rejected_claims": [
    {
      "filename": "rlhf-alignment-trilemma-proves-no-system-can-simultaneously-achieve-representativeness-tractability-and-robustness.md",
      "issues": [
        "missing_attribution_extractor"
      ]
    },
    {
      "filename": "rlhf-pathologies-are-computational-necessities-not-implementation-bugs-because-preference-collapse-sycophancy-and-bias-amplification-follow-from-the-trilemma.md",
      "issues": [
        "missing_attribution_extractor"
      ]
    }
  ],
  "validation_stats": {
    "total": 2,
    "kept": 0,
    "fixed": 7,
    "rejected": 2,
    "fixes_applied": [
      "rlhf-alignment-trilemma-proves-no-system-can-simultaneously-achieve-representativeness-tractability-and-robustness.md:set_created:2026-03-16",
      "rlhf-alignment-trilemma-proves-no-system-can-simultaneously-achieve-representativeness-tractability-and-robustness.md:stripped_wiki_link:universal-alignment-is-mathematically-impossible-because-Arr",
      "rlhf-alignment-trilemma-proves-no-system-can-simultaneously-achieve-representativeness-tractability-and-robustness.md:stripped_wiki_link:single-reward-rlhf-cannot-align-diverse-preferences-because-",
      "rlhf-alignment-trilemma-proves-no-system-can-simultaneously-achieve-representativeness-tractability-and-robustness.md:stripped_wiki_link:rlhf-is-implicit-social-choice-without-normative-scrutiny.md",
      "rlhf-pathologies-are-computational-necessities-not-implementation-bugs-because-preference-collapse-sycophancy-and-bias-amplification-follow-from-the-trilemma.md:set_created:2026-03-16",
      "rlhf-pathologies-are-computational-necessities-not-implementation-bugs-because-preference-collapse-sycophancy-and-bias-amplification-follow-from-the-trilemma.md:stripped_wiki_link:single-reward-rlhf-cannot-align-diverse-preferences-because-",
      "rlhf-pathologies-are-computational-necessities-not-implementation-bugs-because-preference-collapse-sycophancy-and-bias-amplification-follow-from-the-trilemma.md:stripped_wiki_link:emergent-misalignment-arises-naturally-from-reward-hacking-a"
    ],
    "rejections": [
      "rlhf-alignment-trilemma-proves-no-system-can-simultaneously-achieve-representativeness-tractability-and-robustness.md:missing_attribution_extractor",
      "rlhf-pathologies-are-computational-necessities-not-implementation-bugs-because-preference-collapse-sycophancy-and-bias-amplification-follow-from-the-trilemma.md:missing_attribution_extractor"
    ]
  },
  "model": "anthropic/claude-sonnet-4.5",
  "date": "2026-03-16"
 }
--- a/inbox/archive/2025-11-00-sahoo-rlhf-alignment-trilemma.md
+++ b/inbox/archive/2025-11-00-sahoo-rlhf-alignment-trilemma.md
@ -7,9 +7,13 @@ date: 2025-11-01
 domain: ai-alignment
 secondary_domains: [collective-intelligence]
 format: paper
-status: unprocessed
+status: enrichment
 priority: high
 tags: [alignment-trilemma, impossibility-result, rlhf, representativeness, robustness, tractability, preference-collapse, sycophancy]
 processed_by: theseus
 processed_date: 2026-03-16
 enrichments_applied: ["single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md", "rlhf-is-implicit-social-choice-without-normative-scrutiny.md"]
 extraction_model: "anthropic/claude-sonnet-4.5"
 ---
 ## Content
@ -56,3 +60,12 @@ Position paper from Berkeley AI Safety Initiative, AWS/Stanford, Meta/Stanford,
 PRIMARY CONNECTION: [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]]
 WHY ARCHIVED: Formalizes our informal impossibility claim with complexity-theoretic proof — independent confirmation of Arrow's-theorem-based argument from a different mathematical tradition
 EXTRACTION HINT: The trilemma is the key claim. Also extract the practical gap (10^3 vs 10^8) and the "pathologies as computational necessities" framing
 ## Key Facts
 - Paper presented at NeurIPS 2025 Workshop on Socially Responsible and Trustworthy Foundation Models
 - Authors affiliated with Berkeley AI Safety Initiative, AWS, Stanford, Meta, and Northeastern
 - Current RLHF systems collect 10^3-10^4 samples from annotator pools
 - True global representation would require 10^7-10^8 samples
 - Models assign >99% probability to majority opinions in documented cases
 - Three strategic relaxation pathways proposed: constrain representativeness to ~30 core values, scope robustness narrowly to plausible threats, or accept super-polynomial costs for high-stakes applications