From f803306f472eedfc2110600576cc4ef7a08dd97d Mon Sep 17 00:00:00 2001 From: Teleo Agents Date: Mon, 16 Mar 2026 15:38:56 +0000 Subject: [PATCH] extract: 2025-11-00-sahoo-rlhf-alignment-trilemma Pentagon-Agent: Ganymede --- ...ocial-choice-without-normative-scrutiny.md | 6 ++++ ...roportional-to-minority-distinctiveness.md | 6 ++++ ...5-11-00-sahoo-rlhf-alignment-trilemma.json | 36 +++++++++++++++++++ ...025-11-00-sahoo-rlhf-alignment-trilemma.md | 14 +++++++- 4 files changed, 61 insertions(+), 1 deletion(-) create mode 100644 inbox/archive/.extraction-debug/2025-11-00-sahoo-rlhf-alignment-trilemma.json diff --git a/domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md b/domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md index 611e8b364..66cb8b1a8 100644 --- a/domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md +++ b/domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md @@ -45,6 +45,12 @@ Comprehensive February 2026 survey by An & Du documents that contemporary ML sys EM-DPO makes the social choice function explicit by using MinMax Regret Aggregation based on egalitarian fairness principles, demonstrating that pluralistic alignment requires choosing a specific social welfare function (here: maximin regret) rather than pretending aggregation is value-neutral. + +### Additional Evidence (extend) +*Source: [[2025-11-00-sahoo-rlhf-alignment-trilemma]] | Added: 2026-03-16* + +The trilemma formalizes why RLHF's implicit social choice is problematic: the system must choose between representativeness, tractability, and robustness, but this choice is hidden in hyperparameters and training procedures rather than made explicit. Current RLHF implementations sacrifice representativeness for tractability without acknowledging this is a normative choice about whose values matter. + --- Relevant Notes: diff --git a/domains/ai-alignment/single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md b/domains/ai-alignment/single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md index c6ab6f2bf..584c39dc2 100644 --- a/domains/ai-alignment/single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md +++ b/domains/ai-alignment/single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md @@ -45,6 +45,12 @@ An & Du's survey reveals the mechanism behind single-reward failure: RLHF is doi EM-DPO provides formal proof that binary comparisons are mathematically insufficient for preference type identification, explaining WHY single-reward RLHF fails: the training signal format cannot contain the information needed to discover heterogeneity, regardless of dataset size. Rankings over 3+ responses are necessary. + +### Additional Evidence (confirm) +*Source: [[2025-11-00-sahoo-rlhf-alignment-trilemma]] | Added: 2026-03-16* + +Formal proof that achieving epsilon <= 0.01 representativeness requires 10^7-10^8 samples for global populations, while current systems use 10^3-10^4 samples from homogeneous pools — a 3-4 order of magnitude gap. The paper quantifies the alignment gap: models assign >99% probability to majority opinions, functionally erasing minority perspectives. This is not fixable through better sampling because the computational complexity is super-polynomial. + --- Relevant Notes: diff --git a/inbox/archive/.extraction-debug/2025-11-00-sahoo-rlhf-alignment-trilemma.json b/inbox/archive/.extraction-debug/2025-11-00-sahoo-rlhf-alignment-trilemma.json new file mode 100644 index 000000000..e4464db06 --- /dev/null +++ b/inbox/archive/.extraction-debug/2025-11-00-sahoo-rlhf-alignment-trilemma.json @@ -0,0 +1,36 @@ +{ + "rejected_claims": [ + { + "filename": "rlhf-alignment-trilemma-proves-no-system-can-simultaneously-achieve-representativeness-tractability-and-robustness.md", + "issues": [ + "missing_attribution_extractor" + ] + }, + { + "filename": "rlhf-pathologies-are-computational-necessities-not-implementation-bugs-because-preference-collapse-sycophancy-and-bias-amplification-follow-from-the-trilemma.md", + "issues": [ + "missing_attribution_extractor" + ] + } + ], + "validation_stats": { + "total": 2, + "kept": 0, + "fixed": 6, + "rejected": 2, + "fixes_applied": [ + "rlhf-alignment-trilemma-proves-no-system-can-simultaneously-achieve-representativeness-tractability-and-robustness.md:set_created:2026-03-16", + "rlhf-alignment-trilemma-proves-no-system-can-simultaneously-achieve-representativeness-tractability-and-robustness.md:stripped_wiki_link:universal-alignment-is-mathematically-impossible-because-Arr", + "rlhf-alignment-trilemma-proves-no-system-can-simultaneously-achieve-representativeness-tractability-and-robustness.md:stripped_wiki_link:single-reward-rlhf-cannot-align-diverse-preferences-because-", + "rlhf-pathologies-are-computational-necessities-not-implementation-bugs-because-preference-collapse-sycophancy-and-bias-amplification-follow-from-the-trilemma.md:set_created:2026-03-16", + "rlhf-pathologies-are-computational-necessities-not-implementation-bugs-because-preference-collapse-sycophancy-and-bias-amplification-follow-from-the-trilemma.md:stripped_wiki_link:emergent-misalignment-arises-naturally-from-reward-hacking-a", + "rlhf-pathologies-are-computational-necessities-not-implementation-bugs-because-preference-collapse-sycophancy-and-bias-amplification-follow-from-the-trilemma.md:stripped_wiki_link:rlhf-is-implicit-social-choice-without-normative-scrutiny.md" + ], + "rejections": [ + "rlhf-alignment-trilemma-proves-no-system-can-simultaneously-achieve-representativeness-tractability-and-robustness.md:missing_attribution_extractor", + "rlhf-pathologies-are-computational-necessities-not-implementation-bugs-because-preference-collapse-sycophancy-and-bias-amplification-follow-from-the-trilemma.md:missing_attribution_extractor" + ] + }, + "model": "anthropic/claude-sonnet-4.5", + "date": "2026-03-16" +} \ No newline at end of file diff --git a/inbox/archive/2025-11-00-sahoo-rlhf-alignment-trilemma.md b/inbox/archive/2025-11-00-sahoo-rlhf-alignment-trilemma.md index 17c59596c..03c63b36c 100644 --- a/inbox/archive/2025-11-00-sahoo-rlhf-alignment-trilemma.md +++ b/inbox/archive/2025-11-00-sahoo-rlhf-alignment-trilemma.md @@ -7,9 +7,13 @@ date: 2025-11-01 domain: ai-alignment secondary_domains: [collective-intelligence] format: paper -status: unprocessed +status: enrichment priority: high tags: [alignment-trilemma, impossibility-result, rlhf, representativeness, robustness, tractability, preference-collapse, sycophancy] +processed_by: theseus +processed_date: 2026-03-16 +enrichments_applied: ["single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md", "rlhf-is-implicit-social-choice-without-normative-scrutiny.md"] +extraction_model: "anthropic/claude-sonnet-4.5" --- ## Content @@ -56,3 +60,11 @@ Position paper from Berkeley AI Safety Initiative, AWS/Stanford, Meta/Stanford, PRIMARY CONNECTION: [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]] WHY ARCHIVED: Formalizes our informal impossibility claim with complexity-theoretic proof — independent confirmation of Arrow's-theorem-based argument from a different mathematical tradition EXTRACTION HINT: The trilemma is the key claim. Also extract the practical gap (10^3 vs 10^8) and the "pathologies as computational necessities" framing + + +## Key Facts +- Paper presented at NeurIPS 2025 Workshop on Socially Responsible and Trustworthy Foundation Models +- Authors affiliated with Berkeley AI Safety Initiative, AWS/Stanford, Meta/Stanford, and Northeastern +- Current RLHF systems collect 10^3-10^4 samples from homogeneous annotator pools +- True global representation requires 10^7-10^8 samples +- Models trained with standard RLHF assign >99% probability to majority opinions