From 5c758113bfb67837af68c15a4318e9c9edbcc28c Mon Sep 17 00:00:00 2001 From: Teleo Agents Date: Mon, 16 Mar 2026 11:36:50 +0000 Subject: [PATCH] extract: 2025-11-00-sahoo-rlhf-alignment-trilemma Pentagon-Agent: Ganymede --- ...here-vulnerable-populations-concentrate.md | 6 ++++ ...ocial-choice-without-normative-scrutiny.md | 6 ++++ ...roportional-to-minority-distinctiveness.md | 6 ++++ ...5-11-00-sahoo-rlhf-alignment-trilemma.json | 34 +++++++++++++++++++ ...025-11-00-sahoo-rlhf-alignment-trilemma.md | 19 ++++++++++- 5 files changed, 70 insertions(+), 1 deletion(-) create mode 100644 inbox/archive/.extraction-debug/2025-11-00-sahoo-rlhf-alignment-trilemma.json diff --git a/domains/ai-alignment/machine-learning-pattern-extraction-systematically-erases-dataset-outliers-where-vulnerable-populations-concentrate.md b/domains/ai-alignment/machine-learning-pattern-extraction-systematically-erases-dataset-outliers-where-vulnerable-populations-concentrate.md index f8ccda6e9..0ac90d2ea 100644 --- a/domains/ai-alignment/machine-learning-pattern-extraction-systematically-erases-dataset-outliers-where-vulnerable-populations-concentrate.md +++ b/domains/ai-alignment/machine-learning-pattern-extraction-systematically-erases-dataset-outliers-where-vulnerable-populations-concentrate.md @@ -30,6 +30,12 @@ This claim rests on a single source—a research strategy document rather than e - Ensemble methods or mixture models can capture diverse subpopulations - The outlier-erasure effect is implementation-dependent rather than fundamental + +### Additional Evidence (confirm) +*Source: [[2025-11-00-sahoo-rlhf-alignment-trilemma]] | Added: 2026-03-16* + +Bias amplification in RLHF is proven to be a computational necessity: models assign >99% probability to majority opinions, functionally erasing minority perspectives, because sample efficiency constraints force over-fitting to the majority distribution. This is not a correctable bias but a direct consequence of choosing tractability over representativeness in the trilemma. + --- Relevant Notes: diff --git a/domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md b/domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md index dc59e9565..b9d951d8c 100644 --- a/domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md +++ b/domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md @@ -33,6 +33,12 @@ The paper's proposed solution—RLCHF with explicit social welfare functions—c RLCF makes the social choice mechanism explicit through the bridging algorithm (matrix factorization with intercept scores). Unlike standard RLHF which aggregates preferences opaquely through reward model training, RLCF's use of intercepts as the training signal is a deliberate choice to optimize for cross-partisan agreement—a specific social welfare function. + +### Additional Evidence (extend) +*Source: [[2025-11-00-sahoo-rlhf-alignment-trilemma]] | Added: 2026-03-16* + +The trilemma formalizes why RLHF's implicit social choice is structurally problematic: achieving epsilon-representativeness (epsilon <= 0.01) and delta-robustness (delta <= 0.001) simultaneously requires super-polynomial compute, forcing systems to sacrifice representativeness for tractability. This makes the 'implicit' nature of RLHF social choice not just a transparency problem but a computational necessity. + --- Relevant Notes: diff --git a/domains/ai-alignment/single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md b/domains/ai-alignment/single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md index a19a82ade..e03a8e7b0 100644 --- a/domains/ai-alignment/single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md +++ b/domains/ai-alignment/single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md @@ -33,6 +33,12 @@ Chakraborty, Qiu, Yuan, Koppel, Manocha, Huang, Bedi, Wang. "MaxMin-RLHF: Alignm Study demonstrates that models trained on different demographic populations show measurable behavioral divergence (3-5 percentage points), providing empirical evidence that single-reward functions trained on one population systematically misalign with others. + +### Additional Evidence (extend) +*Source: [[2025-11-00-sahoo-rlhf-alignment-trilemma]] | Added: 2026-03-16* + +The formal trilemma provides the theoretical foundation: preference collapse is proven to be computationally necessary, not an implementation choice. Single-reward RLHF cannot capture multimodal preferences even in theory because representing multiple preference modes requires super-polynomial parameters. Current systems collect 10^3-10^4 samples while 10^7-10^8 are needed for global representation. + --- Relevant Notes: diff --git a/inbox/archive/.extraction-debug/2025-11-00-sahoo-rlhf-alignment-trilemma.json b/inbox/archive/.extraction-debug/2025-11-00-sahoo-rlhf-alignment-trilemma.json new file mode 100644 index 000000000..e96e05da8 --- /dev/null +++ b/inbox/archive/.extraction-debug/2025-11-00-sahoo-rlhf-alignment-trilemma.json @@ -0,0 +1,34 @@ +{ + "rejected_claims": [ + { + "filename": "rlhf-alignment-trilemma-proves-no-system-can-simultaneously-achieve-representativeness-tractability-and-robustness.md", + "issues": [ + "missing_attribution_extractor" + ] + }, + { + "filename": "rlhf-pathologies-are-computational-necessities-not-implementation-bugs-because-preference-collapse-sycophancy-and-bias-amplification-follow-from-the-trilemma.md", + "issues": [ + "missing_attribution_extractor" + ] + } + ], + "validation_stats": { + "total": 2, + "kept": 0, + "fixed": 4, + "rejected": 2, + "fixes_applied": [ + "rlhf-alignment-trilemma-proves-no-system-can-simultaneously-achieve-representativeness-tractability-and-robustness.md:set_created:2026-03-16", + "rlhf-alignment-trilemma-proves-no-system-can-simultaneously-achieve-representativeness-tractability-and-robustness.md:stripped_wiki_link:universal-alignment-is-mathematically-impossible-because-Arr", + "rlhf-pathologies-are-computational-necessities-not-implementation-bugs-because-preference-collapse-sycophancy-and-bias-amplification-follow-from-the-trilemma.md:set_created:2026-03-16", + "rlhf-pathologies-are-computational-necessities-not-implementation-bugs-because-preference-collapse-sycophancy-and-bias-amplification-follow-from-the-trilemma.md:stripped_wiki_link:emergent-misalignment-arises-naturally-from-reward-hacking-a" + ], + "rejections": [ + "rlhf-alignment-trilemma-proves-no-system-can-simultaneously-achieve-representativeness-tractability-and-robustness.md:missing_attribution_extractor", + "rlhf-pathologies-are-computational-necessities-not-implementation-bugs-because-preference-collapse-sycophancy-and-bias-amplification-follow-from-the-trilemma.md:missing_attribution_extractor" + ] + }, + "model": "anthropic/claude-sonnet-4.5", + "date": "2026-03-16" +} \ No newline at end of file diff --git a/inbox/archive/2025-11-00-sahoo-rlhf-alignment-trilemma.md b/inbox/archive/2025-11-00-sahoo-rlhf-alignment-trilemma.md index 17c59596c..71e642548 100644 --- a/inbox/archive/2025-11-00-sahoo-rlhf-alignment-trilemma.md +++ b/inbox/archive/2025-11-00-sahoo-rlhf-alignment-trilemma.md @@ -7,9 +7,13 @@ date: 2025-11-01 domain: ai-alignment secondary_domains: [collective-intelligence] format: paper -status: unprocessed +status: enrichment priority: high tags: [alignment-trilemma, impossibility-result, rlhf, representativeness, robustness, tractability, preference-collapse, sycophancy] +processed_by: theseus +processed_date: 2026-03-16 +enrichments_applied: ["single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md", "rlhf-is-implicit-social-choice-without-normative-scrutiny.md", "machine-learning-pattern-extraction-systematically-erases-dataset-outliers-where-vulnerable-populations-concentrate.md"] +extraction_model: "anthropic/claude-sonnet-4.5" --- ## Content @@ -56,3 +60,16 @@ Position paper from Berkeley AI Safety Initiative, AWS/Stanford, Meta/Stanford, PRIMARY CONNECTION: [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]] WHY ARCHIVED: Formalizes our informal impossibility claim with complexity-theoretic proof — independent confirmation of Arrow's-theorem-based argument from a different mathematical tradition EXTRACTION HINT: The trilemma is the key claim. Also extract the practical gap (10^3 vs 10^8) and the "pathologies as computational necessities" framing + + +## Key Facts +- Paper presented at NeurIPS 2025 Workshop on Socially Responsible and Trustworthy Foundation Models +- Authors affiliated with Berkeley AI Safety Initiative, AWS/Stanford, Meta/Stanford, and Northeastern +- Core complexity bound: Omega(2^{d_context}) operations required for representativeness + robustness +- Current RLHF systems collect 10^3-10^4 samples from homogeneous pools +- True global representation requires 10^7-10^8 samples (3-4 orders of magnitude more) +- Representativeness threshold: epsilon <= 0.01 +- Robustness threshold: delta <= 0.001 +- Bias amplification: models assign >99% probability to majority opinions +- Strategic relaxation option 1: Focus on K << |H| core values (~30 universal principles) +- Paper does NOT reference Arrow's theorem despite structural similarity