diff --git a/domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md b/domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md index 611e8b364..dfed044a0 100644 --- a/domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md +++ b/domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md @@ -45,6 +45,12 @@ Comprehensive February 2026 survey by An & Du documents that contemporary ML sys EM-DPO makes the social choice function explicit by using MinMax Regret Aggregation based on egalitarian fairness principles, demonstrating that pluralistic alignment requires choosing a specific social welfare function (here: maximin regret) rather than pretending aggregation is value-neutral. + +### Additional Evidence (extend) +*Source: [[2025-11-00-sahoo-rlhf-alignment-trilemma]] | Added: 2026-03-16* + +The trilemma formalizes why implicit social choice in RLHF is problematic: the computational constraints force strategic relaxation of either representativeness, robustness, or tractability. Current RLHF implementations implicitly choose tractability, which mathematically necessitates sacrificing representativeness (homogeneous annotator pools) and robustness (vulnerability to distribution shift). This makes the normative choices explicit: which property are we willing to sacrifice? + --- Relevant Notes: diff --git a/domains/ai-alignment/single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md b/domains/ai-alignment/single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md index c6ab6f2bf..e4fa50e25 100644 --- a/domains/ai-alignment/single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md +++ b/domains/ai-alignment/single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md @@ -45,6 +45,12 @@ An & Du's survey reveals the mechanism behind single-reward failure: RLHF is doi EM-DPO provides formal proof that binary comparisons are mathematically insufficient for preference type identification, explaining WHY single-reward RLHF fails: the training signal format cannot contain the information needed to discover heterogeneity, regardless of dataset size. Rankings over 3+ responses are necessary. + +### Additional Evidence (extend) +*Source: [[2025-11-00-sahoo-rlhf-alignment-trilemma]] | Added: 2026-03-16* + +The alignment gap is not just proportional to minority distinctiveness — it's super-polynomial in context dimensionality. Sahoo et al. prove that achieving epsilon <= 0.01 representativeness and delta <= 0.001 robustness requires Omega(2^{d_context}) operations. Current systems use 10^3-10^4 samples while 10^7-10^8 are needed for global representation. The gap compounds exponentially with the dimensionality of human values, making it structurally impossible to close through incremental improvements. + --- Relevant Notes: diff --git a/inbox/archive/.extraction-debug/2025-11-00-sahoo-rlhf-alignment-trilemma.json b/inbox/archive/.extraction-debug/2025-11-00-sahoo-rlhf-alignment-trilemma.json new file mode 100644 index 000000000..09b11b8b9 --- /dev/null +++ b/inbox/archive/.extraction-debug/2025-11-00-sahoo-rlhf-alignment-trilemma.json @@ -0,0 +1,36 @@ +{ + "rejected_claims": [ + { + "filename": "rlhf-alignment-trilemma-proves-no-system-can-simultaneously-achieve-representativeness-tractability-and-robustness.md", + "issues": [ + "missing_attribution_extractor" + ] + }, + { + "filename": "rlhf-pathologies-are-computational-necessities-not-implementation-bugs.md", + "issues": [ + "missing_attribution_extractor" + ] + } + ], + "validation_stats": { + "total": 2, + "kept": 0, + "fixed": 6, + "rejected": 2, + "fixes_applied": [ + "rlhf-alignment-trilemma-proves-no-system-can-simultaneously-achieve-representativeness-tractability-and-robustness.md:set_created:2026-03-16", + "rlhf-alignment-trilemma-proves-no-system-can-simultaneously-achieve-representativeness-tractability-and-robustness.md:stripped_wiki_link:universal-alignment-is-mathematically-impossible-because-Arr", + "rlhf-alignment-trilemma-proves-no-system-can-simultaneously-achieve-representativeness-tractability-and-robustness.md:stripped_wiki_link:single-reward-rlhf-cannot-align-diverse-preferences-because-", + "rlhf-pathologies-are-computational-necessities-not-implementation-bugs.md:set_created:2026-03-16", + "rlhf-pathologies-are-computational-necessities-not-implementation-bugs.md:stripped_wiki_link:single-reward-rlhf-cannot-align-diverse-preferences-because-", + "rlhf-pathologies-are-computational-necessities-not-implementation-bugs.md:stripped_wiki_link:emergent-misalignment-arises-naturally-from-reward-hacking-a" + ], + "rejections": [ + "rlhf-alignment-trilemma-proves-no-system-can-simultaneously-achieve-representativeness-tractability-and-robustness.md:missing_attribution_extractor", + "rlhf-pathologies-are-computational-necessities-not-implementation-bugs.md:missing_attribution_extractor" + ] + }, + "model": "anthropic/claude-sonnet-4.5", + "date": "2026-03-16" +} \ No newline at end of file diff --git a/inbox/archive/2025-11-00-sahoo-rlhf-alignment-trilemma.md b/inbox/archive/2025-11-00-sahoo-rlhf-alignment-trilemma.md index 17c59596c..ac1a25cd6 100644 --- a/inbox/archive/2025-11-00-sahoo-rlhf-alignment-trilemma.md +++ b/inbox/archive/2025-11-00-sahoo-rlhf-alignment-trilemma.md @@ -7,9 +7,13 @@ date: 2025-11-01 domain: ai-alignment secondary_domains: [collective-intelligence] format: paper -status: unprocessed +status: enrichment priority: high tags: [alignment-trilemma, impossibility-result, rlhf, representativeness, robustness, tractability, preference-collapse, sycophancy] +processed_by: theseus +processed_date: 2026-03-16 +enrichments_applied: ["single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md", "rlhf-is-implicit-social-choice-without-normative-scrutiny.md"] +extraction_model: "anthropic/claude-sonnet-4.5" --- ## Content @@ -56,3 +60,12 @@ Position paper from Berkeley AI Safety Initiative, AWS/Stanford, Meta/Stanford, PRIMARY CONNECTION: [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]] WHY ARCHIVED: Formalizes our informal impossibility claim with complexity-theoretic proof — independent confirmation of Arrow's-theorem-based argument from a different mathematical tradition EXTRACTION HINT: The trilemma is the key claim. Also extract the practical gap (10^3 vs 10^8) and the "pathologies as computational necessities" framing + + +## Key Facts +- Paper presented at NeurIPS 2025 Workshop on Socially Responsible and Trustworthy Foundation Models +- Authors affiliated with Berkeley AI Safety Initiative, AWS, Stanford, Meta, and Northeastern +- Current RLHF systems collect 10^3-10^4 samples from annotator pools +- True global representation would require 10^7-10^8 samples +- Models assign >99% probability to majority opinions in current implementations +- Paper proposes three strategic relaxation pathways: constrain representativeness to ~30 core values, scope robustness to plausible threats, or accept super-polynomial costs for high-stakes applications