diff --git a/domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md b/domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md index dc59e9565..9b81783fc 100644 --- a/domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md +++ b/domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md @@ -33,6 +33,12 @@ The paper's proposed solution—RLCHF with explicit social welfare functions—c RLCF makes the social choice mechanism explicit through the bridging algorithm (matrix factorization with intercept scores). Unlike standard RLHF which aggregates preferences opaquely through reward model training, RLCF's use of intercepts as the training signal is a deliberate choice to optimize for cross-partisan agreement—a specific social welfare function. + +### Additional Evidence (extend) +*Source: [[2025-11-00-sahoo-rlhf-alignment-trilemma]] | Added: 2026-03-16* + +The trilemma formalizes why implicit social choice in RLHF fails: the computational constraint (polynomial tractability) forces systems to sacrifice either representativeness or robustness, making the social choice function structurally biased regardless of normative framework. Strategic relaxation pathways include constraining to K << |H| 'core' values (~30 universal principles) or accepting super-polynomial costs for high-stakes applications. + --- Relevant Notes: diff --git a/domains/ai-alignment/single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md b/domains/ai-alignment/single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md index a19a82ade..4ac6b3e11 100644 --- a/domains/ai-alignment/single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md +++ b/domains/ai-alignment/single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md @@ -33,6 +33,12 @@ Chakraborty, Qiu, Yuan, Koppel, Manocha, Huang, Bedi, Wang. "MaxMin-RLHF: Alignm Study demonstrates that models trained on different demographic populations show measurable behavioral divergence (3-5 percentage points), providing empirical evidence that single-reward functions trained on one population systematically misalign with others. + +### Additional Evidence (extend) +*Source: [[2025-11-00-sahoo-rlhf-alignment-trilemma]] | Added: 2026-03-16* + +Formal complexity bound: achieving epsilon <= 0.01 representativeness and delta <= 0.001 robustness requires 10^7-10^8 samples for global populations, while current systems use 10^3-10^4 samples from homogeneous pools — a 3-4 order of magnitude gap. The alignment gap is not just proportional to minority distinctiveness but grows super-polynomially with context dimensionality. + --- Relevant Notes: diff --git a/inbox/archive/.extraction-debug/2025-11-00-sahoo-rlhf-alignment-trilemma.json b/inbox/archive/.extraction-debug/2025-11-00-sahoo-rlhf-alignment-trilemma.json new file mode 100644 index 000000000..e02745b20 --- /dev/null +++ b/inbox/archive/.extraction-debug/2025-11-00-sahoo-rlhf-alignment-trilemma.json @@ -0,0 +1,34 @@ +{ + "rejected_claims": [ + { + "filename": "rlhf-alignment-trilemma-proves-no-system-can-simultaneously-achieve-representativeness-tractability-and-robustness.md", + "issues": [ + "missing_attribution_extractor" + ] + }, + { + "filename": "rlhf-pathologies-are-computational-necessities-not-implementation-bugs.md", + "issues": [ + "missing_attribution_extractor" + ] + } + ], + "validation_stats": { + "total": 2, + "kept": 0, + "fixed": 4, + "rejected": 2, + "fixes_applied": [ + "rlhf-alignment-trilemma-proves-no-system-can-simultaneously-achieve-representativeness-tractability-and-robustness.md:set_created:2026-03-16", + "rlhf-alignment-trilemma-proves-no-system-can-simultaneously-achieve-representativeness-tractability-and-robustness.md:stripped_wiki_link:universal-alignment-is-mathematically-impossible-because-Arr", + "rlhf-pathologies-are-computational-necessities-not-implementation-bugs.md:set_created:2026-03-16", + "rlhf-pathologies-are-computational-necessities-not-implementation-bugs.md:stripped_wiki_link:emergent-misalignment-arises-naturally-from-reward-hacking-a" + ], + "rejections": [ + "rlhf-alignment-trilemma-proves-no-system-can-simultaneously-achieve-representativeness-tractability-and-robustness.md:missing_attribution_extractor", + "rlhf-pathologies-are-computational-necessities-not-implementation-bugs.md:missing_attribution_extractor" + ] + }, + "model": "anthropic/claude-sonnet-4.5", + "date": "2026-03-16" +} \ No newline at end of file diff --git a/inbox/archive/2025-11-00-sahoo-rlhf-alignment-trilemma.md b/inbox/archive/2025-11-00-sahoo-rlhf-alignment-trilemma.md index 17c59596c..6a45cc337 100644 --- a/inbox/archive/2025-11-00-sahoo-rlhf-alignment-trilemma.md +++ b/inbox/archive/2025-11-00-sahoo-rlhf-alignment-trilemma.md @@ -7,9 +7,13 @@ date: 2025-11-01 domain: ai-alignment secondary_domains: [collective-intelligence] format: paper -status: unprocessed +status: enrichment priority: high tags: [alignment-trilemma, impossibility-result, rlhf, representativeness, robustness, tractability, preference-collapse, sycophancy] +processed_by: theseus +processed_date: 2026-03-16 +enrichments_applied: ["single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md", "rlhf-is-implicit-social-choice-without-normative-scrutiny.md"] +extraction_model: "anthropic/claude-sonnet-4.5" --- ## Content @@ -56,3 +60,12 @@ Position paper from Berkeley AI Safety Initiative, AWS/Stanford, Meta/Stanford, PRIMARY CONNECTION: [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]] WHY ARCHIVED: Formalizes our informal impossibility claim with complexity-theoretic proof — independent confirmation of Arrow's-theorem-based argument from a different mathematical tradition EXTRACTION HINT: The trilemma is the key claim. Also extract the practical gap (10^3 vs 10^8) and the "pathologies as computational necessities" framing + + +## Key Facts +- Paper presented at NeurIPS 2025 Workshop on Socially Responsible and Trustworthy Foundation Models +- Authors affiliated with Berkeley AI Safety Initiative, AWS/Stanford, Meta/Stanford, and Northeastern +- Current RLHF systems collect 10^3-10^4 samples from annotator pools +- True global representation would require 10^7-10^8 samples +- Models assign >99% probability to majority opinions in documented cases +- Paper proposes three strategic relaxation pathways: constrain representativeness to ~30 core values, scope robustness narrowly, or accept super-polynomial costs