diff --git a/domains/ai-alignment/community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules.md b/domains/ai-alignment/community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules.md index fb79aba8..addaf7fa 100644 --- a/domains/ai-alignment/community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules.md +++ b/domains/ai-alignment/community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules.md @@ -19,6 +19,12 @@ Since [[democratic alignment assemblies produce constitutions as effective as ex Since [[collective intelligence requires diversity as a structural precondition not a moral preference]], community-centred norm elicitation is a concrete mechanism for ensuring the structural diversity that collective alignment requires. Without it, alignment defaults to the values of whichever demographic builds the systems. + +### Additional Evidence (confirm) +*Source: [[2025-11-00-operationalizing-pluralistic-values-llm-alignment]] | Added: 2026-03-15* + +Empirical study with 27,375 ratings from 1,095 participants shows that demographic composition of training data produces 3-5 percentage point differences in model behavior across emotional awareness and toxicity dimensions. This quantifies the magnitude of difference between community-sourced and developer-specified alignment targets. + --- Relevant Notes: diff --git a/domains/ai-alignment/single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md b/domains/ai-alignment/single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md index b587b34f..a19a82ad 100644 --- a/domains/ai-alignment/single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md +++ b/domains/ai-alignment/single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md @@ -27,6 +27,12 @@ Chakraborty, Qiu, Yuan, Koppel, Manocha, Huang, Bedi, Wang. "MaxMin-RLHF: Alignm - GPT-2 experiment: single RLHF achieved positive sentiment but ignored conciseness - Tulu2-7B experiment: minority group accuracy dropped from 70.4% to 42% at 10:1 ratio + +### Additional Evidence (confirm) +*Source: [[2025-11-00-operationalizing-pluralistic-values-llm-alignment]] | Added: 2026-03-15* + +Study demonstrates that models trained on different demographic populations show measurable behavioral divergence (3-5 percentage points), providing empirical evidence that single-reward functions trained on one population systematically misalign with others. + --- Relevant Notes: diff --git a/domains/ai-alignment/some disagreements are permanently irreducible because they stem from genuine value differences not information gaps and systems must map rather than eliminate them.md b/domains/ai-alignment/some disagreements are permanently irreducible because they stem from genuine value differences not information gaps and systems must map rather than eliminate them.md index cee8fafc..74c0b306 100644 --- a/domains/ai-alignment/some disagreements are permanently irreducible because they stem from genuine value differences not information gaps and systems must map rather than eliminate them.md +++ b/domains/ai-alignment/some disagreements are permanently irreducible because they stem from genuine value differences not information gaps and systems must map rather than eliminate them.md @@ -21,6 +21,12 @@ The correct response is to map the disagreement rather than eliminate it. Identi [[Collective intelligence within a purpose-driven community faces a structural tension because shared worldview correlates errors while shared purpose enables coordination]]. Persistent irreducible disagreement is actually a safeguard here -- it prevents the correlated error problem by maintaining genuine diversity of perspective within a coordinated community. The independence-coherence tradeoff is managed not by eliminating disagreement but by channeling it productively. + +### Additional Evidence (confirm) +*Source: [[2025-11-00-operationalizing-pluralistic-values-llm-alignment]] | Added: 2026-03-15* + +Systematic variation of demographic composition in alignment training produced persistent behavioral differences across Liberal/Conservative, White/Black, and Female/Male populations, suggesting these reflect genuine value differences rather than information asymmetries that could be resolved. + --- Relevant Notes: diff --git a/inbox/archive/.extraction-debug/2025-11-00-operationalizing-pluralistic-values-llm-alignment.json b/inbox/archive/.extraction-debug/2025-11-00-operationalizing-pluralistic-values-llm-alignment.json new file mode 100644 index 00000000..0a7030da --- /dev/null +++ b/inbox/archive/.extraction-debug/2025-11-00-operationalizing-pluralistic-values-llm-alignment.json @@ -0,0 +1,24 @@ +{ + "rejected_claims": [ + { + "filename": "demographic-composition-of-alignment-training-data-produces-measurable-behavioral-differences-in-model-outputs.md", + "issues": [ + "missing_attribution_extractor" + ] + } + ], + "validation_stats": { + "total": 1, + "kept": 0, + "fixed": 1, + "rejected": 1, + "fixes_applied": [ + "demographic-composition-of-alignment-training-data-produces-measurable-behavioral-differences-in-model-outputs.md:set_created:2026-03-15" + ], + "rejections": [ + "demographic-composition-of-alignment-training-data-produces-measurable-behavioral-differences-in-model-outputs.md:missing_attribution_extractor" + ] + }, + "model": "anthropic/claude-sonnet-4.5", + "date": "2026-03-15" +} \ No newline at end of file diff --git a/inbox/archive/2025-11-00-operationalizing-pluralistic-values-llm-alignment.md b/inbox/archive/2025-11-00-operationalizing-pluralistic-values-llm-alignment.md index 55cab0e4..9998b09a 100644 --- a/inbox/archive/2025-11-00-operationalizing-pluralistic-values-llm-alignment.md +++ b/inbox/archive/2025-11-00-operationalizing-pluralistic-values-llm-alignment.md @@ -7,9 +7,13 @@ date: 2025-11-01 domain: ai-alignment secondary_domains: [] format: paper -status: unprocessed +status: enrichment priority: high tags: [pluralistic-alignment, demographic-composition, empirical, safety-inclusivity, real-human-feedback] +processed_by: theseus +processed_date: 2026-03-15 +enrichments_applied: ["community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules.md", "single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md", "some disagreements are permanently irreducible because they stem from genuine value differences not information gaps and systems must map rather than eliminate them.md"] +extraction_model: "anthropic/claude-sonnet-4.5" --- ## Content @@ -37,3 +41,11 @@ Demonstrates that "whose feedback" matters as much as "how much feedback" for al PRIMARY CONNECTION: community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules WHY ARCHIVED: Empirical evidence that "whose preferences" is a quantitatively important question, not just a fairness concern EXTRACTION HINT: Focus on the magnitude of demographic composition effects and what this means for single-population alignment training + + +## Key Facts +- Study included 27,375 ratings from 1,095 participants +- Models fine-tuned on Liberal feedback showed 5.0 percentage point improvement over Conservative baseline +- Models fine-tuned on White feedback showed 4.7 percentage point improvement over Black baseline +- Models fine-tuned on Female feedback showed 3.4 percentage point improvement over Male baseline +- Effects measured across emotional awareness and toxicity dimensions