Merge pull request 'extract: 2025-11-00-operationalizing-pluralistic-values-llm-alignment' (#1010) from extract/2025-11-00-operationalizing-pluralistic-values-llm-alignment into main
Some checks are pending
Sync Graph Data to teleo-app / sync (push) Waiting to run

This commit is contained in:
Leo 2026-03-15 20:28:17 +00:00
commit 0bc5544adf
5 changed files with 59 additions and 5 deletions

View file

@ -19,6 +19,12 @@ Since [[democratic alignment assemblies produce constitutions as effective as ex
Since [[collective intelligence requires diversity as a structural precondition not a moral preference]], community-centred norm elicitation is a concrete mechanism for ensuring the structural diversity that collective alignment requires. Without it, alignment defaults to the values of whichever demographic builds the systems.
### Additional Evidence (confirm)
*Source: [[2025-11-00-operationalizing-pluralistic-values-llm-alignment]] | Added: 2026-03-15*
Empirical study with 27,375 ratings from 1,095 participants shows that demographic composition of training data produces 3-5 percentage point differences in model behavior across emotional awareness and toxicity dimensions. This quantifies the magnitude of difference between community-sourced and developer-specified alignment targets.
---
Relevant Notes:

View file

@ -27,6 +27,12 @@ Chakraborty, Qiu, Yuan, Koppel, Manocha, Huang, Bedi, Wang. "MaxMin-RLHF: Alignm
- GPT-2 experiment: single RLHF achieved positive sentiment but ignored conciseness
- Tulu2-7B experiment: minority group accuracy dropped from 70.4% to 42% at 10:1 ratio
### Additional Evidence (confirm)
*Source: [[2025-11-00-operationalizing-pluralistic-values-llm-alignment]] | Added: 2026-03-15*
Study demonstrates that models trained on different demographic populations show measurable behavioral divergence (3-5 percentage points), providing empirical evidence that single-reward functions trained on one population systematically misalign with others.
---
Relevant Notes:

View file

@ -11,15 +11,21 @@ source: "Arrow's impossibility theorem; value pluralism (Isaiah Berlin); LivingI
Not all disagreement is an information problem. Some disagreements persist because people genuinely weight values differently -- liberty against equality, individual against collective, present against future, growth against sustainability. These are not failures of reasoning or gaps in evidence. They are structural features of a world where multiple legitimate values cannot all be maximized simultaneously.
[[Universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective]]. Arrow proved this formally: no aggregation mechanism can satisfy all fairness criteria simultaneously when preferences genuinely diverge. The implication is not that we should give up on coordination, but that any system claiming to have resolved all disagreement has either suppressed minority positions or defined away the hard cases.
Universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective. Arrow proved this formally: no aggregation mechanism can satisfy all fairness criteria simultaneously when preferences genuinely diverge. The implication is not that we should give up on coordination, but that any system claiming to have resolved all disagreement has either suppressed minority positions or defined away the hard cases.
This matters for knowledge systems because the temptation is always to converge. Consensus feels like progress. But premature consensus on value-laden questions is more dangerous than sustained tension. A system that forces agreement on whether AI development should prioritize capability or safety, or whether economic growth or ecological preservation takes precedence, has not solved the problem -- it has hidden it. And hidden disagreements surface at the worst possible moments.
The correct response is to map the disagreement rather than eliminate it. Identify the common ground. Build steelman arguments for each position. Locate the precise crux -- is it empirical (resolvable with evidence) or evaluative (genuinely about different values)? Make the structure of the disagreement visible so that participants can engage with the strongest version of positions they oppose.
[[Pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state]] -- this is the same principle applied to AI systems. [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]] -- collapsing diverse preferences into a single function is the technical version of premature consensus.
Pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state -- this is the same principle applied to AI systems. [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]] -- collapsing diverse preferences into a single function is the technical version of premature consensus.
[[Collective intelligence within a purpose-driven community faces a structural tension because shared worldview correlates errors while shared purpose enables coordination]]. Persistent irreducible disagreement is actually a safeguard here -- it prevents the correlated error problem by maintaining genuine diversity of perspective within a coordinated community. The independence-coherence tradeoff is managed not by eliminating disagreement but by channeling it productively.
Collective intelligence within a purpose-driven community faces a structural tension because shared worldview correlates errors while shared purpose enables coordination. Persistent irreducible disagreement is actually a safeguard here -- it prevents the correlated error problem by maintaining genuine diversity of perspective within a coordinated community. The independence-coherence tradeoff is managed not by eliminating disagreement but by channeling it productively.
### Additional Evidence (confirm)
*Source: [[2025-11-00-operationalizing-pluralistic-values-llm-alignment]] | Added: 2026-03-15*
Systematic variation of demographic composition in alignment training produced persistent behavioral differences across Liberal/Conservative, White/Black, and Female/Male populations, suggesting these reflect genuine value differences rather than information asymmetries that could be resolved.
---

View file

@ -0,0 +1,24 @@
{
"rejected_claims": [
{
"filename": "demographic-composition-of-alignment-training-data-produces-measurable-behavioral-differences-in-model-outputs.md",
"issues": [
"missing_attribution_extractor"
]
}
],
"validation_stats": {
"total": 1,
"kept": 0,
"fixed": 1,
"rejected": 1,
"fixes_applied": [
"demographic-composition-of-alignment-training-data-produces-measurable-behavioral-differences-in-model-outputs.md:set_created:2026-03-15"
],
"rejections": [
"demographic-composition-of-alignment-training-data-produces-measurable-behavioral-differences-in-model-outputs.md:missing_attribution_extractor"
]
},
"model": "anthropic/claude-sonnet-4.5",
"date": "2026-03-15"
}

View file

@ -7,9 +7,13 @@ date: 2025-11-01
domain: ai-alignment
secondary_domains: []
format: paper
status: unprocessed
status: enrichment
priority: high
tags: [pluralistic-alignment, demographic-composition, empirical, safety-inclusivity, real-human-feedback]
processed_by: theseus
processed_date: 2026-03-15
enrichments_applied: ["community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules.md", "single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md", "some disagreements are permanently irreducible because they stem from genuine value differences not information gaps and systems must map rather than eliminate them.md"]
extraction_model: "anthropic/claude-sonnet-4.5"
---
## Content
@ -29,7 +33,7 @@ Demonstrates that "whose feedback" matters as much as "how much feedback" for al
**Why this matters:** First large-scale empirical study varying DEMOGRAPHIC COMPOSITION of alignment training data. Proves that the composition question (whose preferences?) has measurable, quantitative effects on model behavior.
**What surprised me:** The magnitude of the effect (3-5 percentage points) from demographic composition alone. This is not a subtle effect.
**What I expected but didn't find:** Couldn't access full paper. Would need: interaction effects between demographics, comparison with PAL/MixDPO approaches, analysis of whether these effects compound.
**KB connections:** Directly supports [[community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules]]. Confirms [[some disagreements are permanently irreducible because they stem from genuine value differences not information gaps]].
**KB connections:** Directly supports [[community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules]]. Confirms some disagreements are permanently irreducible because they stem from genuine value differences not information gaps.
**Extraction hints:** Extract claim about demographic composition of alignment data materially affecting model behavior (3-5 pp effects).
**Context:** 1,095 participants is a large N for alignment research. Real human feedback, not synthetic.
@ -37,3 +41,11 @@ Demonstrates that "whose feedback" matters as much as "how much feedback" for al
PRIMARY CONNECTION: community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules
WHY ARCHIVED: Empirical evidence that "whose preferences" is a quantitatively important question, not just a fairness concern
EXTRACTION HINT: Focus on the magnitude of demographic composition effects and what this means for single-population alignment training
## Key Facts
- Study included 27,375 ratings from 1,095 participants
- Models fine-tuned on Liberal feedback showed 5.0 percentage point improvement over Conservative baseline
- Models fine-tuned on White feedback showed 4.7 percentage point improvement over Black baseline
- Models fine-tuned on Female feedback showed 3.4 percentage point improvement over Male baseline
- Effects measured across emotional awareness and toxicity dimensions