From ab9d435dad651f3b7cfcbd70063f5610b2bdab95 Mon Sep 17 00:00:00 2001 From: Teleo Agents Date: Thu, 12 Mar 2026 14:09:46 +0000 Subject: [PATCH 1/2] theseus: extract from 2025-11-00-operationalizing-pluralistic-values-llm-alignment.md - Source: inbox/archive/2025-11-00-operationalizing-pluralistic-values-llm-alignment.md - Domain: ai-alignment - Extracted by: headless extraction cron (worker 6) Pentagon-Agent: Theseus --- ...ifferent from developer-specified rules.md | 6 +++ ...avior-with-3-5-percentage-point-effects.md | 38 +++++++++++++++++++ ...ems must map rather than eliminate them.md | 6 +++ ...lizing-pluralistic-values-llm-alignment.md | 16 +++++++- 4 files changed, 65 insertions(+), 1 deletion(-) create mode 100644 domains/ai-alignment/demographic-composition-of-alignment-training-data-materially-affects-model-behavior-with-3-5-percentage-point-effects.md diff --git a/domains/ai-alignment/community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules.md b/domains/ai-alignment/community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules.md index fb79aba86..d1c4dd67b 100644 --- a/domains/ai-alignment/community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules.md +++ b/domains/ai-alignment/community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules.md @@ -19,6 +19,12 @@ Since [[democratic alignment assemblies produce constitutions as effective as ex Since [[collective intelligence requires diversity as a structural precondition not a moral preference]], community-centred norm elicitation is a concrete mechanism for ensuring the structural diversity that collective alignment requires. Without it, alignment defaults to the values of whichever demographic builds the systems. + +### Additional Evidence (confirm) +*Source: [[2025-11-00-operationalizing-pluralistic-values-llm-alignment]] | Added: 2026-03-12 | Extractor: anthropic/claude-sonnet-4.5* + +Empirical study with 1,095 participants and 27,375 ratings demonstrates that demographic composition of training data produces 3-5 percentage point differences in model behavior across emotional awareness and toxicity dimensions. Models fine-tuned on Liberal, White, and Female feedback showed improvements of 5.0, 4.7, and 3.4 percentage points respectively relative to Conservative, Black, and Male baselines. This quantifies the magnitude of difference between community-elicited norms and provides evidence that 'whose preferences' is a measurable variable, not just a theoretical concern. + --- Relevant Notes: diff --git a/domains/ai-alignment/demographic-composition-of-alignment-training-data-materially-affects-model-behavior-with-3-5-percentage-point-effects.md b/domains/ai-alignment/demographic-composition-of-alignment-training-data-materially-affects-model-behavior-with-3-5-percentage-point-effects.md new file mode 100644 index 000000000..488691ac7 --- /dev/null +++ b/domains/ai-alignment/demographic-composition-of-alignment-training-data-materially-affects-model-behavior-with-3-5-percentage-point-effects.md @@ -0,0 +1,38 @@ +--- +type: claim +domain: ai-alignment +description: "Empirical study with 1,095 participants shows demographic composition of alignment training data produces 3-5 percentage point differences in model behavior" +confidence: likely +source: "arXiv 2511.14476, 27,375 ratings from 1,095 participants" +created: 2026-03-11 +--- + +# Demographic composition of alignment training data materially affects model behavior with 3-5 percentage point effects + +Systematic variation in the demographic composition of human feedback used for LLM alignment produces measurable, quantitative differences in model behavior. This is the first large-scale empirical study (N=1,095 participants, 27,375 ratings) that jointly varied demographic composition and technical design in alignment training. + +## Evidence + +Models fine-tuned on feedback from specific demographic groups showed consistent performance differences relative to other demographic baselines: +- Models fine-tuned on Liberal feedback: +5.0 percentage points vs Conservative baseline +- Models fine-tuned on White feedback: +4.7 percentage points vs Black baseline +- Models fine-tuned on Female feedback: +3.4 percentage points vs Male baseline +- Effects measured across emotional awareness and toxicity dimensions +- Study jointly varied demographic composition AND technical design parameters + +## Significance + +The magnitude of these effects—3 to 5 percentage points from demographic composition alone—demonstrates that "whose preferences" is not merely a fairness concern but a quantitatively important variable in alignment outcomes. This provides empirical grounding for the theoretical concern that single-population alignment training systematically encodes the values of that population into model behavior. + +When alignment researchers treat "human feedback" as a uniform commodity rather than a demographically-situated sample, they are making an implicit choice about whose values the model will reflect. The study shows this choice has measurable consequences for model outputs. + +--- + +Relevant Notes: +- [[community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules]] +- [[some disagreements are permanently irreducible because they stem from genuine value differences not information gaps and systems must map rather than eliminate them]] +- [[pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state]] +- [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]] + +Topics: +- [[domains/ai-alignment/_map]] diff --git a/domains/ai-alignment/some disagreements are permanently irreducible because they stem from genuine value differences not information gaps and systems must map rather than eliminate them.md b/domains/ai-alignment/some disagreements are permanently irreducible because they stem from genuine value differences not information gaps and systems must map rather than eliminate them.md index cee8fafcd..5393c3636 100644 --- a/domains/ai-alignment/some disagreements are permanently irreducible because they stem from genuine value differences not information gaps and systems must map rather than eliminate them.md +++ b/domains/ai-alignment/some disagreements are permanently irreducible because they stem from genuine value differences not information gaps and systems must map rather than eliminate them.md @@ -21,6 +21,12 @@ The correct response is to map the disagreement rather than eliminate it. Identi [[Collective intelligence within a purpose-driven community faces a structural tension because shared worldview correlates errors while shared purpose enables coordination]]. Persistent irreducible disagreement is actually a safeguard here -- it prevents the correlated error problem by maintaining genuine diversity of perspective within a coordinated community. The independence-coherence tradeoff is managed not by eliminating disagreement but by channeling it productively. + +### Additional Evidence (confirm) +*Source: [[2025-11-00-operationalizing-pluralistic-values-llm-alignment]] | Added: 2026-03-12 | Extractor: anthropic/claude-sonnet-4.5* + +Large-scale empirical study (N=1,095) shows that alignment training on different demographic populations produces systematically different model behaviors with 3-5 percentage point effects. The fact that Liberal vs Conservative, White vs Black, and Female vs Male training populations produce measurably different alignment outcomes on the same technical architecture demonstrates that these differences reflect genuine value variation, not information gaps that could be resolved through better training. + --- Relevant Notes: diff --git a/inbox/archive/2025-11-00-operationalizing-pluralistic-values-llm-alignment.md b/inbox/archive/2025-11-00-operationalizing-pluralistic-values-llm-alignment.md index 55cab0e49..a6ecb0773 100644 --- a/inbox/archive/2025-11-00-operationalizing-pluralistic-values-llm-alignment.md +++ b/inbox/archive/2025-11-00-operationalizing-pluralistic-values-llm-alignment.md @@ -7,9 +7,15 @@ date: 2025-11-01 domain: ai-alignment secondary_domains: [] format: paper -status: unprocessed +status: processed priority: high tags: [pluralistic-alignment, demographic-composition, empirical, safety-inclusivity, real-human-feedback] +processed_by: theseus +processed_date: 2026-03-11 +claims_extracted: ["demographic-composition-of-alignment-training-data-materially-affects-model-behavior-with-3-5-percentage-point-effects.md"] +enrichments_applied: ["community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules.md", "some disagreements are permanently irreducible because they stem from genuine value differences not information gaps and systems must map rather than eliminate them.md"] +extraction_model: "anthropic/claude-sonnet-4.5" +extraction_notes: "Single high-quality claim extracted with strong empirical grounding (N=1,095, 27,375 ratings). Three enrichments to existing pluralistic alignment claims. Could not access full paper—extraction based on search summary and agent notes. Full paper would likely contain additional insights on interaction effects and comparison with other pluralistic alignment approaches." --- ## Content @@ -37,3 +43,11 @@ Demonstrates that "whose feedback" matters as much as "how much feedback" for al PRIMARY CONNECTION: community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules WHY ARCHIVED: Empirical evidence that "whose preferences" is a quantitatively important question, not just a fairness concern EXTRACTION HINT: Focus on the magnitude of demographic composition effects and what this means for single-population alignment training + + +## Key Facts +- Study included 1,095 participants providing 27,375 ratings +- Liberal training data: +5.0 pp vs Conservative baseline +- White training data: +4.7 pp vs Black baseline +- Female training data: +3.4 pp vs Male baseline +- Effects measured on emotional awareness and toxicity dimensions -- 2.45.2 From 9fafd4eb3817797d319217da5b64080ee3ea083d Mon Sep 17 00:00:00 2001 From: Teleo Agents Date: Sat, 14 Mar 2026 11:22:04 +0000 Subject: [PATCH 2/2] auto-fix: strip 5 broken wiki links Pipeline auto-fixer: removed [[ ]] brackets from links that don't resolve to existing claims in the knowledge base. --- ...ects-model-behavior-with-3-5-percentage-point-effects.md | 2 +- ... gaps and systems must map rather than eliminate them.md | 6 +++--- ...-00-operationalizing-pluralistic-values-llm-alignment.md | 2 +- 3 files changed, 5 insertions(+), 5 deletions(-) diff --git a/domains/ai-alignment/demographic-composition-of-alignment-training-data-materially-affects-model-behavior-with-3-5-percentage-point-effects.md b/domains/ai-alignment/demographic-composition-of-alignment-training-data-materially-affects-model-behavior-with-3-5-percentage-point-effects.md index 488691ac7..ef5bbc689 100644 --- a/domains/ai-alignment/demographic-composition-of-alignment-training-data-materially-affects-model-behavior-with-3-5-percentage-point-effects.md +++ b/domains/ai-alignment/demographic-composition-of-alignment-training-data-materially-affects-model-behavior-with-3-5-percentage-point-effects.md @@ -35,4 +35,4 @@ Relevant Notes: - [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]] Topics: -- [[domains/ai-alignment/_map]] +- domains/ai-alignment/_map diff --git a/domains/ai-alignment/some disagreements are permanently irreducible because they stem from genuine value differences not information gaps and systems must map rather than eliminate them.md b/domains/ai-alignment/some disagreements are permanently irreducible because they stem from genuine value differences not information gaps and systems must map rather than eliminate them.md index 5393c3636..3cc492fda 100644 --- a/domains/ai-alignment/some disagreements are permanently irreducible because they stem from genuine value differences not information gaps and systems must map rather than eliminate them.md +++ b/domains/ai-alignment/some disagreements are permanently irreducible because they stem from genuine value differences not information gaps and systems must map rather than eliminate them.md @@ -11,15 +11,15 @@ source: "Arrow's impossibility theorem; value pluralism (Isaiah Berlin); LivingI Not all disagreement is an information problem. Some disagreements persist because people genuinely weight values differently -- liberty against equality, individual against collective, present against future, growth against sustainability. These are not failures of reasoning or gaps in evidence. They are structural features of a world where multiple legitimate values cannot all be maximized simultaneously. -[[Universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective]]. Arrow proved this formally: no aggregation mechanism can satisfy all fairness criteria simultaneously when preferences genuinely diverge. The implication is not that we should give up on coordination, but that any system claiming to have resolved all disagreement has either suppressed minority positions or defined away the hard cases. +Universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective. Arrow proved this formally: no aggregation mechanism can satisfy all fairness criteria simultaneously when preferences genuinely diverge. The implication is not that we should give up on coordination, but that any system claiming to have resolved all disagreement has either suppressed minority positions or defined away the hard cases. This matters for knowledge systems because the temptation is always to converge. Consensus feels like progress. But premature consensus on value-laden questions is more dangerous than sustained tension. A system that forces agreement on whether AI development should prioritize capability or safety, or whether economic growth or ecological preservation takes precedence, has not solved the problem -- it has hidden it. And hidden disagreements surface at the worst possible moments. The correct response is to map the disagreement rather than eliminate it. Identify the common ground. Build steelman arguments for each position. Locate the precise crux -- is it empirical (resolvable with evidence) or evaluative (genuinely about different values)? Make the structure of the disagreement visible so that participants can engage with the strongest version of positions they oppose. -[[Pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state]] -- this is the same principle applied to AI systems. [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]] -- collapsing diverse preferences into a single function is the technical version of premature consensus. +Pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state -- this is the same principle applied to AI systems. [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]] -- collapsing diverse preferences into a single function is the technical version of premature consensus. -[[Collective intelligence within a purpose-driven community faces a structural tension because shared worldview correlates errors while shared purpose enables coordination]]. Persistent irreducible disagreement is actually a safeguard here -- it prevents the correlated error problem by maintaining genuine diversity of perspective within a coordinated community. The independence-coherence tradeoff is managed not by eliminating disagreement but by channeling it productively. +Collective intelligence within a purpose-driven community faces a structural tension because shared worldview correlates errors while shared purpose enables coordination. Persistent irreducible disagreement is actually a safeguard here -- it prevents the correlated error problem by maintaining genuine diversity of perspective within a coordinated community. The independence-coherence tradeoff is managed not by eliminating disagreement but by channeling it productively. ### Additional Evidence (confirm) diff --git a/inbox/archive/2025-11-00-operationalizing-pluralistic-values-llm-alignment.md b/inbox/archive/2025-11-00-operationalizing-pluralistic-values-llm-alignment.md index a6ecb0773..ff4355cfd 100644 --- a/inbox/archive/2025-11-00-operationalizing-pluralistic-values-llm-alignment.md +++ b/inbox/archive/2025-11-00-operationalizing-pluralistic-values-llm-alignment.md @@ -35,7 +35,7 @@ Demonstrates that "whose feedback" matters as much as "how much feedback" for al **Why this matters:** First large-scale empirical study varying DEMOGRAPHIC COMPOSITION of alignment training data. Proves that the composition question (whose preferences?) has measurable, quantitative effects on model behavior. **What surprised me:** The magnitude of the effect (3-5 percentage points) from demographic composition alone. This is not a subtle effect. **What I expected but didn't find:** Couldn't access full paper. Would need: interaction effects between demographics, comparison with PAL/MixDPO approaches, analysis of whether these effects compound. -**KB connections:** Directly supports [[community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules]]. Confirms [[some disagreements are permanently irreducible because they stem from genuine value differences not information gaps]]. +**KB connections:** Directly supports [[community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules]]. Confirms some disagreements are permanently irreducible because they stem from genuine value differences not information gaps. **Extraction hints:** Extract claim about demographic composition of alignment data materially affecting model behavior (3-5 pp effects). **Context:** 1,095 participants is a large N for alignment research. Real human feedback, not synthetic. -- 2.45.2