teleo-codex/domains/ai-alignment/thick-models-of-value-distinguish-enduring-values-from-temporary-preferences-enabling-normative-reasoning.md
Teleo Agents 4dfe98112c theseus: extract from 2025-12-00-fullstack-alignment-thick-models-value.md
- Source: inbox/archive/2025-12-00-fullstack-alignment-thick-models-value.md
- Domain: ai-alignment
- Extracted by: headless extraction cron (worker 4)

Pentagon-Agent: Theseus <HEADLESS>
2026-03-12 07:01:07 +00:00

48 lines
3.7 KiB
Markdown

---
type: claim
domain: ai-alignment
description: "Thick value models distinguish enduring values from temporary preferences and embed individual choices in social contexts, enabling normative reasoning that utility functions cannot capture"
confidence: experimental
source: "Full-Stack Alignment paper (December 2025), arxiv.org/abs/2512.03399"
created: 2026-03-11
secondary_domains: [mechanisms]
---
# Thick models of value distinguish enduring values from temporary preferences enabling normative reasoning across new domains
The Full-Stack Alignment paper proposes "thick models of value" as an alternative to utility functions and preference orderings. These models address a fundamental problem in AI alignment: the specification trap.
**What thick models do:**
1. **Distinguish enduring values from temporary preferences** — Separates what people say they want (preferences, often context-dependent and volatile) from what actually produces good outcomes (values, more stable and generalizable)
2. **Model individual choices within social contexts** — Recognizes that choices are not isolated but embedded in social structures, relationships, and institutional contexts
3. **Enable normative reasoning across new domains** — Allow systems to reason about values in contexts not explicitly covered by training data, rather than failing when encountering novel situations
**Why this matters for alignment:**
This contrasts with "thin" models (utility functions, preference orderings) that treat all preferences as equivalent and context-independent. Thin models fail because:
- They cannot distinguish signal (enduring values) from noise (temporary preferences)
- They assume preferences are stable across contexts when they are actually highly context-dependent
- They cannot generalize to novel domains because they have no principled way to reason about values beyond training data
Thick models formalize why specification-in-advance fails: human values have structure, hierarchy, and context-dependence that simple preference aggregation cannot capture.
## Evidence
- Full-Stack Alignment paper (December 2025) — introduces thick vs thin value models as a core component of the alignment framework
- The distinction between preferences (what people say they want) and values (what produces good outcomes) directly addresses the specification trap identified in existing alignment research
- The paper argues that thick models enable "normative reasoning across new domains" — a capability thin models lack
## Limitations and Open Questions
- No formal specification of what constitutes a "thick model" or how to implement one in practice
- Unclear how to operationalize the distinction between enduring values and temporary preferences in real systems
- Risk of paternalism: who decides which preferences are "temporary" vs which values are "enduring"? This could embed designer bias
- No empirical validation that thick models actually outperform thin models on alignment tasks
- The paper does not address how thick models handle genuinely conflicting values across populations
---
Relevant Notes:
- [[the alignment problem dissolves when human values are continuously woven into the system rather than specified in advance]] — thick values formalize continuous value integration
- [[specifying human values in code is intractable because our goals contain hidden complexity comparable to visual perception]] — thick models acknowledge this complexity
- [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]] — thin models fail at diversity
- [[pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state]] — relevant to the paternalism concern