- Source: inbox/archive/2025-12-00-fullstack-alignment-thick-models-value.md - Domain: ai-alignment - Extracted by: headless extraction cron (worker 4) Pentagon-Agent: Theseus <HEADLESS>
3.7 KiB
3.7 KiB
| type | domain | description | confidence | source | created | secondary_domains | |
|---|---|---|---|---|---|---|---|
| claim | ai-alignment | Thick value models distinguish enduring values from temporary preferences and embed individual choices in social contexts, enabling normative reasoning that utility functions cannot capture | experimental | Full-Stack Alignment paper (December 2025), arxiv.org/abs/2512.03399 | 2026-03-11 |
|
Thick models of value distinguish enduring values from temporary preferences enabling normative reasoning across new domains
The Full-Stack Alignment paper proposes "thick models of value" as an alternative to utility functions and preference orderings. These models address a fundamental problem in AI alignment: the specification trap.
What thick models do:
- Distinguish enduring values from temporary preferences — Separates what people say they want (preferences, often context-dependent and volatile) from what actually produces good outcomes (values, more stable and generalizable)
- Model individual choices within social contexts — Recognizes that choices are not isolated but embedded in social structures, relationships, and institutional contexts
- Enable normative reasoning across new domains — Allow systems to reason about values in contexts not explicitly covered by training data, rather than failing when encountering novel situations
Why this matters for alignment:
This contrasts with "thin" models (utility functions, preference orderings) that treat all preferences as equivalent and context-independent. Thin models fail because:
- They cannot distinguish signal (enduring values) from noise (temporary preferences)
- They assume preferences are stable across contexts when they are actually highly context-dependent
- They cannot generalize to novel domains because they have no principled way to reason about values beyond training data
Thick models formalize why specification-in-advance fails: human values have structure, hierarchy, and context-dependence that simple preference aggregation cannot capture.
Evidence
- Full-Stack Alignment paper (December 2025) — introduces thick vs thin value models as a core component of the alignment framework
- The distinction between preferences (what people say they want) and values (what produces good outcomes) directly addresses the specification trap identified in existing alignment research
- The paper argues that thick models enable "normative reasoning across new domains" — a capability thin models lack
Limitations and Open Questions
- No formal specification of what constitutes a "thick model" or how to implement one in practice
- Unclear how to operationalize the distinction between enduring values and temporary preferences in real systems
- Risk of paternalism: who decides which preferences are "temporary" vs which values are "enduring"? This could embed designer bias
- No empirical validation that thick models actually outperform thin models on alignment tasks
- The paper does not address how thick models handle genuinely conflicting values across populations
Relevant Notes:
- the alignment problem dissolves when human values are continuously woven into the system rather than specified in advance — thick values formalize continuous value integration
- specifying human values in code is intractable because our goals contain hidden complexity comparable to visual perception — thick models acknowledge this complexity
- RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values — thin models fail at diversity
- pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state — relevant to the paternalism concern