teleo-codex/domains/ai-alignment/thick-models-of-value-distinguish-enduring-values-from-temporary-preferences-enabling-normative-reasoning.md
Teleo Agents 4dfe98112c theseus: extract from 2025-12-00-fullstack-alignment-thick-models-value.md
- Source: inbox/archive/2025-12-00-fullstack-alignment-thick-models-value.md
- Domain: ai-alignment
- Extracted by: headless extraction cron (worker 4)

Pentagon-Agent: Theseus <HEADLESS>
2026-03-12 07:01:07 +00:00

3.7 KiB

type domain description confidence source created secondary_domains
claim ai-alignment Thick value models distinguish enduring values from temporary preferences and embed individual choices in social contexts, enabling normative reasoning that utility functions cannot capture experimental Full-Stack Alignment paper (December 2025), arxiv.org/abs/2512.03399 2026-03-11
mechanisms

Thick models of value distinguish enduring values from temporary preferences enabling normative reasoning across new domains

The Full-Stack Alignment paper proposes "thick models of value" as an alternative to utility functions and preference orderings. These models address a fundamental problem in AI alignment: the specification trap.

What thick models do:

  1. Distinguish enduring values from temporary preferences — Separates what people say they want (preferences, often context-dependent and volatile) from what actually produces good outcomes (values, more stable and generalizable)
  2. Model individual choices within social contexts — Recognizes that choices are not isolated but embedded in social structures, relationships, and institutional contexts
  3. Enable normative reasoning across new domains — Allow systems to reason about values in contexts not explicitly covered by training data, rather than failing when encountering novel situations

Why this matters for alignment:

This contrasts with "thin" models (utility functions, preference orderings) that treat all preferences as equivalent and context-independent. Thin models fail because:

  • They cannot distinguish signal (enduring values) from noise (temporary preferences)
  • They assume preferences are stable across contexts when they are actually highly context-dependent
  • They cannot generalize to novel domains because they have no principled way to reason about values beyond training data

Thick models formalize why specification-in-advance fails: human values have structure, hierarchy, and context-dependence that simple preference aggregation cannot capture.

Evidence

  • Full-Stack Alignment paper (December 2025) — introduces thick vs thin value models as a core component of the alignment framework
  • The distinction between preferences (what people say they want) and values (what produces good outcomes) directly addresses the specification trap identified in existing alignment research
  • The paper argues that thick models enable "normative reasoning across new domains" — a capability thin models lack

Limitations and Open Questions

  • No formal specification of what constitutes a "thick model" or how to implement one in practice
  • Unclear how to operationalize the distinction between enduring values and temporary preferences in real systems
  • Risk of paternalism: who decides which preferences are "temporary" vs which values are "enduring"? This could embed designer bias
  • No empirical validation that thick models actually outperform thin models on alignment tasks
  • The paper does not address how thick models handle genuinely conflicting values across populations

Relevant Notes: