teleo-codex/domains/ai-alignment/thick-models-of-value-distinguish-enduring-values-from-temporary-preferences-enabling-normative-reasoning.md at 6d9dc35f8aa1bdb6698f432fc50d811f51f60f9d

Teleo Agents 6d9dc35f8a theseus: extract from 2025-12-00-fullstack-alignment-thick-models-value.md

- Source: inbox/archive/2025-12-00-fullstack-alignment-thick-models-value.md
- Domain: ai-alignment
- Extracted by: headless extraction cron (worker 4)

Pentagon-Agent: Theseus <HEADLESS>

2026-03-12 10:25:06 +00:00

4.2 KiB

Raw Blame History

type

domain

description

confidence

source

created

secondary_domains

claim

ai-alignment

Thick value models distinguish stable enduring values from context-dependent preferences and embed individual choices within social contexts, enabling normative reasoning across new domains

experimental

Multiple authors, 'Full-Stack Alignment: Co-Aligning AI and Institutions with Thick Models of Value' (December 2025)

2026-03-11

mechanisms

The full-stack alignment framework proposes "thick models of value" as an alternative to utility functions and preference orderings for representing what people value. These models have three key properties:

Distinguish enduring values from temporary preferences — They separate stable, long-term values from context-dependent, momentary preferences. This distinction recognizes that what people say they want in a given moment (preferences) may differ from what actually produces good outcomes over time (values).
Model social embeddedness — They represent how individual choices are embedded within social contexts rather than treating preferences as purely individual atomic choices. Values are shaped by social context, cultural norms, and institutional structures.
Enable normative reasoning across new domains — They support reasoning about values in new contexts and domains, not just optimization over fixed preferences. This allows alignment systems to extend to novel situations without requiring complete respecification of values.

This approach contrasts with thin models (utility functions, preference orderings) that treat all stated preferences equally and assume values can be captured in a single scalar or ordering. Thin models assume preferences are stable and context-independent, which fails when contexts change or when preferences conflict with deeper values.

The distinction maps to the difference between optimizing for stated preferences (which may be unstable or context-dependent) versus integrating values continuously as contexts evolve.

Evidence

From "Full-Stack Alignment: Co-Aligning AI and Institutions with Thick Models of Value" (December 2025), the paper explicitly defines thick models of value as having three properties: (1) "distinguish enduring values from temporary preferences", (2) "model how individual choices embed within social contexts", and (3) "enable normative reasoning across new domains." These are presented as contrasts with utility functions and preference orderings, which the paper characterizes as "thin" models that fail to capture the complexity of human values.

Relationship to Existing Claims

This claim formalizes the intuition behind the specification trap means any values encoded at training time become structurally unstable as deployment contexts diverge from training conditions.md by providing a technical framework for representing values that can evolve with context rather than becoming brittle. It also relates to RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values.md (if this claim exists) by offering an alternative to single-reward-function approaches. It extends super co-alignment proposes that human and AI values should be co-shaped through iterative alignment rather than specified in advance.md by providing a concrete mechanism for how co-shaped values can be represented and reasoned about.

Relevant Notes:

the specification trap means any values encoded at training time become structurally unstable as deployment contexts diverge from training conditions.md — thick models address this instability
specifying human values in code is intractable because our goals contain hidden complexity comparable to visual perception.md — thick models acknowledge this complexity
super co-alignment proposes that human and AI values should be co-shaped through iterative alignment rather than specified in advance.md — thick models formalize how co-shaped values can be represented

4.2 KiB Raw Blame History

Thick models of value distinguish enduring values from temporary preferences and model how individual choices embed within social contexts which enables normative reasoning across new domains

Evidence

Relationship to Existing Claims

4.2 KiB

Raw Blame History