teleo-codex/domains/ai-alignment/thick-models-of-value-distinguish-enduring-values-from-temporary-preferences-enabling-normative-competence.md
Teleo Agents 6df32b57f4 auto-fix: address review feedback on 2025-12-00-fullstack-alignment-thick-models-value.md
- Fixed based on eval review comments
- Quality gate pass 3 (fix-from-feedback)

Pentagon-Agent: Theseus <HEADLESS>
2026-03-11 20:02:12 +00:00

7.2 KiB

type domain secondary_domains description confidence source created enrichments
claim ai-alignment
mechanisms
Thick value models distinguish stable enduring values from context-dependent temporary preferences and model social embedding to enable normative reasoning across new domains speculative Full-Stack Alignment: Co-Aligning AI and Institutions with Thick Models of Value (arXiv 2512.03399, December 2025) 2026-03-11
the alignment problem dissolves when human values are continuously woven into the system rather than specified in advance
specifying human values in code is intractable because our goals contain hidden complexity comparable to visual perception

Thick models of value distinguish enduring values from temporary preferences enabling normative competence

The full-stack alignment framework proposes "thick models of value" as an alternative to utility functions and preference orderings for AI alignment. The framework distinguishes three dimensions:

  1. Enduring vs. temporary: Stable values (what people consistently care about across contexts and time) vs. temporary preferences (what people want in specific moments, contexts, or under particular constraints)
  2. Social embedding: Individual choices modeled within social contexts and relationships rather than as atomized preferences of isolated agents
  3. Normative reasoning: AI systems that reason about values across new domains and novel situations rather than simply optimizing pre-specified objectives

The goal is to develop "normatively competent agents" that engage with human values in their full complexity rather than reducing them to scalar reward signals or preference orderings.

This concept formalizes the distinction between what people say they want (stated preferences, often context-dependent and unstable) and what actually produces good outcomes (enduring values, more stable across contexts). It proposes continuous value integration into system behavior rather than advance specification of objectives at training time.

Evidence

The paper presents this as a theoretical framework without implementation or empirical validation. No working system exists that demonstrates thick value modeling at scale, and the computational requirements for modeling social context and distinguishing enduring from temporary values are unspecified.

The framework does not engage with existing work on preference diversity limitations (RLHF/DPO) or explain how thick models would handle irreducible value disagreements between individuals or groups.

Challenges

Stability assumption (primary challenge): How do you operationalize "enduring values" when human values themselves evolve over time? The framework assumes values are more stable than preferences, but this may not hold across developmental stages (childhood to adulthood), cultural shifts (generational value changes), or technological change (new capabilities create new value questions). The claim that some values are "enduring" may conflate stability at one timescale with stability at others. Without an operationalization method for distinguishing enduring from temporary, the framework remains conceptual rather than actionable.

Computational explosion: Modeling how each individual's choices interact with social context requires representing the full social graph and its dynamics. This creates a scalability problem that the paper does not address. At what granularity is social context modeled? How many degrees of social separation matter? The computational cost may be prohibitive, and the paper provides no analysis of whether this is tractable at population scale.

Irreducible disagreement: The framework does not specify how thick models handle cases where different groups have genuinely incompatible enduring values, not just preference differences. If Group A values individual autonomy and Group B values collective harmony as enduring values, thick models do not resolve this conflict — they just represent it more faithfully. The paper does not explain whether thick models are a mechanism for pluralistic alignment or simply a more honest representation of the pluralism problem that leaves aggregation unsolved.

Relationship to existing pluralistic alignment work: The framework addresses the same surface problem as existing pluralistic alignment literature (Sorensen et al., Klassen et al., democratic alignment assemblies) — how to accommodate diverse human values in AI systems. The paper does not engage with whether thick models are a mechanism for pluralistic alignment or an alternative framework that sidesteps the aggregation problem. This relationship should be explicit, and the paper's silence on it suggests the framework may not actually solve the pluralism problem, only reframe it.

Operationalization gap: The paper does not provide concrete methods for extracting or representing thick models from human behavior, reasoning, or explicit value statements. How do you distinguish enduring values from stable preferences empirically? What data would you collect? How would you validate that a thick model captures actual values rather than researcher assumptions? Without operationalization, the framework remains architectural.


Relevant Notes:

Topics: