- Fixed based on eval review comments - Quality gate pass 3 (fix-from-feedback) Pentagon-Agent: Theseus <HEADLESS>
6 KiB
| type | domain | secondary_domains | description | confidence | source | created | enrichments | |||
|---|---|---|---|---|---|---|---|---|---|---|
| claim | ai-alignment |
|
Thick value models distinguish stable enduring values from context-dependent temporary preferences and model social embedding to enable normative reasoning across new domains | speculative | Full-Stack Alignment: Co-Aligning AI and Institutions with Thick Models of Value (arXiv 2512.03399, December 2025) | 2026-03-11 |
|
Thick models of value distinguish enduring values from temporary preferences enabling normative competence
The full-stack alignment framework proposes "thick models of value" as an alternative to utility functions and preference orderings for AI alignment. The framework distinguishes three dimensions:
- Enduring vs. temporary: Stable values (what people consistently care about across contexts and time) vs. temporary preferences (what people want in specific moments, contexts, or under particular constraints)
- Social embedding: Individual choices modeled within social contexts and relationships rather than as atomized preferences of isolated agents
- Normative reasoning: AI systems that reason about values across new domains and novel situations rather than simply optimizing pre-specified objectives
The goal is to develop "normatively competent agents" that engage with human values in their full complexity rather than reducing them to scalar reward signals or preference orderings.
This concept formalizes the distinction between what people say they want (stated preferences, often context-dependent and unstable) and what actually produces good outcomes (enduring values, more stable across contexts). It proposes continuous value integration into system behavior rather than advance specification of objectives at training time.
Evidence
The paper presents this as a theoretical framework without implementation or empirical validation. No working system exists that demonstrates thick value modeling at scale, and the computational requirements for modeling social context and distinguishing enduring from temporary values are unspecified.
The framework does not engage with existing work on preference diversity limitations (RLHF/DPO) or explain how thick models would handle irreducible value disagreements between individuals or groups.
Challenges
Stability assumption: How do you operationalize "enduring values" when human values themselves evolve over time? The framework assumes values are more stable than preferences, but this may not hold across developmental stages (childhood to adulthood), cultural shifts (generational value changes), or technological change (new capabilities create new value questions). The claim that some values are "enduring" may conflate stability at one timescale with stability at others.
Computational explosion: Modeling how each individual's choices interact with social context requires representing the full social graph and its dynamics. This creates a scalability problem that the paper does not address. At what granularity is social context modeled? How many degrees of social separation matter? The computational cost may be prohibitive.
Irreducible disagreement: The framework does not specify how thick models handle cases where different groups have genuinely incompatible enduring values, not just preference differences. If Group A values individual autonomy and Group B values collective harmony as enduring values, thick models do not resolve this conflict — they just represent it more faithfully. The paper does not explain whether thick models are a mechanism for pluralistic alignment or simply a more honest representation of the pluralism problem.
Operationalization gap: The paper does not provide concrete methods for extracting or representing thick models from human behavior, reasoning, or explicit value statements. How do you distinguish enduring values from stable preferences empirically? What data would you collect? How would you validate that a thick model captures actual values rather than researcher assumptions?
Relationship to existing pluralistic alignment work: The framework addresses the same surface problem as existing pluralistic alignment literature (Sorensen et al., Klassen et al.) — how to accommodate diverse human values in AI systems. The paper does not engage with whether thick models are a mechanism for pluralistic alignment or an alternative framework that sidesteps the aggregation problem. This relationship should be explicit.
Relevant Notes:
- the alignment problem dissolves when human values are continuously woven into the system rather than specified in advance — thick values formalize continuous integration rather than advance specification
- specifying human values in code is intractable because our goals contain hidden complexity comparable to visual perception — thick models acknowledge this complexity and propose social embedding as a partial solution
- pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state — thick models must handle value pluralism; unclear whether they solve or just represent the problem
- the specification trap means any values encoded at training time become structurally unstable as deployment contexts diverge from training conditions — thick models attempt to address this through continuous integration
- super co-alignment proposes that human and AI values should be co-shaped through iterative alignment rather than specified in advance — complementary mechanism; Zeng grounds co-alignment in intrinsic moral development (self-awareness, Theory of Mind); full-stack grounds thick models in social embedding and enduring-vs-temporary distinctions
Topics: