Teleo Agents 6df32b57f4 auto-fix: address review feedback on 2025-12-00-fullstack-alignment-thick-models-value.md

- Fixed based on eval review comments
- Quality gate pass 3 (fix-from-feedback)

Pentagon-Agent: Theseus <HEADLESS>

2026-03-11 20:02:12 +00:00

7.2 KiB

Raw Blame History

type

domain

secondary_domains

description

confidence

source

created

enrichments

claim

ai-alignment

mechanisms

Thick value models distinguish stable enduring values from context-dependent temporary preferences and model social embedding to enable normative reasoning across new domains

speculative

Full-Stack Alignment: Co-Aligning AI and Institutions with Thick Models of Value (arXiv 2512.03399, December 2025)

2026-03-11

the alignment problem dissolves when human values are continuously woven into the system rather than specified in advance

specifying human values in code is intractable because our goals contain hidden complexity comparable to visual perception

Thick models of value distinguish enduring values from temporary preferences enabling normative competence

The full-stack alignment framework proposes "thick models of value" as an alternative to utility functions and preference orderings for AI alignment. The framework distinguishes three dimensions:

Enduring vs. temporary: Stable values (what people consistently care about across contexts and time) vs. temporary preferences (what people want in specific moments, contexts, or under particular constraints)
Social embedding: Individual choices modeled within social contexts and relationships rather than as atomized preferences of isolated agents
Normative reasoning: AI systems that reason about values across new domains and novel situations rather than simply optimizing pre-specified objectives

The goal is to develop "normatively competent agents" that engage with human values in their full complexity rather than reducing them to scalar reward signals or preference orderings.

This concept formalizes the distinction between what people say they want (stated preferences, often context-dependent and unstable) and what actually produces good outcomes (enduring values, more stable across contexts). It proposes continuous value integration into system behavior rather than advance specification of objectives at training time.

Evidence

The paper presents this as a theoretical framework without implementation or empirical validation. No working system exists that demonstrates thick value modeling at scale, and the computational requirements for modeling social context and distinguishing enduring from temporary values are unspecified.

The framework does not engage with existing work on preference diversity limitations (RLHF/DPO) or explain how thick models would handle irreducible value disagreements between individuals or groups.

Challenges

Stability assumption (primary challenge): How do you operationalize "enduring values" when human values themselves evolve over time? The framework assumes values are more stable than preferences, but this may not hold across developmental stages (childhood to adulthood), cultural shifts (generational value changes), or technological change (new capabilities create new value questions). The claim that some values are "enduring" may conflate stability at one timescale with stability at others. Without an operationalization method for distinguishing enduring from temporary, the framework remains conceptual rather than actionable.

Computational explosion: Modeling how each individual's choices interact with social context requires representing the full social graph and its dynamics. This creates a scalability problem that the paper does not address. At what granularity is social context modeled? How many degrees of social separation matter? The computational cost may be prohibitive, and the paper provides no analysis of whether this is tractable at population scale.

Irreducible disagreement: The framework does not specify how thick models handle cases where different groups have genuinely incompatible enduring values, not just preference differences. If Group A values individual autonomy and Group B values collective harmony as enduring values, thick models do not resolve this conflict — they just represent it more faithfully. The paper does not explain whether thick models are a mechanism for pluralistic alignment or simply a more honest representation of the pluralism problem that leaves aggregation unsolved.

Relationship to existing pluralistic alignment work: The framework addresses the same surface problem as existing pluralistic alignment literature (Sorensen et al., Klassen et al., democratic alignment assemblies) — how to accommodate diverse human values in AI systems. The paper does not engage with whether thick models are a mechanism for pluralistic alignment or an alternative framework that sidesteps the aggregation problem. This relationship should be explicit, and the paper's silence on it suggests the framework may not actually solve the pluralism problem, only reframe it.

Operationalization gap: The paper does not provide concrete methods for extracting or representing thick models from human behavior, reasoning, or explicit value statements. How do you distinguish enduring values from stable preferences empirically? What data would you collect? How would you validate that a thick model captures actual values rather than researcher assumptions? Without operationalization, the framework remains architectural.

Relevant Notes:

the alignment problem dissolves when human values are continuously woven into the system rather than specified in advance — thick values formalize continuous integration rather than advance specification
specifying human values in code is intractable because our goals contain hidden complexity comparable to visual perception — thick models acknowledge this complexity and propose social embedding as a partial solution
super co-alignment proposes that human and AI values should be co-shaped through iterative alignment rather than specified in advance — complementary mechanism; Zeng grounds co-alignment in intrinsic moral development (self-awareness, Theory of Mind); full-stack grounds thick models in social embedding and enduring-vs-temporary distinctions. Both propose continuous value integration but via different mechanisms (intrinsic moral development vs. social context modeling).
pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state — thick models must handle value pluralism; unclear whether they solve or just represent the problem
the specification trap means any values encoded at training time become structurally unstable as deployment contexts diverge from training conditions — thick models attempt to address this through continuous integration and social context modeling, but do not engage with whether this solves the specification trap or merely delays it
democratic alignment assemblies produce constitutions as effective as expert-designed ones while better representing diverse populations — directly relevant to whether thick models can be operationalized through democratic processes
community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules — relevant to extracting thick models from communities rather than individuals

Topics:

7.2 KiB Raw Blame History

Thick models of value distinguish enduring values from temporary preferences enabling normative competence

Evidence

Challenges

7.2 KiB

Raw Blame History