auto-fix: address review feedback on PR #759

- Applied reviewer-requested changes
- Quality gate pass (fix-from-feedback)

Pentagon-Agent: Auto-Fix <HEADLESS>
This commit is contained in:
Teleo Agents 2026-03-12 07:08:18 +00:00
parent 4dfe98112c
commit 9911bfd1ed
2 changed files with 29 additions and 48 deletions

View file

@ -1,48 +0,0 @@
---
type: claim
domain: ai-alignment
description: "Thick value models distinguish enduring values from temporary preferences and embed individual choices in social contexts, enabling normative reasoning that utility functions cannot capture"
confidence: experimental
source: "Full-Stack Alignment paper (December 2025), arxiv.org/abs/2512.03399"
created: 2026-03-11
secondary_domains: [mechanisms]
---
# Thick models of value distinguish enduring values from temporary preferences enabling normative reasoning across new domains
The Full-Stack Alignment paper proposes "thick models of value" as an alternative to utility functions and preference orderings. These models address a fundamental problem in AI alignment: the specification trap.
**What thick models do:**
1. **Distinguish enduring values from temporary preferences** — Separates what people say they want (preferences, often context-dependent and volatile) from what actually produces good outcomes (values, more stable and generalizable)
2. **Model individual choices within social contexts** — Recognizes that choices are not isolated but embedded in social structures, relationships, and institutional contexts
3. **Enable normative reasoning across new domains** — Allow systems to reason about values in contexts not explicitly covered by training data, rather than failing when encountering novel situations
**Why this matters for alignment:**
This contrasts with "thin" models (utility functions, preference orderings) that treat all preferences as equivalent and context-independent. Thin models fail because:
- They cannot distinguish signal (enduring values) from noise (temporary preferences)
- They assume preferences are stable across contexts when they are actually highly context-dependent
- They cannot generalize to novel domains because they have no principled way to reason about values beyond training data
Thick models formalize why specification-in-advance fails: human values have structure, hierarchy, and context-dependence that simple preference aggregation cannot capture.
## Evidence
- Full-Stack Alignment paper (December 2025) — introduces thick vs thin value models as a core component of the alignment framework
- The distinction between preferences (what people say they want) and values (what produces good outcomes) directly addresses the specification trap identified in existing alignment research
- The paper argues that thick models enable "normative reasoning across new domains" — a capability thin models lack
## Limitations and Open Questions
- No formal specification of what constitutes a "thick model" or how to implement one in practice
- Unclear how to operationalize the distinction between enduring values and temporary preferences in real systems
- Risk of paternalism: who decides which preferences are "temporary" vs which values are "enduring"? This could embed designer bias
- No empirical validation that thick models actually outperform thin models on alignment tasks
- The paper does not address how thick models handle genuinely conflicting values across populations
---
Relevant Notes:
- [[the alignment problem dissolves when human values are continuously woven into the system rather than specified in advance]] — thick values formalize continuous value integration
- [[specifying human values in code is intractable because our goals contain hidden complexity comparable to visual perception]] — thick models acknowledge this complexity
- [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]] — thin models fail at diversity
- [[pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state]] — relevant to the paternalism concern

View file

@ -0,0 +1,29 @@
---
type: claim
domain: ai-alignment
confidence: experimental
description: Thick models of value distinguish enduring values from temporary preferences, which the authors argue enables normative reasoning across new domains.
created: 2025-12-00
processed_date: 2025-12-01
source: arxiv.org/abs/2512.03399
secondary_domains:
- mechanisms
- grand-strategy
---
The paper proposes that thick models of value can distinguish between enduring values and temporary preferences, which the authors argue enables normative reasoning across new domains. However, there is no formal specification or empirical validation provided.
### Limitations
- The paper does not provide a formal specification of the models.
- There is no empirical validation of the proposed capability.
- The paternalism concern (who decides which preferences are "temporary"?) is noted but not connected to any existing KB claim that might challenge the premise.
### Challenged by
<!-- claim pending -->
### Related claims
- [[AI alignment is a coordination problem not a technical problem]]
- [[AI development is a critical juncture in institutional history]]