theseus: extract from 2025-12-00-fullstack-alignment-thick-models-value.md

- Source: inbox/archive/2025-12-00-fullstack-alignment-thick-models-value.md - Domain: ai-alignment - Extracted by: headless extraction cron (worker 4) Pentagon-Agent: Theseus <HEADLESS>
2026-03-12 07:01:07 +00:00
7 changed files with 104 additions and 78 deletions
--- a/domains/ai-alignment/AI
+++ b/domains/ai-alignment/AI
@ -25,7 +25,7 @@ Since [[the internet enabled global communication but not global cognition]], th
 ### Additional Evidence (extend)
 *Source: [[2025-12-00-fullstack-alignment-thick-models-value]] | Added: 2026-03-12 | Extractor: anthropic/claude-sonnet-4.5*

-Full-stack alignment extends the coordination thesis from inter-lab coordination to institutional co-alignment. The framework argues that alignment requires concurrent co-alignment of AI systems AND the institutions that shape them—not just coordination between AI developers. Five implementation mechanisms are proposed: AI value stewardship, normatively competent agents, win-win negotiation systems, meaning-preserving economic mechanisms, and democratic regulatory institutions. This suggests the coordination problem operates at multiple levels: between AI labs (existing thesis), between AI systems and institutions (new), and between institutional stakeholders (implicit).
+The Full-Stack Alignment paper (December 2025) extends the coordination-first thesis to institutions themselves, not just coordination between AI labs. It argues that 'beneficial societal outcomes cannot be guaranteed by aligning individual AI systems' alone and proposes concurrent alignment of both AI systems and the institutions that govern them. This is a stronger claim than lab-to-lab coordination: it requires institutional transformation alongside technical alignment. The paper proposes five implementation mechanisms spanning both technical (normatively competent agents) and institutional (democratic regulatory institutions) domains. This suggests that coordination problems exist not only between AI developers but between AI systems, developers, and institutional structures—a multi-level coordination challenge.

 ---

--- a/domains/ai-alignment/AI
+++ b/domains/ai-alignment/AI
@ -13,6 +13,12 @@ AI development is creating precisely this kind of critical juncture. The mismatc

 Critical junctures are windows, not guarantees. They can close. Acemoglu also documents backsliding risk -- even established democracies can experience institutional regression when elites exploit societal divisions. Any movement seeking to build new governance institutions during this juncture must be anti-fragile to backsliding. The institutional question is not just "how do we build better governance?" but "how do we build governance that resists recapture by concentrated interests once the juncture closes?"

+
+### Additional Evidence (confirm)
+*Source: [[2025-12-00-fullstack-alignment-thick-models-value]] | Added: 2026-03-12 | Extractor: anthropic/claude-sonnet-4.5*
+
+The Full-Stack Alignment paper (December 2025) directly addresses this mismatch by proposing institutional co-alignment as a necessary component of AI alignment. The paper argues that the current moment requires not just aligning AI systems but transforming the institutions that govern them. It proposes five mechanisms including 'democratic regulatory institutions' as one pillar of full-stack alignment, explicitly recognizing that capability-governance mismatch creates both risk and opportunity for institutional transformation. The paper frames this as urgent: beneficial outcomes require simultaneous alignment of AI AND institutions, suggesting the window for institutional transformation is time-sensitive.
+
 ---

 Relevant Notes:
--- a/domains/ai-alignment/ai-alignment-requires-institutional-co-alignment-not-just-model-alignment.md
+++ b/domains/ai-alignment/ai-alignment-requires-institutional-co-alignment-not-just-model-alignment.md
@ -0,0 +1,46 @@
+---
+type: claim
+domain: ai-alignment
+description: "Beneficial AI outcomes require simultaneously aligning both AI systems and the institutions that govern them rather than focusing on individual model alignment alone"
+confidence: experimental
+source: "Full-Stack Alignment paper (December 2025), arxiv.org/abs/2512.03399"
+created: 2026-03-11
+secondary_domains: [mechanisms, grand-strategy]
+---
+
+# AI alignment requires institutional co-alignment not just model alignment
+
+The Full-Stack Alignment framework argues that alignment must operate at two levels simultaneously: AI systems AND the institutions that shape their development and deployment. This extends beyond single-organization objectives to address misalignment across multiple stakeholders.
+
+**Full-stack alignment** is defined as the concurrent alignment of AI systems and institutions with what people value. The paper argues that focusing solely on model-level alignment (RLHF, constitutional AI, etc.) is insufficient because:
+
+1. **Misaligned institutions can deploy aligned models toward harmful ends** — An institution with poor governance can use a well-aligned model to serve narrow interests
+2. **Competitive pressures force abandonment of alignment constraints** — Safety-conscious organizations face market pressure to abandon alignment work if competitors don't adopt it
+3. **Single-organization alignment cannot guarantee societal outcomes** — The paper's core claim: "beneficial societal outcomes cannot be guaranteed by aligning individual AI systems" alone
+
+The framework proposes five implementation mechanisms spanning both technical and institutional domains:
+1. AI value stewardship
+2. Normatively competent agents
+3. Win-win negotiation systems
+4. Meaning-preserving economic mechanisms
+5. Democratic regulatory institutions
+
+This represents a stronger claim than coordination-focused alignment theories, which address coordination between AI labs but not the institutional structures themselves.
+
+## Evidence
+- Full-Stack Alignment paper (December 2025) — introduces the framework and argues that "beneficial societal outcomes cannot be guaranteed by aligning individual AI systems" alone
+- The paper's five proposed mechanisms explicitly span both technical (normatively competent agents) and institutional (democratic regulatory institutions) domains
+- The framework directly addresses the failure mode of aligned-model-misaligned-institution
+
+## Limitations
+- The paper provides architectural ambition but may lack technical specificity for implementation
+- No engagement with existing bridging-based mechanisms or formal impossibility results
+- Early-stage proposal (December 2025) without empirical validation or case studies
+- The paper does not provide formal definitions of what constitutes "institutional alignment"
+
+---
+
+Relevant Notes:
+- [[AI alignment is a coordination problem not a technical problem]] — this claim extends the coordination thesis to institutions
+- [[AI development is a critical juncture in institutional history where the mismatch between capabilities and governance creates a window for transformation]] — directly relevant context
+- [[safe AI development requires building alignment mechanisms before scaling capability]] — complementary timing constraint
--- a/domains/ai-alignment/beneficial-ai-outcomes-require-concurrent-alignment-of-systems-and-institutions-not-model-alignment-alone.md
+++ b/domains/ai-alignment/beneficial-ai-outcomes-require-concurrent-alignment-of-systems-and-institutions-not-model-alignment-alone.md
@ -1,39 +0,0 @@
---
-type: claim
-domain: ai-alignment
-description: "Full-stack alignment requires concurrent co-alignment of AI systems and institutions, not model alignment alone"
-confidence: experimental
-source: "Full-Stack Alignment: Co-Aligning AI and Institutions with Thick Models of Value (arxiv.org/abs/2512.03399), December 2025"
-created: 2026-03-11
-secondary_domains: [mechanisms, grand-strategy]
---
-
-# Beneficial AI outcomes require concurrent alignment of systems and institutions, not model alignment alone
-
-The full-stack alignment framework argues that "beneficial societal outcomes cannot be guaranteed by aligning individual AI systems" in isolation. Alignment must address both AI systems AND the institutions that shape their development and deployment. This extends beyond single-organization objectives to address misalignment across multiple stakeholders.
-
-The paper proposes five implementation mechanisms for institutional co-alignment:
-1. AI value stewardship
-2. Normatively competent agents
-3. Win-win negotiation systems
-4. Meaning-preserving economic mechanisms
-5. Democratic regulatory institutions
-
-The core argument: even perfectly aligned individual AI systems can produce harmful outcomes through misaligned deployment contexts, competitive dynamics between organizations, or governance failures at the institutional level. Alignment is therefore a system-level coordination problem where institutional structures must co-evolve with AI capabilities.
-
-## Evidence
-
-The paper provides architectural reasoning grounded in the observation that institutional incentives often conflict with individual system alignment. However, the framework lacks empirical validation—no deployment data, no formal verification, and no engagement with existing technical alignment approaches (RLHF, constitutional AI, bridging-based mechanisms). The five mechanisms are proposed as necessary but remain underspecified technically.
-
-This is a conceptually ambitious framework from a recent paper (December 2025) that extends rather than replaces existing alignment work.
-
---
-
-Relevant Notes:
- [[AI alignment is a coordination problem not a technical problem]] — full-stack alignment extends this thesis from inter-lab coordination to AI-institution co-alignment
- [[AI development is a critical juncture in institutional history where the mismatch between capabilities and governance creates a window for transformation]] — directly addresses the institutional governance gap
- [[safe AI development requires building alignment mechanisms before scaling capability]] — institutional alignment is proposed as one such mechanism
- [[superorganism organization extends effective lifespan substantially at each organizational level which means civilizational intelligence operates on temporal horizons that individual-preference alignment cannot serve]] — related argument about system-level vs individual alignment
-
-Topics:
- [[domains/ai-alignment/_map]]
--- a/domains/ai-alignment/thick-models-of-value-distinguish-enduring-values-from-temporary-preferences-enabling-normative-reasoning-across-contexts.md
+++ b/domains/ai-alignment/thick-models-of-value-distinguish-enduring-values-from-temporary-preferences-enabling-normative-reasoning-across-contexts.md
@ -1,35 +0,0 @@
---
-type: claim
-domain: ai-alignment
-description: "Thick value models distinguish stable enduring values from context-dependent preferences, enabling normative reasoning in novel domains"
-confidence: experimental
-source: "Full-Stack Alignment: Co-Aligning AI and Institutions with Thick Models of Value (arxiv.org/abs/2512.03399), December 2025"
-created: 2026-03-11
-secondary_domains: [mechanisms]
---
-
-# Thick models of value distinguish enduring values from temporary preferences, enabling normative reasoning across contexts
-
-Thick models of value provide an alternative to utility functions and preference orderings by:
- Distinguishing enduring values (stable commitments) from temporary preferences (context-dependent wants)
- Modeling how individual choices embed within social contexts rather than treating preferences as atomic
- Enabling normative reasoning—determining what *should* happen rather than merely what humans *say they want*—by grounding decisions in stable values
-
-This framework addresses a core limitation of preference-based alignment: preferences are unstable and context-dependent ("I prefer coffee today"), while values represent deeper commitments that persist across situations ("I value autonomy"). A thick model can distinguish these and reason about which should guide AI behavior in novel contexts where humans haven't specified preferences.
-
-## Evidence
-
-The paper introduces thick models conceptually but provides no implementation details, training procedures, or empirical validation. The distinction between values and preferences is philosophically motivated but lacks operationalization—no comparison with existing alignment approaches (RLHF, constitutional AI, DPO) is provided, and no experiments demonstrate that thick models actually improve alignment outcomes.
-
-This is a conceptual contribution from a recent paper that formalizes an intuition about value stability but remains unvalidated technically.
-
---
-
-Relevant Notes:
- [[the alignment problem dissolves when human values are continuously woven into the system rather than specified in advance]] — thick models formalize continuous value integration by distinguishing stable values from momentary preferences
- [[the specification trap means any values encoded at training time become structurally unstable as deployment contexts diverge from training conditions]] — thick models propose addressing this by modeling value stability explicitly
- [[specifying human values in code is intractable because our goals contain hidden complexity comparable to visual perception]] — thick models acknowledge this complexity by refusing to specify values completely in advance
-
-Topics:
- [[domains/ai-alignment/_map]]
- [[core/mechanisms/_map]]
--- a/domains/ai-alignment/thick-models-of-value-distinguish-enduring-values-from-temporary-preferences-enabling-normative-reasoning.md
+++ b/domains/ai-alignment/thick-models-of-value-distinguish-enduring-values-from-temporary-preferences-enabling-normative-reasoning.md
@ -0,0 +1,48 @@
+---
+type: claim
+domain: ai-alignment
+description: "Thick value models distinguish enduring values from temporary preferences and embed individual choices in social contexts, enabling normative reasoning that utility functions cannot capture"
+confidence: experimental
+source: "Full-Stack Alignment paper (December 2025), arxiv.org/abs/2512.03399"
+created: 2026-03-11
+secondary_domains: [mechanisms]
+---
+
+# Thick models of value distinguish enduring values from temporary preferences enabling normative reasoning across new domains
+
+The Full-Stack Alignment paper proposes "thick models of value" as an alternative to utility functions and preference orderings. These models address a fundamental problem in AI alignment: the specification trap.
+
+**What thick models do:**
+
+1. **Distinguish enduring values from temporary preferences** — Separates what people say they want (preferences, often context-dependent and volatile) from what actually produces good outcomes (values, more stable and generalizable)
+2. **Model individual choices within social contexts** — Recognizes that choices are not isolated but embedded in social structures, relationships, and institutional contexts
+3. **Enable normative reasoning across new domains** — Allow systems to reason about values in contexts not explicitly covered by training data, rather than failing when encountering novel situations
+
+**Why this matters for alignment:**
+
+This contrasts with "thin" models (utility functions, preference orderings) that treat all preferences as equivalent and context-independent. Thin models fail because:
+- They cannot distinguish signal (enduring values) from noise (temporary preferences)
+- They assume preferences are stable across contexts when they are actually highly context-dependent
+- They cannot generalize to novel domains because they have no principled way to reason about values beyond training data
+
+Thick models formalize why specification-in-advance fails: human values have structure, hierarchy, and context-dependence that simple preference aggregation cannot capture.
+
+## Evidence
+- Full-Stack Alignment paper (December 2025) — introduces thick vs thin value models as a core component of the alignment framework
+- The distinction between preferences (what people say they want) and values (what produces good outcomes) directly addresses the specification trap identified in existing alignment research
+- The paper argues that thick models enable "normative reasoning across new domains" — a capability thin models lack
+
+## Limitations and Open Questions
+- No formal specification of what constitutes a "thick model" or how to implement one in practice
+- Unclear how to operationalize the distinction between enduring values and temporary preferences in real systems
+- Risk of paternalism: who decides which preferences are "temporary" vs which values are "enduring"? This could embed designer bias
+- No empirical validation that thick models actually outperform thin models on alignment tasks
+- The paper does not address how thick models handle genuinely conflicting values across populations
+
+---
+
+Relevant Notes:
+- [[the alignment problem dissolves when human values are continuously woven into the system rather than specified in advance]] — thick values formalize continuous value integration
+- [[specifying human values in code is intractable because our goals contain hidden complexity comparable to visual perception]] — thick models acknowledge this complexity
+- [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]] — thin models fail at diversity
+- [[pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state]] — relevant to the paternalism concern
--- a/inbox/archive/2025-12-00-fullstack-alignment-thick-models-value.md
+++ b/inbox/archive/2025-12-00-fullstack-alignment-thick-models-value.md
@ -12,10 +12,10 @@ priority: medium
 tags: [full-stack-alignment, institutional-alignment, thick-values, normative-competence, co-alignment]
 processed_by: theseus
 processed_date: 2026-03-11
-claims_extracted: ["beneficial-ai-outcomes-require-concurrent-alignment-of-systems-and-institutions-not-model-alignment-alone.md", "thick-models-of-value-distinguish-enduring-values-from-temporary-preferences-enabling-normative-reasoning-across-contexts.md"]
-enrichments_applied: ["AI alignment is a coordination problem not a technical problem.md"]
+claims_extracted: ["ai-alignment-requires-institutional-co-alignment-not-just-model-alignment.md", "thick-models-of-value-distinguish-enduring-values-from-temporary-preferences-enabling-normative-reasoning.md"]
+enrichments_applied: ["AI alignment is a coordination problem not a technical problem.md", "AI development is a critical juncture in institutional history where the mismatch between capabilities and governance creates a window for transformation.md"]
 extraction_model: "anthropic/claude-sonnet-4.5"
-extraction_notes: "Extracted two experimental claims from architecturally ambitious but technically underspecified paper. Full-stack alignment extends coordination thesis to institutions. Thick models of value formalize continuous value integration. No empirical validation or implementation details provided. Paper lacks engagement with existing alignment approaches (RLHF, constitutional AI, etc.). Both claims rated experimental due to single-source conceptual framework without deployment data."
+extraction_notes: "Extracted two novel claims from Full-Stack Alignment paper: (1) institutional co-alignment as necessary component of AI alignment, extending coordination thesis to institutions themselves, and (2) thick models of value as formalization of continuous value integration. Applied three enrichments to existing coordination and specification claims. Paper is architecturally ambitious but lacks technical specificity - claims rated experimental pending implementation details and empirical validation. No entity data in this source."
 ---

 ## Content