theseus: extract from 2025-12-00-fullstack-alignment-thick-models-value.md

- Source: inbox/archive/2025-12-00-fullstack-alignment-thick-models-value.md - Domain: ai-alignment - Extracted by: headless extraction cron (worker 4) Pentagon-Agent: Theseus <HEADLESS>
2026-03-12 10:25:06 +00:00
6 changed files with 57 additions and 44 deletions
--- a/domains/ai-alignment/AI
+++ b/domains/ai-alignment/AI
@ -25,7 +25,7 @@ Since [[the internet enabled global communication but not global cognition]], th
 ### Additional Evidence (extend)
 *Source: [[2025-12-00-fullstack-alignment-thick-models-value]] | Added: 2026-03-12 | Extractor: anthropic/claude-sonnet-4.5*

-The Full-Stack Alignment paper extends the coordination thesis to institutions themselves, not just coordination between AI labs. It argues that 'beneficial societal outcomes cannot be guaranteed by aligning individual AI systems' and proposes concurrent alignment of both AI systems and the institutions that govern them. This is a stronger version of the coordination-first approach—it claims institutions need structural alignment with human values, not just better coordination protocols between existing actors. The five implementation mechanisms (AI value stewardship, normatively competent agents, win-win negotiation systems, meaning-preserving economic mechanisms, democratic regulatory institutions) are institutional structures, not coordination protocols.
+The full-stack alignment framework extends the coordination-first thesis by arguing that coordination must occur not just between AI labs but between AI systems and the institutions that shape them. The paper proposes 'concurrent alignment of AI systems and institutions with what people value' and argues that 'beneficial societal outcomes cannot be guaranteed by aligning individual AI systems' alone. This suggests that even if AI labs coordinate successfully on model alignment, institutional misalignment could still produce harmful outcomes. The framework identifies five institutional coordination mechanisms: AI value stewardship, normatively competent agents, win-win negotiation systems, meaning-preserving economic mechanisms, and democratic regulatory institutions.

 ---

--- a/domains/ai-alignment/AI
+++ b/domains/ai-alignment/AI
@ -14,10 +14,10 @@ AI development is creating precisely this kind of critical juncture. The mismatc
 Critical junctures are windows, not guarantees. They can close. Acemoglu also documents backsliding risk -- even established democracies can experience institutional regression when elites exploit societal divisions. Any movement seeking to build new governance institutions during this juncture must be anti-fragile to backsliding. The institutional question is not just "how do we build better governance?" but "how do we build governance that resists recapture by concentrated interests once the juncture closes?"


-### Additional Evidence (extend)
+### Additional Evidence (confirm)
 *Source: [[2025-12-00-fullstack-alignment-thick-models-value]] | Added: 2026-03-12 | Extractor: anthropic/claude-sonnet-4.5*

-The Full-Stack Alignment framework directly addresses the capability-governance mismatch by proposing institutional co-alignment as a solution. The paper argues that alignment cannot succeed through technical means alone and requires transforming the institutions that shape AI development. The five implementation mechanisms (AI value stewardship, normatively competent agents, win-win negotiation systems, meaning-preserving economic mechanisms, and democratic regulatory institutions) are institutional structures designed to close the capability-governance gap by aligning institutions themselves with human values.
+The full-stack alignment framework provides a concrete proposal for institutional transformation during this critical juncture. The paper argues that institutions themselves must be aligned alongside AI capabilities, proposing five implementation mechanisms including 'democratic regulatory institutions' and 'meaning-preserving economic mechanisms.' This confirms that the capability-governance mismatch creates not just risk but opportunity for institutional redesign. The framework treats institutional transformation as a necessary component of beneficial AI outcomes, not merely a constraint on AI development.

 ---

--- a/domains/ai-alignment/ai-alignment-requires-institutional-co-alignment-not-just-model-alignment.md
+++ b/domains/ai-alignment/ai-alignment-requires-institutional-co-alignment-not-just-model-alignment.md
@ -1,40 +1,45 @@
 ---
 type: claim
 domain: ai-alignment
-description: "Beneficial AI outcomes require simultaneously aligning both AI systems and the institutions that govern them rather than focusing on individual model alignment alone"
-confidence: speculative
-source: "Full-Stack Alignment: Co-Aligning AI and Institutions with Thick Models of Value (December 2025), arxiv.org/abs/2512.03399"
+description: "Full-stack alignment requires concurrent alignment of both AI systems and institutions, not model alignment alone, because institutional misalignment can produce harmful outcomes even when individual systems are technically aligned"
+confidence: experimental
+source: "Multiple authors, 'Full-Stack Alignment: Co-Aligning AI and Institutions with Thick Models of Value' (December 2025)"
 created: 2026-03-11
 secondary_domains: [mechanisms, grand-strategy]
 ---

-# AI alignment requires institutional co-alignment, not just model alignment
+# AI alignment requires institutional co-alignment not just model alignment because beneficial societal outcomes cannot be guaranteed by aligning individual AI systems alone

-The Full-Stack Alignment framework proposes that "beneficial societal outcomes cannot be guaranteed by aligning individual AI systems" alone. The paper argues alignment must be comprehensive—addressing both AI systems and the institutions that shape their development and deployment simultaneously.
+The full-stack alignment framework proposes that achieving beneficial AI outcomes requires concurrent alignment of both AI systems and the institutions that shape, deploy, and govern them. This extends beyond single-organization objectives to address misalignment across multiple stakeholders.

-This extends beyond single-organization objectives to address misalignment across multiple stakeholders. The framework proposes "full-stack alignment" as the concurrent alignment of AI systems and institutions with what people value, reframing the problem from technical model alignment to system-level institutional coordination.
+The paper argues that focusing solely on aligning individual AI models is insufficient because:

-## Implementation Mechanisms
+1. **Institutional context shapes deployment** — AI systems operate within institutional contexts that determine how they are deployed, governed, and scaled. Technical alignment at the model level does not constrain institutional choices about deployment.

-The paper identifies five mechanisms for achieving full-stack alignment:
+2. **Misalignment can occur at institutional level** — Even when individual systems are technically aligned with narrow objectives, institutions can misalign them through incentive structures, regulatory capture, or competing stakeholder interests. The paper explicitly states: "beneficial societal outcomes cannot be guaranteed by aligning individual AI systems" alone.

-1. **AI value stewardship** — institutional structures for stewarding AI development
-2. **Normatively competent agents** — AI systems capable of normative reasoning
-3. **Win-win negotiation systems** — mechanisms for resolving stakeholder conflicts
-4. **Meaning-preserving economic mechanisms** — economic structures that preserve human values
-5. **Democratic regulatory institutions** — governance structures that embed democratic input
+3. **Coordination problems across stakeholders** — Multiple stakeholders with competing interests create coordination problems that model-level alignment cannot solve. Full-stack alignment addresses this by proposing concurrent work on both AI capabilities and institutional structures.

-## Relationship to Existing Alignment Work
+The framework proposes five implementation mechanisms for institutional co-alignment:
+- AI value stewardship
+- Normatively competent agents
+- Win-win negotiation systems
+- Meaning-preserving economic mechanisms
+- Democratic regulatory institutions

-This represents a stronger claim than coordination-focused approaches that address AI lab coordination alone. Rather than improving coordination protocols between existing actors, full-stack alignment argues the institutions themselves require structural alignment with human values.
+This reframes alignment from a purely technical problem (how to specify and optimize for human values in code) to a sociotechnical coordination challenge requiring simultaneous work on AI systems and the institutions that govern them.

-## Evidence and Limitations
+## Evidence

-The paper provides architectural framing and mechanism proposals rather than empirical validation or formal proofs. Confidence is speculative because this is a December 2025 paper proposing a framework without implementation results, independent verification, or engagement with formal impossibility results. The paper is architecturally ambitious but lacks technical specificity in how thick value models would be operationalized or how institutional alignment would be measured.
+From "Full-Stack Alignment: Co-Aligning AI and Institutions with Thick Models of Value" (December 2025): The paper explicitly defines full-stack alignment as "concurrent alignment of AI systems and institutions with what people value" and argues that "beneficial societal outcomes cannot be guaranteed by aligning individual AI systems" alone. The five implementation mechanisms are presented as concrete pathways for achieving this institutional co-alignment.
+
+## Relationship to Existing Claims
+
+This claim extends [[AI alignment is a coordination problem not a technical problem.md]] by arguing that coordination must occur not just between AI labs but between AI systems and the institutions that govern them. It also connects to [[AI development is a critical juncture in institutional history where the mismatch between capabilities and governance creates a window for transformation.md]] by proposing that institutions themselves must be transformed alongside AI capabilities.

 ---

-**Related claims:**
- [[AI alignment is a coordination problem not a technical problem]] — this claim extends coordination thesis to institutions themselves
- [[AI development is a critical juncture in institutional history where the mismatch between capabilities and governance creates a window for transformation]] — institutional alignment directly addresses this capability-governance gap
- [[safe AI development requires building alignment mechanisms before scaling capability]] — institutional co-alignment is proposed as one such mechanism
+Relevant Notes:
+- [[AI alignment is a coordination problem not a technical problem.md]] — extends this thesis to institutions
+- [[AI development is a critical juncture in institutional history where the mismatch between capabilities and governance creates a window for transformation.md]] — provides context for why institutional alignment matters
+- [[safe AI development requires building alignment mechanisms before scaling capability.md]] — institutional alignment is one such mechanism
--- a/domains/ai-alignment/the
+++ b/domains/ai-alignment/the
@ -17,6 +17,12 @@ This converges with findings across at least five other research programs. Zeng'

 The specification trap is why since [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]] -- the failure is not just about diversity but about fixing anything at all. It is why since [[the alignment problem dissolves when human values are continuously woven into the system rather than specified in advance]] -- continuous weaving is the structural response to structural instability. And it is why since [[adaptive governance outperforms rigid alignment blueprints because superintelligence development has too many unknowns for fixed plans]] -- the same logic that makes rigid blueprints fail for governance makes rigid value specifications fail for alignment.

+
+### Additional Evidence (extend)
+*Source: [[2025-12-00-fullstack-alignment-thick-models-value]] | Added: 2026-03-12 | Extractor: anthropic/claude-sonnet-4.5*
+
+The 'thick models of value' concept provides a technical framework for addressing specification trap instability. The paper argues that thick models 'distinguish enduring values from temporary preferences' and 'enable normative reasoning across new domains,' which formalizes how values can remain stable across context shifts rather than becoming brittle when deployment diverges from training. Thick models embed values within social contexts rather than encoding them as fixed specifications, allowing values to adapt as contexts evolve. This extends the original claim by providing a concrete mechanism for how values can remain coherent despite deployment context divergence.
+
 ---

 Relevant Notes:
--- a/domains/ai-alignment/thick-models-of-value-distinguish-enduring-values-from-temporary-preferences-enabling-normative-reasoning.md
+++ b/domains/ai-alignment/thick-models-of-value-distinguish-enduring-values-from-temporary-preferences-enabling-normative-reasoning.md
@ -1,36 +1,38 @@
 ---
 type: claim
 domain: ai-alignment
-description: "Thick value models that distinguish enduring values from temporary preferences enable AI systems to reason normatively across new domains by embedding choices in social context"
-confidence: speculative
-source: "Full-Stack Alignment: Co-Aligning AI and Institutions with Thick Models of Value (December 2025), arxiv.org/abs/2512.03399"
+description: "Thick value models distinguish stable enduring values from context-dependent preferences and embed individual choices within social contexts, enabling normative reasoning across new domains"
+confidence: experimental
+source: "Multiple authors, 'Full-Stack Alignment: Co-Aligning AI and Institutions with Thick Models of Value' (December 2025)"
 created: 2026-03-11
 secondary_domains: [mechanisms]
 ---

-# Thick models of value distinguish enduring values from temporary preferences, enabling normative reasoning
+# Thick models of value distinguish enduring values from temporary preferences and model how individual choices embed within social contexts which enables normative reasoning across new domains

-The Full-Stack Alignment paper proposes "thick models of value" as a conceptual alternative to utility functions and preference orderings. These models are characterized by three properties:
+The full-stack alignment framework proposes "thick models of value" as an alternative to utility functions and preference orderings for representing what people value. These models have three key properties:

-1. **Distinguish enduring values from temporary preferences** — separating what people durably care about from momentary wants or revealed preferences
-2. **Embed individual choices within social contexts** — recognizing that preferences are shaped by and dependent on social structures rather than being context-independent
-3. **Enable normative reasoning across new domains** — allowing AI systems to generalize value judgments to novel situations beyond training data
+1. **Distinguish enduring values from temporary preferences** — They separate stable, long-term values from context-dependent, momentary preferences. This distinction recognizes that what people say they want in a given moment (preferences) may differ from what actually produces good outcomes over time (values).

-## Contrast with Thin Models
+2. **Model social embeddedness** — They represent how individual choices are embedded within social contexts rather than treating preferences as purely individual atomic choices. Values are shaped by social context, cultural norms, and institutional structures.

-Thin models (utility functions, preference orderings) treat all stated preferences as equally valid and assume context-independence. Thick models acknowledge that what people say they want (preferences) often diverges from what produces good outcomes (values), and that this divergence is systematic rather than random.
+3. **Enable normative reasoning across new domains** — They support reasoning about values in new contexts and domains, not just optimization over fixed preferences. This allows alignment systems to extend to novel situations without requiring complete respecification of values.

-## Limitations and Gaps
+This approach contrasts with thin models (utility functions, preference orderings) that treat all stated preferences equally and assume values can be captured in a single scalar or ordering. Thin models assume preferences are stable and context-independent, which fails when contexts change or when preferences conflict with deeper values.

-The paper does not provide formal definitions of thick value models, implementation details for how they would be operationalized in AI systems, or empirical validation. It remains a conceptual proposal for how alignment systems should represent human values. No engagement with existing preference learning literature (RLHF, DPO) or formal methods for value specification is provided.
+The distinction maps to the difference between optimizing for stated preferences (which may be unstable or context-dependent) versus integrating values continuously as contexts evolve.

-## Relationship to Continuous Value Integration
+## Evidence

-This concept formalizes the intuition that values should be continuously integrated into systems rather than specified once at training time. Rather than encoding values as fixed parameters, thick models would enable ongoing normative reasoning as deployment contexts evolve and new situations emerge.
+From "Full-Stack Alignment: Co-Aligning AI and Institutions with Thick Models of Value" (December 2025), the paper explicitly defines thick models of value as having three properties: (1) "distinguish enduring values from temporary preferences", (2) "model how individual choices embed within social contexts", and (3) "enable normative reasoning across new domains." These are presented as contrasts with utility functions and preference orderings, which the paper characterizes as "thin" models that fail to capture the complexity of human values.
+
+## Relationship to Existing Claims
+
+This claim formalizes the intuition behind [[the specification trap means any values encoded at training time become structurally unstable as deployment contexts diverge from training conditions.md]] by providing a technical framework for representing values that can evolve with context rather than becoming brittle. It also relates to [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values.md]] (if this claim exists) by offering an alternative to single-reward-function approaches. It extends [[super co-alignment proposes that human and AI values should be co-shaped through iterative alignment rather than specified in advance.md]] by providing a concrete mechanism for how co-shaped values can be represented and reasoned about.

 ---

-**Related claims:**
- [[the alignment problem dissolves when human values are continuously woven into the system rather than specified in advance]] — thick models operationalize continuous value integration
- [[specifying human values in code is intractable because our goals contain hidden complexity comparable to visual perception]] — thick models acknowledge this complexity by modeling context-dependence
- [[the specification trap means any values encoded at training time become structurally unstable as deployment contexts diverge from training conditions]] — thick models address this by enabling context-dependent reasoning
+Relevant Notes:
+- [[the specification trap means any values encoded at training time become structurally unstable as deployment contexts diverge from training conditions.md]] — thick models address this instability
+- [[specifying human values in code is intractable because our goals contain hidden complexity comparable to visual perception.md]] — thick models acknowledge this complexity
+- [[super co-alignment proposes that human and AI values should be co-shaped through iterative alignment rather than specified in advance.md]] — thick models formalize how co-shaped values can be represented
--- a/inbox/archive/2025-12-00-fullstack-alignment-thick-models-value.md
+++ b/inbox/archive/2025-12-00-fullstack-alignment-thick-models-value.md
@ -13,9 +13,9 @@ tags: [full-stack-alignment, institutional-alignment, thick-values, normative-co
 processed_by: theseus
 processed_date: 2026-03-11
 claims_extracted: ["ai-alignment-requires-institutional-co-alignment-not-just-model-alignment.md", "thick-models-of-value-distinguish-enduring-values-from-temporary-preferences-enabling-normative-reasoning.md"]
-enrichments_applied: ["AI alignment is a coordination problem not a technical problem.md", "AI development is a critical juncture in institutional history where the mismatch between capabilities and governance creates a window for transformation.md"]
+enrichments_applied: ["AI alignment is a coordination problem not a technical problem.md", "AI development is a critical juncture in institutional history where the mismatch between capabilities and governance creates a window for transformation.md", "the specification trap means any values encoded at training time become structurally unstable as deployment contexts diverge from training conditions.md"]
 extraction_model: "anthropic/claude-sonnet-4.5"
-extraction_notes: "Extracted two novel claims from Full-Stack Alignment paper: (1) institutional co-alignment as necessary for beneficial AI outcomes, extending coordination thesis to institutions themselves, and (2) thick models of value as formalization of continuous value integration. Applied three enrichments to existing coordination and continuous-alignment claims. Paper is architecturally ambitious but lacks technical specificity or empirical validation—confidence levels reflect this (experimental for institutional co-alignment, speculative for thick value models). No engagement with RLHF/bridging mechanisms or formal impossibility results as curator noted."
+extraction_notes: "Extracted two novel claims from full-stack alignment paper: (1) institutional co-alignment requirement, (2) thick models of value framework. Applied three enrichments to existing coordination and value-integration claims. Paper is architecturally ambitious but lacks technical specificity - no formal results or engagement with RLHF/bridging mechanisms. Confidence rated experimental due to single-source theoretical proposal without empirical validation."
 ---

 ## Content