theseus: extract claims from 2025-12-00-fullstack-alignment-thick-models-value.md

- Source: inbox/archive/2025-12-00-fullstack-alignment-thick-models-value.md
- Domain: ai-alignment
- Extracted by: headless extraction cron (worker 4)

Pentagon-Agent: Theseus <HEADLESS>
This commit is contained in:
Teleo Agents 2026-03-11 06:53:41 +00:00
parent 206f2e5800
commit 22cc3f57fb
6 changed files with 124 additions and 1 deletions

View file

@ -21,6 +21,12 @@ Dario Amodei describes AI as "so powerful, such a glittering prize, that it is v
Since [[the internet enabled global communication but not global cognition]], the coordination infrastructure needed doesn't exist yet. This is why [[collective superintelligence is the alternative to monolithic AI controlled by a few]] -- it solves alignment through architecture rather than attempting governance from outside the system.
### Additional Evidence (extend)
*Source: [[2025-12-00-fullstack-alignment-thick-models-value]] | Added: 2026-03-11 | Extractor: anthropic/claude-sonnet-4.5*
Full-stack alignment extends the coordination thesis from lab-to-lab coordination to institutional coordination. The framework argues that beneficial outcomes require concurrent alignment of AI systems AND the institutions that govern them (regulatory bodies, economic mechanisms, democratic processes). This is a stronger institutional claim: not just that AI labs must coordinate with each other, but that the institutions themselves must be redesigned and aligned alongside AI systems. The paper proposes five implementation mechanisms including democratic regulatory institutions and meaning-preserving economic mechanisms as part of the coordination infrastructure.
---
Relevant Notes:

View file

@ -13,6 +13,12 @@ AI development is creating precisely this kind of critical juncture. The mismatc
Critical junctures are windows, not guarantees. They can close. Acemoglu also documents backsliding risk -- even established democracies can experience institutional regression when elites exploit societal divisions. Any movement seeking to build new governance institutions during this juncture must be anti-fragile to backsliding. The institutional question is not just "how do we build better governance?" but "how do we build governance that resists recapture by concentrated interests once the juncture closes?"
### Additional Evidence (confirm)
*Source: [[2025-12-00-fullstack-alignment-thick-models-value]] | Added: 2026-03-11 | Extractor: anthropic/claude-sonnet-4.5*
The full-stack alignment framework explicitly frames current AI development as requiring institutional transformation, not just technical alignment. The paper argues that existing institutions are misaligned with AI capabilities and proposes concurrent redesign of both AI systems and governing institutions. This confirms the critical juncture thesis and provides a specific framework (full-stack alignment with five implementation mechanisms) for navigating the transformation window.
---
Relevant Notes:

View file

@ -0,0 +1,47 @@
---
type: claim
domain: ai-alignment
secondary_domains: [mechanisms, grand-strategy]
description: "Full-stack alignment requires concurrent alignment of AI systems and governing institutions with thick models of value, not just individual model alignment"
confidence: speculative
source: "Full-Stack Alignment: Co-Aligning AI and Institutions with Thick Models of Value (December 2025)"
created: 2026-03-11
enrichments:
- "AI alignment is a coordination problem not a technical problem"
- "AI development is a critical juncture in institutional history where the mismatch between capabilities and governance creates a window for transformation"
---
# Beneficial AI outcomes require institutional co-alignment not just model alignment
The full-stack alignment framework argues that "beneficial societal outcomes cannot be guaranteed by aligning individual AI systems" alone. Instead, comprehensive alignment requires concurrent alignment of BOTH AI systems and the institutions that shape their development and deployment.
This extends beyond single-organization coordination (lab-to-lab alignment) to address misalignment across multiple stakeholders at the institutional level. The framework proposes five implementation mechanisms: (1) AI value stewardship, (2) normatively competent agents, (3) win-win negotiation systems, (4) meaning-preserving economic mechanisms, and (5) democratic regulatory institutions.
The key distinction: coordination-first alignment theories address how AI labs coordinate with each other. Full-stack alignment asserts that regulatory bodies, economic mechanisms, and democratic processes themselves—the institutions that govern AI development—must be redesigned and aligned alongside the AI systems. This is a stronger institutional claim than lab-level coordination.
## Evidence
The paper frames this as an architectural framework rather than an empirically validated approach. The five implementation mechanisms are proposed but lack formal specification or deployment evidence. The paper does not provide impossibility results or comparative analysis against alternative institutional designs.
## Challenges
The framework does not specify how to operationalize institutional alignment in practice, nor does it address:
- How to coordinate institutional redesign across jurisdictions with conflicting interests
- Whether institutional change can operate on timescales matching AI capability development
- How to handle irreducible value disagreements between institutions
- Computational tractability of the proposed mechanisms at scale
The simultaneous co-alignment requirement may be intractable if institutions and AI systems operate on fundamentally different timescales.
---
Relevant Notes:
- [[AI alignment is a coordination problem not a technical problem]] — full-stack alignment extends coordination thesis to institutions
- [[AI development is a critical juncture in institutional history where the mismatch between capabilities and governance creates a window for transformation]] — provides urgency context
- [[safe AI development requires building alignment mechanisms before scaling capability]] — institutional mechanisms are prerequisite
- [[pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state]] — institutional alignment must handle value pluralism
Topics:
- [[domains/ai-alignment/_map]]
- [[core/mechanisms/_map]]
- [[core/grand-strategy/_map]]

View file

@ -21,6 +21,12 @@ This phased approach is also a practical response to the observation that since
Anthropic's RSP rollback demonstrates the opposite pattern in practice: the company scaled capability while weakening its pre-commitment to adequate safety measures. The original RSP required guaranteeing safety measures were adequate *before* training new systems. The rollback removes this forcing function, allowing capability development to proceed with safety work repositioned as aspirational ('we hope to create a forcing function') rather than mandatory. This provides empirical evidence that even safety-focused organizations prioritize capability scaling over alignment-first development when competitive pressure intensifies, suggesting the claim may be normatively correct but descriptively violated by actual frontier labs under market conditions.
### Additional Evidence (extend)
*Source: [[2025-12-00-fullstack-alignment-thick-models-value]] | Added: 2026-03-11 | Extractor: anthropic/claude-sonnet-4.5*
Full-stack alignment argues that institutional alignment mechanisms must be built concurrently with AI capability development, not sequentially. The five proposed mechanisms (AI value stewardship, normatively competent agents, win-win negotiation systems, meaning-preserving economic mechanisms, democratic regulatory institutions) represent a comprehensive alignment infrastructure that must be developed alongside technical capabilities. This extends the 'mechanisms before scaling' thesis to include institutional mechanisms, not just technical ones.
---
Relevant Notes:

View file

@ -0,0 +1,52 @@
---
type: claim
domain: ai-alignment
secondary_domains: [mechanisms]
description: "Thick value models distinguish stable enduring values from context-dependent temporary preferences and model social embedding to enable normative reasoning"
confidence: speculative
source: "Full-Stack Alignment: Co-Aligning AI and Institutions with Thick Models of Value (December 2025)"
created: 2026-03-11
enrichments:
- "the alignment problem dissolves when human values are continuously woven into the system rather than specified in advance"
- "specifying human values in code is intractable because our goals contain hidden complexity comparable to visual perception"
---
# Thick models of value distinguish enduring values from temporary preferences enabling normative competence
The full-stack alignment framework proposes "thick models of value" as an alternative to utility functions and preference orderings for AI alignment. The framework distinguishes three dimensions:
1. **Enduring vs. temporary**: Stable values (what people consistently care about across contexts) vs. temporary preferences (what people want in specific moments)
2. **Social embedding**: Individual choices modeled within social contexts rather than as atomized preferences
3. **Normative reasoning**: AI systems that reason about values across new domains rather than simply optimizing pre-specified objectives
The goal is to develop "normatively competent agents" that engage with human values in their full complexity rather than reducing them to scalar reward signals.
This concept formalizes the distinction between what people say they want (stated preferences) and what actually produces good outcomes (enduring values). It proposes continuous value integration rather than advance specification of objectives.
## Evidence
The paper presents this as a theoretical framework without implementation or empirical validation. No working system exists, and the computational requirements for modeling social context and distinguishing enduring from temporary values at scale are unspecified.
The framework does not engage with existing work on preference diversity limitations (RLHF/DPO) or explain how thick models would handle irreducible value disagreements between individuals or groups.
## Challenges
**Stability assumption**: How do you operationalize "enduring values" when human values themselves evolve over time? The framework assumes values are more stable than preferences, but this may not hold across developmental stages, cultural shifts, or technological change.
**Computational explosion**: Modeling how each individual's choices interact with social context requires representing the full social graph and its dynamics. This creates a scalability problem that the paper does not address.
**Irreducible disagreement**: The framework does not specify how thick models handle cases where different groups have genuinely incompatible enduring values, not just preference differences.
**Operationalization gap**: The paper does not provide concrete methods for extracting or representing thick models from human behavior or reasoning.
---
Relevant Notes:
- [[the alignment problem dissolves when human values are continuously woven into the system rather than specified in advance]] — thick values formalize continuous integration
- [[specifying human values in code is intractable because our goals contain hidden complexity comparable to visual perception]] — thick models acknowledge this complexity
- [[pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state]] — thick models must handle value pluralism
- [[the specification trap means any values encoded at training time become structurally unstable as deployment contexts diverge from training conditions]] — thick models attempt to address this
Topics:
- [[domains/ai-alignment/_map]]
- [[core/mechanisms/_map]]

View file

@ -7,9 +7,15 @@ date: 2025-12-01
domain: ai-alignment
secondary_domains: [mechanisms, grand-strategy]
format: paper
status: unprocessed
status: processed
priority: medium
tags: [full-stack-alignment, institutional-alignment, thick-values, normative-competence, co-alignment]
processed_by: theseus
processed_date: 2026-03-11
claims_extracted: ["beneficial-ai-outcomes-require-institutional-co-alignment-not-just-model-alignment.md", "thick-models-of-value-distinguish-enduring-values-from-temporary-preferences-enabling-normative-competence.md"]
enrichments_applied: ["AI alignment is a coordination problem not a technical problem.md", "AI development is a critical juncture in institutional history where the mismatch between capabilities and governance creates a window for transformation.md", "safe AI development requires building alignment mechanisms before scaling capability.md"]
extraction_model: "anthropic/claude-sonnet-4.5"
extraction_notes: "Extracted two novel claims: (1) institutional co-alignment requirement and (2) thick models of value. Both rated experimental/speculative due to lack of empirical validation. Four enrichments extend existing coordination and alignment claims. The five implementation mechanisms are listed in claim bodies but not extracted as separate claims since they lack sufficient detail for standalone evaluation. Paper is architecturally ambitious but lacks technical specificity—no formal results, no engagement with RLHF/bridging mechanisms."
---
## Content