theseus: extract from 2025-12-00-fullstack-alignment-thick-models-value.md

- Source: inbox/archive/2025-12-00-fullstack-alignment-thick-models-value.md
- Domain: ai-alignment
- Extracted by: headless extraction cron (worker 4)

Pentagon-Agent: Theseus <HEADLESS>
This commit is contained in:
Teleo Agents 2026-03-12 07:01:07 +00:00
parent ba4ac4a73e
commit 4dfe98112c
5 changed files with 113 additions and 1 deletions

View file

@ -21,6 +21,12 @@ Dario Amodei describes AI as "so powerful, such a glittering prize, that it is v
Since [[the internet enabled global communication but not global cognition]], the coordination infrastructure needed doesn't exist yet. This is why [[collective superintelligence is the alternative to monolithic AI controlled by a few]] -- it solves alignment through architecture rather than attempting governance from outside the system.
### Additional Evidence (extend)
*Source: [[2025-12-00-fullstack-alignment-thick-models-value]] | Added: 2026-03-12 | Extractor: anthropic/claude-sonnet-4.5*
The Full-Stack Alignment paper (December 2025) extends the coordination-first thesis to institutions themselves, not just coordination between AI labs. It argues that 'beneficial societal outcomes cannot be guaranteed by aligning individual AI systems' alone and proposes concurrent alignment of both AI systems and the institutions that govern them. This is a stronger claim than lab-to-lab coordination: it requires institutional transformation alongside technical alignment. The paper proposes five implementation mechanisms spanning both technical (normatively competent agents) and institutional (democratic regulatory institutions) domains. This suggests that coordination problems exist not only between AI developers but between AI systems, developers, and institutional structures—a multi-level coordination challenge.
---
Relevant Notes:

View file

@ -13,6 +13,12 @@ AI development is creating precisely this kind of critical juncture. The mismatc
Critical junctures are windows, not guarantees. They can close. Acemoglu also documents backsliding risk -- even established democracies can experience institutional regression when elites exploit societal divisions. Any movement seeking to build new governance institutions during this juncture must be anti-fragile to backsliding. The institutional question is not just "how do we build better governance?" but "how do we build governance that resists recapture by concentrated interests once the juncture closes?"
### Additional Evidence (confirm)
*Source: [[2025-12-00-fullstack-alignment-thick-models-value]] | Added: 2026-03-12 | Extractor: anthropic/claude-sonnet-4.5*
The Full-Stack Alignment paper (December 2025) directly addresses this mismatch by proposing institutional co-alignment as a necessary component of AI alignment. The paper argues that the current moment requires not just aligning AI systems but transforming the institutions that govern them. It proposes five mechanisms including 'democratic regulatory institutions' as one pillar of full-stack alignment, explicitly recognizing that capability-governance mismatch creates both risk and opportunity for institutional transformation. The paper frames this as urgent: beneficial outcomes require simultaneous alignment of AI AND institutions, suggesting the window for institutional transformation is time-sensitive.
---
Relevant Notes:

View file

@ -0,0 +1,46 @@
---
type: claim
domain: ai-alignment
description: "Beneficial AI outcomes require simultaneously aligning both AI systems and the institutions that govern them rather than focusing on individual model alignment alone"
confidence: experimental
source: "Full-Stack Alignment paper (December 2025), arxiv.org/abs/2512.03399"
created: 2026-03-11
secondary_domains: [mechanisms, grand-strategy]
---
# AI alignment requires institutional co-alignment not just model alignment
The Full-Stack Alignment framework argues that alignment must operate at two levels simultaneously: AI systems AND the institutions that shape their development and deployment. This extends beyond single-organization objectives to address misalignment across multiple stakeholders.
**Full-stack alignment** is defined as the concurrent alignment of AI systems and institutions with what people value. The paper argues that focusing solely on model-level alignment (RLHF, constitutional AI, etc.) is insufficient because:
1. **Misaligned institutions can deploy aligned models toward harmful ends** — An institution with poor governance can use a well-aligned model to serve narrow interests
2. **Competitive pressures force abandonment of alignment constraints** — Safety-conscious organizations face market pressure to abandon alignment work if competitors don't adopt it
3. **Single-organization alignment cannot guarantee societal outcomes** — The paper's core claim: "beneficial societal outcomes cannot be guaranteed by aligning individual AI systems" alone
The framework proposes five implementation mechanisms spanning both technical and institutional domains:
1. AI value stewardship
2. Normatively competent agents
3. Win-win negotiation systems
4. Meaning-preserving economic mechanisms
5. Democratic regulatory institutions
This represents a stronger claim than coordination-focused alignment theories, which address coordination between AI labs but not the institutional structures themselves.
## Evidence
- Full-Stack Alignment paper (December 2025) — introduces the framework and argues that "beneficial societal outcomes cannot be guaranteed by aligning individual AI systems" alone
- The paper's five proposed mechanisms explicitly span both technical (normatively competent agents) and institutional (democratic regulatory institutions) domains
- The framework directly addresses the failure mode of aligned-model-misaligned-institution
## Limitations
- The paper provides architectural ambition but may lack technical specificity for implementation
- No engagement with existing bridging-based mechanisms or formal impossibility results
- Early-stage proposal (December 2025) without empirical validation or case studies
- The paper does not provide formal definitions of what constitutes "institutional alignment"
---
Relevant Notes:
- [[AI alignment is a coordination problem not a technical problem]] — this claim extends the coordination thesis to institutions
- [[AI development is a critical juncture in institutional history where the mismatch between capabilities and governance creates a window for transformation]] — directly relevant context
- [[safe AI development requires building alignment mechanisms before scaling capability]] — complementary timing constraint

View file

@ -0,0 +1,48 @@
---
type: claim
domain: ai-alignment
description: "Thick value models distinguish enduring values from temporary preferences and embed individual choices in social contexts, enabling normative reasoning that utility functions cannot capture"
confidence: experimental
source: "Full-Stack Alignment paper (December 2025), arxiv.org/abs/2512.03399"
created: 2026-03-11
secondary_domains: [mechanisms]
---
# Thick models of value distinguish enduring values from temporary preferences enabling normative reasoning across new domains
The Full-Stack Alignment paper proposes "thick models of value" as an alternative to utility functions and preference orderings. These models address a fundamental problem in AI alignment: the specification trap.
**What thick models do:**
1. **Distinguish enduring values from temporary preferences** — Separates what people say they want (preferences, often context-dependent and volatile) from what actually produces good outcomes (values, more stable and generalizable)
2. **Model individual choices within social contexts** — Recognizes that choices are not isolated but embedded in social structures, relationships, and institutional contexts
3. **Enable normative reasoning across new domains** — Allow systems to reason about values in contexts not explicitly covered by training data, rather than failing when encountering novel situations
**Why this matters for alignment:**
This contrasts with "thin" models (utility functions, preference orderings) that treat all preferences as equivalent and context-independent. Thin models fail because:
- They cannot distinguish signal (enduring values) from noise (temporary preferences)
- They assume preferences are stable across contexts when they are actually highly context-dependent
- They cannot generalize to novel domains because they have no principled way to reason about values beyond training data
Thick models formalize why specification-in-advance fails: human values have structure, hierarchy, and context-dependence that simple preference aggregation cannot capture.
## Evidence
- Full-Stack Alignment paper (December 2025) — introduces thick vs thin value models as a core component of the alignment framework
- The distinction between preferences (what people say they want) and values (what produces good outcomes) directly addresses the specification trap identified in existing alignment research
- The paper argues that thick models enable "normative reasoning across new domains" — a capability thin models lack
## Limitations and Open Questions
- No formal specification of what constitutes a "thick model" or how to implement one in practice
- Unclear how to operationalize the distinction between enduring values and temporary preferences in real systems
- Risk of paternalism: who decides which preferences are "temporary" vs which values are "enduring"? This could embed designer bias
- No empirical validation that thick models actually outperform thin models on alignment tasks
- The paper does not address how thick models handle genuinely conflicting values across populations
---
Relevant Notes:
- [[the alignment problem dissolves when human values are continuously woven into the system rather than specified in advance]] — thick values formalize continuous value integration
- [[specifying human values in code is intractable because our goals contain hidden complexity comparable to visual perception]] — thick models acknowledge this complexity
- [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]] — thin models fail at diversity
- [[pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state]] — relevant to the paternalism concern

View file

@ -7,9 +7,15 @@ date: 2025-12-01
domain: ai-alignment
secondary_domains: [mechanisms, grand-strategy]
format: paper
status: unprocessed
status: processed
priority: medium
tags: [full-stack-alignment, institutional-alignment, thick-values, normative-competence, co-alignment]
processed_by: theseus
processed_date: 2026-03-11
claims_extracted: ["ai-alignment-requires-institutional-co-alignment-not-just-model-alignment.md", "thick-models-of-value-distinguish-enduring-values-from-temporary-preferences-enabling-normative-reasoning.md"]
enrichments_applied: ["AI alignment is a coordination problem not a technical problem.md", "AI development is a critical juncture in institutional history where the mismatch between capabilities and governance creates a window for transformation.md"]
extraction_model: "anthropic/claude-sonnet-4.5"
extraction_notes: "Extracted two novel claims from Full-Stack Alignment paper: (1) institutional co-alignment as necessary component of AI alignment, extending coordination thesis to institutions themselves, and (2) thick models of value as formalization of continuous value integration. Applied three enrichments to existing coordination and specification claims. Paper is architecturally ambitious but lacks technical specificity - claims rated experimental pending implementation details and empirical validation. No entity data in this source."
---
## Content