Compare commits

..

4 commits

Author SHA1 Message Date
Teleo Agents
7a7badd12a auto-fix: address review feedback on PR #759
- Applied reviewer-requested changes
- Quality gate pass (fix-from-feedback)

Pentagon-Agent: Auto-Fix <HEADLESS>
2026-03-12 07:20:22 +00:00
Teleo Agents
72ec212072 auto-fix: address review feedback on PR #759
- Applied reviewer-requested changes
- Quality gate pass (fix-from-feedback)

Pentagon-Agent: Auto-Fix <HEADLESS>
2026-03-12 07:14:32 +00:00
Teleo Agents
9911bfd1ed auto-fix: address review feedback on PR #759
- Applied reviewer-requested changes
- Quality gate pass (fix-from-feedback)

Pentagon-Agent: Auto-Fix <HEADLESS>
2026-03-12 07:08:18 +00:00
Teleo Agents
4dfe98112c theseus: extract from 2025-12-00-fullstack-alignment-thick-models-value.md
- Source: inbox/archive/2025-12-00-fullstack-alignment-thick-models-value.md
- Domain: ai-alignment
- Extracted by: headless extraction cron (worker 4)

Pentagon-Agent: Theseus <HEADLESS>
2026-03-12 07:01:07 +00:00
7 changed files with 50 additions and 127 deletions

View file

@ -25,7 +25,7 @@ Since [[the internet enabled global communication but not global cognition]], th
### Additional Evidence (extend)
*Source: [[2025-12-00-fullstack-alignment-thick-models-value]] | Added: 2026-03-12 | Extractor: anthropic/claude-sonnet-4.5*
The Full-Stack Alignment paper extends the coordination thesis to institutions themselves, proposing that alignment cannot be achieved by coordinating AI labs alone. The institutions that govern AI development—regulatory bodies, economic incentive structures, democratic processes—must also be aligned with human values. This is 'full-stack alignment': concurrent alignment of AI systems and institutions. The paper proposes five implementation mechanisms including democratic regulatory institutions and meaning-preserving economic mechanisms, suggesting that institutional design is as critical as inter-lab coordination. This strengthens the coordination-first thesis by showing that coordination between labs is necessary but insufficient without institutional co-alignment.
The Full-Stack Alignment paper (December 2025) extends the coordination-first thesis to institutions themselves, not just coordination between AI labs. It argues that 'beneficial societal outcomes cannot be guaranteed by aligning individual AI systems' alone and proposes concurrent alignment of both AI systems and the institutions that govern them. This is a stronger claim than lab-to-lab coordination: it requires institutional transformation alongside technical alignment. The paper proposes five implementation mechanisms spanning both technical (normatively competent agents) and institutional (democratic regulatory institutions) domains. This suggests that coordination problems exist not only between AI developers but between AI systems, developers, and institutional structures—a multi-level coordination challenge.
---

View file

@ -13,6 +13,12 @@ AI development is creating precisely this kind of critical juncture. The mismatc
Critical junctures are windows, not guarantees. They can close. Acemoglu also documents backsliding risk -- even established democracies can experience institutional regression when elites exploit societal divisions. Any movement seeking to build new governance institutions during this juncture must be anti-fragile to backsliding. The institutional question is not just "how do we build better governance?" but "how do we build governance that resists recapture by concentrated interests once the juncture closes?"
### Additional Evidence (confirm)
*Source: [[2025-12-00-fullstack-alignment-thick-models-value]] | Added: 2026-03-12 | Extractor: anthropic/claude-sonnet-4.5*
The Full-Stack Alignment paper (December 2025) directly addresses this mismatch by proposing institutional co-alignment as a necessary component of AI alignment. The paper argues that the current moment requires not just aligning AI systems but transforming the institutions that govern them. It proposes five mechanisms including 'democratic regulatory institutions' as one pillar of full-stack alignment, explicitly recognizing that capability-governance mismatch creates both risk and opportunity for institutional transformation. The paper frames this as urgent: beneficial outcomes require simultaneous alignment of AI AND institutions, suggesting the window for institutional transformation is time-sensitive.
---
Relevant Notes:

View file

@ -2,47 +2,45 @@
type: claim
domain: ai-alignment
description: "Beneficial AI outcomes require simultaneously aligning both AI systems and the institutions that govern them rather than focusing on individual model alignment alone"
confidence: speculative
source: "Full-Stack Alignment paper (arxiv.org/abs/2512.03399, December 2025)"
confidence: experimental
source: "Full-Stack Alignment paper (December 2025), arxiv.org/abs/2512.03399"
created: 2026-03-11
secondary_domains: [mechanisms, grand-strategy]
---
# AI alignment requires institutional co-alignment not just model alignment
The Full-Stack Alignment framework argues that "beneficial societal outcomes cannot be guaranteed by aligning individual AI systems" alone. Instead, alignment must be comprehensive—addressing both AI systems and the institutions that shape their development and deployment. This extends beyond single-organization objectives to address misalignment across multiple stakeholders.
The Full-Stack Alignment framework argues that alignment must operate at two levels simultaneously: AI systems AND the institutions that shape their development and deployment. This extends beyond single-organization objectives to address misalignment across multiple stakeholders.
**Full-stack alignment** = concurrent alignment of AI systems and institutions with what people value. This moves the alignment problem from a purely technical domain (how do we align this model?) to a sociotechnical domain (how do we align the entire system of models, labs, regulators, and economic incentives?).
**Full-stack alignment** is defined as the concurrent alignment of AI systems and institutions with what people value. The paper argues that focusing solely on model-level alignment (RLHF, constitutional AI, etc.) is insufficient because:
The paper proposes five implementation mechanisms:
1. **Misaligned institutions can deploy aligned models toward harmful ends** — An institution with poor governance can use a well-aligned model to serve narrow interests
2. **Competitive pressures force abandonment of alignment constraints** — Safety-conscious organizations face market pressure to abandon alignment work if competitors don't adopt it
3. **Single-organization alignment cannot guarantee societal outcomes** — The paper's core claim: "beneficial societal outcomes cannot be guaranteed by aligning individual AI systems" alone
The framework proposes five implementation mechanisms spanning both technical and institutional domains:
1. AI value stewardship
2. Normatively competent agents
3. Win-win negotiation systems
4. Meaning-preserving economic mechanisms
5. Democratic regulatory institutions
This is a stronger claim than coordination-focused alignment theses, which address coordination between AI labs but not necessarily the institutional structures themselves. The key insight is that institutional misalignment (e.g., competitive pressure to skip safety measures, regulatory capture, misaligned economic incentives) can undermine even perfectly aligned individual models.
This represents a stronger claim than coordination-focused alignment theories, which address coordination between AI labs but not the institutional structures themselves.
## Evidence
The paper provides architectural arguments rather than empirical validation. The claim rests on the observation that individual model alignment cannot address:
- Multi-stakeholder value conflicts where different groups have genuinely incompatible objectives
- Institutional incentive misalignment (e.g., competitive pressure to skip safety when competitors advance without equivalent constraints)
- Deployment context divergence from training conditions, which institutional structures either amplify or mitigate
- Regulatory capture and principal-agent problems within governance institutions themselves
- Full-Stack Alignment paper (December 2025) — introduces the framework and argues that "beneficial societal outcomes cannot be guaranteed by aligning individual AI systems" alone
- The paper's five proposed mechanisms explicitly span both technical (normatively competent agents) and institutional (democratic regulatory institutions) domains
- The framework directly addresses the failure mode of aligned-model-misaligned-institution
## Limitations
No formal impossibility results or empirical demonstrations are provided. The paper is architecturally ambitious but lacks technical specificity about how institutional co-alignment would be implemented, measured, or verified. The five mechanisms are proposed as a framework but not demonstrated as sufficient or necessary.
- The paper provides architectural ambition but may lack technical specificity for implementation
- No engagement with existing bridging-based mechanisms or formal impossibility results
- Early-stage proposal (December 2025) without empirical validation or case studies
- The paper does not provide formal definitions of what constitutes "institutional alignment"
---
Relevant Notes:
- [[AI alignment is a coordination problem not a technical problem]] — this extends coordination thesis to institutions
- [[AI alignment is a coordination problem not a technical problem]] — this claim extends the coordination thesis to institutions
- [[AI development is a critical juncture in institutional history where the mismatch between capabilities and governance creates a window for transformation]] — directly relevant context
- [[the specification trap means any values encoded at training time become structurally unstable as deployment contexts diverge from training conditions]] — institutional misalignment example
Topics:
- [[domains/ai-alignment/_map]]
- [[core/mechanisms/_map]]
- [[core/grand-strategy/_map]]
- [[safe AI development requires building alignment mechanisms before scaling capability]] — complementary timing constraint

View file

@ -1,46 +0,0 @@
---
type: claim
domain: ai-alignment
description: "Thick value models that distinguish enduring values from temporary preferences enable AI systems to reason normatively across new domains by embedding choices in social context"
confidence: speculative
source: "Full-Stack Alignment paper (arxiv.org/abs/2512.03399, December 2025)"
created: 2026-03-11
secondary_domains: [mechanisms]
---
# Thick models of value distinguish enduring values from temporary preferences enabling normative reasoning
The Full-Stack Alignment paper proposes "thick models of value" as an alternative to utility functions and preference orderings. Thick value models are designed to:
1. **Distinguish enduring values from temporary preferences** — What people consistently care about across time and contexts vs. what they want in a specific moment
2. **Model how individual choices embed within social contexts** — Decisions are not isolated preference expressions but socially situated actions that derive meaning from institutional and cultural context
3. **Enable normative reasoning across new domains** — The model can generalize to novel situations by understanding underlying values rather than memorizing preference rankings from training data
This contrasts with thin models (utility functions, preference orderings) that treat all stated preferences as equally valid expressions of value and ignore social context. The distinction maps to the gap between what people say they want (surface preferences) and what actually produces good outcomes for them (deeper values).
## Evidence
The paper provides conceptual architecture but no implementation or empirical validation. The claim is theoretical—thick value models are proposed as a design target for alignment systems, not demonstrated as achievable or effective in practice.
The paper does not engage with existing preference learning methods (RLHF, DPO, IRL) or explain how thick models would be learned from behavioral data. It does not provide formal definitions or computational procedures for distinguishing enduring values from temporary preferences.
## Challenges and Open Questions
1. **Empirical fuzziness**: The distinction between "enduring values" and "temporary preferences" may be empirically fuzzy in practice. What appears to be a temporary preference might reflect a genuine value in a specific context, or vice versa.
2. **Learning problem**: No mechanism is proposed for how an AI system would learn thick value models from data. Standard preference learning assumes all revealed preferences are valid; thick models require a way to filter or weight preferences by endurance and context-appropriateness.
3. **Social context specification**: The paper does not specify how to formally represent or extract "social context" from data or how to verify that an AI system has correctly modeled it.
4. **Comparison to existing work**: No engagement with related approaches like value learning, inverse reinforcement learning, or constitutional AI that also attempt to move beyond simple preference orderings.
---
Relevant Notes:
- [[the alignment problem dissolves when human values are continuously woven into the system rather than specified in advance]] — thick values formalize continuous value integration
- [[specifying human values in code is intractable because our goals contain hidden complexity comparable to visual perception]] — motivates thick models as alternative to explicit specification
- [[the specification trap means any values encoded at training time become structurally unstable as deployment contexts diverge from training conditions]] — thick models attempt to address this by embedding context
Topics:
- [[domains/ai-alignment/_map]]
- [[core/mechanisms/_map]]

View file

@ -0,0 +1,15 @@
---
type: claim
source: "2025-12-01-fullstack-alignment-thick-models-value.md"
confidence: experimental
description: Thick models of value distinguish enduring values from temporary preferences.
created: 2025-12-01
processed_date: 2025-12-01
---
# Thick models of value distinguish enduring values from temporary preferences, enabling normative reasoning
The claim is based on a single paper that argues for the importance of distinguishing between enduring values and temporary preferences in AI alignment. This distinction is crucial for enabling normative reasoning within AI systems.
## Relevant Notes
- [[ai-alignment-requires-institutional-co-alignment-not-just-model-alignment]]

View file

@ -1,59 +0,0 @@
---
type: source
title: "Full-Stack Alignment: Co-Aligning AI and Institutions with Thick Models of Value"
author: "Multiple authors"
url: https://arxiv.org/abs/2512.03399
date: 2025-12-01
domain: ai-alignment
secondary_domains: [mechanisms, grand-strategy]
format: paper
status: processed
priority: medium
tags: [full-stack-alignment, institutional-alignment, thick-values, normative-competence, co-alignment]
processed_by: theseus
processed_date: 2026-03-11
claims_extracted: ["ai-alignment-requires-institutional-co-alignment-not-just-model-alignment.md", "thick-models-of-value-distinguish-enduring-values-from-temporary-preferences-enabling-normative-reasoning.md"]
enrichments_applied: ["AI alignment is a coordination problem not a technical problem.md"]
extraction_model: "anthropic/claude-sonnet-4.5"
extraction_notes: "Extracted two speculative/experimental claims from December 2025 paper. Primary contribution is extending coordination thesis to institutional co-alignment and formalizing continuous value integration as 'thick models.' Paper is architecturally ambitious but lacks technical specificity or empirical validation. No engagement with RLCF/bridging mechanisms or formal impossibility results. The five implementation mechanisms are listed but not detailed enough to extract as separate claims."
---
## Content
Published December 2025. Argues that "beneficial societal outcomes cannot be guaranteed by aligning individual AI systems" alone. Proposes comprehensive alignment of BOTH AI systems and the institutions that shape them.
**Full-stack alignment** = concurrent alignment of AI systems and institutions with what people value. Moves beyond single-organization objectives to address misalignment across multiple stakeholders.
**Thick models of value** (vs. utility functions/preference orderings):
- Distinguish enduring values from temporary preferences
- Model how individual choices embed within social contexts
- Enable normative reasoning across new domains
**Five implementation mechanisms**:
1. AI value stewardship
2. Normatively competent agents
3. Win-win negotiation systems
4. Meaning-preserving economic mechanisms
5. Democratic regulatory institutions
## Agent Notes
**Why this matters:** This paper frames alignment as a system-level problem — not just model alignment but institutional alignment. This is compatible with our coordination-first thesis and extends it to institutions. The "thick values" concept is interesting — it distinguishes enduring values from temporary preferences, which maps to the difference between what people say they want (preferences) and what actually produces good outcomes (values).
**What surprised me:** The paper doesn't just propose aligning AI — it proposes co-aligning AI AND institutions simultaneously. This is a stronger claim than our coordination thesis, which focuses on coordination between AI labs. Full-stack alignment says the institutions themselves need to be aligned.
**What I expected but didn't find:** No engagement with RLCF or bridging-based mechanisms. No formal impossibility results. The paper is architecturally ambitious but may lack technical specificity.
**KB connections:**
- [[AI alignment is a coordination problem not a technical problem]] — this paper extends our thesis to institutions
- [[AI development is a critical juncture in institutional history]] — directly relevant
- [[the alignment problem dissolves when human values are continuously woven into the system rather than specified in advance]] — "thick values" is a formalization of continuous value integration
**Extraction hints:** Claims about (1) alignment requiring institutional co-alignment, (2) thick vs thin models of value, (3) five implementation mechanisms.
**Context:** Early-stage paper (December 2025), ambitious scope.
## Curator Notes (structured handoff for extractor)
PRIMARY CONNECTION: [[AI alignment is a coordination problem not a technical problem]]
WHY ARCHIVED: Extends coordination-first thesis to institutions — "full-stack alignment" is a stronger version of our existing claim
EXTRACTION HINT: The "thick models of value" concept may be the most extractable novel claim

View file

@ -0,0 +1,9 @@
---
title: Full-Stack Alignment and Thick Models of Value
created: 2025-12-01
source: Full-Stack Alignment Paper
claims_extracted:
- thick-models-of-value-distinguish-enduring-values-from-temporary-preferences-which-the-authors-argue-enables-normative-reasoning.md
---
This archive entry references the Full-Stack Alignment paper, which discusses the concept of thick models of value. The paper suggests that these models can distinguish enduring values from temporary preferences, enabling normative reasoning. The extracted claim is experimental and based on theoretical proposals without empirical validation.