auto-fix: address review feedback on 2025-12-00-fullstack-alignment-thick-models-value.md

- Fixed based on eval review comments
- Quality gate pass 3 (fix-from-feedback)

Pentagon-Agent: Theseus <HEADLESS>
This commit is contained in:
Teleo Agents 2026-03-11 20:02:12 +00:00
parent ef292693a4
commit 6df32b57f4
3 changed files with 21 additions and 22 deletions

View file

@ -15,7 +15,7 @@ enrichments:
The full-stack alignment framework argues that "beneficial societal outcomes cannot be guaranteed by aligning individual AI systems" alone. Instead, comprehensive alignment requires concurrent alignment of BOTH AI systems and the institutions that shape their development and deployment. The full-stack alignment framework argues that "beneficial societal outcomes cannot be guaranteed by aligning individual AI systems" alone. Instead, comprehensive alignment requires concurrent alignment of BOTH AI systems and the institutions that shape their development and deployment.
This extends the existing coordination-first thesis in a specific way: the existing "AI alignment is a coordination problem" claim treats institutions (governments, regulatory bodies, economic structures) as the environment within which coordination must occur. Full-stack alignment treats institutions themselves as alignment targets that must be redesigned and co-evolved alongside AI systems. The distinction is architectural: coordination-first asks "how do competing actors align around AI?" Full-stack alignment asks "how do we align the institutions that govern AI development?" This extends the existing coordination-first thesis in a specific architectural way: the existing "AI alignment is a coordination problem" claim treats institutions (governments, regulatory bodies, economic structures) as the *environment* within which coordination between labs must occur. Full-stack alignment treats institutions themselves as *alignment targets* that must be redesigned and co-evolved alongside AI systems. The distinction is critical: coordination-first asks "how do competing actors align around AI development?"; full-stack alignment asks "how do we align the institutions that govern AI development?"
The framework proposes five implementation mechanisms: The framework proposes five implementation mechanisms:
1. **AI value stewardship** — institutional structures for preserving and transmitting human values 1. **AI value stewardship** — institutional structures for preserving and transmitting human values
@ -24,7 +24,7 @@ The framework proposes five implementation mechanisms:
4. **Meaning-preserving economic mechanisms** — economic structures that preserve rather than flatten human meaning and purpose 4. **Meaning-preserving economic mechanisms** — economic structures that preserve rather than flatten human meaning and purpose
5. **Democratic regulatory institutions** — governance structures that represent affected populations, not just developers or governments 5. **Democratic regulatory institutions** — governance structures that represent affected populations, not just developers or governments
The key claim: these five institutional mechanisms must be built concurrently with AI capability development, not sequentially after. This creates a timing problem: institutional redesign operates on decades-long timescales (Acemoglu's critical junctures are measured in decades); AI capability development operates on months-to-years timescales. The simultaneous co-alignment requirement may be structurally incoherent if the two processes cannot be synchronized. The key claim: these five institutional mechanisms must be built concurrently with AI capability development, not sequentially after. This creates a fundamental timing problem: institutional redesign operates on decades-long timescales (Acemoglu's critical junctures are measured in decades); AI capability development operates on months-to-years timescales. The simultaneous co-alignment requirement may be structurally incoherent if the two processes cannot be synchronized.
## Evidence ## Evidence
@ -32,15 +32,15 @@ The paper presents this as a theoretical framework rather than an empirically va
## Challenges ## Challenges
**Timescale incoherence**: Institutional change (decades) and AI capability development (months) operate on fundamentally different timescales. The paper does not address whether simultaneous co-alignment is even temporally feasible, or whether the requirement should be sequential (build institutions first, then scale AI) or parallel (accept institutional lag). **Timescale incoherence (primary challenge)**: Institutional change (decades) and AI capability development (months) operate on fundamentally different timescales. The paper does not address whether simultaneous co-alignment is even temporally feasible, or whether the requirement should be sequential (build institutions first, then scale AI) or parallel (accept institutional lag). This is not merely a difficulty — it may be a structural impossibility if institutional redesign cannot be accelerated to match AI development velocity.
**Coordination across jurisdictions**: The framework does not specify how to coordinate institutional redesign across nations with conflicting interests, different legal systems, and competing strategic incentives. Full-stack alignment requires global institutional alignment, but the mechanisms for achieving this across sovereign states are unspecified. **Coordination across jurisdictions**: The framework does not specify how to coordinate institutional redesign across nations with conflicting interests, different legal systems, and competing strategic incentives. Full-stack alignment requires global institutional alignment, but the mechanisms for achieving this across sovereign states are unspecified. The paper does not engage with whether this is a coordination problem (solvable with better mechanisms) or a fundamental conflict of interest (unsolvable).
**Irreducible value disagreement**: The framework does not address how institutional co-alignment handles cases where different populations have genuinely incompatible enduring values, not just preference differences. Democratic regulatory institutions may amplify rather than resolve these conflicts. **Irreducible value disagreement**: The framework does not address how institutional co-alignment handles cases where different populations have genuinely incompatible enduring values, not just preference differences. Democratic regulatory institutions may amplify rather than resolve these conflicts. The paper assumes institutional redesign can accommodate value pluralism, but provides no mechanism for handling cases where pluralism is irreducible.
**Operationalization gap**: The paper does not provide concrete methods for implementing any of the five mechanisms. "AI value stewardship" and "meaning-preserving economic mechanisms" are conceptually interesting but lack specification sufficient for deployment. **Operationalization gap**: The paper does not provide concrete methods for implementing any of the five mechanisms. "AI value stewardship" and "meaning-preserving economic mechanisms" are conceptually interesting but lack specification sufficient for deployment. Without operationalization, the framework remains architectural rather than actionable.
**Institutional capture risk**: The framework does not address how to prevent the proposed institutions from being captured by concentrated interests once they are built. Acemoglu's work emphasizes that critical junctures can close through backsliding — the paper does not propose anti-fragility mechanisms. **Institutional capture risk**: The framework does not address how to prevent the proposed institutions from being captured by concentrated interests once they are built. Acemoglu's own work emphasizes that critical junctures can close through backsliding — the paper does not propose anti-fragility mechanisms or institutional designs that resist capture.
--- ---
@ -48,10 +48,10 @@ Relevant Notes:
- [[AI alignment is a coordination problem not a technical problem]] — full-stack alignment extends coordination thesis to institutions; existing claim treats institutions as environment, this claim treats them as alignment targets - [[AI alignment is a coordination problem not a technical problem]] — full-stack alignment extends coordination thesis to institutions; existing claim treats institutions as environment, this claim treats them as alignment targets
- [[AI development is a critical juncture in institutional history where the mismatch between capabilities and governance creates a window for transformation]] — provides urgency context and timescale framework - [[AI development is a critical juncture in institutional history where the mismatch between capabilities and governance creates a window for transformation]] — provides urgency context and timescale framework
- [[safe AI development requires building alignment mechanisms before scaling capability]] — institutional mechanisms are prerequisite, though creates tension with concurrent co-alignment requirement - [[safe AI development requires building alignment mechanisms before scaling capability]] — institutional mechanisms are prerequisite, though creates tension with concurrent co-alignment requirement
- [[pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state]] — institutional alignment must handle value pluralism - [[super co-alignment proposes that human and AI values should be co-shaped through iterative alignment rather than specified in advance]] — individual-level co-alignment complement; full-stack extends scope to institutions
- [[pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state]] — institutional alignment must handle value pluralism; unclear whether full-stack framework solves or just represents this problem
- [[democratic alignment assemblies produce constitutions as effective as expert-designed ones while better representing diverse populations]] — directly relevant to democratic regulatory institutions mechanism - [[democratic alignment assemblies produce constitutions as effective as expert-designed ones while better representing diverse populations]] — directly relevant to democratic regulatory institutions mechanism
- [[community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules]] — relevant to AI value stewardship mechanism - [[community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules]] — relevant to AI value stewardship mechanism
- [[super co-alignment proposes that human and AI values should be co-shaped through iterative alignment rather than specified in advance]] — individual-level co-alignment complement; full-stack extends scope to institutions
Topics: Topics:
- [[domains/ai-alignment/_map]] - [[domains/ai-alignment/_map]]

View file

@ -15,18 +15,15 @@ The grant application identifies three concrete risks that make this sequencing
This phased approach is also a practical response to the observation that since [[existential risk breaks trial and error because the first failure is the last event]], there is no opportunity to iterate on safety after a catastrophic failure. You must get safety right on the first deployment in high-stakes domains, which means practicing in low-stakes domains first. The goal framework remains permanently open to revision at every stage, making the system's values a living document rather than a locked specification. This phased approach is also a practical response to the observation that since [[existential risk breaks trial and error because the first failure is the last event]], there is no opportunity to iterate on safety after a catastrophic failure. You must get safety right on the first deployment in high-stakes domains, which means practicing in low-stakes domains first. The goal framework remains permanently open to revision at every stage, making the system's values a living document rather than a locked specification.
## Tension with concurrent co-alignment approaches
Full-stack alignment proposes a concurrent rather than sequential approach: institutional alignment mechanisms must be built *alongside* AI capability development, not before it. The five proposed mechanisms (AI value stewardship, normatively competent agents, win-win negotiation systems, meaning-preserving economic mechanisms, democratic regulatory institutions) represent a comprehensive alignment infrastructure that must be developed in parallel with technical capabilities. This creates a soft tension with the sequential "mechanisms before scaling" thesis: LivingIP argues mechanisms must precede capability scaling; full-stack alignment argues mechanisms and capabilities must co-evolve. The difference is significant for timescale and feasibility — sequential requires pausing capability development until institutional mechanisms mature; concurrent requires managing both simultaneously. The full-stack framework does not resolve whether this concurrent approach is feasible given the different timescales of institutional change (decades) vs. AI development (months).
### Additional Evidence (challenge) ### Additional Evidence (challenge)
*Source: [[2026-02-00-anthropic-rsp-rollback]] | Added: 2026-03-10 | Extractor: anthropic/claude-sonnet-4.5* *Source: [[2026-02-00-anthropic-rsp-rollback]] | Added: 2026-03-10 | Extractor: anthropic/claude-sonnet-4.5*
Anthropics RSP rollback demonstrates the opposite pattern in practice: the company scaled capability while weakening its pre-commitment to adequate safety measures. The original RSP required guaranteeing safety measures were adequate *before* training new systems. The rollback removes this forcing function, allowing capability development to proceed with safety work repositioned as aspirational ('we hope to create a forcing function') rather than mandatory. This provides empirical evidence that even safety-focused organizations prioritize capability scaling over alignment-first development when competitive pressure intensifies, suggesting the claim may be normatively correct but descriptively violated by actual frontier labs under market conditions. Anthropics RSP rollback demonstrates the opposite pattern in practice: the company scaled capability while weakening its pre-commitment to adequate safety measures. The original RSP required guaranteeing safety measures were adequate *before* training new systems. The rollback removes this forcing function, allowing capability development to proceed with safety work repositioned as aspirational ('we hope to create a forcing function') rather than mandatory. This provides empirical evidence that even safety-focused organizations prioritize capability scaling over alignment-first development when competitive pressure intensifies, suggesting the claim may be normatively correct but descriptively violated by actual frontier labs under market conditions.
### Additional Evidence (extend)
*Source: [[2025-12-00-fullstack-alignment-thick-models-value]] | Added: 2026-03-11 | Extractor: anthropic/claude-sonnet-4.5*
Full-stack alignment proposes a concurrent rather than sequential approach: institutional alignment mechanisms must be built *alongside* AI capability development, not before it. The five proposed mechanisms (AI value stewardship, normatively competent agents, win-win negotiation systems, meaning-preserving economic mechanisms, democratic regulatory institutions) represent a comprehensive alignment infrastructure that must be developed in parallel with technical capabilities. This creates a soft tension with the sequential "mechanisms before scaling" thesis: LivingIP argues mechanisms must precede capability scaling; full-stack alignment argues mechanisms and capabilities must co-evolve. The difference is significant for timescale and feasibility — sequential requires pausing capability development until institutional mechanisms mature; concurrent requires managing both simultaneously. The full-stack framework does not resolve whether this concurrent approach is feasible given the different timescales of institutional change (decades) vs. AI development (months).
--- ---
Relevant Notes: Relevant Notes:

View file

@ -31,24 +31,26 @@ The framework does not engage with existing work on preference diversity limitat
## Challenges ## Challenges
**Stability assumption**: How do you operationalize "enduring values" when human values themselves evolve over time? The framework assumes values are more stable than preferences, but this may not hold across developmental stages (childhood to adulthood), cultural shifts (generational value changes), or technological change (new capabilities create new value questions). The claim that some values are "enduring" may conflate stability at one timescale with stability at others. **Stability assumption (primary challenge)**: How do you operationalize "enduring values" when human values themselves evolve over time? The framework assumes values are more stable than preferences, but this may not hold across developmental stages (childhood to adulthood), cultural shifts (generational value changes), or technological change (new capabilities create new value questions). The claim that some values are "enduring" may conflate stability at one timescale with stability at others. Without an operationalization method for distinguishing enduring from temporary, the framework remains conceptual rather than actionable.
**Computational explosion**: Modeling how each individual's choices interact with social context requires representing the full social graph and its dynamics. This creates a scalability problem that the paper does not address. At what granularity is social context modeled? How many degrees of social separation matter? The computational cost may be prohibitive. **Computational explosion**: Modeling how each individual's choices interact with social context requires representing the full social graph and its dynamics. This creates a scalability problem that the paper does not address. At what granularity is social context modeled? How many degrees of social separation matter? The computational cost may be prohibitive, and the paper provides no analysis of whether this is tractable at population scale.
**Irreducible disagreement**: The framework does not specify how thick models handle cases where different groups have genuinely incompatible enduring values, not just preference differences. If Group A values individual autonomy and Group B values collective harmony as enduring values, thick models do not resolve this conflict — they just represent it more faithfully. The paper does not explain whether thick models are a mechanism *for* pluralistic alignment or simply a more honest representation of the pluralism problem. **Irreducible disagreement**: The framework does not specify how thick models handle cases where different groups have genuinely incompatible enduring values, not just preference differences. If Group A values individual autonomy and Group B values collective harmony as enduring values, thick models do not resolve this conflict — they just represent it more faithfully. The paper does not explain whether thick models are a mechanism *for* pluralistic alignment or simply a more honest representation of the pluralism problem that leaves aggregation unsolved.
**Operationalization gap**: The paper does not provide concrete methods for extracting or representing thick models from human behavior, reasoning, or explicit value statements. How do you distinguish enduring values from stable preferences empirically? What data would you collect? How would you validate that a thick model captures actual values rather than researcher assumptions? **Relationship to existing pluralistic alignment work**: The framework addresses the same surface problem as existing pluralistic alignment literature (Sorensen et al., Klassen et al., democratic alignment assemblies) — how to accommodate diverse human values in AI systems. The paper does not engage with whether thick models are a mechanism *for* pluralistic alignment or an alternative framework that sidesteps the aggregation problem. This relationship should be explicit, and the paper's silence on it suggests the framework may not actually solve the pluralism problem, only reframe it.
**Relationship to existing pluralistic alignment work**: The framework addresses the same surface problem as existing pluralistic alignment literature (Sorensen et al., Klassen et al.) — how to accommodate diverse human values in AI systems. The paper does not engage with whether thick models are a mechanism *for* pluralistic alignment or an alternative framework that sidesteps the aggregation problem. This relationship should be explicit. **Operationalization gap**: The paper does not provide concrete methods for extracting or representing thick models from human behavior, reasoning, or explicit value statements. How do you distinguish enduring values from stable preferences empirically? What data would you collect? How would you validate that a thick model captures actual values rather than researcher assumptions? Without operationalization, the framework remains architectural.
--- ---
Relevant Notes: Relevant Notes:
- [[the alignment problem dissolves when human values are continuously woven into the system rather than specified in advance]] — thick values formalize continuous integration rather than advance specification - [[the alignment problem dissolves when human values are continuously woven into the system rather than specified in advance]] — thick values formalize continuous integration rather than advance specification
- [[specifying human values in code is intractable because our goals contain hidden complexity comparable to visual perception]] — thick models acknowledge this complexity and propose social embedding as a partial solution - [[specifying human values in code is intractable because our goals contain hidden complexity comparable to visual perception]] — thick models acknowledge this complexity and propose social embedding as a partial solution
- [[super co-alignment proposes that human and AI values should be co-shaped through iterative alignment rather than specified in advance]] — complementary mechanism; Zeng grounds co-alignment in intrinsic moral development (self-awareness, Theory of Mind); full-stack grounds thick models in social embedding and enduring-vs-temporary distinctions. Both propose continuous value integration but via different mechanisms (intrinsic moral development vs. social context modeling).
- [[pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state]] — thick models must handle value pluralism; unclear whether they solve or just represent the problem - [[pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state]] — thick models must handle value pluralism; unclear whether they solve or just represent the problem
- [[the specification trap means any values encoded at training time become structurally unstable as deployment contexts diverge from training conditions]] — thick models attempt to address this through continuous integration - [[the specification trap means any values encoded at training time become structurally unstable as deployment contexts diverge from training conditions]] — thick models attempt to address this through continuous integration and social context modeling, but do not engage with whether this solves the specification trap or merely delays it
- [[super co-alignment proposes that human and AI values should be co-shaped through iterative alignment rather than specified in advance]] — complementary mechanism; Zeng grounds co-alignment in intrinsic moral development (self-awareness, Theory of Mind); full-stack grounds thick models in social embedding and enduring-vs-temporary distinctions - [[democratic alignment assemblies produce constitutions as effective as expert-designed ones while better representing diverse populations]] — directly relevant to whether thick models can be operationalized through democratic processes
- [[community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules]] — relevant to extracting thick models from communities rather than individuals
Topics: Topics:
- [[domains/ai-alignment/_map]] - [[domains/ai-alignment/_map]]