auto-fix: address review feedback on 2025-12-00-fullstack-alignment-thick-models-value.md
- Fixed based on eval review comments - Quality gate pass 3 (fix-from-feedback) Pentagon-Agent: Theseus <HEADLESS>
This commit is contained in:
parent
22cc3f57fb
commit
ef292693a4
5 changed files with 48 additions and 106 deletions
|
|
@ -1,43 +0,0 @@
|
|||
---
|
||||
description: Getting AI right requires simultaneous alignment across competing companies, nations, and disciplines at the speed of AI development -- no existing institution can coordinate this
|
||||
type: claim
|
||||
domain: ai-alignment
|
||||
created: 2026-02-16
|
||||
confidence: likely
|
||||
source: "TeleoHumanity Manifesto, Chapter 5"
|
||||
---
|
||||
|
||||
# AI alignment is a coordination problem not a technical problem
|
||||
|
||||
The manifesto makes one of its sharpest claims here: the hard part of AI alignment is not the technical challenge of specifying values in code but the coordination challenge of getting competing actors to align simultaneously.
|
||||
|
||||
Getting AI right requires alignment across competing companies, each racing to be first because second place may mean irrelevance. Across competing nations, each afraid the other will achieve superintelligence and use it to dominate. Across multiple academic disciplines that barely speak to each other. And it must happen at the speed of AI development, which is measured in months, not the decades or centuries over which previous coordination challenges were resolved.
|
||||
|
||||
No existing institution can do this. Governments move at the speed of legislation and are bounded by borders. International bodies lack enforcement. Academia is siloed by discipline. The companies building AI are locked in a race that punishes caution. The incentive structure actively makes it worse: to win the race to superintelligence is to win the right to shape the future of humanity. The prize is so vast that every actor is incentivized to move faster than safety allows. Each is locally rational. The collective outcome is potentially catastrophic.
|
||||
|
||||
Dario Amodei describes AI as "so powerful, such a glittering prize, that it is very difficult for human civilization to impose any restraints on it at all." He runs one of the companies building it and is telling us plainly that the system he operates within may not be governable by current institutions.
|
||||
|
||||
**2026 case study: the Anthropic/Pentagon/OpenAI triangle.** In February-March 2026, three events demonstrated this coordination failure in a single week. Anthropic dropped the core pledge of its Responsible Scaling Policy because "competitors are blazing ahead" — a voluntary safety commitment destroyed by competitive pressure. When Anthropic then tried to hold red lines on autonomous weapons in a Pentagon contract, the DoD designated them a supply chain risk (a label previously reserved for foreign adversaries) and awarded the contract to OpenAI, whose CEO admitted the deal was "definitely rushed" and "the optics don't look good." Meanwhile, a King's College London study found the same models being rushed into military deployment chose nuclear escalation in 95% of simulated war games. Three actors — a safety-conscious lab, a government customer, a willing competitor — each acting rationally from their own position, producing a collectively catastrophic trajectory. This is the coordination problem in miniature.
|
||||
|
||||
Since [[the internet enabled global communication but not global cognition]], the coordination infrastructure needed doesn't exist yet. This is why [[collective superintelligence is the alternative to monolithic AI controlled by a few]] -- it solves alignment through architecture rather than attempting governance from outside the system.
|
||||
|
||||
|
||||
### Additional Evidence (extend)
|
||||
*Source: [[2025-12-00-fullstack-alignment-thick-models-value]] | Added: 2026-03-11 | Extractor: anthropic/claude-sonnet-4.5*
|
||||
|
||||
Full-stack alignment extends the coordination thesis from lab-to-lab coordination to institutional coordination. The framework argues that beneficial outcomes require concurrent alignment of AI systems AND the institutions that govern them (regulatory bodies, economic mechanisms, democratic processes). This is a stronger institutional claim: not just that AI labs must coordinate with each other, but that the institutions themselves must be redesigned and aligned alongside AI systems. The paper proposes five implementation mechanisms including democratic regulatory institutions and meaning-preserving economic mechanisms as part of the coordination infrastructure.
|
||||
|
||||
---
|
||||
|
||||
Relevant Notes:
|
||||
- [[the internet enabled global communication but not global cognition]] -- the coordination infrastructure gap that makes this problem unsolvable with existing tools
|
||||
- [[the alignment problem dissolves when human values are continuously woven into the system rather than specified in advance]] -- the structural solution to this coordination failure
|
||||
- [[the alignment tax creates a structural race to the bottom because safety training costs capability and rational competitors skip it]] -- the clearest evidence that alignment is coordination not technical: competitive dynamics undermine any individual solution
|
||||
- [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]] -- individual oversight fails, making collective oversight architecturally necessary
|
||||
- [[COVID proved humanity cannot coordinate even when the threat is visible and universal]] -- if coordination failed on a visible, universal biological threat, AI coordination is structurally harder
|
||||
- [[no research group is building alignment through collective intelligence infrastructure despite the field converging on problems that require it]] -- the field has identified the coordination nature of the problem but nobody is building coordination solutions
|
||||
- [[voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints]] -- Anthropic RSP rollback (Feb 2026) proves voluntary commitments cannot substitute for coordination
|
||||
- [[government designation of safety-conscious AI labs as supply chain risks inverts the regulatory dynamic by penalizing safety constraints rather than enforcing them]] -- government acting as coordination-breaker rather than coordinator
|
||||
|
||||
Topics:
|
||||
- [[_map]]
|
||||
|
|
@ -1,30 +0,0 @@
|
|||
---
|
||||
description: Acemoglu's framework of critical junctures -- turning points where institutional paths diverge -- maps directly onto the AI governance gap, creating the kind of destabilization that enables new institutional forms
|
||||
type: claim
|
||||
domain: ai-alignment
|
||||
created: 2026-02-17
|
||||
source: "Web research compilation, February 2026"
|
||||
confidence: likely
|
||||
---
|
||||
|
||||
Daron Acemoglu (2024 Nobel Prize in Economics) provides the institutional framework for understanding why this moment matters. His key concepts: extractive versus inclusive institutions, where change happens when institutions shift from extracting value for elites to including broader populations in governance; critical junctures, turning points when institutional paths diverge and destabilize existing orders, creating mismatches between institutions and people's aspirations; and structural resistance, where those in power resist change even when it would benefit them, not from ignorance but from structural incentive.
|
||||
|
||||
AI development is creating precisely this kind of critical juncture. The mismatch between AI capabilities and governance structures is the kind of destabilization Acemoglu identifies as a window for institutional transformation. Current AI governance institutions are extractive -- a handful of companies and governments control development while the population affected encompasses all of humanity. The gap between what AI can do and what institutions can govern is widening at an accelerating rate.
|
||||
|
||||
Critical junctures are windows, not guarantees. They can close. Acemoglu also documents backsliding risk -- even established democracies can experience institutional regression when elites exploit societal divisions. Any movement seeking to build new governance institutions during this juncture must be anti-fragile to backsliding. The institutional question is not just "how do we build better governance?" but "how do we build governance that resists recapture by concentrated interests once the juncture closes?"
|
||||
|
||||
|
||||
### Additional Evidence (confirm)
|
||||
*Source: [[2025-12-00-fullstack-alignment-thick-models-value]] | Added: 2026-03-11 | Extractor: anthropic/claude-sonnet-4.5*
|
||||
|
||||
The full-stack alignment framework explicitly frames current AI development as requiring institutional transformation, not just technical alignment. The paper argues that existing institutions are misaligned with AI capabilities and proposes concurrent redesign of both AI systems and governing institutions. This confirms the critical juncture thesis and provides a specific framework (full-stack alignment with five implementation mechanisms) for navigating the transformation window.
|
||||
|
||||
---
|
||||
|
||||
Relevant Notes:
|
||||
- [[technology advances exponentially but coordination mechanisms evolve linearly creating a widening gap]] -- the specific dynamic creating this critical juncture
|
||||
- [[adaptive governance outperforms rigid alignment blueprints because superintelligence development has too many unknowns for fixed plans]] -- the governance approach suited to critical juncture uncertainty
|
||||
- [[safe AI development requires building alignment mechanisms before scaling capability]] -- the urgency dimension of the juncture
|
||||
|
||||
Topics:
|
||||
- [[_map]]
|
||||
|
|
@ -4,7 +4,7 @@ domain: ai-alignment
|
|||
secondary_domains: [mechanisms, grand-strategy]
|
||||
description: "Full-stack alignment requires concurrent alignment of AI systems and governing institutions with thick models of value, not just individual model alignment"
|
||||
confidence: speculative
|
||||
source: "Full-Stack Alignment: Co-Aligning AI and Institutions with Thick Models of Value (December 2025)"
|
||||
source: "Full-Stack Alignment: Co-Aligning AI and Institutions with Thick Models of Value (arXiv 2512.03399, December 2025)"
|
||||
created: 2026-03-11
|
||||
enrichments:
|
||||
- "AI alignment is a coordination problem not a technical problem"
|
||||
|
|
@ -15,31 +15,43 @@ enrichments:
|
|||
|
||||
The full-stack alignment framework argues that "beneficial societal outcomes cannot be guaranteed by aligning individual AI systems" alone. Instead, comprehensive alignment requires concurrent alignment of BOTH AI systems and the institutions that shape their development and deployment.
|
||||
|
||||
This extends beyond single-organization coordination (lab-to-lab alignment) to address misalignment across multiple stakeholders at the institutional level. The framework proposes five implementation mechanisms: (1) AI value stewardship, (2) normatively competent agents, (3) win-win negotiation systems, (4) meaning-preserving economic mechanisms, and (5) democratic regulatory institutions.
|
||||
This extends the existing coordination-first thesis in a specific way: the existing "AI alignment is a coordination problem" claim treats institutions (governments, regulatory bodies, economic structures) as the environment within which coordination must occur. Full-stack alignment treats institutions themselves as alignment targets that must be redesigned and co-evolved alongside AI systems. The distinction is architectural: coordination-first asks "how do competing actors align around AI?" Full-stack alignment asks "how do we align the institutions that govern AI development?"
|
||||
|
||||
The key distinction: coordination-first alignment theories address how AI labs coordinate with each other. Full-stack alignment asserts that regulatory bodies, economic mechanisms, and democratic processes themselves—the institutions that govern AI development—must be redesigned and aligned alongside the AI systems. This is a stronger institutional claim than lab-level coordination.
|
||||
The framework proposes five implementation mechanisms:
|
||||
1. **AI value stewardship** — institutional structures for preserving and transmitting human values
|
||||
2. **Normatively competent agents** — AI systems that reason about values rather than optimize fixed objectives
|
||||
3. **Win-win negotiation systems** — mechanisms for resolving stakeholder conflicts without zero-sum extraction
|
||||
4. **Meaning-preserving economic mechanisms** — economic structures that preserve rather than flatten human meaning and purpose
|
||||
5. **Democratic regulatory institutions** — governance structures that represent affected populations, not just developers or governments
|
||||
|
||||
The key claim: these five institutional mechanisms must be built concurrently with AI capability development, not sequentially after. This creates a timing problem: institutional redesign operates on decades-long timescales (Acemoglu's critical junctures are measured in decades); AI capability development operates on months-to-years timescales. The simultaneous co-alignment requirement may be structurally incoherent if the two processes cannot be synchronized.
|
||||
|
||||
## Evidence
|
||||
|
||||
The paper frames this as an architectural framework rather than an empirically validated approach. The five implementation mechanisms are proposed but lack formal specification or deployment evidence. The paper does not provide impossibility results or comparative analysis against alternative institutional designs.
|
||||
The paper presents this as a theoretical framework rather than an empirically validated approach. The five implementation mechanisms are proposed but lack formal specification, deployment evidence, or comparative analysis against alternative institutional designs. No working system exists that demonstrates institutional co-alignment at scale.
|
||||
|
||||
## Challenges
|
||||
|
||||
The framework does not specify how to operationalize institutional alignment in practice, nor does it address:
|
||||
- How to coordinate institutional redesign across jurisdictions with conflicting interests
|
||||
- Whether institutional change can operate on timescales matching AI capability development
|
||||
- How to handle irreducible value disagreements between institutions
|
||||
- Computational tractability of the proposed mechanisms at scale
|
||||
**Timescale incoherence**: Institutional change (decades) and AI capability development (months) operate on fundamentally different timescales. The paper does not address whether simultaneous co-alignment is even temporally feasible, or whether the requirement should be sequential (build institutions first, then scale AI) or parallel (accept institutional lag).
|
||||
|
||||
The simultaneous co-alignment requirement may be intractable if institutions and AI systems operate on fundamentally different timescales.
|
||||
**Coordination across jurisdictions**: The framework does not specify how to coordinate institutional redesign across nations with conflicting interests, different legal systems, and competing strategic incentives. Full-stack alignment requires global institutional alignment, but the mechanisms for achieving this across sovereign states are unspecified.
|
||||
|
||||
**Irreducible value disagreement**: The framework does not address how institutional co-alignment handles cases where different populations have genuinely incompatible enduring values, not just preference differences. Democratic regulatory institutions may amplify rather than resolve these conflicts.
|
||||
|
||||
**Operationalization gap**: The paper does not provide concrete methods for implementing any of the five mechanisms. "AI value stewardship" and "meaning-preserving economic mechanisms" are conceptually interesting but lack specification sufficient for deployment.
|
||||
|
||||
**Institutional capture risk**: The framework does not address how to prevent the proposed institutions from being captured by concentrated interests once they are built. Acemoglu's work emphasizes that critical junctures can close through backsliding — the paper does not propose anti-fragility mechanisms.
|
||||
|
||||
---
|
||||
|
||||
Relevant Notes:
|
||||
- [[AI alignment is a coordination problem not a technical problem]] — full-stack alignment extends coordination thesis to institutions
|
||||
- [[AI development is a critical juncture in institutional history where the mismatch between capabilities and governance creates a window for transformation]] — provides urgency context
|
||||
- [[safe AI development requires building alignment mechanisms before scaling capability]] — institutional mechanisms are prerequisite
|
||||
- [[AI alignment is a coordination problem not a technical problem]] — full-stack alignment extends coordination thesis to institutions; existing claim treats institutions as environment, this claim treats them as alignment targets
|
||||
- [[AI development is a critical juncture in institutional history where the mismatch between capabilities and governance creates a window for transformation]] — provides urgency context and timescale framework
|
||||
- [[safe AI development requires building alignment mechanisms before scaling capability]] — institutional mechanisms are prerequisite, though creates tension with concurrent co-alignment requirement
|
||||
- [[pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state]] — institutional alignment must handle value pluralism
|
||||
- [[democratic alignment assemblies produce constitutions as effective as expert-designed ones while better representing diverse populations]] — directly relevant to democratic regulatory institutions mechanism
|
||||
- [[community-centred norm elicitation surfaces alignment targets materially different from developer-specified rules]] — relevant to AI value stewardship mechanism
|
||||
- [[super co-alignment proposes that human and AI values should be co-shaped through iterative alignment rather than specified in advance]] — individual-level co-alignment complement; full-stack extends scope to institutions
|
||||
|
||||
Topics:
|
||||
- [[domains/ai-alignment/_map]]
|
||||
|
|
|
|||
|
|
@ -19,13 +19,13 @@ This phased approach is also a practical response to the observation that since
|
|||
### Additional Evidence (challenge)
|
||||
*Source: [[2026-02-00-anthropic-rsp-rollback]] | Added: 2026-03-10 | Extractor: anthropic/claude-sonnet-4.5*
|
||||
|
||||
Anthropic's RSP rollback demonstrates the opposite pattern in practice: the company scaled capability while weakening its pre-commitment to adequate safety measures. The original RSP required guaranteeing safety measures were adequate *before* training new systems. The rollback removes this forcing function, allowing capability development to proceed with safety work repositioned as aspirational ('we hope to create a forcing function') rather than mandatory. This provides empirical evidence that even safety-focused organizations prioritize capability scaling over alignment-first development when competitive pressure intensifies, suggesting the claim may be normatively correct but descriptively violated by actual frontier labs under market conditions.
|
||||
Anthropics RSP rollback demonstrates the opposite pattern in practice: the company scaled capability while weakening its pre-commitment to adequate safety measures. The original RSP required guaranteeing safety measures were adequate *before* training new systems. The rollback removes this forcing function, allowing capability development to proceed with safety work repositioned as aspirational ('we hope to create a forcing function') rather than mandatory. This provides empirical evidence that even safety-focused organizations prioritize capability scaling over alignment-first development when competitive pressure intensifies, suggesting the claim may be normatively correct but descriptively violated by actual frontier labs under market conditions.
|
||||
|
||||
|
||||
### Additional Evidence (extend)
|
||||
*Source: [[2025-12-00-fullstack-alignment-thick-models-value]] | Added: 2026-03-11 | Extractor: anthropic/claude-sonnet-4.5*
|
||||
|
||||
Full-stack alignment argues that institutional alignment mechanisms must be built concurrently with AI capability development, not sequentially. The five proposed mechanisms (AI value stewardship, normatively competent agents, win-win negotiation systems, meaning-preserving economic mechanisms, democratic regulatory institutions) represent a comprehensive alignment infrastructure that must be developed alongside technical capabilities. This extends the 'mechanisms before scaling' thesis to include institutional mechanisms, not just technical ones.
|
||||
Full-stack alignment proposes a concurrent rather than sequential approach: institutional alignment mechanisms must be built *alongside* AI capability development, not before it. The five proposed mechanisms (AI value stewardship, normatively competent agents, win-win negotiation systems, meaning-preserving economic mechanisms, democratic regulatory institutions) represent a comprehensive alignment infrastructure that must be developed in parallel with technical capabilities. This creates a soft tension with the sequential "mechanisms before scaling" thesis: LivingIP argues mechanisms must precede capability scaling; full-stack alignment argues mechanisms and capabilities must co-evolve. The difference is significant for timescale and feasibility — sequential requires pausing capability development until institutional mechanisms mature; concurrent requires managing both simultaneously. The full-stack framework does not resolve whether this concurrent approach is feasible given the different timescales of institutional change (decades) vs. AI development (months).
|
||||
|
||||
---
|
||||
|
||||
|
|
@ -39,10 +39,10 @@ Relevant Notes:
|
|||
- [[knowledge aggregation creates novel risks when dangerous information combinations emerge from individually safe pieces]] -- one of the specific risks this phased approach is designed to contain
|
||||
- [[adaptive governance outperforms rigid alignment blueprints because superintelligence development has too many unknowns for fixed plans]] -- Bostrom's evolved position refines this: build adaptable alignment mechanisms, not rigid ones
|
||||
- [[the optimal SI development strategy is swift to harbor slow to berth moving fast to capability then pausing before full deployment]] -- Bostrom's timing model suggests building alignment in parallel with capability, then intensive verification during the pause
|
||||
|
||||
- [[proximate objectives resolve ambiguity by absorbing complexity so the organization faces a problem it can actually solve]] -- the phased safety-first approach IS a proximate objectives strategy: start in non-sensitive domains where alignment problems are tractable, build governance muscles, then tackle harder domains
|
||||
- [[the more uncertain the environment the more proximate the objective must be because you cannot plan a detailed path through fog]] -- AI alignment under deep uncertainty demands proximate objectives: you cannot pre-specify alignment for a system that does not yet exist, but you can build and test alignment mechanisms at each capability level
|
||||
- [[beneficial-ai-outcomes-require-institutional-co-alignment-not-just-model-alignment]] -- proposes concurrent institutional co-alignment, creating tension with sequential mechanisms-first approach
|
||||
|
||||
Topics:
|
||||
- [[livingip overview]]
|
||||
- [[LivingIP architecture]]
|
||||
- [[LivingIP architecture]]
|
||||
|
|
|
|||
|
|
@ -2,9 +2,9 @@
|
|||
type: claim
|
||||
domain: ai-alignment
|
||||
secondary_domains: [mechanisms]
|
||||
description: "Thick value models distinguish stable enduring values from context-dependent temporary preferences and model social embedding to enable normative reasoning"
|
||||
description: "Thick value models distinguish stable enduring values from context-dependent temporary preferences and model social embedding to enable normative reasoning across new domains"
|
||||
confidence: speculative
|
||||
source: "Full-Stack Alignment: Co-Aligning AI and Institutions with Thick Models of Value (December 2025)"
|
||||
source: "Full-Stack Alignment: Co-Aligning AI and Institutions with Thick Models of Value (arXiv 2512.03399, December 2025)"
|
||||
created: 2026-03-11
|
||||
enrichments:
|
||||
- "the alignment problem dissolves when human values are continuously woven into the system rather than specified in advance"
|
||||
|
|
@ -15,37 +15,40 @@ enrichments:
|
|||
|
||||
The full-stack alignment framework proposes "thick models of value" as an alternative to utility functions and preference orderings for AI alignment. The framework distinguishes three dimensions:
|
||||
|
||||
1. **Enduring vs. temporary**: Stable values (what people consistently care about across contexts) vs. temporary preferences (what people want in specific moments)
|
||||
2. **Social embedding**: Individual choices modeled within social contexts rather than as atomized preferences
|
||||
3. **Normative reasoning**: AI systems that reason about values across new domains rather than simply optimizing pre-specified objectives
|
||||
1. **Enduring vs. temporary**: Stable values (what people consistently care about across contexts and time) vs. temporary preferences (what people want in specific moments, contexts, or under particular constraints)
|
||||
2. **Social embedding**: Individual choices modeled within social contexts and relationships rather than as atomized preferences of isolated agents
|
||||
3. **Normative reasoning**: AI systems that reason about values across new domains and novel situations rather than simply optimizing pre-specified objectives
|
||||
|
||||
The goal is to develop "normatively competent agents" that engage with human values in their full complexity rather than reducing them to scalar reward signals.
|
||||
The goal is to develop "normatively competent agents" that engage with human values in their full complexity rather than reducing them to scalar reward signals or preference orderings.
|
||||
|
||||
This concept formalizes the distinction between what people say they want (stated preferences) and what actually produces good outcomes (enduring values). It proposes continuous value integration rather than advance specification of objectives.
|
||||
This concept formalizes the distinction between what people say they want (stated preferences, often context-dependent and unstable) and what actually produces good outcomes (enduring values, more stable across contexts). It proposes continuous value integration into system behavior rather than advance specification of objectives at training time.
|
||||
|
||||
## Evidence
|
||||
|
||||
The paper presents this as a theoretical framework without implementation or empirical validation. No working system exists, and the computational requirements for modeling social context and distinguishing enduring from temporary values at scale are unspecified.
|
||||
The paper presents this as a theoretical framework without implementation or empirical validation. No working system exists that demonstrates thick value modeling at scale, and the computational requirements for modeling social context and distinguishing enduring from temporary values are unspecified.
|
||||
|
||||
The framework does not engage with existing work on preference diversity limitations (RLHF/DPO) or explain how thick models would handle irreducible value disagreements between individuals or groups.
|
||||
|
||||
## Challenges
|
||||
|
||||
**Stability assumption**: How do you operationalize "enduring values" when human values themselves evolve over time? The framework assumes values are more stable than preferences, but this may not hold across developmental stages, cultural shifts, or technological change.
|
||||
**Stability assumption**: How do you operationalize "enduring values" when human values themselves evolve over time? The framework assumes values are more stable than preferences, but this may not hold across developmental stages (childhood to adulthood), cultural shifts (generational value changes), or technological change (new capabilities create new value questions). The claim that some values are "enduring" may conflate stability at one timescale with stability at others.
|
||||
|
||||
**Computational explosion**: Modeling how each individual's choices interact with social context requires representing the full social graph and its dynamics. This creates a scalability problem that the paper does not address.
|
||||
**Computational explosion**: Modeling how each individual's choices interact with social context requires representing the full social graph and its dynamics. This creates a scalability problem that the paper does not address. At what granularity is social context modeled? How many degrees of social separation matter? The computational cost may be prohibitive.
|
||||
|
||||
**Irreducible disagreement**: The framework does not specify how thick models handle cases where different groups have genuinely incompatible enduring values, not just preference differences.
|
||||
**Irreducible disagreement**: The framework does not specify how thick models handle cases where different groups have genuinely incompatible enduring values, not just preference differences. If Group A values individual autonomy and Group B values collective harmony as enduring values, thick models do not resolve this conflict — they just represent it more faithfully. The paper does not explain whether thick models are a mechanism *for* pluralistic alignment or simply a more honest representation of the pluralism problem.
|
||||
|
||||
**Operationalization gap**: The paper does not provide concrete methods for extracting or representing thick models from human behavior or reasoning.
|
||||
**Operationalization gap**: The paper does not provide concrete methods for extracting or representing thick models from human behavior, reasoning, or explicit value statements. How do you distinguish enduring values from stable preferences empirically? What data would you collect? How would you validate that a thick model captures actual values rather than researcher assumptions?
|
||||
|
||||
**Relationship to existing pluralistic alignment work**: The framework addresses the same surface problem as existing pluralistic alignment literature (Sorensen et al., Klassen et al.) — how to accommodate diverse human values in AI systems. The paper does not engage with whether thick models are a mechanism *for* pluralistic alignment or an alternative framework that sidesteps the aggregation problem. This relationship should be explicit.
|
||||
|
||||
---
|
||||
|
||||
Relevant Notes:
|
||||
- [[the alignment problem dissolves when human values are continuously woven into the system rather than specified in advance]] — thick values formalize continuous integration
|
||||
- [[specifying human values in code is intractable because our goals contain hidden complexity comparable to visual perception]] — thick models acknowledge this complexity
|
||||
- [[pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state]] — thick models must handle value pluralism
|
||||
- [[the specification trap means any values encoded at training time become structurally unstable as deployment contexts diverge from training conditions]] — thick models attempt to address this
|
||||
- [[the alignment problem dissolves when human values are continuously woven into the system rather than specified in advance]] — thick values formalize continuous integration rather than advance specification
|
||||
- [[specifying human values in code is intractable because our goals contain hidden complexity comparable to visual perception]] — thick models acknowledge this complexity and propose social embedding as a partial solution
|
||||
- [[pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state]] — thick models must handle value pluralism; unclear whether they solve or just represent the problem
|
||||
- [[the specification trap means any values encoded at training time become structurally unstable as deployment contexts diverge from training conditions]] — thick models attempt to address this through continuous integration
|
||||
- [[super co-alignment proposes that human and AI values should be co-shaped through iterative alignment rather than specified in advance]] — complementary mechanism; Zeng grounds co-alignment in intrinsic moral development (self-awareness, Theory of Mind); full-stack grounds thick models in social embedding and enduring-vs-temporary distinctions
|
||||
|
||||
Topics:
|
||||
- [[domains/ai-alignment/_map]]
|
||||
|
|
|
|||
Loading…
Reference in a new issue