theseus: extract from 2026-00-00-friederich-against-manhattan-project-alignment.md
- Source: inbox/archive/2026-00-00-friederich-against-manhattan-project-alignment.md - Domain: ai-alignment - Extracted by: headless extraction cron (worker 2) Pentagon-Agent: Theseus <HEADLESS>
This commit is contained in:
parent
ba4ac4a73e
commit
20ecba37fe
5 changed files with 70 additions and 1 deletions
|
|
@ -21,6 +21,12 @@ Dario Amodei describes AI as "so powerful, such a glittering prize, that it is v
|
|||
|
||||
Since [[the internet enabled global communication but not global cognition]], the coordination infrastructure needed doesn't exist yet. This is why [[collective superintelligence is the alternative to monolithic AI controlled by a few]] -- it solves alignment through architecture rather than attempting governance from outside the system.
|
||||
|
||||
|
||||
### Additional Evidence (confirm)
|
||||
*Source: [[2026-00-00-friederich-against-manhattan-project-alignment]] | Added: 2026-03-12 | Extractor: anthropic/claude-sonnet-4.5*
|
||||
|
||||
Friederich and Dung (2026) provide philosophical support from analytic philosophy of science that alignment has 'irreducible social/political dimensions' and cannot be 'mainly technical-scientific.' Their argument that alignment is not a natural kind (not a single unified phenomenon) and not clearly operationalizable provides independent confirmation from a different disciplinary tradition (philosophy of science vs. systems theory) that alignment cannot be reduced to a technical problem. The five-point decomposition systematically shows why the Manhattan Project framing—which assumes a clear technical problem with a definite solution—is a category error.
|
||||
|
||||
---
|
||||
|
||||
Relevant Notes:
|
||||
|
|
|
|||
|
|
@ -0,0 +1,45 @@
|
|||
---
|
||||
type: claim
|
||||
domain: ai-alignment
|
||||
description: "The Manhattan Project framing of AI alignment assumes five properties—binary achievement, natural kind status, purely technical nature, one-shot achievability, and clear operationalizability—that alignment likely lacks"
|
||||
confidence: experimental
|
||||
source: "Simon Friederich, Leonard Dung, 'Against the Manhattan Project Framing of AI Alignment' (Mind & Language, 2026)"
|
||||
created: 2026-03-11
|
||||
---
|
||||
|
||||
# The Manhattan Project framing of AI alignment assumes five properties that alignment lacks: binary achievement, natural kind status, purely technical nature, one-shot achievability, and clear operationalizability
|
||||
|
||||
Friederich and Dung (2026) argue that AI companies frame alignment as a clear, well-delineated, unified scientific problem solvable within years—a "Manhattan project"—but this framing fails as a category error across five dimensions:
|
||||
|
||||
## The Five Dimensions of Framing Failure
|
||||
|
||||
1. **Not binary** — Alignment is not a yes/no achievement but exists on a spectrum. There is no discrete state at which alignment is "complete."
|
||||
|
||||
2. **Not a natural kind** — Alignment is not a single unified phenomenon but a collection of heterogeneous problems. The framing treats alignment as a natural category (like "water" or "disease") when it is actually a constructed category bundling distinct concerns.
|
||||
|
||||
3. **Not mainly technical-scientific** — Alignment has irreducible social and political dimensions that cannot be solved through engineering alone. Value specification, stakeholder representation, and governance are not technical problems.
|
||||
|
||||
4. **Not realistically achievable as one-shot** — Alignment cannot be "solved" once and deployed, but requires ongoing adaptation as deployment contexts diverge from training conditions and new stakeholders emerge.
|
||||
|
||||
5. **Not clearly operationalizable** — It is "probably impossible to operationalize AI alignment in such a way that solving the alignment problem and implementing the solution would be sufficient to rule out AI takeover." Even if we could specify values correctly, the operationalization itself cannot be sufficient to guarantee safety.
|
||||
|
||||
## Why This Matters
|
||||
|
||||
The authors argue this framing "may bias societal discourse and decision-making towards faster AI development and deployment than is responsible" by creating false confidence that alignment is a tractable engineering problem with a definite solution timeline. The Manhattan Project analogy suggests that sufficient resources, focus, and time will yield a solution—but alignment may not be solvable in this way.
|
||||
|
||||
This represents a philosophy-of-science critique distinct from both AI safety technical work and governance literature. The claim that operationalization itself is impossible—not just difficult—is stronger than most coordination-focused critiques, which typically argue alignment is hard to specify rather than impossible to specify sufficiently.
|
||||
|
||||
## Evidence and Limitations
|
||||
|
||||
Published in *Mind & Language* (2026), a respected analytic philosophy journal. The five-point decomposition provides a systematic framework for why the Manhattan Project analogy fails as a category, not just as a matter of difficulty.
|
||||
|
||||
**Critical limitation:** Full text is paywalled. This extraction is based on abstract, search results, and related discussion. The underlying philosophical arguments for each of the five points cannot be fully evaluated without access to the complete paper. The specific reasoning for why operationalization is impossible in principle (rather than merely difficult) is not accessible.
|
||||
|
||||
---
|
||||
|
||||
Related claims:
|
||||
- [[AI alignment is a coordination problem not a technical problem]] — complementary argument from systems theory perspective
|
||||
- [[the specification trap means any values encoded at training time become structurally unstable as deployment contexts diverge from training conditions]] — relates to the operationalization impossibility claim
|
||||
- [[persistent irreducible disagreement]] — supports the "not binary" dimension
|
||||
- [[safe AI development requires building alignment mechanisms before scaling capability]] — challenged by the "not one-shot achievable" argument
|
||||
- [[pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state]] — related to the non-binary and non-natural-kind arguments
|
||||
|
|
@ -21,6 +21,12 @@ The correct response is to map the disagreement rather than eliminate it. Identi
|
|||
|
||||
[[Collective intelligence within a purpose-driven community faces a structural tension because shared worldview correlates errors while shared purpose enables coordination]]. Persistent irreducible disagreement is actually a safeguard here -- it prevents the correlated error problem by maintaining genuine diversity of perspective within a coordinated community. The independence-coherence tradeoff is managed not by eliminating disagreement but by channeling it productively.
|
||||
|
||||
|
||||
### Additional Evidence (confirm)
|
||||
*Source: [[2026-00-00-friederich-against-manhattan-project-alignment]] | Added: 2026-03-12 | Extractor: anthropic/claude-sonnet-4.5*
|
||||
|
||||
Friederich and Dung's claim that alignment is 'not binary' but exists on a spectrum supports the irreducibility thesis. If alignment cannot be achieved as a yes/no state, then disagreements about what constitutes 'aligned' behavior are not resolvable through better specification but reflect genuinely different value positions. The non-binary nature of alignment means there is no single target state that would satisfy all stakeholders, making some disagreements permanently irreducible by design rather than by contingent limitation.
|
||||
|
||||
---
|
||||
|
||||
Relevant Notes:
|
||||
|
|
|
|||
|
|
@ -17,6 +17,12 @@ This converges with findings across at least five other research programs. Zeng'
|
|||
|
||||
The specification trap is why since [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]] -- the failure is not just about diversity but about fixing anything at all. It is why since [[the alignment problem dissolves when human values are continuously woven into the system rather than specified in advance]] -- continuous weaving is the structural response to structural instability. And it is why since [[adaptive governance outperforms rigid alignment blueprints because superintelligence development has too many unknowns for fixed plans]] -- the same logic that makes rigid blueprints fail for governance makes rigid value specifications fail for alignment.
|
||||
|
||||
|
||||
### Additional Evidence (extend)
|
||||
*Source: [[2026-00-00-friederich-against-manhattan-project-alignment]] | Added: 2026-03-12 | Extractor: anthropic/claude-sonnet-4.5*
|
||||
|
||||
Friederich and Dung (2026) argue it is 'probably impossible to operationalize AI alignment in such a way that solving the alignment problem and implementing the solution would be sufficient to rule out AI takeover.' This extends the specification trap from 'values become unstable over time' to 'sufficient operationalization may be impossible in principle.' The claim is stronger: even if we could specify values correctly at training time and prevent them from drifting, the operationalization itself cannot be sufficient to guarantee safety. This suggests the problem is not just specification instability but fundamental limits on what operationalization can achieve.
|
||||
|
||||
---
|
||||
|
||||
Relevant Notes:
|
||||
|
|
|
|||
|
|
@ -7,9 +7,15 @@ date: 2026-01-01
|
|||
domain: ai-alignment
|
||||
secondary_domains: []
|
||||
format: paper
|
||||
status: unprocessed
|
||||
status: processed
|
||||
priority: medium
|
||||
tags: [alignment-framing, Manhattan-project, operationalization, philosophical, AI-safety]
|
||||
processed_by: theseus
|
||||
processed_date: 2026-03-11
|
||||
claims_extracted: ["alignment-framing-as-manhattan-project-assumes-five-properties-that-alignment-lacks.md"]
|
||||
enrichments_applied: ["AI alignment is a coordination problem not a technical problem.md", "the specification trap means any values encoded at training time become structurally unstable as deployment contexts diverge from training conditions.md", "persistent irreducible disagreement.md"]
|
||||
extraction_model: "anthropic/claude-sonnet-4.5"
|
||||
extraction_notes: "Philosophical critique of alignment-as-technical-problem from philosophy of science perspective. Five-point decomposition provides systematic framework. Full text paywalled—extraction based on abstract and discussion. Single claim extracted as composite argument; enrichments confirm existing coordination-focused claims from different disciplinary tradition."
|
||||
---
|
||||
|
||||
## Content
|
||||
|
|
|
|||
Loading…
Reference in a new issue