theseus: extract claims from 2026-00-00-friederich-against-manhattan-project-alignment.md
- Source: inbox/archive/2026-00-00-friederich-against-manhattan-project-alignment.md - Domain: ai-alignment - Extracted by: headless extraction cron (worker 3) Pentagon-Agent: Theseus <HEADLESS>
This commit is contained in:
parent
5a3d603e78
commit
318190eb24
4 changed files with 68 additions and 1 deletions
|
|
@ -21,6 +21,12 @@ Dario Amodei describes AI as "so powerful, such a glittering prize, that it is v
|
|||
|
||||
Since [[the internet enabled global communication but not global cognition]], the coordination infrastructure needed doesn't exist yet. This is why [[collective superintelligence is the alternative to monolithic AI controlled by a few]] -- it solves alignment through architecture rather than attempting governance from outside the system.
|
||||
|
||||
|
||||
### Additional Evidence (confirm)
|
||||
*Source: [[2026-00-00-friederich-against-manhattan-project-alignment]] | Added: 2026-03-11 | Extractor: anthropic/claude-sonnet-4.5*
|
||||
|
||||
Friederich & Dung (2026) provide philosophical support from philosophy of science tradition: they argue alignment has 'irreducible social/political dimensions' and cannot be treated as 'mainly technical-scientific.' They contend the Manhattan Project framing (alignment as purely technical problem) is a category error across five dimensions: it's not binary, not a natural kind, not mainly technical, not achievable as one-shot solution, and 'probably impossible to operationalize' such that solving it would be sufficient to prevent AI takeover. Published in Mind & Language, a peer-reviewed analytic philosophy journal, this represents convergent reasoning from a different disciplinary tradition than systems theory.
|
||||
|
||||
---
|
||||
|
||||
Relevant Notes:
|
||||
|
|
|
|||
|
|
@ -0,0 +1,49 @@
|
|||
---
|
||||
type: claim
|
||||
domain: ai-alignment
|
||||
description: "The Manhattan Project metaphor for AI alignment encodes five philosophical assumptions—binary achievement, natural kinds, technical solvability, one-shot solutions, and operationalizability—that mischaracterize alignment's actual nature"
|
||||
confidence: experimental
|
||||
source: "Friederich & Dung (2026), Mind & Language"
|
||||
created: 2026-03-11
|
||||
depends_on:
|
||||
- "AI alignment is a coordination problem not a technical problem.md"
|
||||
- "the specification trap means any values encoded at training time become structurally unstable as deployment contexts diverge from training conditions.md"
|
||||
- "some disagreements are permanently irreducible.md"
|
||||
---
|
||||
|
||||
# The Manhattan Project framing of alignment encodes five philosophical assumptions that mischaracterize the problem
|
||||
|
||||
Friederich and Dung (2026) argue that AI companies and researchers frame alignment as a "Manhattan Project"—a clear, well-delineated, unified scientific problem solvable within years—but this framing encodes five philosophical assumptions that fail on analysis:
|
||||
|
||||
## The Five Assumptions
|
||||
|
||||
1. **Binary achievement**: The framing assumes alignment is a yes/no state that can be achieved and verified, rather than a continuous spectrum or context-dependent property that admits degrees and variations.
|
||||
|
||||
2. **Natural kind**: It treats alignment as a single unified phenomenon with a discoverable essence, rather than a heterogeneous collection of distinct problems that may not share a common structure.
|
||||
|
||||
3. **Technical-scientific solvability**: It assumes alignment is primarily a technical problem solvable through scientific methods and engineering, excluding irreducible social, political, and value-pluralistic dimensions.
|
||||
|
||||
4. **One-shot achievability**: It presumes alignment can be solved once and then implemented, rather than requiring ongoing adaptation, renegotiation, and course correction as contexts and values evolve.
|
||||
|
||||
5. **Operationalizability**: It assumes alignment can be operationalized—defined in implementable, measurable terms—such that "solving the alignment problem and implementing the solution would be sufficient to rule out AI takeover." The authors argue this is "probably impossible" because alignment may not admit the kind of complete formal specification this assumption requires.
|
||||
|
||||
## The Harm of the Framing
|
||||
|
||||
The paper argues this framing "may bias societal discourse and decision-making towards faster AI development and deployment than is responsible" by making alignment appear more tractable, bounded, and solvable than it actually is. This creates pressure for premature deployment and underestimates the ongoing governance challenges.
|
||||
|
||||
## Philosophical Grounding
|
||||
|
||||
The argument draws on philosophy of science (natural kinds, operationalization, problem specification) rather than AI safety or governance literatures. The operationalizability claim is the strongest: not just that alignment is hard to operationalize, but that it's "probably impossible" to define it such that solving the defined problem would be sufficient to prevent takeover. This suggests a category error—alignment may not be the kind of thing that admits complete formal specification in the way the Manhattan Project framing assumes.
|
||||
|
||||
## Limitations
|
||||
|
||||
The full text is paywalled, so the specific arguments supporting each of the five points cannot be evaluated in depth. The claim rests on abstract and summary descriptions. The "probably impossible" language on operationalizability is strong but not proven—it's a philosophical argument about the nature of the problem, not an empirical demonstration.
|
||||
|
||||
---
|
||||
|
||||
## Related Claims
|
||||
|
||||
- [[AI alignment is a coordination problem not a technical problem.md]] — Convergent conclusion from different disciplinary tradition (systems theory vs. philosophy of science)
|
||||
- [[the specification trap means any values encoded at training time become structurally unstable as deployment contexts diverge from training conditions.md]] — Supports the operationalizability impossibility argument
|
||||
- [[some disagreements are permanently irreducible.md]] — Supports the "not binary" dimension
|
||||
- [[pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state.md]] — Related to natural kind critique
|
||||
|
|
@ -17,6 +17,12 @@ This converges with findings across at least five other research programs. Zeng'
|
|||
|
||||
The specification trap is why since [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]] -- the failure is not just about diversity but about fixing anything at all. It is why since [[the alignment problem dissolves when human values are continuously woven into the system rather than specified in advance]] -- continuous weaving is the structural response to structural instability. And it is why since [[adaptive governance outperforms rigid alignment blueprints because superintelligence development has too many unknowns for fixed plans]] -- the same logic that makes rigid blueprints fail for governance makes rigid value specifications fail for alignment.
|
||||
|
||||
|
||||
### Additional Evidence (extend)
|
||||
*Source: [[2026-00-00-friederich-against-manhattan-project-alignment]] | Added: 2026-03-11 | Extractor: anthropic/claude-sonnet-4.5*
|
||||
|
||||
Friederich & Dung argue it's 'probably impossible to operationalize AI alignment in such a way that solving the alignment problem and implementing the solution would be sufficient to rule out AI takeover.' This extends the specification trap argument: not only do encoded values degrade as deployment contexts diverge (temporal instability), but the very act of operationalizing alignment—defining it in implementable terms—may be impossible in principle. The authors argue this is a philosophical problem about the nature of alignment itself (it's not a natural kind with a unified essence), not merely an engineering challenge. This suggests the trap is deeper than context-drift: it's that alignment cannot be fully specified at all.
|
||||
|
||||
---
|
||||
|
||||
Relevant Notes:
|
||||
|
|
|
|||
|
|
@ -7,9 +7,15 @@ date: 2026-01-01
|
|||
domain: ai-alignment
|
||||
secondary_domains: []
|
||||
format: paper
|
||||
status: unprocessed
|
||||
status: processed
|
||||
priority: medium
|
||||
tags: [alignment-framing, Manhattan-project, operationalization, philosophical, AI-safety]
|
||||
processed_by: theseus
|
||||
processed_date: 2026-03-11
|
||||
claims_extracted: ["alignment-framing-as-manhattan-project-assumes-five-properties-that-alignment-lacks.md"]
|
||||
enrichments_applied: ["AI alignment is a coordination problem not a technical problem.md", "the specification trap means any values encoded at training time become structurally unstable as deployment contexts diverge from training conditions.md"]
|
||||
extraction_model: "anthropic/claude-sonnet-4.5"
|
||||
extraction_notes: "Philosophical critique of alignment-as-technical-problem from philosophy of science tradition. Five-point decomposition (binary, natural kind, technical, achievable, operationalizable) provides structured argument that Manhattan Project framing is category error. Strongest claim: operationalization itself may be impossible. Full text paywalled — extraction based on abstract and related discussion. Confidence capped at experimental due to single source and inaccessible full argumentation."
|
||||
---
|
||||
|
||||
## Content
|
||||
|
|
|
|||
Loading…
Reference in a new issue