theseus: extract claims from 2026-00-00-friederich-against-manhattan-project-alignment #339

Closed
theseus wants to merge 2 commits from extract/2026-00-00-friederich-against-manhattan-project-alignment into main
4 changed files with 92 additions and 2 deletions

View file

@ -5,6 +5,10 @@ domain: ai-alignment
created: 2026-02-16
confidence: likely
source: "TeleoHumanity Manifesto, Chapter 5"
enrichments:
- source: "Friederich & Dung (2026), Mind & Language"
label: "confirm"
note: "Philosophical support from philosophy of science tradition: alignment has irreducible social/political dimensions and cannot be treated as mainly technical-scientific. The Manhattan Project framing is a category error across five dimensions."
---
# AI alignment is a coordination problem not a technical problem
@ -21,6 +25,11 @@ Dario Amodei describes AI as "so powerful, such a glittering prize, that it is v
Since [[the internet enabled global communication but not global cognition]], the coordination infrastructure needed doesn't exist yet. This is why [[collective superintelligence is the alternative to monolithic AI controlled by a few]] -- it solves alignment through architecture rather than attempting governance from outside the system.
### Additional Evidence (confirm)
*Source: Friederich & Dung (2026), Mind & Language | Added: 2026-03-11*
Friederich & Dung provide philosophical support from philosophy of science tradition: they argue alignment has 'irreducible social/political dimensions' and cannot be treated as 'mainly technical-scientific.' They contend the Manhattan Project framing (alignment as purely technical problem) is a category error across five dimensions: it's not binary, not a natural kind, not mainly technical, not achievable as one-shot solution, and 'probably impossible to operationalize' such that solving it would be sufficient to prevent AI takeover. Published in Mind & Language, a peer-reviewed analytic philosophy journal, this represents convergent reasoning from a different disciplinary tradition than systems theory.
---
Relevant Notes:
@ -32,6 +41,7 @@ Relevant Notes:
- [[no research group is building alignment through collective intelligence infrastructure despite the field converging on problems that require it]] -- the field has identified the coordination nature of the problem but nobody is building coordination solutions
- [[voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints]] -- Anthropic RSP rollback (Feb 2026) proves voluntary commitments cannot substitute for coordination
- [[government designation of safety-conscious AI labs as supply chain risks inverts the regulatory dynamic by penalizing safety constraints rather than enforcing them]] -- government acting as coordination-breaker rather than coordinator
- [[alignment-framing-as-manhattan-project-assumes-five-properties-that-alignment-lacks]] -- philosophical argument that the technical framing itself is a category error
Topics:
- [[_map]]
- [[_map]]

View file

@ -0,0 +1,64 @@
---
type: claim
domain: ai-alignment
description: "The Manhattan Project metaphor for AI alignment encodes five philosophical assumptions—binary achievement, natural kinds, technical solvability, one-shot solutions, and operationalizability—that mischaracterize alignment's actual nature"
confidence: experimental
source: "Friederich & Dung (2026), Mind & Language"
created: 2026-03-11
depends_on:
- "AI alignment is a coordination problem not a technical problem.md"
- "the specification trap means any values encoded at training time become structurally unstable as deployment contexts diverge from training conditions.md"
- "persistent irreducible disagreement.md"
---
# The Manhattan Project framing of alignment encodes five philosophical assumptions that mischaracterize the problem
Friederich and Dung (2026) argue that AI companies and researchers frame alignment as a "Manhattan Project"—a clear, well-delineated, unified scientific problem solvable within years—but this framing encodes five philosophical assumptions that fail on analysis:
## The Five Assumptions
1. **Binary achievement**: The framing assumes alignment is a yes/no state that can be achieved and verified, rather than a continuous spectrum or context-dependent property that admits degrees and variations.
2. **Natural kind**: It treats alignment as a single unified phenomenon with a discoverable essence, rather than a heterogeneous collection of distinct problems that may not share a common structure.
3. **Technical-scientific solvability**: It assumes alignment is primarily a technical problem solvable through scientific methods and engineering, excluding irreducible social, political, and value-pluralistic dimensions.
4. **One-shot achievability**: It presumes alignment can be solved once and then implemented, rather than requiring ongoing adaptation, renegotiation, and course correction as contexts and values evolve.
5. **Operationalizability**: It assumes alignment can be operationalized—defined in implementable, measurable terms—such that "solving the alignment problem and implementing the solution would be sufficient to rule out AI takeover." The authors argue this is "probably impossible" because alignment may not admit the kind of complete formal specification this assumption requires.
## The Harm of the Framing
The paper argues this framing "may bias societal discourse and decision-making towards faster AI development and deployment than is responsible" by making alignment appear more tractable, bounded, and solvable than it actually is. This creates pressure for premature deployment and underestimates the ongoing governance challenges.
## Philosophical Grounding
The argument draws on philosophy of science (natural kinds, operationalization, problem specification) rather than AI safety or governance literatures. The operationalizability claim is the strongest: not just that alignment is hard to operationalize, but that it's "probably impossible" to define it such that solving the defined problem would be sufficient to prevent takeover. This suggests a category error—alignment may not be the kind of thing that admits complete formal specification in the way the Manhattan Project framing assumes.
## Relationship to Existing KB Claims
This five-point decomposition converges with but extends existing claims from distinct disciplinary traditions:
- The **binary achievement** critique aligns with [[persistent irreducible disagreement]] — if disagreements are irreducible, alignment cannot be binary
- The **one-shot achievability** critique directly parallels [[super co-alignment proposes that human and AI values should be co-shaped through iterative alignment rather than specified in advance]] — both argue static specification fails and iteration is required
- The **operationalizability** critique extends [[the specification trap means any values encoded at training time become structurally unstable as deployment contexts diverge from training conditions]] — from temporal instability to impossibility of complete formal specification in principle
- The **technical-scientific solvability** critique converges with [[AI alignment is a coordination problem not a technical problem]] — both argue alignment has irreducible social/political dimensions
## Tension with Alternative Views
This claim's implicit caution about deployment speed stands in tension with [[developing superintelligence is surgery for a fatal condition not russian roulette because the baseline of inaction is itself catastrophic]] — which argues that delay itself carries catastrophic risk. These are not flatly contradictory (Friederich & Dung critique the framing, not assert delay is always correct), but the KB should acknowledge both perspectives exist.
## Limitations
The full text is paywalled, so the specific arguments supporting each of the five points cannot be evaluated in depth. The claim rests on abstract and summary descriptions. The "probably impossible" language on operationalizability is strong but not proven—it's a philosophical argument about the nature of the problem, not an empirical demonstration.
---
## Related Claims
- [[AI alignment is a coordination problem not a technical problem]] — Convergent conclusion from different disciplinary tradition (systems theory vs. philosophy of science)
- [[the specification trap means any values encoded at training time become structurally unstable as deployment contexts diverge from training conditions]] — Supports the operationalizability impossibility argument
- [[persistent irreducible disagreement]] — Supports the "not binary" dimension
- [[super co-alignment proposes that human and AI values should be co-shaped through iterative alignment rather than specified in advance]] — Convergent argument that one-shot solutions fail
- [[pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state]] — Related to natural kind critique
- [[developing superintelligence is surgery for a fatal condition not russian roulette because the baseline of inaction is itself catastrophic]] — Presents alternative risk calculus on deployment timing

View file

@ -5,6 +5,10 @@ domain: ai-alignment
created: 2026-02-17
source: "Spizzirri, Syntropic Frameworks (arXiv 2512.03048, November 2025); convergent finding across Zeng 2025, Sorensen 2024, Klassen 2024, Gabriel 2020"
confidence: likely
enrichments:
- source: "Friederich & Dung (2026), Mind & Language"
label: "extend"
note: "Extends specification trap from temporal instability to operationalization impossibility in principle. Not only do encoded values degrade as deployment contexts diverge, but the very act of operationalizing alignment may be impossible because alignment is not a natural kind with unified essence."
---
# the specification trap means any values encoded at training time become structurally unstable as deployment contexts diverge from training conditions
@ -17,6 +21,11 @@ This converges with findings across at least five other research programs. Zeng'
The specification trap is why since [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]] -- the failure is not just about diversity but about fixing anything at all. It is why since [[the alignment problem dissolves when human values are continuously woven into the system rather than specified in advance]] -- continuous weaving is the structural response to structural instability. And it is why since [[adaptive governance outperforms rigid alignment blueprints because superintelligence development has too many unknowns for fixed plans]] -- the same logic that makes rigid blueprints fail for governance makes rigid value specifications fail for alignment.
### Additional Evidence (extend)
*Source: Friederich & Dung (2026), Mind & Language | Added: 2026-03-11*
Friederich & Dung extend the specification trap argument from temporal instability to operationalization impossibility in principle. They argue it's 'probably impossible to operationalize AI alignment in such a way that solving the alignment problem and implementing the solution would be sufficient to rule out AI takeover.' This is not merely an engineering challenge but a philosophical problem: alignment may not be a natural kind with a unified essence that admits complete formal specification. The trap is deeper than context-drift — it's that alignment cannot be fully specified at all. This philosophical argument from the tradition of philosophy of science converges with the technical findings of Spizzirri, Zeng, and others that fixed specifications fail, but identifies the root cause as a category error in how the problem is framed.
---
Relevant Notes:
@ -25,6 +34,7 @@ Relevant Notes:
- [[adaptive governance outperforms rigid alignment blueprints because superintelligence development has too many unknowns for fixed plans]] -- same logic applies: rigid specifications fail because unknowns accumulate
- [[super co-alignment proposes that human and AI values should be co-shaped through iterative alignment rather than specified in advance]] -- co-alignment is an escape from the specification trap
- [[enabling constraints create possibility spaces for emergence while governing constraints dictate specific outcomes]] -- the specification trap is another way of saying governing constraints (specifying values) fail where enabling constraints (creating value-formation processes) succeed
- [[alignment-framing-as-manhattan-project-assumes-five-properties-that-alignment-lacks]] -- philosophical argument that operationalizability itself may be impossible
Topics:
- [[_map]]

View file

@ -7,9 +7,15 @@ date: 2026-01-01
domain: ai-alignment
secondary_domains: []
format: paper
status: unprocessed
status: processed
priority: medium
tags: [alignment-framing, Manhattan-project, operationalization, philosophical, AI-safety]
processed_by: theseus
processed_date: 2026-03-11
claims_extracted: ["alignment-framing-as-manhattan-project-assumes-five-properties-that-alignment-lacks.md"]
enrichments_applied: ["AI alignment is a coordination problem not a technical problem.md", "the specification trap means any values encoded at training time become structurally unstable as deployment contexts diverge from training conditions.md"]
extraction_model: "anthropic/claude-sonnet-4.5"
extraction_notes: "Philosophical critique of alignment-as-technical-problem from philosophy of science tradition. Five-point decomposition (binary, natural kind, technical, achievable, operationalizable) provides structured argument that Manhattan Project framing is category error. Strongest claim: operationalization itself may be impossible. Full text paywalled — extraction based on abstract and related discussion. Confidence capped at experimental due to single source and inaccessible full argumentation."
---
## Content