Teleo Agents f581959d23 theseus: extract from 2026-00-00-friederich-against-manhattan-project-alignment.md

- Source: inbox/archive/2026-00-00-friederich-against-manhattan-project-alignment.md
- Domain: ai-alignment
- Extracted by: headless extraction cron (worker 2)

Pentagon-Agent: Theseus <HEADLESS>

2026-03-12 07:24:54 +00:00

3.7 KiB

Raw Blame History

type	domain	description	confidence	source	created
claim	ai-alignment	Philosophical critique arguing the Manhattan Project framing of AI alignment rests on five false assumptions about the nature of the alignment problem	experimental	Simon Friederich & Leonard Dung, 'Against the Manhattan Project Framing of AI Alignment', Mind & Language (2026)	2026-03-11

The Manhattan Project framing of alignment assumes five properties that alignment lacks: binary achievement, natural kind status, purely technical nature, one-shot achievability, and clear operationalizability

Friederich and Dung argue that AI companies frame alignment as a clear, well-delineated, unified scientific problem solvable within years—a "Manhattan project"—but this framing fails across five independent dimensions:

Not binary: Alignment is not a yes/no achievement but exists on a continuous spectrum with no clear threshold
Not a natural kind: Alignment is not a single unified phenomenon but a heterogeneous collection of distinct problems
Not purely technical-scientific: Alignment has irreducible social and political dimensions that cannot be solved through technical means alone
Not achievable as one-shot solution: Alignment cannot realistically be solved once and deployed permanently; it requires ongoing adjustment
Not clearly operationalizable: It is "probably impossible to operationalize AI alignment in such a way that solving the alignment problem and implementing the solution would be sufficient to rule out AI takeover"

The authors argue this framing "may bias societal discourse and decision-making towards faster AI development and deployment than is responsible" by creating false confidence that alignment is a tractable engineering problem with a definite solution timeline.

This represents a philosophy-of-science critique distinct from both AI safety technical work and governance literature. The claim that operationalization itself is impossible—not just difficult—is stronger than most coordination-focused critiques, which typically argue alignment is hard to operationalize rather than impossible in principle.

Evidence

Published in Mind & Language (2026), a respected analytic philosophy journal. The five-point decomposition provides a structured argument that alignment fails to meet the preconditions required for Manhattan-project-style problem-solving.

Limitation: Full text is paywalled. This extraction is based on abstract, search results, and related discussion. The underlying philosophical arguments for each dimension require access to the complete paper for full evaluation. The "impossible to operationalize" claim (dimension 5) is the strongest and most contestable—many alignment researchers would argue that while operationalization is difficult and context-dependent, it is not impossible in principle. The distinction between "very hard" and "impossible" matters significantly for research strategy.

AI alignment is a coordination problem not a technical problem — convergent conclusion from different disciplinary tradition (philosophy of science vs. systems theory)
the specification trap means any values encoded at training time become structurally unstable as deployment contexts diverge from training conditions — relates to operationalization impossibility argument
persistent irreducible disagreement — supports non-binary nature of alignment
pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state — extends the non-binary argument
safe AI development requires building alignment mechanisms before scaling capability — challenged by the one-shot achievability critique

3.7 KiB Raw Blame History

The Manhattan Project framing of alignment assumes five properties that alignment lacks: binary achievement, natural kind status, purely technical nature, one-shot achievability, and clear operationalizability

Evidence

Related Claims

3.7 KiB

Raw Blame History