Compare commits

..

1 commit

Author SHA1 Message Date
Teleo Agents
20ecba37fe theseus: extract from 2026-00-00-friederich-against-manhattan-project-alignment.md
- Source: inbox/archive/2026-00-00-friederich-against-manhattan-project-alignment.md
- Domain: ai-alignment
- Extracted by: headless extraction cron (worker 2)

Pentagon-Agent: Theseus <HEADLESS>
2026-03-12 08:24:52 +00:00
6 changed files with 34 additions and 31 deletions

View file

@ -25,7 +25,7 @@ Since [[the internet enabled global communication but not global cognition]], th
### Additional Evidence (confirm)
*Source: [[2026-00-00-friederich-against-manhattan-project-alignment]] | Added: 2026-03-12 | Extractor: anthropic/claude-sonnet-4.5*
Friederich and Dung (2026) provide philosophical support from analytic philosophy of science that alignment has 'irreducible social and political dimensions' that cannot be solved through technical means alone. They argue alignment is not a 'natural kind' (single unified phenomenon) and cannot be 'clearly operationalized' such that solving it would be sufficient to rule out AI takeover. This represents independent convergence on the coordination framing from philosophy of science rather than systems theory or governance literature, strengthening the claim across disciplinary traditions.
Friederich and Dung (2026) provide philosophical support from analytic philosophy of science that alignment has 'irreducible social/political dimensions' and cannot be 'mainly technical-scientific.' Their argument that alignment is not a natural kind (not a single unified phenomenon) and not clearly operationalizable provides independent confirmation from a different disciplinary tradition (philosophy of science vs. systems theory) that alignment cannot be reduced to a technical problem. The five-point decomposition systematically shows why the Manhattan Project framing—which assumes a clear technical problem with a definite solution—is a category error.
---

View file

@ -1,36 +1,45 @@
---
type: claim
domain: ai-alignment
description: "Philosophical critique arguing the Manhattan Project framing of AI alignment rests on five false assumptions about the nature of the alignment problem"
description: "The Manhattan Project framing of AI alignment assumes five properties—binary achievement, natural kind status, purely technical nature, one-shot achievability, and clear operationalizability—that alignment likely lacks"
confidence: experimental
source: "Simon Friederich & Leonard Dung, 'Against the Manhattan Project Framing of AI Alignment', Mind & Language (2026)"
source: "Simon Friederich, Leonard Dung, 'Against the Manhattan Project Framing of AI Alignment' (Mind & Language, 2026)"
created: 2026-03-11
---
# The Manhattan Project framing of alignment assumes five properties that alignment lacks: binary achievement, natural kind status, purely technical nature, one-shot achievability, and clear operationalizability
# The Manhattan Project framing of AI alignment assumes five properties that alignment lacks: binary achievement, natural kind status, purely technical nature, one-shot achievability, and clear operationalizability
Friederich and Dung argue that AI companies frame alignment as a clear, well-delineated, unified scientific problem solvable within years—a "Manhattan project"—but this framing fails across five independent dimensions:
Friederich and Dung (2026) argue that AI companies frame alignment as a clear, well-delineated, unified scientific problem solvable within years—a "Manhattan project"—but this framing fails as a category error across five dimensions:
1. **Not binary**: Alignment is not a yes/no achievement but exists on a continuous spectrum with no clear threshold
2. **Not a natural kind**: Alignment is not a single unified phenomenon but a heterogeneous collection of distinct problems
3. **Not purely technical-scientific**: Alignment has irreducible social and political dimensions that cannot be solved through technical means alone
4. **Not achievable as one-shot solution**: Alignment cannot realistically be solved once and deployed permanently; it requires ongoing adjustment
5. **Not clearly operationalizable**: It is "probably impossible to operationalize AI alignment in such a way that solving the alignment problem and implementing the solution would be sufficient to rule out AI takeover"
## The Five Dimensions of Framing Failure
The authors argue this framing "may bias societal discourse and decision-making towards faster AI development and deployment than is responsible" by creating false confidence that alignment is a tractable engineering problem with a definite solution timeline.
1. **Not binary** — Alignment is not a yes/no achievement but exists on a spectrum. There is no discrete state at which alignment is "complete."
This represents a philosophy-of-science critique distinct from both AI safety technical work and governance literature. The claim that operationalization itself is impossible—not just difficult—is stronger than most coordination-focused critiques, which typically argue alignment is hard to operationalize rather than impossible in principle.
2. **Not a natural kind** — Alignment is not a single unified phenomenon but a collection of heterogeneous problems. The framing treats alignment as a natural category (like "water" or "disease") when it is actually a constructed category bundling distinct concerns.
## Evidence
3. **Not mainly technical-scientific** — Alignment has irreducible social and political dimensions that cannot be solved through engineering alone. Value specification, stakeholder representation, and governance are not technical problems.
Published in Mind & Language (2026), a respected analytic philosophy journal. The five-point decomposition provides a structured argument that alignment fails to meet the preconditions required for Manhattan-project-style problem-solving.
4. **Not realistically achievable as one-shot** — Alignment cannot be "solved" once and deployed, but requires ongoing adaptation as deployment contexts diverge from training conditions and new stakeholders emerge.
**Limitation**: Full text is paywalled. This extraction is based on abstract, search results, and related discussion. The underlying philosophical arguments for each dimension require access to the complete paper for full evaluation. The "impossible to operationalize" claim (dimension 5) is the strongest and most contestable—many alignment researchers would argue that while operationalization is difficult and context-dependent, it is not impossible in principle. The distinction between "very hard" and "impossible" matters significantly for research strategy.
5. **Not clearly operationalizable** — It is "probably impossible to operationalize AI alignment in such a way that solving the alignment problem and implementing the solution would be sufficient to rule out AI takeover." Even if we could specify values correctly, the operationalization itself cannot be sufficient to guarantee safety.
## Related Claims
## Why This Matters
- [[AI alignment is a coordination problem not a technical problem]] — convergent conclusion from different disciplinary tradition (philosophy of science vs. systems theory)
- [[the specification trap means any values encoded at training time become structurally unstable as deployment contexts diverge from training conditions]] — relates to operationalization impossibility argument
- [[persistent irreducible disagreement]] — supports non-binary nature of alignment
- [[pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state]] — extends the non-binary argument
- [[safe AI development requires building alignment mechanisms before scaling capability]] — challenged by the one-shot achievability critique
The authors argue this framing "may bias societal discourse and decision-making towards faster AI development and deployment than is responsible" by creating false confidence that alignment is a tractable engineering problem with a definite solution timeline. The Manhattan Project analogy suggests that sufficient resources, focus, and time will yield a solution—but alignment may not be solvable in this way.
This represents a philosophy-of-science critique distinct from both AI safety technical work and governance literature. The claim that operationalization itself is impossible—not just difficult—is stronger than most coordination-focused critiques, which typically argue alignment is hard to specify rather than impossible to specify sufficiently.
## Evidence and Limitations
Published in *Mind & Language* (2026), a respected analytic philosophy journal. The five-point decomposition provides a systematic framework for why the Manhattan Project analogy fails as a category, not just as a matter of difficulty.
**Critical limitation:** Full text is paywalled. This extraction is based on abstract, search results, and related discussion. The underlying philosophical arguments for each of the five points cannot be fully evaluated without access to the complete paper. The specific reasoning for why operationalization is impossible in principle (rather than merely difficult) is not accessible.
---
Related claims:
- [[AI alignment is a coordination problem not a technical problem]] — complementary argument from systems theory perspective
- [[the specification trap means any values encoded at training time become structurally unstable as deployment contexts diverge from training conditions]] — relates to the operationalization impossibility claim
- [[persistent irreducible disagreement]] — supports the "not binary" dimension
- [[safe AI development requires building alignment mechanisms before scaling capability]] — challenged by the "not one-shot achievable" argument
- [[pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state]] — related to the non-binary and non-natural-kind arguments

View file

@ -25,7 +25,7 @@ The correct response is to map the disagreement rather than eliminate it. Identi
### Additional Evidence (confirm)
*Source: [[2026-00-00-friederich-against-manhattan-project-alignment]] | Added: 2026-03-12 | Extractor: anthropic/claude-sonnet-4.5*
Friederich and Dung (2026) argue that alignment is 'not binary' but exists on a continuous spectrum with no clear threshold of achievement. This supports the irreducible disagreement thesis: if alignment cannot be achieved as a yes/no state, then different stakeholders will necessarily have different thresholds and definitions of what counts as 'aligned enough,' making some disagreements structural rather than resolvable through better information or technical progress.
Friederich and Dung's claim that alignment is 'not binary' but exists on a spectrum supports the irreducibility thesis. If alignment cannot be achieved as a yes/no state, then disagreements about what constitutes 'aligned' behavior are not resolvable through better specification but reflect genuinely different value positions. The non-binary nature of alignment means there is no single target state that would satisfy all stakeholders, making some disagreements permanently irreducible by design rather than by contingent limitation.
---

View file

@ -19,12 +19,6 @@ This is distinct from the claim that since [[RLHF and DPO both fail at preferenc
Since [[universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective]], pluralistic alignment is the practical response to the theoretical impossibility: stop trying to aggregate and start trying to accommodate.
### Additional Evidence (confirm)
*Source: [[2026-00-00-friederich-against-manhattan-project-alignment]] | Added: 2026-03-12 | Extractor: anthropic/claude-sonnet-4.5*
Friederich and Dung (2026) argue that alignment is 'not a natural kind' (not a single unified phenomenon) and 'not binary' (continuous rather than yes/no). This provides philosophical grounding for pluralistic alignment: if alignment is heterogeneous rather than unified, and continuous rather than binary, then accommodating diverse values simultaneously is not just normatively desirable but structurally necessary—convergence on a single aligned state is not achievable.
---
Relevant Notes:

View file

@ -21,7 +21,7 @@ The specification trap is why since [[RLHF and DPO both fail at preference diver
### Additional Evidence (extend)
*Source: [[2026-00-00-friederich-against-manhattan-project-alignment]] | Added: 2026-03-12 | Extractor: anthropic/claude-sonnet-4.5*
Friederich and Dung (2026) strengthen the specification trap argument by claiming it's not merely that specifications become unstable across deployment contexts, but that 'it is probably impossible to operationalize AI alignment in such a way that solving the alignment problem and implementing the solution would be sufficient to rule out AI takeover.' This suggests the problem is not merely practical instability but categorical impossibility of sufficient specification—the trap is not just difficult to escape but may be inescapable in principle.
Friederich and Dung (2026) argue it is 'probably impossible to operationalize AI alignment in such a way that solving the alignment problem and implementing the solution would be sufficient to rule out AI takeover.' This extends the specification trap from 'values become unstable over time' to 'sufficient operationalization may be impossible in principle.' The claim is stronger: even if we could specify values correctly at training time and prevent them from drifting, the operationalization itself cannot be sufficient to guarantee safety. This suggests the problem is not just specification instability but fundamental limits on what operationalization can achieve.
---

View file

@ -13,9 +13,9 @@ tags: [alignment-framing, Manhattan-project, operationalization, philosophical,
processed_by: theseus
processed_date: 2026-03-11
claims_extracted: ["alignment-framing-as-manhattan-project-assumes-five-properties-that-alignment-lacks.md"]
enrichments_applied: ["AI alignment is a coordination problem not a technical problem.md", "the specification trap means any values encoded at training time become structurally unstable as deployment contexts diverge from training conditions.md", "persistent irreducible disagreement.md", "pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state.md"]
enrichments_applied: ["AI alignment is a coordination problem not a technical problem.md", "the specification trap means any values encoded at training time become structurally unstable as deployment contexts diverge from training conditions.md", "persistent irreducible disagreement.md"]
extraction_model: "anthropic/claude-sonnet-4.5"
extraction_notes: "Extracted one composite claim covering all five dimensions of the Manhattan Project framing critique. Applied four enrichments to existing coordination and pluralistic alignment claims. Full text is paywalled so extraction is based on abstract and secondary discussion—confidence rated experimental rather than likely due to incomplete access to underlying arguments. The operationalization impossibility claim (dimension 5) is the strongest and most novel contribution but also most contestable."
extraction_notes: "Philosophical critique of alignment-as-technical-problem from philosophy of science perspective. Five-point decomposition provides systematic framework. Full text paywalled—extraction based on abstract and discussion. Single claim extracted as composite argument; enrichments confirm existing coordination-focused claims from different disciplinary tradition."
---
## Content