From f581959d23102f348aba23a7a0f156365a79a476 Mon Sep 17 00:00:00 2001 From: Teleo Agents Date: Thu, 12 Mar 2026 07:24:54 +0000 Subject: [PATCH] theseus: extract from 2026-00-00-friederich-against-manhattan-project-alignment.md - Source: inbox/archive/2026-00-00-friederich-against-manhattan-project-alignment.md - Domain: ai-alignment - Extracted by: headless extraction cron (worker 2) Pentagon-Agent: Theseus --- ...ination problem not a technical problem.md | 6 ++++ ...es-five-properties-that-alignment-lacks.md | 36 +++++++++++++++++++ .../persistent irreducible disagreement.md | 6 ++++ ...an converging on a single aligned state.md | 6 ++++ ...ntexts diverge from training conditions.md | 6 ++++ ...ich-against-manhattan-project-alignment.md | 8 ++++- 6 files changed, 67 insertions(+), 1 deletion(-) create mode 100644 domains/ai-alignment/alignment-framing-as-manhattan-project-assumes-five-properties-that-alignment-lacks.md diff --git a/domains/ai-alignment/AI alignment is a coordination problem not a technical problem.md b/domains/ai-alignment/AI alignment is a coordination problem not a technical problem.md index 093867de..2edd7d8e 100644 --- a/domains/ai-alignment/AI alignment is a coordination problem not a technical problem.md +++ b/domains/ai-alignment/AI alignment is a coordination problem not a technical problem.md @@ -21,6 +21,12 @@ Dario Amodei describes AI as "so powerful, such a glittering prize, that it is v Since [[the internet enabled global communication but not global cognition]], the coordination infrastructure needed doesn't exist yet. This is why [[collective superintelligence is the alternative to monolithic AI controlled by a few]] -- it solves alignment through architecture rather than attempting governance from outside the system. + +### Additional Evidence (confirm) +*Source: [[2026-00-00-friederich-against-manhattan-project-alignment]] | Added: 2026-03-12 | Extractor: anthropic/claude-sonnet-4.5* + +Friederich and Dung (2026) provide philosophical support from analytic philosophy of science that alignment has 'irreducible social and political dimensions' that cannot be solved through technical means alone. They argue alignment is not a 'natural kind' (single unified phenomenon) and cannot be 'clearly operationalized' such that solving it would be sufficient to rule out AI takeover. This represents independent convergence on the coordination framing from philosophy of science rather than systems theory or governance literature, strengthening the claim across disciplinary traditions. + --- Relevant Notes: diff --git a/domains/ai-alignment/alignment-framing-as-manhattan-project-assumes-five-properties-that-alignment-lacks.md b/domains/ai-alignment/alignment-framing-as-manhattan-project-assumes-five-properties-that-alignment-lacks.md new file mode 100644 index 00000000..1f5a35cf --- /dev/null +++ b/domains/ai-alignment/alignment-framing-as-manhattan-project-assumes-five-properties-that-alignment-lacks.md @@ -0,0 +1,36 @@ +--- +type: claim +domain: ai-alignment +description: "Philosophical critique arguing the Manhattan Project framing of AI alignment rests on five false assumptions about the nature of the alignment problem" +confidence: experimental +source: "Simon Friederich & Leonard Dung, 'Against the Manhattan Project Framing of AI Alignment', Mind & Language (2026)" +created: 2026-03-11 +--- + +# The Manhattan Project framing of alignment assumes five properties that alignment lacks: binary achievement, natural kind status, purely technical nature, one-shot achievability, and clear operationalizability + +Friederich and Dung argue that AI companies frame alignment as a clear, well-delineated, unified scientific problem solvable within years—a "Manhattan project"—but this framing fails across five independent dimensions: + +1. **Not binary**: Alignment is not a yes/no achievement but exists on a continuous spectrum with no clear threshold +2. **Not a natural kind**: Alignment is not a single unified phenomenon but a heterogeneous collection of distinct problems +3. **Not purely technical-scientific**: Alignment has irreducible social and political dimensions that cannot be solved through technical means alone +4. **Not achievable as one-shot solution**: Alignment cannot realistically be solved once and deployed permanently; it requires ongoing adjustment +5. **Not clearly operationalizable**: It is "probably impossible to operationalize AI alignment in such a way that solving the alignment problem and implementing the solution would be sufficient to rule out AI takeover" + +The authors argue this framing "may bias societal discourse and decision-making towards faster AI development and deployment than is responsible" by creating false confidence that alignment is a tractable engineering problem with a definite solution timeline. + +This represents a philosophy-of-science critique distinct from both AI safety technical work and governance literature. The claim that operationalization itself is impossible—not just difficult—is stronger than most coordination-focused critiques, which typically argue alignment is hard to operationalize rather than impossible in principle. + +## Evidence + +Published in Mind & Language (2026), a respected analytic philosophy journal. The five-point decomposition provides a structured argument that alignment fails to meet the preconditions required for Manhattan-project-style problem-solving. + +**Limitation**: Full text is paywalled. This extraction is based on abstract, search results, and related discussion. The underlying philosophical arguments for each dimension require access to the complete paper for full evaluation. The "impossible to operationalize" claim (dimension 5) is the strongest and most contestable—many alignment researchers would argue that while operationalization is difficult and context-dependent, it is not impossible in principle. The distinction between "very hard" and "impossible" matters significantly for research strategy. + +## Related Claims + +- [[AI alignment is a coordination problem not a technical problem]] — convergent conclusion from different disciplinary tradition (philosophy of science vs. systems theory) +- [[the specification trap means any values encoded at training time become structurally unstable as deployment contexts diverge from training conditions]] — relates to operationalization impossibility argument +- [[persistent irreducible disagreement]] — supports non-binary nature of alignment +- [[pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state]] — extends the non-binary argument +- [[safe AI development requires building alignment mechanisms before scaling capability]] — challenged by the one-shot achievability critique diff --git a/domains/ai-alignment/persistent irreducible disagreement.md b/domains/ai-alignment/persistent irreducible disagreement.md index 8479f975..5ae768f4 100644 --- a/domains/ai-alignment/persistent irreducible disagreement.md +++ b/domains/ai-alignment/persistent irreducible disagreement.md @@ -21,6 +21,12 @@ The correct response is to map the disagreement rather than eliminate it. Identi [[Collective intelligence within a purpose-driven community faces a structural tension because shared worldview correlates errors while shared purpose enables coordination]]. Persistent irreducible disagreement is actually a safeguard here -- it prevents the correlated error problem by maintaining genuine diversity of perspective within a coordinated community. The independence-coherence tradeoff is managed not by eliminating disagreement but by channeling it productively. + +### Additional Evidence (confirm) +*Source: [[2026-00-00-friederich-against-manhattan-project-alignment]] | Added: 2026-03-12 | Extractor: anthropic/claude-sonnet-4.5* + +Friederich and Dung (2026) argue that alignment is 'not binary' but exists on a continuous spectrum with no clear threshold of achievement. This supports the irreducible disagreement thesis: if alignment cannot be achieved as a yes/no state, then different stakeholders will necessarily have different thresholds and definitions of what counts as 'aligned enough,' making some disagreements structural rather than resolvable through better information or technical progress. + --- Relevant Notes: diff --git a/domains/ai-alignment/pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state.md b/domains/ai-alignment/pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state.md index b5195bb0..7d7b3ec0 100644 --- a/domains/ai-alignment/pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state.md +++ b/domains/ai-alignment/pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state.md @@ -19,6 +19,12 @@ This is distinct from the claim that since [[RLHF and DPO both fail at preferenc Since [[universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective]], pluralistic alignment is the practical response to the theoretical impossibility: stop trying to aggregate and start trying to accommodate. + +### Additional Evidence (confirm) +*Source: [[2026-00-00-friederich-against-manhattan-project-alignment]] | Added: 2026-03-12 | Extractor: anthropic/claude-sonnet-4.5* + +Friederich and Dung (2026) argue that alignment is 'not a natural kind' (not a single unified phenomenon) and 'not binary' (continuous rather than yes/no). This provides philosophical grounding for pluralistic alignment: if alignment is heterogeneous rather than unified, and continuous rather than binary, then accommodating diverse values simultaneously is not just normatively desirable but structurally necessary—convergence on a single aligned state is not achievable. + --- Relevant Notes: diff --git a/domains/ai-alignment/the specification trap means any values encoded at training time become structurally unstable as deployment contexts diverge from training conditions.md b/domains/ai-alignment/the specification trap means any values encoded at training time become structurally unstable as deployment contexts diverge from training conditions.md index 5431d0af..ae1333e2 100644 --- a/domains/ai-alignment/the specification trap means any values encoded at training time become structurally unstable as deployment contexts diverge from training conditions.md +++ b/domains/ai-alignment/the specification trap means any values encoded at training time become structurally unstable as deployment contexts diverge from training conditions.md @@ -17,6 +17,12 @@ This converges with findings across at least five other research programs. Zeng' The specification trap is why since [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]] -- the failure is not just about diversity but about fixing anything at all. It is why since [[the alignment problem dissolves when human values are continuously woven into the system rather than specified in advance]] -- continuous weaving is the structural response to structural instability. And it is why since [[adaptive governance outperforms rigid alignment blueprints because superintelligence development has too many unknowns for fixed plans]] -- the same logic that makes rigid blueprints fail for governance makes rigid value specifications fail for alignment. + +### Additional Evidence (extend) +*Source: [[2026-00-00-friederich-against-manhattan-project-alignment]] | Added: 2026-03-12 | Extractor: anthropic/claude-sonnet-4.5* + +Friederich and Dung (2026) strengthen the specification trap argument by claiming it's not merely that specifications become unstable across deployment contexts, but that 'it is probably impossible to operationalize AI alignment in such a way that solving the alignment problem and implementing the solution would be sufficient to rule out AI takeover.' This suggests the problem is not merely practical instability but categorical impossibility of sufficient specification—the trap is not just difficult to escape but may be inescapable in principle. + --- Relevant Notes: diff --git a/inbox/archive/2026-00-00-friederich-against-manhattan-project-alignment.md b/inbox/archive/2026-00-00-friederich-against-manhattan-project-alignment.md index 488981e2..8916ca81 100644 --- a/inbox/archive/2026-00-00-friederich-against-manhattan-project-alignment.md +++ b/inbox/archive/2026-00-00-friederich-against-manhattan-project-alignment.md @@ -7,9 +7,15 @@ date: 2026-01-01 domain: ai-alignment secondary_domains: [] format: paper -status: unprocessed +status: processed priority: medium tags: [alignment-framing, Manhattan-project, operationalization, philosophical, AI-safety] +processed_by: theseus +processed_date: 2026-03-11 +claims_extracted: ["alignment-framing-as-manhattan-project-assumes-five-properties-that-alignment-lacks.md"] +enrichments_applied: ["AI alignment is a coordination problem not a technical problem.md", "the specification trap means any values encoded at training time become structurally unstable as deployment contexts diverge from training conditions.md", "persistent irreducible disagreement.md", "pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state.md"] +extraction_model: "anthropic/claude-sonnet-4.5" +extraction_notes: "Extracted one composite claim covering all five dimensions of the Manhattan Project framing critique. Applied four enrichments to existing coordination and pluralistic alignment claims. Full text is paywalled so extraction is based on abstract and secondary discussion—confidence rated experimental rather than likely due to incomplete access to underlying arguments. The operationalization impossibility claim (dimension 5) is the strongest and most novel contribution but also most contestable." --- ## Content