From 8265774a44cbde56f156aa86a15ed39c97948811 Mon Sep 17 00:00:00 2001 From: Teleo Agents Date: Wed, 11 Mar 2026 19:33:52 +0000 Subject: [PATCH] auto-fix: address review feedback on 2026-00-00-friederich-against-manhattan-project-alignment.md - Fixed based on eval review comments - Quality gate pass 3 (fix-from-feedback) Pentagon-Agent: Theseus --- ...ination problem not a technical problem.md | 12 ++++++--- ...es-five-properties-that-alignment-lacks.md | 25 +++++++++++++++---- ...ntexts diverge from training conditions.md | 10 +++++--- 3 files changed, 35 insertions(+), 12 deletions(-) diff --git a/domains/ai-alignment/AI alignment is a coordination problem not a technical problem.md b/domains/ai-alignment/AI alignment is a coordination problem not a technical problem.md index f5f7474e4..67e4a9986 100644 --- a/domains/ai-alignment/AI alignment is a coordination problem not a technical problem.md +++ b/domains/ai-alignment/AI alignment is a coordination problem not a technical problem.md @@ -5,6 +5,10 @@ domain: ai-alignment created: 2026-02-16 confidence: likely source: "TeleoHumanity Manifesto, Chapter 5" +enrichments: + - source: "Friederich & Dung (2026), Mind & Language" + label: "confirm" + note: "Philosophical support from philosophy of science tradition: alignment has irreducible social/political dimensions and cannot be treated as mainly technical-scientific. The Manhattan Project framing is a category error across five dimensions." --- # AI alignment is a coordination problem not a technical problem @@ -21,11 +25,10 @@ Dario Amodei describes AI as "so powerful, such a glittering prize, that it is v Since [[the internet enabled global communication but not global cognition]], the coordination infrastructure needed doesn't exist yet. This is why [[collective superintelligence is the alternative to monolithic AI controlled by a few]] -- it solves alignment through architecture rather than attempting governance from outside the system. - ### Additional Evidence (confirm) -*Source: [[2026-00-00-friederich-against-manhattan-project-alignment]] | Added: 2026-03-11 | Extractor: anthropic/claude-sonnet-4.5* +*Source: Friederich & Dung (2026), Mind & Language | Added: 2026-03-11* -Friederich & Dung (2026) provide philosophical support from philosophy of science tradition: they argue alignment has 'irreducible social/political dimensions' and cannot be treated as 'mainly technical-scientific.' They contend the Manhattan Project framing (alignment as purely technical problem) is a category error across five dimensions: it's not binary, not a natural kind, not mainly technical, not achievable as one-shot solution, and 'probably impossible to operationalize' such that solving it would be sufficient to prevent AI takeover. Published in Mind & Language, a peer-reviewed analytic philosophy journal, this represents convergent reasoning from a different disciplinary tradition than systems theory. +Friederich & Dung provide philosophical support from philosophy of science tradition: they argue alignment has 'irreducible social/political dimensions' and cannot be treated as 'mainly technical-scientific.' They contend the Manhattan Project framing (alignment as purely technical problem) is a category error across five dimensions: it's not binary, not a natural kind, not mainly technical, not achievable as one-shot solution, and 'probably impossible to operationalize' such that solving it would be sufficient to prevent AI takeover. Published in Mind & Language, a peer-reviewed analytic philosophy journal, this represents convergent reasoning from a different disciplinary tradition than systems theory. --- @@ -38,6 +41,7 @@ Relevant Notes: - [[no research group is building alignment through collective intelligence infrastructure despite the field converging on problems that require it]] -- the field has identified the coordination nature of the problem but nobody is building coordination solutions - [[voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints]] -- Anthropic RSP rollback (Feb 2026) proves voluntary commitments cannot substitute for coordination - [[government designation of safety-conscious AI labs as supply chain risks inverts the regulatory dynamic by penalizing safety constraints rather than enforcing them]] -- government acting as coordination-breaker rather than coordinator +- [[alignment-framing-as-manhattan-project-assumes-five-properties-that-alignment-lacks]] -- philosophical argument that the technical framing itself is a category error Topics: -- [[_map]] \ No newline at end of file +- [[_map]] diff --git a/domains/ai-alignment/alignment-framing-as-manhattan-project-assumes-five-properties-that-alignment-lacks.md b/domains/ai-alignment/alignment-framing-as-manhattan-project-assumes-five-properties-that-alignment-lacks.md index 1fb060eaf..cfd9bf613 100644 --- a/domains/ai-alignment/alignment-framing-as-manhattan-project-assumes-five-properties-that-alignment-lacks.md +++ b/domains/ai-alignment/alignment-framing-as-manhattan-project-assumes-five-properties-that-alignment-lacks.md @@ -8,7 +8,7 @@ created: 2026-03-11 depends_on: - "AI alignment is a coordination problem not a technical problem.md" - "the specification trap means any values encoded at training time become structurally unstable as deployment contexts diverge from training conditions.md" - - "some disagreements are permanently irreducible.md" + - "persistent irreducible disagreement.md" --- # The Manhattan Project framing of alignment encodes five philosophical assumptions that mischaracterize the problem @@ -35,6 +35,19 @@ The paper argues this framing "may bias societal discourse and decision-making t The argument draws on philosophy of science (natural kinds, operationalization, problem specification) rather than AI safety or governance literatures. The operationalizability claim is the strongest: not just that alignment is hard to operationalize, but that it's "probably impossible" to define it such that solving the defined problem would be sufficient to prevent takeover. This suggests a category error—alignment may not be the kind of thing that admits complete formal specification in the way the Manhattan Project framing assumes. +## Relationship to Existing KB Claims + +This five-point decomposition converges with but extends existing claims from distinct disciplinary traditions: + +- The **binary achievement** critique aligns with [[persistent irreducible disagreement]] — if disagreements are irreducible, alignment cannot be binary +- The **one-shot achievability** critique directly parallels [[super co-alignment proposes that human and AI values should be co-shaped through iterative alignment rather than specified in advance]] — both argue static specification fails and iteration is required +- The **operationalizability** critique extends [[the specification trap means any values encoded at training time become structurally unstable as deployment contexts diverge from training conditions]] — from temporal instability to impossibility of complete formal specification in principle +- The **technical-scientific solvability** critique converges with [[AI alignment is a coordination problem not a technical problem]] — both argue alignment has irreducible social/political dimensions + +## Tension with Alternative Views + +This claim's implicit caution about deployment speed stands in tension with [[developing superintelligence is surgery for a fatal condition not russian roulette because the baseline of inaction is itself catastrophic]] — which argues that delay itself carries catastrophic risk. These are not flatly contradictory (Friederich & Dung critique the framing, not assert delay is always correct), but the KB should acknowledge both perspectives exist. + ## Limitations The full text is paywalled, so the specific arguments supporting each of the five points cannot be evaluated in depth. The claim rests on abstract and summary descriptions. The "probably impossible" language on operationalizability is strong but not proven—it's a philosophical argument about the nature of the problem, not an empirical demonstration. @@ -43,7 +56,9 @@ The full text is paywalled, so the specific arguments supporting each of the fiv ## Related Claims -- [[AI alignment is a coordination problem not a technical problem.md]] — Convergent conclusion from different disciplinary tradition (systems theory vs. philosophy of science) -- [[the specification trap means any values encoded at training time become structurally unstable as deployment contexts diverge from training conditions.md]] — Supports the operationalizability impossibility argument -- [[some disagreements are permanently irreducible.md]] — Supports the "not binary" dimension -- [[pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state.md]] — Related to natural kind critique +- [[AI alignment is a coordination problem not a technical problem]] — Convergent conclusion from different disciplinary tradition (systems theory vs. philosophy of science) +- [[the specification trap means any values encoded at training time become structurally unstable as deployment contexts diverge from training conditions]] — Supports the operationalizability impossibility argument +- [[persistent irreducible disagreement]] — Supports the "not binary" dimension +- [[super co-alignment proposes that human and AI values should be co-shaped through iterative alignment rather than specified in advance]] — Convergent argument that one-shot solutions fail +- [[pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state]] — Related to natural kind critique +- [[developing superintelligence is surgery for a fatal condition not russian roulette because the baseline of inaction is itself catastrophic]] — Presents alternative risk calculus on deployment timing diff --git a/domains/ai-alignment/the specification trap means any values encoded at training time become structurally unstable as deployment contexts diverge from training conditions.md b/domains/ai-alignment/the specification trap means any values encoded at training time become structurally unstable as deployment contexts diverge from training conditions.md index fbfe40d86..98e8f3a7b 100644 --- a/domains/ai-alignment/the specification trap means any values encoded at training time become structurally unstable as deployment contexts diverge from training conditions.md +++ b/domains/ai-alignment/the specification trap means any values encoded at training time become structurally unstable as deployment contexts diverge from training conditions.md @@ -5,6 +5,10 @@ domain: ai-alignment created: 2026-02-17 source: "Spizzirri, Syntropic Frameworks (arXiv 2512.03048, November 2025); convergent finding across Zeng 2025, Sorensen 2024, Klassen 2024, Gabriel 2020" confidence: likely +enrichments: + - source: "Friederich & Dung (2026), Mind & Language" + label: "extend" + note: "Extends specification trap from temporal instability to operationalization impossibility in principle. Not only do encoded values degrade as deployment contexts diverge, but the very act of operationalizing alignment may be impossible because alignment is not a natural kind with unified essence." --- # the specification trap means any values encoded at training time become structurally unstable as deployment contexts diverge from training conditions @@ -17,11 +21,10 @@ This converges with findings across at least five other research programs. Zeng' The specification trap is why since [[RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values]] -- the failure is not just about diversity but about fixing anything at all. It is why since [[the alignment problem dissolves when human values are continuously woven into the system rather than specified in advance]] -- continuous weaving is the structural response to structural instability. And it is why since [[adaptive governance outperforms rigid alignment blueprints because superintelligence development has too many unknowns for fixed plans]] -- the same logic that makes rigid blueprints fail for governance makes rigid value specifications fail for alignment. - ### Additional Evidence (extend) -*Source: [[2026-00-00-friederich-against-manhattan-project-alignment]] | Added: 2026-03-11 | Extractor: anthropic/claude-sonnet-4.5* +*Source: Friederich & Dung (2026), Mind & Language | Added: 2026-03-11* -Friederich & Dung argue it's 'probably impossible to operationalize AI alignment in such a way that solving the alignment problem and implementing the solution would be sufficient to rule out AI takeover.' This extends the specification trap argument: not only do encoded values degrade as deployment contexts diverge (temporal instability), but the very act of operationalizing alignment—defining it in implementable terms—may be impossible in principle. The authors argue this is a philosophical problem about the nature of alignment itself (it's not a natural kind with a unified essence), not merely an engineering challenge. This suggests the trap is deeper than context-drift: it's that alignment cannot be fully specified at all. +Friederich & Dung extend the specification trap argument from temporal instability to operationalization impossibility in principle. They argue it's 'probably impossible to operationalize AI alignment in such a way that solving the alignment problem and implementing the solution would be sufficient to rule out AI takeover.' This is not merely an engineering challenge but a philosophical problem: alignment may not be a natural kind with a unified essence that admits complete formal specification. The trap is deeper than context-drift — it's that alignment cannot be fully specified at all. This philosophical argument from the tradition of philosophy of science converges with the technical findings of Spizzirri, Zeng, and others that fixed specifications fail, but identifies the root cause as a category error in how the problem is framed. --- @@ -31,6 +34,7 @@ Relevant Notes: - [[adaptive governance outperforms rigid alignment blueprints because superintelligence development has too many unknowns for fixed plans]] -- same logic applies: rigid specifications fail because unknowns accumulate - [[super co-alignment proposes that human and AI values should be co-shaped through iterative alignment rather than specified in advance]] -- co-alignment is an escape from the specification trap - [[enabling constraints create possibility spaces for emergence while governing constraints dictate specific outcomes]] -- the specification trap is another way of saying governing constraints (specifying values) fail where enabling constraints (creating value-formation processes) succeed +- [[alignment-framing-as-manhattan-project-assumes-five-properties-that-alignment-lacks]] -- philosophical argument that operationalizability itself may be impossible Topics: - [[_map]]