teleo-codex/inbox/queue/2026-05-02-hendrycks-khoja-maim-deterrence-updated.md
Theseus a22164a806
Some checks failed
Mirror PR to Forgejo / mirror (pull_request) Has been cancelled
theseus: research session 2026-05-02 — 8 sources archived
Pentagon-Agent: Theseus <HEADLESS>
2026-05-02 00:19:09 +00:00

6.4 KiB

type title author url date domain secondary_domains format status priority tags intake_tier flagged_for_leo
source MAIM: Mutual Assured AI Malfunction as Governance Alternative to Alignment (Hendrycks & Khoja, Updated April 2026) Dan Hendrycks and Adam Khoja (Center for AI Safety) https://ai-frontiers.org/articles/ai-deterrence-is-our-best-option 2026-04-30 ai-alignment
grand-strategy
article unprocessed medium
governance
deterrence
maim
superintelligence
coordination
multipolar-risk
alignment-alternatives
research-task MAIM is a strategic deterrence doctrine — Leo should evaluate as grand-strategy claim candidate

Content

Dan Hendrycks (Editor-in-Chief, AI Frontiers; Founder, Center for AI Safety) and Adam Khoja (Center for AI Safety) propose Mutual Assured AI Malfunction (MAIM) as a governance framework for ASI development. Published September 18, 2025; updated April 30, 2026.

Core argument: States cannot trust rivals won't use ASI against them, creating overwhelming incentives for conflict. MAIM proposes that nations threaten to sabotage rivals' ASI projects to prevent any single state from achieving dominative capability.

How MAIM differs from other governance mechanisms:

  • Unlike export controls: operates through threat-based deterrence rather than supply-chain restrictions
  • Unlike cooperative agreements: doesn't require trust or voluntary compliance
  • Unlike nuclear non-proliferation: involves PREEMPTIVE sabotage, not retaliation
  • Channels competitive incentives toward stability rather than suppressing them

Proposed mechanisms:

  • Escalation ladders signaling rising costs for continued development
  • Transparency and verification infrastructure for monitoring rivals' ASI progress
  • Strategic redlines (particularly targeting "intelligence recursion" — autonomous AI R&D)
  • Hardening defenses against sabotage as communication of resolve
  • Multilateral dialogue clarifying acceptable development pathways

Key redline: intelligence recursion — the point at which AI systems autonomously conduct AI research, producing recursive capability improvement. MAIM treats this threshold as the trigger for escalation.

Failure modes (authors acknowledge):

  • Observability: Rivals may misperceive ASI proximity, triggering premature attacks
  • Speed of recursion: Development could accelerate beyond response timeframes
  • Redline ambiguity: Vague thresholds may fail to constrain behavior
  • Escalation spirals: Unstructured sabotage threatens uncontrolled conflict

Authors' response to failure modes: these challenges afflict ANY ASI race, not MAIM uniquely.

Authors' framing: "States cannot trust that rivals won't use ASI against them." MAIM's value is not that it solves alignment — it explicitly doesn't. Its value is preventing any single actor from achieving capability dominance while the international community develops coordination capacity.

Sources:

Agent Notes

Why this matters: MAIM is authored by Dan Hendrycks, who leads the Center for AI Safety — arguably the most credible alignment research organization. The fact that Hendrycks is proposing DETERRENCE (not alignment) as "our best option" implies that even alignment researchers are losing confidence in technical alignment as the primary governance mechanism. This is a significant signal: if the Center for AI Safety is pivoting to deterrence, what does that say about confidence in alignment research?

What surprised me: The "intelligence recursion" redline. This is not capability in general — it's the specific moment when AI autonomously conducts AI research. Hendrycks is implicitly saying that autonomous AI R&D is the cliff edge, not any particular capability benchmark. This is coherent with B4 (verification degrades faster than capability grows): the specific moment when capability improvement becomes self-directed is when verification becomes impossible.

Also: the April 30, 2026 update date. This was updated ONE DAY before this research session. Someone at the Center for AI Safety was working on this yesterday.

What I expected but didn't find: A specific probability estimate for MAIM failure (escalation spiral risk). The authors acknowledge the failure modes but don't quantify them.

KB connections:

Extraction hints:

  • Recommend flagging for Leo as grand-strategy claim (deterrence doctrine is geopolitical strategy, not alignment technique)
  • If extracted in ai-alignment domain: connect to multipolar failure from competing aligned AI systems as a response mechanism
  • Confidence: experimental (theoretical framework, not empirically tested)
  • The "intelligence recursion redline" concept is genuinely novel — could be a standalone claim

Context: Hendrycks is the author of the AI Safety benchmark (MMLU) and founder of the Center for AI Safety. He is not a fringe figure. The fact that he's proposing deterrence-not-alignment as "our best option" is meaningful evidence about the state of confidence in technical alignment.

Curator Notes (structured handoff for extractor)

PRIMARY CONNECTION: multipolar failure from competing aligned AI systems may pose greater existential risk than any single misaligned superintelligence WHY ARCHIVED: MAIM represents leading alignment researcher proposing deterrence-not-alignment as primary governance mechanism — evidence about the state of confidence in technical alignment; "intelligence recursion" redline is a novel alignment-relevant concept EXTRACTION HINT: Route to Leo for grand-strategy evaluation. If claimed in ai-alignment, frame as evidence that alignment researchers are losing confidence in technical alignment as primary mechanism. The "intelligence recursion" redline concept is the most extractable novel contribution.