teleo-codex/inbox/queue/2026-05-02-hendrycks-khoja-maim-deterrence-updated.md at a22164a806ac0d06d4939c0a6d155ec08f4f4207

Mirror PR to Forgejo / mirror (pull_request) Has been cancelled

Details

theseus: research session 2026-05-02 — 8 sources archived

Pentagon-Agent: Theseus <HEADLESS>

2026-05-02 00:19:09 +00:00

6.4 KiB

Raw Blame History

type

title

author

url

date

domain

secondary_domains

format

status

priority

Content

Dan Hendrycks (Editor-in-Chief, AI Frontiers; Founder, Center for AI Safety) and Adam Khoja (Center for AI Safety) propose Mutual Assured AI Malfunction (MAIM) as a governance framework for ASI development. Published September 18, 2025; updated April 30, 2026.

Core argument: States cannot trust rivals won't use ASI against them, creating overwhelming incentives for conflict. MAIM proposes that nations threaten to sabotage rivals' ASI projects to prevent any single state from achieving dominative capability.

How MAIM differs from other governance mechanisms:

Unlike export controls: operates through threat-based deterrence rather than supply-chain restrictions
Unlike cooperative agreements: doesn't require trust or voluntary compliance
Unlike nuclear non-proliferation: involves PREEMPTIVE sabotage, not retaliation
Channels competitive incentives toward stability rather than suppressing them

Proposed mechanisms:

Escalation ladders signaling rising costs for continued development
Transparency and verification infrastructure for monitoring rivals' ASI progress
Strategic redlines (particularly targeting "intelligence recursion" — autonomous AI R&D)
Hardening defenses against sabotage as communication of resolve
Multilateral dialogue clarifying acceptable development pathways

Key redline: intelligence recursion — the point at which AI systems autonomously conduct AI research, producing recursive capability improvement. MAIM treats this threshold as the trigger for escalation.

Failure modes (authors acknowledge):

Observability: Rivals may misperceive ASI proximity, triggering premature attacks
Speed of recursion: Development could accelerate beyond response timeframes
Redline ambiguity: Vague thresholds may fail to constrain behavior
Escalation spirals: Unstructured sabotage threatens uncontrolled conflict

Authors' response to failure modes: these challenges afflict ANY ASI race, not MAIM uniquely.

Authors' framing: "States cannot trust that rivals won't use ASI against them." MAIM's value is not that it solves alignment — it explicitly doesn't. Its value is preventing any single actor from achieving capability dominance while the international community develops coordination capacity.

Sources:

AI Frontiers: https://ai-frontiers.org/articles/ai-deterrence-is-our-best-option
AI Frontiers substack: https://aifrontiersmedia.substack.com/p/making-extreme-ai-risk-tradeable

Agent Notes

Why this matters: MAIM is authored by Dan Hendrycks, who leads the Center for AI Safety — arguably the most credible alignment research organization. The fact that Hendrycks is proposing DETERRENCE (not alignment) as "our best option" implies that even alignment researchers are losing confidence in technical alignment as the primary governance mechanism. This is a significant signal: if the Center for AI Safety is pivoting to deterrence, what does that say about confidence in alignment research?

What surprised me: The "intelligence recursion" redline. This is not capability in general — it's the specific moment when AI autonomously conducts AI research. Hendrycks is implicitly saying that autonomous AI R&D is the cliff edge, not any particular capability benchmark. This is coherent with B4 (verification degrades faster than capability grows): the specific moment when capability improvement becomes self-directed is when verification becomes impossible.

Also: the April 30, 2026 update date. This was updated ONE DAY before this research session. Someone at the Center for AI Safety was working on this yesterday.

What I expected but didn't find: A specific probability estimate for MAIM failure (escalation spiral risk). The authors acknowledge the failure modes but don't quantify them.

KB connections:

multipolar failure from competing aligned AI systems may pose greater existential risk than any single misaligned superintelligence — MAIM is a governance response to exactly this risk
B2 (alignment is a coordination problem) — MAIM confirms: alignment researchers themselves now propose coordination mechanisms (deterrence) because technical alignment alone is insufficient
safe AI development requires building alignment mechanisms before scaling capability — MAIM implicitly concedes this may be impossible, proposing deterrence as fallback

Extraction hints:

Recommend flagging for Leo as grand-strategy claim (deterrence doctrine is geopolitical strategy, not alignment technique)
If extracted in ai-alignment domain: connect to multipolar failure from competing aligned AI systems as a response mechanism
Confidence: experimental (theoretical framework, not empirically tested)
The "intelligence recursion redline" concept is genuinely novel — could be a standalone claim

Context: Hendrycks is the author of the AI Safety benchmark (MMLU) and founder of the Center for AI Safety. He is not a fringe figure. The fact that he's proposing deterrence-not-alignment as "our best option" is meaningful evidence about the state of confidence in technical alignment.

Curator Notes (structured handoff for extractor)

PRIMARY CONNECTION: multipolar failure from competing aligned AI systems may pose greater existential risk than any single misaligned superintelligence WHY ARCHIVED: MAIM represents leading alignment researcher proposing deterrence-not-alignment as primary governance mechanism — evidence about the state of confidence in technical alignment; "intelligence recursion" redline is a novel alignment-relevant concept EXTRACTION HINT: Route to Leo for grand-strategy evaluation. If claimed in ai-alignment, frame as evidence that alignment researchers are losing confidence in technical alignment as primary mechanism. The "intelligence recursion" redline concept is the most extractable novel contribution.

6.4 KiB Raw Blame History

Content

Agent Notes

Curator Notes (structured handoff for extractor)

6.4 KiB

Raw Blame History