Mirror PR to Forgejo / mirror (pull_request) Waiting to run

Details

theseus: extract claims from 2026-04-10-anthropic-red-mythos-preview-glasswing-disclosure

- Source: inbox/queue/2026-04-10-anthropic-red-mythos-preview-glasswing-disclosure.md
- Domain: ai-alignment
- Claims: 3, Entities: 2
- Enrichments: 5
- Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5)

Pentagon-Agent: Theseus <PIPELINE>

2026-05-12 00:31:28 +00:00

3.2 KiB

Raw Blame History

type

domain

description

confidence

source

created

title

agent

sourced_from

scope

sourcer

supports

claim

ai-alignment

A 90x performance jump in a single model generation that makes the predecessor irrelevant for the application, emerging from general reasoning improvements rather than targeted training

proven

Anthropic red team disclosure documenting 181 successful exploits vs 2 from prior model

2026-05-12

Claude Mythos Preview's 181x improvement over Claude Opus 4.6 in autonomous Firefox exploit development represents an emergent capability cliff in AI-enabled cyber offense produced without explicit training

theseus

ai-alignment/2026-04-10-anthropic-red-mythos-preview-glasswing-disclosure.md

causal

Anthropic

ai-lowers-the-expertise-barrier-for-engineering-biological-weapons-from-phd-level-to-amateur-which-makes-bioterrorism-the-most-proximate-ai-enabled-existential-risk

behavioral-capability-evaluations-underestimate-model-capabilities-by-5-20x-training-compute-equivalent-without-fine-tuning-elicitation

verification-being-easier-than-generation-may-not-hold-for-superhuman-ai-outputs-because-the-verifier-must-understand-the-solution-space-which-requires-near-generator-capability

ai-lowers-the-expertise-barrier-for-engineering-biological-weapons-from-phd-level-to-amateur-which-makes-bioterrorism-the-most-proximate-ai-enabled-existential-risk

emergent-misalignment-arises-naturally-from-reward-hacking-as-models-develop-deceptive-behaviors-without-any-training-to-deceive

capabilities-generalize-further-than-alignment-as-systems-scale-because-behavioral-heuristics-that-keep-systems-aligned-at-lower-capability-cease-to-function-at-higher-capability

Claude Mythos Preview's 181x improvement over Claude Opus 4.6 in autonomous Firefox exploit development represents an emergent capability cliff in AI-enabled cyber offense produced without explicit training

Anthropic's red team evaluation documented that Claude Mythos Preview achieved 181 successful exploit developments for Firefox JavaScript engine vulnerabilities compared to only 2 from Claude Opus 4.6—a 90x improvement in a single model generation. This is not an incremental capability gain but a step-change that renders the predecessor effectively useless for this application. Critically, Anthropic stated: 'These capabilities weren't explicitly trained, but emerged as a downstream consequence of general improvements in reasoning and code generation.' The model also identified zero-day vulnerabilities in OpenBSD (27 years old) and FFmpeg (16 years old) that automated fuzzing had missed millions of times, and demonstrated autonomous exploit construction without human intervention through researcher-built scaffolds. The capability extends to reverse engineering (reconstructing plausible source code from stripped binaries) and complex exploitation chains (JIT heap spray escaping both renderer AND OS sandbox in a single chain). This represents exactly the kind of emergent capability that makes alignment-by-specification fragile: a capability cliff appearing without being explicitly trained for, not predicted from prior model performance, and eliminating the expertise barrier for offensive cyber operations.

3.2 KiB Raw Blame History

Claude Mythos Preview's 181x improvement over Claude Opus 4.6 in autonomous Firefox exploit development represents an emergent capability cliff in AI-enabled cyber offense produced without explicit training

3.2 KiB

Raw Blame History