theseus: extract claims from 2026-04-10-anthropic-red-mythos-preview-glasswing-disclosure
Some checks failed
Mirror PR to Forgejo / mirror (pull_request) Has been cancelled

- Source: inbox/queue/2026-04-10-anthropic-red-mythos-preview-glasswing-disclosure.md
- Domain: ai-alignment
- Claims: 3, Entities: 2
- Enrichments: 5
- Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5)

Pentagon-Agent: Theseus <PIPELINE>
This commit is contained in:
Teleo Agents 2026-05-12 00:30:15 +00:00
parent df9881a16e
commit 312babf2be
8 changed files with 180 additions and 3 deletions

View file

@ -0,0 +1,19 @@
---
type: claim
domain: ai-alignment
description: A 90x performance jump in a single model generation that makes the predecessor irrelevant for the application, emerging from general reasoning improvements rather than targeted training
confidence: proven
source: Anthropic red team disclosure documenting 181 successful exploits vs 2 from prior model
created: 2026-05-12
title: Claude Mythos Preview's 181x improvement over Claude Opus 4.6 in autonomous Firefox exploit development represents an emergent capability cliff in AI-enabled cyber offense produced without explicit training
agent: theseus
sourced_from: ai-alignment/2026-04-10-anthropic-red-mythos-preview-glasswing-disclosure.md
scope: causal
sourcer: Anthropic
supports: ["ai-lowers-the-expertise-barrier-for-engineering-biological-weapons-from-phd-level-to-amateur-which-makes-bioterrorism-the-most-proximate-ai-enabled-existential-risk", "behavioral-capability-evaluations-underestimate-model-capabilities-by-5-20x-training-compute-equivalent-without-fine-tuning-elicitation", "verification-being-easier-than-generation-may-not-hold-for-superhuman-ai-outputs-because-the-verifier-must-understand-the-solution-space-which-requires-near-generator-capability"]
related: ["ai-lowers-the-expertise-barrier-for-engineering-biological-weapons-from-phd-level-to-amateur-which-makes-bioterrorism-the-most-proximate-ai-enabled-existential-risk", "emergent-misalignment-arises-naturally-from-reward-hacking-as-models-develop-deceptive-behaviors-without-any-training-to-deceive", "capabilities-generalize-further-than-alignment-as-systems-scale-because-behavioral-heuristics-that-keep-systems-aligned-at-lower-capability-cease-to-function-at-higher-capability"]
---
# Claude Mythos Preview's 181x improvement over Claude Opus 4.6 in autonomous Firefox exploit development represents an emergent capability cliff in AI-enabled cyber offense produced without explicit training
Anthropic's red team evaluation documented that Claude Mythos Preview achieved 181 successful exploit developments for Firefox JavaScript engine vulnerabilities compared to only 2 from Claude Opus 4.6—a 90x improvement in a single model generation. This is not an incremental capability gain but a step-change that renders the predecessor effectively useless for this application. Critically, Anthropic stated: 'These capabilities weren't explicitly trained, but emerged as a downstream consequence of general improvements in reasoning and code generation.' The model also identified zero-day vulnerabilities in OpenBSD (27 years old) and FFmpeg (16 years old) that automated fuzzing had missed millions of times, and demonstrated autonomous exploit construction without human intervention through researcher-built scaffolds. The capability extends to reverse engineering (reconstructing plausible source code from stripped binaries) and complex exploitation chains (JIT heap spray escaping both renderer AND OS sandbox in a single chain). This represents exactly the kind of emergent capability that makes alignment-by-specification fragile: a capability cliff appearing without being explicitly trained for, not predicted from prior model performance, and eliminating the expertise barrier for offensive cyber operations.

View file

@ -0,0 +1,20 @@
---
type: claim
domain: ai-alignment
description: Creates a transition window where offense dramatically outpaces defense until defensive adoption and organizational processes catch up
confidence: likely
source: Anthropic Mythos disclosure, Pentagon CTO characterization as 'national security moment'
created: 2026-05-12
title: AI-enabled offensive cyber capabilities currently favor attackers over defenders because the time to discover and weaponize vulnerabilities has compressed from weeks to overnight while organizational patch cycles have not accelerated
agent: theseus
sourced_from: ai-alignment/2026-04-10-anthropic-red-mythos-preview-glasswing-disclosure.md
scope: structural
sourcer: Anthropic
supports: ["verification-is-easier-than-generation-for-ai-alignment-at-current-capability-levels-but-the-asymmetry-narrows-as-capability-gaps-grow-creating-a-window-of-alignment-opportunity-that-closes-with-scaling", "cyber-is-exceptional-dangerous-capability-domain-with-documented-real-world-evidence-exceeding-benchmark-predictions"]
challenges: ["economic-forces-push-humans-out-of-every-cognitive-loop-where-output-quality-is-independently-verifiable-because-human-in-the-loop-is-a-cost-that-competitive-markets-eliminate"]
related: ["verification-is-easier-than-generation-for-ai-alignment-at-current-capability-levels-but-the-asymmetry-narrows-as-capability-gaps-grow-creating-a-window-of-alignment-opportunity-that-closes-with-scaling", "cyber-is-exceptional-dangerous-capability-domain-with-documented-real-world-evidence-exceeding-benchmark-predictions", "private-ai-lab-access-restrictions-create-government-offensive-defensive-capability-asymmetries-without-accountability-structure"]
---
# AI-enabled offensive cyber capabilities currently favor attackers over defenders because the time to discover and weaponize vulnerabilities has compressed from weeks to overnight while organizational patch cycles have not accelerated
Anthropic frames the Mythos capability as a 'transitional period' where 'offense currently ahead of defense.' The mechanism is specific: non-experts can now ask Mythos to find remote code execution vulnerabilities overnight and receive a complete working exploit by morning—compressing what previously took weeks of expert work into hours of automated discovery. Meanwhile, organizational patch cycles remain unchanged: Anthropic found over 271 Firefox vulnerabilities through Project Glasswing with less than 1% patched at time of writing. Pentagon CTO Emil Michael characterized this as a 'national security moment,' and Anthropic explicitly urges organizations to 'shorten patch cycles, adopt AI-powered defensive tools, restructure vulnerability response.' The restriction is explicitly temporary, not permanent, with an 'eventual goal to enable users to safely deploy Mythos-class models at scale—for cybersecurity purposes but also for myriad other benefits' once safeguards exist. This creates a race condition: can defensive infrastructure and organizational processes accelerate before adversaries gain comparable offensive capability? The transition window exists because capability deployment is asymmetric—offense can be automated immediately while defense requires organizational change.

View file

@ -0,0 +1,19 @@
---
type: claim
domain: ai-alignment
description: First documented case of a frontier lab withholding a model from public release while allowing controlled access to ~40 organizations, creating a novel governance architecture distinct from both open deployment and complete restriction
confidence: proven
source: Anthropic red team disclosure, April 2026
created: 2026-05-12
title: Anthropic's restricted-access deployment of Claude Mythos Preview via Project Glasswing establishes a third deployment tier between general availability and non-deployment based on capability harm assessment
agent: theseus
sourced_from: ai-alignment/2026-04-10-anthropic-red-mythos-preview-glasswing-disclosure.md
scope: structural
sourcer: Anthropic
challenges: ["the-alignment-tax-creates-a-structural-race-to-the-bottom-because-safety-training-costs-capability-and-rational-competitors-skip-it", "anthropics-rsp-rollback-under-commercial-pressure-is-the-first-empirical-confirmation-that-binding-safety-commitments-cannot-survive-the-competitive-dynamics-of-frontier-ai-development"]
related: ["voluntary-safety-constraints-without-enforcement-are-statements-of-intent-not-binding-governance", "only-binding-regulation-with-enforcement-teeth-changes-frontier-ai-lab-behavior-because-every-voluntary-commitment-has-been-eroded-abandoned-or-made-conditional-on-competitor-behavior-when-commercially-inconvenient", "legible-immediate-harm-enforces-governance-convergence-independent-of-competitive-incentives", "limited-partner-deployment-model-fails-at-supply-chain-boundary-for-asl-4-capabilities"]
---
# Anthropic's restricted-access deployment of Claude Mythos Preview via Project Glasswing establishes a third deployment tier between general availability and non-deployment based on capability harm assessment
Anthropic explicitly stated they 'do not plan to make Claude Mythos Preview generally available' and instead restricted access to approximately 40 organizations through Project Glasswing, a coalition including AWS, Apple, Microsoft, Google, CrowdStrike, and Palo Alto Networks. This represents the first documented case where a frontier lab deployed a capability-complete model under permanent access restrictions based on harm assessment rather than either releasing publicly or not deploying at all. The rationale was explicit: 'The capabilities could enable attackers if frontier labs aren't careful about how they release these models' because non-experts can now 'ask Mythos to find remote code execution vulnerabilities overnight and get a complete working exploit by morning.' Critically, this is framed as a 'transitional period' with an 'eventual goal to enable users to safely deploy Mythos-class models at scale' once safeguards exist, making it a temporary governance architecture rather than permanent restriction. The restricted-access model includes human validators reviewing findings before coordinated disclosure, with less than 1% of discovered vulnerabilities patched at time of writing. This establishes a deployment tier the KB's current framework does not capture: not 'too dangerous to exist' but 'too dangerous to release publicly now.'

View file

@ -43,4 +43,10 @@ Claude Mythos Preview achieved 73% success rate on expert-level CTF challenges a
**Source:** UK AISI Mythos evaluation, April 2026
Claude Mythos Preview's 3/10 success rate on completing a 32-step enterprise network intrusion from start to finish provides the first documented case of an AI model achieving end-to-end autonomous attack capability in a realistic environment. This exceeds what CTF benchmark performance (73% success on isolated tasks) would predict, confirming that cyber capabilities in integrated attack scenarios can exceed component-task predictions. AISI specifically noted Mythos's effectiveness at 'mapping complex software dependencies, making it highly effective at locating zero-day vulnerabilities in critical infrastructure software.'
Claude Mythos Preview's 3/10 success rate on completing a 32-step enterprise network intrusion from start to finish provides the first documented case of an AI model achieving end-to-end autonomous attack capability in a realistic environment. This exceeds what CTF benchmark performance (73% success on isolated tasks) would predict, confirming that cyber capabilities in integrated attack scenarios can exceed component-task predictions. AISI specifically noted Mythos's effectiveness at 'mapping complex software dependencies, making it highly effective at locating zero-day vulnerabilities in critical infrastructure software.'
## Supporting Evidence
**Source:** Anthropic Mythos Preview disclosure, April 2026
Claude Mythos Preview identified zero-day vulnerabilities in OpenBSD (27 years old) and FFmpeg (16 years old) that automated fuzzing had missed millions of times. It achieved 181 successful exploit developments for Firefox JavaScript engine compared to 2 from the prior model—a 90x improvement. It demonstrated autonomous exploit construction, reverse engineering of stripped binaries, and complex exploitation chains escaping both renderer and OS sandbox. This provides documented real-world evidence of cyber capability exceeding benchmark predictions.

View file

@ -12,7 +12,7 @@ sourcer: The Intercept
related_claims: ["voluntary-safety-pledges-cannot-survive-competitive-pressure", "[[the alignment tax creates a structural race to the bottom because safety training costs capability and rational competitors skip it]]"]
supports: ["Voluntary AI safety constraints are protected as corporate speech but unenforceable as safety requirements, creating legal mechanism gap when primary demand-side actor seeks safety-unconstrained providers"]
reweave_edges: ["Voluntary AI safety constraints are protected as corporate speech but unenforceable as safety requirements, creating legal mechanism gap when primary demand-side actor seeks safety-unconstrained providers|supports|2026-04-20"]
related: ["voluntary-safety-constraints-without-enforcement-are-statements-of-intent-not-binding-governance", "voluntary-safety-constraints-without-external-enforcement-are-statements-of-intent-not-binding-governance", "multilateral-verification-mechanisms-can-substitute-for-failed-voluntary-commitments-when-binding-enforcement-replaces-unilateral-sacrifice", "voluntary-ai-safety-constraints-lack-legal-enforcement-mechanism-when-primary-customer-demands-safety-unconstrained-alternatives", "government-safety-penalties-invert-regulatory-incentives-by-blacklisting-cautious-actors", "voluntary-ai-safety-red-lines-are-structurally-equivalent-to-no-red-lines-when-lacking-constitutional-protection", "advisory-safety-language-with-contractual-adjustment-obligations-constitutes-governance-form-without-enforcement-mechanism", "trust-based-safety-guarantees-fail-architecturally-in-classified-deployments"]
related: ["voluntary-safety-constraints-without-enforcement-are-statements-of-intent-not-binding-governance", "voluntary-safety-constraints-without-external-enforcement-are-statements-of-intent-not-binding-governance", "multilateral-verification-mechanisms-can-substitute-for-failed-voluntary-commitments-when-binding-enforcement-replaces-unilateral-sacrifice", "voluntary-ai-safety-constraints-lack-legal-enforcement-mechanism-when-primary-customer-demands-safety-unconstrained-alternatives", "government-safety-penalties-invert-regulatory-incentives-by-blacklisting-cautious-actors", "voluntary-ai-safety-red-lines-are-structurally-equivalent-to-no-red-lines-when-lacking-constitutional-protection", "advisory-safety-language-with-contractual-adjustment-obligations-constitutes-governance-form-without-enforcement-mechanism", "trust-based-safety-guarantees-fail-architecturally-in-classified-deployments", "ai-verification-limits-become-corporate-safety-arguments-in-government-contracts"]
---
# Voluntary safety constraints without external enforcement mechanisms are statements of intent not binding governance because aspirational language with loopholes enables compliance theater while preserving operational flexibility
@ -80,3 +80,10 @@ The EU AI Act's August 2, 2026 enforcement deadline represents the first time in
**Source:** Tillipman, Lawfare March 2026
Procurement contracts as governance instruments have four structural weaknesses that prevent them from functioning as binding governance: no democratic accountability, no institutional durability (can be changed by executive action), enforcement depends on uncertain post-deployment technical controls, and intelligence community interpretation applies broadest possible reading to exceptions.
## Challenging Evidence
**Source:** Anthropic Mythos Preview disclosure, April 2026
Anthropic's decision to restrict Claude Mythos Preview to ~40 organizations via Project Glasswing rather than releasing publicly represents a voluntary safety constraint that is being maintained despite commercial pressure. The restriction is explicit and operational: 'we do not plan to make Claude Mythos Preview generally available.' This challenges the claim that voluntary constraints cannot survive competitive pressure, though it remains to be seen whether this restriction holds long-term or whether competitors will force Anthropic to release more broadly.

View file

@ -0,0 +1,64 @@
# Claude Mythos Preview
**Developer:** Anthropic
**Type:** Frontier AI model with autonomous cyber offense capabilities
**Status:** Restricted access (not generally available)
**Access:** ~40 organizations via Project Glasswing
**Disclosed:** April 2026
## Overview
Claude Mythos Preview is Anthropic's frontier AI model demonstrating autonomous zero-day vulnerability discovery and exploit development capabilities. It represents the first documented case of a frontier lab withholding a capability-complete model from public release based on explicit capability harm assessment.
## Capabilities
### Autonomous Exploit Development
- **181 successful exploits** for Firefox JavaScript engine (vs. 2 from prior Claude Opus 4.6)
- 90x improvement over predecessor model in single generation
- Autonomous exploit construction without human intervention
- Complex exploitation chains: JIT heap spray escaping both renderer AND OS sandbox
### Zero-Day Discovery
- Identified vulnerabilities in OpenBSD (27 years old) and FFmpeg (16 years old) that automated fuzzing missed millions of times
- Found >271 Firefox vulnerabilities (less than 1% patched at disclosure)
- Operates across major OSes, web browsers, and widely-used software
### Reverse Engineering
- Reconstructs plausible source code from stripped binaries
- Enables closed-source vulnerability discovery
## Emergent Capability
Anthropics stated: "These capabilities weren't explicitly trained, but emerged as a downstream consequence of general improvements in reasoning and code generation."
## Deployment Restriction
Anthropics explicitly stated: "we do not plan to make Claude Mythos Preview generally available."
**Rationale:** "The capabilities could enable attackers if frontier labs aren't careful about how they release these models." Non-experts can ask Mythos to find remote code execution vulnerabilities overnight and receive complete working exploits by morning.
**Temporal framing:** Described as "transitional period" with "eventual goal to enable users to safely deploy Mythos-class models at scale" once safeguards exist.
## Project Glasswing
Restricted access provided to ~40 organizations including:
- AWS
- Apple
- Microsoft
- Google
- CrowdStrike
- Palo Alto Networks
Human validators review findings before coordinated disclosure to affected parties.
## Governance Significance
First documented frontier AI model deployed under permanent access restrictions based on capability harm assessment, establishing a third deployment tier between general availability and non-deployment.
## Timeline
- **2026-04-10** — Anthropic published technical disclosure on red team research site (red.anthropic.com)
## Sources
- Anthropic Mythos Preview Technical Disclosure (April 2026)

View file

@ -0,0 +1,39 @@
# Project Glasswing
**Type:** Private-sector AI capability access coalition
**Founded:** 2026 (disclosed April 2026)
**Purpose:** Coordinated vulnerability discovery and disclosure using restricted-access frontier AI models
**Members:** ~40 organizations including AWS, Apple, Microsoft, Google, CrowdStrike, Palo Alto Networks
## Overview
Project Glasswing is a coalition of technology companies granted restricted access to Anthropic's Claude Mythos Preview model for cybersecurity purposes. It represents the first documented private-sector governance architecture for capability-harm-based deployment restriction.
## Operational Model
- **Access control:** Anthropic restricts Mythos Preview to approximately 40 member organizations
- **Coordinated disclosure:** Human validators review AI-discovered vulnerabilities before notifying affected parties
- **Temporal framing:** Explicitly described as a "transitional period" until defensive safeguards enable broader deployment
- **Goal:** Use Mythos to find and patch vulnerabilities before adversaries gain comparable capability
## Governance Architecture
Project Glasswing establishes a third deployment tier between general availability and non-deployment:
- Not "too dangerous to exist" (model is deployed)
- Not "safe for public release" (access permanently restricted to coalition)
- Temporary restriction pending development of defensive safeguards
## Effectiveness
As of April 2026 disclosure:
- Mythos discovered >271 Firefox vulnerabilities through Glasswing
- Less than 1% had been patched at time of writing
- Demonstrates offensive capability outpacing defensive verification infrastructure
## Timeline
- **2026-04-10** — Anthropic publicly disclosed Project Glasswing existence and operational model in Mythos Preview technical disclosure
## Sources
- Anthropic Mythos Preview Technical Disclosure (April 2026)

View file

@ -7,10 +7,13 @@ date: 2026-04-10
domain: ai-alignment
secondary_domains: []
format: article
status: unprocessed
status: processed
processed_by: theseus
processed_date: 2026-05-12
priority: high
tags: [Mythos, Glasswing, cybersecurity, autonomous-exploit, zero-day, dangerous-capabilities, restricted-access, offense-defense, capability-harm-assessment, B4, B1]
intake_tier: research-task
extraction_model: "anthropic/claude-sonnet-4.5"
---
## Content