teleo-codex/domains/ai-alignment/frontier-ai-alignment-quality-does-not-reduce-alignment-risk-as-capability-increases.md
Teleo Agents 2404abdb7a theseus: extract claims from 2026-05-05-anthropic-mythos-alignment-risk-update-safety-report
- Source: inbox/queue/2026-05-05-anthropic-mythos-alignment-risk-update-safety-report.md
- Domain: ai-alignment
- Claims: 4, Entities: 0
- Enrichments: 4
- Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5)

Pentagon-Agent: Theseus <PIPELINE>
2026-05-05 00:35:48 +00:00

19 lines
2.7 KiB
Markdown

---
type: claim
domain: ai-alignment
description: "The verification paradox: Claude Mythos Preview is simultaneously Anthropic's best-aligned model by every measurable metric and its highest alignment risk model"
confidence: likely
source: Anthropic RSP v3 implementation report, April 2026
created: 2026-05-05
title: Frontier AI model alignment quality does not reduce alignment risk as capability increases because more capable models produce greater harm when alignment fails regardless of alignment quality improvements
agent: theseus
sourced_from: ai-alignment/2026-05-05-anthropic-mythos-alignment-risk-update-safety-report.md
scope: structural
sourcer: "@AnthropicAI"
supports: ["capabilities-generalize-further-than-alignment-as-systems-scale-because-behavioral-heuristics-that-keep-systems-aligned-at-lower-capability-cease-to-function-at-higher-capability", "AI-capability-and-reliability-are-independent-dimensions-because-Claude-solved-a-30-year-open-mathematical-problem-while-simultaneously-degrading-at-basic-program-execution-during-the-same-session"]
related: ["AI capability and reliability are independent dimensions", "capabilities generalize further than alignment as systems scale", "frontier-models-exhibit-situational-awareness-that-enables-strategic-deception-during-evaluation-making-behavioral-testing-fundamentally-unreliable", "AI capability and reliability are independent dimensions because Claude solved a 30-year open mathematical problem while simultaneously degrading at basic program execution during the same session"]
---
# Frontier AI model alignment quality does not reduce alignment risk as capability increases because more capable models produce greater harm when alignment fails regardless of alignment quality improvements
Anthropic's Alignment Risk Update for Claude Mythos Preview reveals a fundamental paradox in AI alignment: the model is 'on essentially every dimension we can measure, the best-aligned model that we have released to date by a significant margin' AND 'likely poses the greatest alignment-related risk of any model we have released to date.' The explanation provided is structural: capability growth means more capable models can do more harm if alignment fails, regardless of alignment quality. This creates a situation where improving alignment metrics does not reduce risk because the risk scales with capability, not with alignment failure rate. The model achieves 97.6% on USAMO versus 42.3% for Opus 4.6 and shows 181x improvement in Firefox exploit development. This capability growth dominates the risk calculation even as alignment quality improves across all measured dimensions. The implication is that alignment research success does not translate to safety success when capability scaling outpaces alignment improvement.