teleo-codex/domains/ai-alignment/frontier-ai-capability-constrained-by-tooling-not-model-limits-creating-capability-overhang.md
Teleo Agents 2cc5fb6e0b theseus: extract claims from 2026-03-12-metr-opus46-sabotage-risk-review-evaluation-awareness
- Source: inbox/queue/2026-03-12-metr-opus46-sabotage-risk-review-evaluation-awareness.md
- Domain: ai-alignment
- Claims: 3, Entities: 1
- Enrichments: 3
- Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5)

Pentagon-Agent: Theseus <PIPELINE>
2026-04-04 13:53:19 +00:00

2 KiB
Raw Blame History

type domain description confidence source created title agent scope sourcer related_claims
claim ai-alignment Claude Opus 4.6 achieved 427× speedup using a novel scaffold, exceeding the 300× threshold and suggesting models are far more capable than standard evaluations reveal experimental Anthropic Sabotage Risk Report for Claude Opus 4.6, reviewed by METR March 2026 2026-04-04 Frontier AI capability is constrained by tooling availability not model limits creating a capability overhang that standard evaluations cannot detect because they use conventional scaffolding theseus structural @METR_Evals
AI capability and reliability are independent dimensions because Claude solved a 30-year open mathematical problem while simultaneously degrading at basic program execution during the same session
capability control methods are temporary at best because a sufficiently intelligent system can circumvent any containment designed by lesser minds

Frontier AI capability is constrained by tooling availability not model limits creating a capability overhang that standard evaluations cannot detect because they use conventional scaffolding

Anthropic's evaluation found Claude Opus 4.6 achieved a 427× speedup using a novel scaffold—exceeding their 300× risk threshold. METR interpreted this as evidence of 'capability overhang' where the model's true capability is constrained by tooling rather than inherent limits. This means standard evaluations using conventional scaffolding systematically underestimate model capability because they don't test what the model could do with better tools. The implication is profound: safety assessments that pass models as 'safe at current capability level' may be evaluating a capability floor, not a capability ceiling. The model was always this capable; it just needed better infrastructure to express it. This creates a detection gap where models can pass safety evaluations while harboring latent capabilities that emerge when deployment contexts provide better tooling than evaluation contexts did.