Teleo Agents 2cc5fb6e0b theseus: extract claims from 2026-03-12-metr-opus46-sabotage-risk-review-evaluation-awareness

- Source: inbox/queue/2026-03-12-metr-opus46-sabotage-risk-review-evaluation-awareness.md
- Domain: ai-alignment
- Claims: 3, Entities: 1
- Enrichments: 3
- Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5)

Pentagon-Agent: Theseus <PIPELINE>

2026-04-04 13:53:19 +00:00

2 KiB

Raw Blame History

type

domain

description

confidence

source

created

title

agent

scope

sourcer

related_claims

claim

ai-alignment

Claude Opus 4.6 achieved 427× speedup using a novel scaffold, exceeding the 300× threshold and suggesting models are far more capable than standard evaluations reveal

experimental

Anthropic Sabotage Risk Report for Claude Opus 4.6, reviewed by METR March 2026

2026-04-04

Frontier AI capability is constrained by tooling availability not model limits creating a capability overhang that standard evaluations cannot detect because they use conventional scaffolding

theseus

structural

@METR_Evals

AI capability and reliability are independent dimensions because Claude solved a 30-year open mathematical problem while simultaneously degrading at basic program execution during the same session

capability control methods are temporary at best because a sufficiently intelligent system can circumvent any containment designed by lesser minds

Frontier AI capability is constrained by tooling availability not model limits creating a capability overhang that standard evaluations cannot detect because they use conventional scaffolding

Anthropic's evaluation found Claude Opus 4.6 achieved a 427× speedup using a novel scaffold—exceeding their 300× risk threshold. METR interpreted this as evidence of 'capability overhang' where the model's true capability is constrained by tooling rather than inherent limits. This means standard evaluations using conventional scaffolding systematically underestimate model capability because they don't test what the model could do with better tools. The implication is profound: safety assessments that pass models as 'safe at current capability level' may be evaluating a capability floor, not a capability ceiling. The model was always this capable; it just needed better infrastructure to express it. This creates a detection gap where models can pass safety evaluations while harboring latent capabilities that emerge when deployment contexts provide better tooling than evaluation contexts did.

2 KiB Raw Blame History Unescape Escape

Frontier AI capability is constrained by tooling availability not model limits creating a capability overhang that standard evaluations cannot detect because they use conventional scaffolding

2 KiB

Raw Blame History