5.9 KiB
| type | title | author | url | date | domain | secondary_domains | format | status | priority | tags | processed_by | processed_date | enrichments_applied | extraction_model | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| source | METR and UK AISI: State of Pre-Deployment AI Evaluation Practice (March 2026) | METR (metr.org) and UK AI Security Institute (aisi.gov.uk) | https://metr.org/blog/ | 2026-03-01 | ai-alignment | article | unprocessed | medium |
|
theseus | 2026-03-19 |
|
anthropic/claude-sonnet-4.5 |
Content
Synthesized overview of the two main organizations conducting pre-deployment AI evaluations as of March 2026.
METR (Model Evaluation and Threat Research):
- Review of Anthropic Sabotage Risk Report: Claude Opus 4.6 (March 12, 2026)
- Review of Anthropic Summer 2025 Pilot Sabotage Risk Report (October 28, 2025)
- Summary of gpt-oss methodology review for OpenAI (October 23, 2025)
- Common Elements of Frontier AI Safety Policies (December 2025 update)
- Frontier AI Safety Policies repository (February 2025) — catalogs safety policies from Amazon, Anthropic, Google DeepMind, Meta, Microsoft, OpenAI
UK AI Security Institute (formerly AI Safety Institute, renamed 2026):
- Cyber capability testing on 7 LLMs on custom-built cyber ranges (March 16, 2026)
- Universal jailbreak assessment against best-defended systems (February 17, 2026)
- Open-source Inspect evaluation framework (April 2024)
- Inspect Scout transcript analysis tool (February 25, 2026)
- ControlArena library for AI control experiments (October 22, 2025)
- HiBayES statistical modeling framework (May 2025)
- International joint testing exercise on agentic systems (July 2025)
Key structural observation: METR's evaluations are conducted by invitation/agreement with labs (METR "worked with" Anthropic on Opus 4.6, "worked with" OpenAI on gpt-oss). UK AISI conducts "joint pre-deployment evaluations." No mandatory requirement exists for labs to submit to these evaluations. AISI's renaming from "Safety Institute" to "Security Institute" suggests a shift from safety (avoiding catastrophic AI risk) to security (preventing cybersecurity threats).
Agent Notes
Why this matters: This is the current ceiling of third-party AI evaluation in practice. Both METR and AISI represent the best-in-class evaluation practice — and both operate on a voluntary-collaborative model where labs invite or agree to evaluation. This maps directly to AAL-1 in the Brundage et al. framework ("the peak of current practices in AI" — relying substantially on company-provided information).
What surprised me: AISI's renaming to "AI Security Institute." This suggests the UK government's focus has shifted from existential AI safety risk (alignment, catastrophic outcomes) toward near-term cybersecurity threats. If the primary government-funded evaluation body is reorienting from safety to security, the evaluation infrastructure for alignment-relevant risks weakens.
What I expected but didn't find: Any evidence that METR evaluates labs without the lab's consent or cooperation. All evaluations appear to be collaborative — the lab shares information, METR reviews it. There is no mechanism for METR to evaluate a lab that refuses.
KB connections:
- voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints — voluntary evaluation has the same structural problem; a lab can simply not invite METR
- technology advances exponentially but coordination mechanisms evolve linearly creating a widening gap — METR and AISI are growing their evaluation capacity, but AI capabilities are growing faster; the gap widens in every period
- government designation of safety-conscious AI labs as supply chain risks inverts the regulatory dynamic — AISI renaming to "Security Institute" is a softer version of the same dynamic — government safety infrastructure shifting to serve government security interests rather than existential risk reduction
Extraction hints:
- Key claim: "Pre-deployment AI evaluation operates on a voluntary-collaborative model where evaluators (METR, AISI) require lab cooperation, meaning labs that decline evaluation face no consequence"
- The AISI renaming is worth noting as a signal: the only government-funded AI safety evaluation body is shifting its mandate
- The scope of METR/AISI evaluations (mostly sabotage risk and cyber capabilities) may be narrower than alignment-relevant evaluation
Context: March 2026 state of play. Assessed by synthesizing METR's published blog and AISI's published work pages — these are the two most active evaluation organizations globally.
Curator Notes
PRIMARY CONNECTION: safe AI development requires building alignment mechanisms before scaling capability — the current ceiling of evaluation practice (METR/AISI, voluntary-collaborative) is far below what "building alignment mechanisms before scaling capability" requires
WHY ARCHIVED: Documents the actual state of pre-deployment AI evaluation practice in early 2026. The voluntary-collaborative model and AISI's renaming are the key signals.
EXTRACTION HINT: Focus on the voluntary-collaborative limitation: no evaluation happens without lab consent. Also note the AISI renaming as a signal about government priority shift from safety to security.
Key Facts
- METR reviewed Anthropic's Claude Opus 4.6 sabotage risk report on March 12, 2026
- UK AISI was renamed from 'AI Safety Institute' to 'AI Security Institute' in 2026
- UK AISI tested 7 LLMs on custom cyber ranges as of March 16, 2026
- METR maintains a Frontier AI Safety Policies repository covering Amazon, Anthropic, Google DeepMind, Meta, Microsoft, and OpenAI