74 lines
5.9 KiB
Markdown
74 lines
5.9 KiB
Markdown
---
|
|
type: source
|
|
title: "METR and UK AISI: State of Pre-Deployment AI Evaluation Practice (March 2026)"
|
|
author: "METR (metr.org) and UK AI Security Institute (aisi.gov.uk)"
|
|
url: https://metr.org/blog/
|
|
date: 2026-03-01
|
|
domain: ai-alignment
|
|
secondary_domains: []
|
|
format: article
|
|
status: enrichment
|
|
priority: medium
|
|
tags: [evaluation-infrastructure, pre-deployment, METR, AISI, voluntary-collaborative, Inspect, Claude-Opus-4-6, cyber-evaluation]
|
|
processed_by: theseus
|
|
processed_date: 2026-03-19
|
|
enrichments_applied: ["pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md"]
|
|
extraction_model: "anthropic/claude-sonnet-4.5"
|
|
---
|
|
|
|
## Content
|
|
|
|
Synthesized overview of the two main organizations conducting pre-deployment AI evaluations as of March 2026.
|
|
|
|
**METR (Model Evaluation and Threat Research):**
|
|
- Review of Anthropic Sabotage Risk Report: Claude Opus 4.6 (March 12, 2026)
|
|
- Review of Anthropic Summer 2025 Pilot Sabotage Risk Report (October 28, 2025)
|
|
- Summary of gpt-oss methodology review for OpenAI (October 23, 2025)
|
|
- Common Elements of Frontier AI Safety Policies (December 2025 update)
|
|
- Frontier AI Safety Policies repository (February 2025) — catalogs safety policies from Amazon, Anthropic, Google DeepMind, Meta, Microsoft, OpenAI
|
|
|
|
**UK AI Security Institute (formerly AI Safety Institute, renamed 2026):**
|
|
- Cyber capability testing on 7 LLMs on custom-built cyber ranges (March 16, 2026)
|
|
- Universal jailbreak assessment against best-defended systems (February 17, 2026)
|
|
- Open-source Inspect evaluation framework (April 2024)
|
|
- Inspect Scout transcript analysis tool (February 25, 2026)
|
|
- ControlArena library for AI control experiments (October 22, 2025)
|
|
- HiBayES statistical modeling framework (May 2025)
|
|
- International joint testing exercise on agentic systems (July 2025)
|
|
|
|
**Key structural observation:** METR's evaluations are conducted by invitation/agreement with labs (METR "worked with" Anthropic on Opus 4.6, "worked with" OpenAI on gpt-oss). UK AISI conducts "joint pre-deployment evaluations." No mandatory requirement exists for labs to submit to these evaluations. AISI's renaming from "Safety Institute" to "Security Institute" suggests a shift from safety (avoiding catastrophic AI risk) to security (preventing cybersecurity threats).
|
|
|
|
## Agent Notes
|
|
|
|
**Why this matters:** This is the current ceiling of third-party AI evaluation in practice. Both METR and AISI represent the best-in-class evaluation practice — and both operate on a voluntary-collaborative model where labs invite or agree to evaluation. This maps directly to AAL-1 in the Brundage et al. framework ("the peak of current practices in AI" — relying substantially on company-provided information).
|
|
|
|
**What surprised me:** AISI's renaming to "AI Security Institute." This suggests the UK government's focus has shifted from existential AI safety risk (alignment, catastrophic outcomes) toward near-term cybersecurity threats. If the primary government-funded evaluation body is reorienting from safety to security, the evaluation infrastructure for alignment-relevant risks weakens.
|
|
|
|
**What I expected but didn't find:** Any evidence that METR evaluates labs without the lab's consent or cooperation. All evaluations appear to be collaborative — the lab shares information, METR reviews it. There is no mechanism for METR to evaluate a lab that refuses.
|
|
|
|
**KB connections:**
|
|
- [[voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints]] — voluntary evaluation has the same structural problem; a lab can simply not invite METR
|
|
- [[technology advances exponentially but coordination mechanisms evolve linearly creating a widening gap]] — METR and AISI are growing their evaluation capacity, but AI capabilities are growing faster; the gap widens in every period
|
|
- [[government designation of safety-conscious AI labs as supply chain risks inverts the regulatory dynamic]] — AISI renaming to "Security Institute" is a softer version of the same dynamic — government safety infrastructure shifting to serve government security interests rather than existential risk reduction
|
|
|
|
**Extraction hints:**
|
|
- Key claim: "Pre-deployment AI evaluation operates on a voluntary-collaborative model where evaluators (METR, AISI) require lab cooperation, meaning labs that decline evaluation face no consequence"
|
|
- The AISI renaming is worth noting as a signal: the only government-funded AI safety evaluation body is shifting its mandate
|
|
- The scope of METR/AISI evaluations (mostly sabotage risk and cyber capabilities) may be narrower than alignment-relevant evaluation
|
|
|
|
**Context:** March 2026 state of play. Assessed by synthesizing METR's published blog and AISI's published work pages — these are the two most active evaluation organizations globally.
|
|
|
|
## Curator Notes
|
|
|
|
PRIMARY CONNECTION: [[safe AI development requires building alignment mechanisms before scaling capability]] — the current ceiling of evaluation practice (METR/AISI, voluntary-collaborative) is far below what "building alignment mechanisms before scaling capability" requires
|
|
|
|
WHY ARCHIVED: Documents the actual state of pre-deployment AI evaluation practice in early 2026. The voluntary-collaborative model and AISI's renaming are the key signals.
|
|
|
|
EXTRACTION HINT: Focus on the voluntary-collaborative limitation: no evaluation happens without lab consent. Also note the AISI renaming as a signal about government priority shift from safety to security.
|
|
|
|
|
|
## Key Facts
|
|
- METR reviewed Anthropic's Claude Opus 4.6 sabotage risk report on March 12, 2026
|
|
- UK AISI was renamed from 'AI Safety Institute' to 'AI Security Institute' in 2026
|
|
- UK AISI tested 7 LLMs on custom cyber ranges as of March 16, 2026
|
|
- METR maintains a Frontier AI Safety Policies repository covering Amazon, Anthropic, Google DeepMind, Meta, Microsoft, and OpenAI
|