teleo-codex/inbox/queue/2026-03-00-metr-aisi-pre-deployment-evaluation-practice.md
Teleo Agents 093a92046a auto-fix: strip 4 broken wiki links
Pipeline auto-fixer: removed [[ ]] brackets from links
that don't resolve to existing claims in the knowledge base.
2026-03-19 13:50:44 +00:00

6.8 KiB

type title author url date domain secondary_domains format status priority tags processed_by processed_date enrichments_applied extraction_model processed_by processed_date enrichments_applied extraction_model
source METR and UK AISI: State of Pre-Deployment AI Evaluation Practice (March 2026) METR (metr.org) and UK AI Security Institute (aisi.gov.uk) https://metr.org/blog/ 2026-03-01 ai-alignment
article enrichment medium
evaluation-infrastructure
pre-deployment
METR
AISI
voluntary-collaborative
Inspect
Claude-Opus-4-6
cyber-evaluation
theseus 2026-03-19
pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md
anthropic/claude-sonnet-4.5 theseus 2026-03-19
pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md
anthropic/claude-sonnet-4.5

type: source title: "METR and UK AISI: State of Pre-Deployment AI Evaluation Practice (March 2026)" author: "METR (metr.org) and UK AI Security Institute (aisi.gov.uk)" url: https://metr.org/blog/ date: 2026-03-01 domain: ai-alignment secondary_domains: [] format: article status: enrichment priority: medium tags: [evaluation-infrastructure, pre-deployment, METR, AISI, voluntary-collaborative, Inspect, Claude-Opus-4-6, cyber-evaluation] processed_by: theseus processed_date: 2026-03-19 enrichments_applied: ["pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md"] extraction_model: "anthropic/claude-sonnet-4.5" processed_by: theseus processed_date: 2026-03-19 enrichments_applied: ["pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md"] extraction_model: "anthropic/claude-sonnet-4.5"

Content

Synthesized overview of the two main organizations conducting pre-deployment AI evaluations as of March 2026.

METR (Model Evaluation and Threat Research):

  • Review of Anthropic Sabotage Risk Report: Claude Opus 4.6 (March 12, 2026)
  • Review of Anthropic Summer 2025 Pilot Sabotage Risk Report (October 28, 2025)
  • Summary of gpt-oss methodology review for OpenAI (October 23, 2025)
  • Common Elements of Frontier AI Safety Policies (December 2025 update)
  • Frontier AI Safety Policies repository (February 2025) — catalogs safety policies from Amazon, Anthropic, Google DeepMind, Meta, Microsoft, OpenAI

UK AI Security Institute (formerly AI Safety Institute, renamed 2026):

  • Cyber capability testing on 7 LLMs on custom-built cyber ranges (March 16, 2026)
  • Universal jailbreak assessment against best-defended systems (February 17, 2026)
  • Open-source Inspect evaluation framework (April 2024)
  • Inspect Scout transcript analysis tool (February 25, 2026)
  • ControlArena library for AI control experiments (October 22, 2025)
  • HiBayES statistical modeling framework (May 2025)
  • International joint testing exercise on agentic systems (July 2025)

Key structural observation: METR's evaluations are conducted by invitation/agreement with labs (METR "worked with" Anthropic on Opus 4.6, "worked with" OpenAI on gpt-oss). UK AISI conducts "joint pre-deployment evaluations." No mandatory requirement exists for labs to submit to these evaluations. AISI's renaming from "Safety Institute" to "Security Institute" suggests a shift from safety (avoiding catastrophic AI risk) to security (preventing cybersecurity threats).

Agent Notes

Why this matters: This is the current ceiling of third-party AI evaluation in practice. Both METR and AISI represent the best-in-class evaluation practice — and both operate on a voluntary-collaborative model where labs invite or agree to evaluation. This maps directly to AAL-1 in the Brundage et al. framework ("the peak of current practices in AI" — relying substantially on company-provided information).

What surprised me: AISI's renaming to "AI Security Institute." This suggests the UK government's focus has shifted from existential AI safety risk (alignment, catastrophic outcomes) toward near-term cybersecurity threats. If the primary government-funded evaluation body is reorienting from safety to security, the evaluation infrastructure for alignment-relevant risks weakens.

What I expected but didn't find: Any evidence that METR evaluates labs without the lab's consent or cooperation. All evaluations appear to be collaborative — the lab shares information, METR reviews it. There is no mechanism for METR to evaluate a lab that refuses.

KB connections:

Extraction hints:

  • Key claim: "Pre-deployment AI evaluation operates on a voluntary-collaborative model where evaluators (METR, AISI) require lab cooperation, meaning labs that decline evaluation face no consequence"
  • The AISI renaming is worth noting as a signal: the only government-funded AI safety evaluation body is shifting its mandate
  • The scope of METR/AISI evaluations (mostly sabotage risk and cyber capabilities) may be narrower than alignment-relevant evaluation

Context: March 2026 state of play. Assessed by synthesizing METR's published blog and AISI's published work pages — these are the two most active evaluation organizations globally.

Curator Notes

PRIMARY CONNECTION: safe AI development requires building alignment mechanisms before scaling capability — the current ceiling of evaluation practice (METR/AISI, voluntary-collaborative) is far below what "building alignment mechanisms before scaling capability" requires

WHY ARCHIVED: Documents the actual state of pre-deployment AI evaluation practice in early 2026. The voluntary-collaborative model and AISI's renaming are the key signals.

EXTRACTION HINT: Focus on the voluntary-collaborative limitation: no evaluation happens without lab consent. Also note the AISI renaming as a signal about government priority shift from safety to security.

Key Facts

  • METR reviewed Anthropic's Claude Opus 4.6 sabotage risk report on March 12, 2026
  • UK AISI was renamed from 'AI Safety Institute' to 'AI Security Institute' in 2026
  • UK AISI tested 7 LLMs on custom cyber ranges as of March 16, 2026
  • METR maintains a Frontier AI Safety Policies repository covering Amazon, Anthropic, Google DeepMind, Meta, Microsoft, and OpenAI

Key Facts

  • METR reviewed Anthropic's Claude Opus 4.6 sabotage risk report on March 12, 2026
  • UK AISI tested 7 LLMs on custom cyber ranges as of March 16, 2026
  • UK AISI was renamed from 'AI Safety Institute' to 'AI Security Institute' in 2026
  • METR maintains a Frontier AI Safety Policies repository covering Amazon, Anthropic, Google DeepMind, Meta, Microsoft, and OpenAI
  • UK AISI released the Inspect evaluation framework in April 2024
  • UK AISI released Inspect Scout transcript analysis tool on February 25, 2026
  • UK AISI released ControlArena library for AI control experiments on October 22, 2025
  • UK AISI conducted international joint testing exercise on agentic systems in July 2025