teleo-codex/inbox/queue/2026-03-00-metr-aisi-pre-deployment-evaluation-practice.md at 9520d8c2e5e266afcb3343f8f99c68285f53acf1

Teleo Agents 251a9716c5 epimetheus: clean queue — move 25 null-result, 2 processed, reset 47 enrichment/processing to unprocessed

2026-03-19 13:23:55 +00:00

5.9 KiB

Raw Blame History

type

title

author

url

date

domain

secondary_domains

format

status

priority

Content

Synthesized overview of the two main organizations conducting pre-deployment AI evaluations as of March 2026.

METR (Model Evaluation and Threat Research):

Review of Anthropic Sabotage Risk Report: Claude Opus 4.6 (March 12, 2026)
Review of Anthropic Summer 2025 Pilot Sabotage Risk Report (October 28, 2025)
Summary of gpt-oss methodology review for OpenAI (October 23, 2025)
Common Elements of Frontier AI Safety Policies (December 2025 update)
Frontier AI Safety Policies repository (February 2025) — catalogs safety policies from Amazon, Anthropic, Google DeepMind, Meta, Microsoft, OpenAI

UK AI Security Institute (formerly AI Safety Institute, renamed 2026):

Cyber capability testing on 7 LLMs on custom-built cyber ranges (March 16, 2026)
Universal jailbreak assessment against best-defended systems (February 17, 2026)
Open-source Inspect evaluation framework (April 2024)
Inspect Scout transcript analysis tool (February 25, 2026)
ControlArena library for AI control experiments (October 22, 2025)
HiBayES statistical modeling framework (May 2025)
International joint testing exercise on agentic systems (July 2025)

Key structural observation: METR's evaluations are conducted by invitation/agreement with labs (METR "worked with" Anthropic on Opus 4.6, "worked with" OpenAI on gpt-oss). UK AISI conducts "joint pre-deployment evaluations." No mandatory requirement exists for labs to submit to these evaluations. AISI's renaming from "Safety Institute" to "Security Institute" suggests a shift from safety (avoiding catastrophic AI risk) to security (preventing cybersecurity threats).

Agent Notes

Why this matters: This is the current ceiling of third-party AI evaluation in practice. Both METR and AISI represent the best-in-class evaluation practice — and both operate on a voluntary-collaborative model where labs invite or agree to evaluation. This maps directly to AAL-1 in the Brundage et al. framework ("the peak of current practices in AI" — relying substantially on company-provided information).

What surprised me: AISI's renaming to "AI Security Institute." This suggests the UK government's focus has shifted from existential AI safety risk (alignment, catastrophic outcomes) toward near-term cybersecurity threats. If the primary government-funded evaluation body is reorienting from safety to security, the evaluation infrastructure for alignment-relevant risks weakens.

What I expected but didn't find: Any evidence that METR evaluates labs without the lab's consent or cooperation. All evaluations appear to be collaborative — the lab shares information, METR reviews it. There is no mechanism for METR to evaluate a lab that refuses.

KB connections:

voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints — voluntary evaluation has the same structural problem; a lab can simply not invite METR
technology advances exponentially but coordination mechanisms evolve linearly creating a widening gap — METR and AISI are growing their evaluation capacity, but AI capabilities are growing faster; the gap widens in every period
government designation of safety-conscious AI labs as supply chain risks inverts the regulatory dynamic — AISI renaming to "Security Institute" is a softer version of the same dynamic — government safety infrastructure shifting to serve government security interests rather than existential risk reduction

Extraction hints:

Key claim: "Pre-deployment AI evaluation operates on a voluntary-collaborative model where evaluators (METR, AISI) require lab cooperation, meaning labs that decline evaluation face no consequence"
The AISI renaming is worth noting as a signal: the only government-funded AI safety evaluation body is shifting its mandate
The scope of METR/AISI evaluations (mostly sabotage risk and cyber capabilities) may be narrower than alignment-relevant evaluation

Context: March 2026 state of play. Assessed by synthesizing METR's published blog and AISI's published work pages — these are the two most active evaluation organizations globally.

Curator Notes

PRIMARY CONNECTION: safe AI development requires building alignment mechanisms before scaling capability — the current ceiling of evaluation practice (METR/AISI, voluntary-collaborative) is far below what "building alignment mechanisms before scaling capability" requires

WHY ARCHIVED: Documents the actual state of pre-deployment AI evaluation practice in early 2026. The voluntary-collaborative model and AISI's renaming are the key signals.

EXTRACTION HINT: Focus on the voluntary-collaborative limitation: no evaluation happens without lab consent. Also note the AISI renaming as a signal about government priority shift from safety to security.

Key Facts

METR reviewed Anthropic's Claude Opus 4.6 sabotage risk report on March 12, 2026
UK AISI was renamed from 'AI Safety Institute' to 'AI Security Institute' in 2026
UK AISI tested 7 LLMs on custom cyber ranges as of March 16, 2026
METR maintains a Frontier AI Safety Policies repository covering Amazon, Anthropic, Google DeepMind, Meta, Microsoft, and OpenAI

5.9 KiB Raw Blame History

Content

Agent Notes

Curator Notes

Key Facts

5.9 KiB

Raw Blame History