Merge pull request 'extract: 2026-03-00-metr-aisi-pre-deployment-evaluation-practice' (#1412) from extract/2026-03-00-metr-aisi-pre-deployment-evaluation-practice into main
Some checks are pending
Sync Graph Data to teleo-app / sync (push) Waiting to run
Some checks are pending
Sync Graph Data to teleo-app / sync (push) Waiting to run
This commit is contained in:
commit
53975fb1e3
3 changed files with 44 additions and 13 deletions
|
|
@ -40,10 +40,16 @@ The voluntary-collaborative model adds a selection bias dimension to evaluation
|
||||||
|
|
||||||
|
|
||||||
### Additional Evidence (confirm)
|
### Additional Evidence (confirm)
|
||||||
*Source: [[2026-02-23-shapira-agents-of-chaos]] | Added: 2026-03-19*
|
*Source: 2026-02-23-shapira-agents-of-chaos | Added: 2026-03-19*
|
||||||
|
|
||||||
Agents of Chaos study provides concrete empirical evidence: 11 documented case studies of security vulnerabilities (unauthorized compliance, identity spoofing, cross-agent propagation, destructive actions) that emerged only in realistic multi-agent deployment with persistent memory and system access—none of which would be detected by static single-agent benchmarks. The study explicitly argues that current evaluation paradigms are insufficient for realistic deployment conditions.
|
Agents of Chaos study provides concrete empirical evidence: 11 documented case studies of security vulnerabilities (unauthorized compliance, identity spoofing, cross-agent propagation, destructive actions) that emerged only in realistic multi-agent deployment with persistent memory and system access—none of which would be detected by static single-agent benchmarks. The study explicitly argues that current evaluation paradigms are insufficient for realistic deployment conditions.
|
||||||
|
|
||||||
|
|
||||||
|
### Additional Evidence (extend)
|
||||||
|
*Source: [[2026-03-00-metr-aisi-pre-deployment-evaluation-practice]] | Added: 2026-03-19*
|
||||||
|
|
||||||
|
METR and UK AISI evaluations as of March 2026 focus primarily on sabotage risk and cyber capabilities (METR's Claude Opus 4.6 sabotage assessment, AISI's cyber range testing of 7 LLMs). This narrow scope may miss alignment-relevant risks that don't manifest as sabotage or cyber threats. The evaluation infrastructure is optimizing for measurable near-term risks rather than harder-to-operationalize catastrophic scenarios.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
Relevant Notes:
|
Relevant Notes:
|
||||||
|
|
@ -52,5 +58,5 @@ Relevant Notes:
|
||||||
- [[the gap between theoretical AI capability and observed deployment is massive across all occupations because adoption lag not capability limits determines real-world impact]]
|
- [[the gap between theoretical AI capability and observed deployment is massive across all occupations because adoption lag not capability limits determines real-world impact]]
|
||||||
|
|
||||||
Topics:
|
Topics:
|
||||||
- [[domains/ai-alignment/_map]]
|
- domains/ai-alignment/_map
|
||||||
- [[core/grand-strategy/_map]]
|
- core/grand-strategy/_map
|
||||||
|
|
|
||||||
|
|
@ -1,24 +1,34 @@
|
||||||
{
|
{
|
||||||
"rejected_claims": [
|
"rejected_claims": [
|
||||||
{
|
{
|
||||||
"filename": "pre-deployment-ai-evaluation-operates-on-voluntary-collaborative-model-where-labs-can-decline-without-consequence.md",
|
"filename": "pre-deployment-AI-evaluation-operates-on-voluntary-collaborative-model-where-labs-can-decline-without-consequence.md",
|
||||||
|
"issues": [
|
||||||
|
"missing_attribution_extractor"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"filename": "UK-AISI-renaming-to-Security-Institute-signals-government-priority-shift-from-existential-safety-to-cybersecurity-threats.md",
|
||||||
"issues": [
|
"issues": [
|
||||||
"missing_attribution_extractor"
|
"missing_attribution_extractor"
|
||||||
]
|
]
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"validation_stats": {
|
"validation_stats": {
|
||||||
"total": 1,
|
"total": 2,
|
||||||
"kept": 0,
|
"kept": 0,
|
||||||
"fixed": 3,
|
"fixed": 6,
|
||||||
"rejected": 1,
|
"rejected": 2,
|
||||||
"fixes_applied": [
|
"fixes_applied": [
|
||||||
"pre-deployment-ai-evaluation-operates-on-voluntary-collaborative-model-where-labs-can-decline-without-consequence.md:set_created:2026-03-19",
|
"pre-deployment-AI-evaluation-operates-on-voluntary-collaborative-model-where-labs-can-decline-without-consequence.md:set_created:2026-03-19",
|
||||||
"pre-deployment-ai-evaluation-operates-on-voluntary-collaborative-model-where-labs-can-decline-without-consequence.md:stripped_wiki_link:voluntary-safety-pledges-cannot-survive-competitive-pressure",
|
"pre-deployment-AI-evaluation-operates-on-voluntary-collaborative-model-where-labs-can-decline-without-consequence.md:stripped_wiki_link:voluntary-safety-pledges-cannot-survive-competitive-pressure",
|
||||||
"pre-deployment-ai-evaluation-operates-on-voluntary-collaborative-model-where-labs-can-decline-without-consequence.md:stripped_wiki_link:only-binding-regulation-with-enforcement-teeth-changes-front"
|
"pre-deployment-AI-evaluation-operates-on-voluntary-collaborative-model-where-labs-can-decline-without-consequence.md:stripped_wiki_link:only-binding-regulation-with-enforcement-teeth-changes-front",
|
||||||
|
"UK-AISI-renaming-to-Security-Institute-signals-government-priority-shift-from-existential-safety-to-cybersecurity-threats.md:set_created:2026-03-19",
|
||||||
|
"UK-AISI-renaming-to-Security-Institute-signals-government-priority-shift-from-existential-safety-to-cybersecurity-threats.md:stripped_wiki_link:government-designation-of-safety-conscious-AI-labs-as-supply",
|
||||||
|
"UK-AISI-renaming-to-Security-Institute-signals-government-priority-shift-from-existential-safety-to-cybersecurity-threats.md:stripped_wiki_link:compute-export-controls-are-the-most-impactful-AI-governance"
|
||||||
],
|
],
|
||||||
"rejections": [
|
"rejections": [
|
||||||
"pre-deployment-ai-evaluation-operates-on-voluntary-collaborative-model-where-labs-can-decline-without-consequence.md:missing_attribution_extractor"
|
"pre-deployment-AI-evaluation-operates-on-voluntary-collaborative-model-where-labs-can-decline-without-consequence.md:missing_attribution_extractor",
|
||||||
|
"UK-AISI-renaming-to-Security-Institute-signals-government-priority-shift-from-existential-safety-to-cybersecurity-threats.md:missing_attribution_extractor"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
"model": "anthropic/claude-sonnet-4.5",
|
"model": "anthropic/claude-sonnet-4.5",
|
||||||
|
|
|
||||||
|
|
@ -7,13 +7,17 @@ date: 2026-03-01
|
||||||
domain: ai-alignment
|
domain: ai-alignment
|
||||||
secondary_domains: []
|
secondary_domains: []
|
||||||
format: article
|
format: article
|
||||||
status: unprocessed
|
status: enrichment
|
||||||
priority: medium
|
priority: medium
|
||||||
tags: [evaluation-infrastructure, pre-deployment, METR, AISI, voluntary-collaborative, Inspect, Claude-Opus-4-6, cyber-evaluation]
|
tags: [evaluation-infrastructure, pre-deployment, METR, AISI, voluntary-collaborative, Inspect, Claude-Opus-4-6, cyber-evaluation]
|
||||||
processed_by: theseus
|
processed_by: theseus
|
||||||
processed_date: 2026-03-19
|
processed_date: 2026-03-19
|
||||||
enrichments_applied: ["pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md"]
|
enrichments_applied: ["pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md"]
|
||||||
extraction_model: "anthropic/claude-sonnet-4.5"
|
extraction_model: "anthropic/claude-sonnet-4.5"
|
||||||
|
processed_by: theseus
|
||||||
|
processed_date: 2026-03-19
|
||||||
|
enrichments_applied: ["pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md"]
|
||||||
|
extraction_model: "anthropic/claude-sonnet-4.5"
|
||||||
---
|
---
|
||||||
|
|
||||||
## Content
|
## Content
|
||||||
|
|
@ -49,7 +53,7 @@ Synthesized overview of the two main organizations conducting pre-deployment AI
|
||||||
**KB connections:**
|
**KB connections:**
|
||||||
- [[voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints]] — voluntary evaluation has the same structural problem; a lab can simply not invite METR
|
- [[voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints]] — voluntary evaluation has the same structural problem; a lab can simply not invite METR
|
||||||
- [[technology advances exponentially but coordination mechanisms evolve linearly creating a widening gap]] — METR and AISI are growing their evaluation capacity, but AI capabilities are growing faster; the gap widens in every period
|
- [[technology advances exponentially but coordination mechanisms evolve linearly creating a widening gap]] — METR and AISI are growing their evaluation capacity, but AI capabilities are growing faster; the gap widens in every period
|
||||||
- [[government designation of safety-conscious AI labs as supply chain risks inverts the regulatory dynamic]] — AISI renaming to "Security Institute" is a softer version of the same dynamic — government safety infrastructure shifting to serve government security interests rather than existential risk reduction
|
- government designation of safety-conscious AI labs as supply chain risks inverts the regulatory dynamic — AISI renaming to "Security Institute" is a softer version of the same dynamic — government safety infrastructure shifting to serve government security interests rather than existential risk reduction
|
||||||
|
|
||||||
**Extraction hints:**
|
**Extraction hints:**
|
||||||
- Key claim: "Pre-deployment AI evaluation operates on a voluntary-collaborative model where evaluators (METR, AISI) require lab cooperation, meaning labs that decline evaluation face no consequence"
|
- Key claim: "Pre-deployment AI evaluation operates on a voluntary-collaborative model where evaluators (METR, AISI) require lab cooperation, meaning labs that decline evaluation face no consequence"
|
||||||
|
|
@ -72,3 +76,14 @@ EXTRACTION HINT: Focus on the voluntary-collaborative limitation: no evaluation
|
||||||
- UK AISI was renamed from 'AI Safety Institute' to 'AI Security Institute' in 2026
|
- UK AISI was renamed from 'AI Safety Institute' to 'AI Security Institute' in 2026
|
||||||
- UK AISI tested 7 LLMs on custom cyber ranges as of March 16, 2026
|
- UK AISI tested 7 LLMs on custom cyber ranges as of March 16, 2026
|
||||||
- METR maintains a Frontier AI Safety Policies repository covering Amazon, Anthropic, Google DeepMind, Meta, Microsoft, and OpenAI
|
- METR maintains a Frontier AI Safety Policies repository covering Amazon, Anthropic, Google DeepMind, Meta, Microsoft, and OpenAI
|
||||||
|
|
||||||
|
|
||||||
|
## Key Facts
|
||||||
|
- METR reviewed Anthropic's Claude Opus 4.6 sabotage risk report on March 12, 2026
|
||||||
|
- UK AISI tested 7 LLMs on custom cyber ranges as of March 16, 2026
|
||||||
|
- UK AISI was renamed from 'AI Safety Institute' to 'AI Security Institute' in 2026
|
||||||
|
- METR maintains a Frontier AI Safety Policies repository covering Amazon, Anthropic, Google DeepMind, Meta, Microsoft, and OpenAI
|
||||||
|
- UK AISI released the Inspect evaluation framework in April 2024
|
||||||
|
- UK AISI released Inspect Scout transcript analysis tool on February 25, 2026
|
||||||
|
- UK AISI released ControlArena library for AI control experiments on October 22, 2025
|
||||||
|
- UK AISI conducted international joint testing exercise on agentic systems in July 2025
|
||||||
|
|
|
||||||
Loading…
Reference in a new issue