extract: 2026-03-21-metr-evaluation-landscape-2026 #1569
5 changed files with 70 additions and 1 deletions
|
|
@ -55,6 +55,12 @@ The Bench-2-CoP analysis reveals that even when labs do conduct evaluations, the
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
### Additional Evidence (extend)
|
||||||
|
*Source: [[2026-03-21-metr-evaluation-landscape-2026]] | Added: 2026-03-21*
|
||||||
|
|
||||||
|
METR's pre-deployment sabotage risk reviews (March 2026: Claude Opus 4.6; October 2025: Anthropic Summer 2025 Pilot; November 2025: GPT-5.1-Codex-Max; August 2025: GPT-5; June 2025: DeepSeek/Qwen; April 2025: o3/o4-mini) represent the most operationally deployed AI evaluation infrastructure outside academic research, but these reviews remain voluntary and are not incorporated into mandatory compliance requirements by any regulatory body (EU AI Office, NIST). The institutional structure exists but lacks binding enforcement.
|
||||||
|
|
||||||
|
|
||||||
Relevant Notes:
|
Relevant Notes:
|
||||||
- [[pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations]] — declining transparency compounds the evaluation problem
|
- [[pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations]] — declining transparency compounds the evaluation problem
|
||||||
- [[voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints]] — transparency commitments follow the same erosion lifecycle
|
- [[voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints]] — transparency commitments follow the same erosion lifecycle
|
||||||
|
|
|
||||||
|
|
@ -29,6 +29,12 @@ Anthropic's own language in RSP documentation: commitments are 'very hard to mee
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
### Additional Evidence (confirm)
|
||||||
|
*Source: [[2026-03-21-metr-evaluation-landscape-2026]] | Added: 2026-03-21*
|
||||||
|
|
||||||
|
METR's pre-deployment sabotage reviews of Anthropic models (March 2026: Claude Opus 4.6; October 2025: Summer 2025 Pilot) document the evaluation infrastructure that exists, but the reviews are voluntary and occur within the same competitive environment where Anthropic rolled back RSP commitments. The existence of sophisticated evaluation infrastructure does not prevent commercial pressure from overriding safety commitments.
|
||||||
|
|
||||||
|
|
||||||
Relevant Notes:
|
Relevant Notes:
|
||||||
- [[voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints]] — the RSP rollback is the empirical confirmation
|
- [[voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints]] — the RSP rollback is the empirical confirmation
|
||||||
- [[AI alignment is a coordination problem not a technical problem]] — voluntary commitments fail; coordination mechanisms might not
|
- [[AI alignment is a coordination problem not a technical problem]] — voluntary commitments fail; coordination mechanisms might not
|
||||||
|
|
|
||||||
|
|
@ -25,6 +25,12 @@ This claim describes a frontier-practitioner effect — top-tier experts getting
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
### Additional Evidence (challenge)
|
||||||
|
*Source: [[2026-03-21-metr-evaluation-landscape-2026]] | Added: 2026-03-21*
|
||||||
|
|
||||||
|
METR's developer productivity RCT found that AI tools made experienced developers '19% longer' to complete tasks, showing negative productivity for experts on time-to-completion metrics. This complicates the force multiplier hypothesis — the RCT measured task completion speed, not delegation quality or the scope of what experts can attempt. An expert who takes longer but produces better-scoped, more ambitious outputs is compatible with both this finding and the original claim. However, if the productivity drag persists across task types, it provides counter-evidence to at least one dimension of the expertise advantage.
|
||||||
|
|
||||||
|
|
||||||
Relevant Notes:
|
Relevant Notes:
|
||||||
- [[centaur team performance depends on role complementarity not mere human-AI combination]] — expertise enables the complementarity that makes centaur teams work
|
- [[centaur team performance depends on role complementarity not mere human-AI combination]] — expertise enables the complementarity that makes centaur teams work
|
||||||
- [[AI is collapsing the knowledge-producing communities it depends on creating a self-undermining loop that collective intelligence can break]] — if expertise is a multiplier, eroding expert communities erodes collaboration quality
|
- [[AI is collapsing the knowledge-producing communities it depends on creating a self-undermining loop that collective intelligence can break]] — if expertise is a multiplier, eroding expert communities erodes collaboration quality
|
||||||
|
|
|
||||||
|
|
@ -0,0 +1,40 @@
|
||||||
|
{
|
||||||
|
"rejected_claims": [
|
||||||
|
{
|
||||||
|
"filename": "metr-monitorability-evaluations-establish-two-sided-oversight-evasion-measurement.md",
|
||||||
|
"issues": [
|
||||||
|
"missing_attribution_extractor"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"filename": "ai-autonomous-task-horizon-doubles-every-six-months-implying-months-long-projects-within-decade.md",
|
||||||
|
"issues": [
|
||||||
|
"missing_attribution_extractor"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"filename": "malt-dataset-provides-first-systematic-corpus-of-evaluation-threatening-behaviors-from-real-deployments.md",
|
||||||
|
"issues": [
|
||||||
|
"missing_attribution_extractor"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"validation_stats": {
|
||||||
|
"total": 3,
|
||||||
|
"kept": 0,
|
||||||
|
"fixed": 3,
|
||||||
|
"rejected": 3,
|
||||||
|
"fixes_applied": [
|
||||||
|
"metr-monitorability-evaluations-establish-two-sided-oversight-evasion-measurement.md:set_created:2026-03-21",
|
||||||
|
"ai-autonomous-task-horizon-doubles-every-six-months-implying-months-long-projects-within-decade.md:set_created:2026-03-21",
|
||||||
|
"malt-dataset-provides-first-systematic-corpus-of-evaluation-threatening-behaviors-from-real-deployments.md:set_created:2026-03-21"
|
||||||
|
],
|
||||||
|
"rejections": [
|
||||||
|
"metr-monitorability-evaluations-establish-two-sided-oversight-evasion-measurement.md:missing_attribution_extractor",
|
||||||
|
"ai-autonomous-task-horizon-doubles-every-six-months-implying-months-long-projects-within-decade.md:missing_attribution_extractor",
|
||||||
|
"malt-dataset-provides-first-systematic-corpus-of-evaluation-threatening-behaviors-from-real-deployments.md:missing_attribution_extractor"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"model": "anthropic/claude-sonnet-4.5",
|
||||||
|
"date": "2026-03-21"
|
||||||
|
}
|
||||||
|
|
@ -7,9 +7,13 @@ date: 2026-03-01
|
||||||
domain: ai-alignment
|
domain: ai-alignment
|
||||||
secondary_domains: []
|
secondary_domains: []
|
||||||
format: thread
|
format: thread
|
||||||
status: unprocessed
|
status: processed
|
||||||
priority: high
|
priority: high
|
||||||
tags: [METR, monitorability, MALT, sabotage-review, time-horizon, evaluation-infrastructure, oversight-evasion, Claude]
|
tags: [METR, monitorability, MALT, sabotage-review, time-horizon, evaluation-infrastructure, oversight-evasion, Claude]
|
||||||
|
processed_by: theseus
|
||||||
|
processed_date: 2026-03-21
|
||||||
|
enrichments_applied: ["AI transparency is declining not improving because Stanford FMTI scores dropped 17 points in one year while frontier labs dissolved safety teams and removed safety language from mission statements.md", "Anthropics RSP rollback under commercial pressure is the first empirical confirmation that binding safety commitments cannot survive the competitive dynamics of frontier AI development.md", "deep technical expertise is a greater force multiplier when combined with AI agents because skilled practitioners delegate more effectively than novices.md"]
|
||||||
|
extraction_model: "anthropic/claude-sonnet-4.5"
|
||||||
---
|
---
|
||||||
|
|
||||||
## Content
|
## Content
|
||||||
|
|
@ -57,3 +61,10 @@ METR's current evaluation portfolio as of March 2026:
|
||||||
PRIMARY CONNECTION: [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]]
|
PRIMARY CONNECTION: [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]]
|
||||||
WHY ARCHIVED: METR's institutional portfolio is the most operationally deployed evaluation infrastructure; the Monitorability Evaluations specifically measure the two-sided oversight problem that the governance architecture is failing to address
|
WHY ARCHIVED: METR's institutional portfolio is the most operationally deployed evaluation infrastructure; the Monitorability Evaluations specifically measure the two-sided oversight problem that the governance architecture is failing to address
|
||||||
EXTRACTION HINT: The time horizon finding (doubling every 6 months) deserves its own claim; the Monitorability Evaluations deserve a claim about what institutional evaluation infrastructure now exists
|
EXTRACTION HINT: The time horizon finding (doubling every 6 months) deserves its own claim; the Monitorability Evaluations deserve a claim about what institutional evaluation infrastructure now exists
|
||||||
|
|
||||||
|
|
||||||
|
## Key Facts
|
||||||
|
- METR published RE-Bench in November 2024 measuring frontier model performance on ML research engineering tasks vs. human experts
|
||||||
|
- METR published Rogue Replication Threat Model on November 12, 2024 analyzing how AI agents might develop large resilient rogue autonomous populations
|
||||||
|
- METR published Reward Hacking Study in June 2025 documenting frontier model instances of exploiting scoring bugs
|
||||||
|
- METR's evaluation portfolio as of March 2026 includes oversight evasion, self-replication, autonomous task completion, and pre-deployment sabotage risk reviews
|
||||||
|
|
|
||||||
Loading…
Reference in a new issue