extract: 2026-01-29-metr-time-horizon-1-1-methodology-update

Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70>
This commit is contained in:
Teleo Agents 2026-03-23 00:17:53 +00:00
parent fb903b4005
commit 2e195f01b6
2 changed files with 52 additions and 1 deletions

View file

@ -0,0 +1,36 @@
{
"rejected_claims": [
{
"filename": "ai-autonomous-capability-doubling-every-131-days-creates-structural-governance-lag.md",
"issues": [
"missing_attribution_extractor"
]
},
{
"filename": "evaluation-infrastructure-saturates-at-capability-levels-where-oversight-matters-most.md",
"issues": [
"missing_attribution_extractor"
]
}
],
"validation_stats": {
"total": 2,
"kept": 0,
"fixed": 6,
"rejected": 2,
"fixes_applied": [
"ai-autonomous-capability-doubling-every-131-days-creates-structural-governance-lag.md:set_created:2026-03-23",
"ai-autonomous-capability-doubling-every-131-days-creates-structural-governance-lag.md:stripped_wiki_link:verification degrades faster than capability grows",
"evaluation-infrastructure-saturates-at-capability-levels-where-oversight-matters-most.md:set_created:2026-03-23",
"evaluation-infrastructure-saturates-at-capability-levels-where-oversight-matters-most.md:stripped_wiki_link:verification degrades faster than capability grows",
"evaluation-infrastructure-saturates-at-capability-levels-where-oversight-matters-most.md:stripped_wiki_link:economic forces push humans out of every cognitive loop wher",
"evaluation-infrastructure-saturates-at-capability-levels-where-oversight-matters-most.md:stripped_wiki_link:human verification bandwidth is the binding constraint on AG"
],
"rejections": [
"ai-autonomous-capability-doubling-every-131-days-creates-structural-governance-lag.md:missing_attribution_extractor",
"evaluation-infrastructure-saturates-at-capability-levels-where-oversight-matters-most.md:missing_attribution_extractor"
]
},
"model": "anthropic/claude-sonnet-4.5",
"date": "2026-03-23"
}

View file

@ -7,9 +7,12 @@ date: 2026-01-29
domain: ai-alignment
secondary_domains: []
format: blog-post
status: unprocessed
status: enrichment
priority: high
tags: [metr, time-horizon, capability-measurement, evaluation-methodology, autonomy, scaling, saturation]
processed_by: theseus
processed_date: 2026-03-23
extraction_model: "anthropic/claude-sonnet-4.5"
---
## Content
@ -65,3 +68,15 @@ METR published an updated version of their autonomous AI capability measurement
PRIMARY CONNECTION: [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]]
WHY ARCHIVED: Quantifies the capability-governance gap with the most precise measurement available; reveals measurement infrastructure itself is failing for frontier models
EXTRACTION HINT: Two claims possible — one on the doubling rate as governance timeline mismatch; one on evaluation saturation as a new instance of B4. Check whether the doubling rate number updates or supersedes existing claims.
## Key Facts
- METR Time Horizon 1.1 expanded task suite from 170 to 228 tasks (34% growth)
- Long tasks (8+ hours) doubled from 14 to 31 in the updated framework
- Only 5 of 31 long tasks have measured human baseline times; remainder use estimates
- Claude Opus 4.6 (Feb 2026): ~719 minutes (~12 hours) 50% success horizon, later revised to ~14.5 hours
- GPT-5.2 (Dec 2025): ~352 minutes 50% success horizon
- Claude Opus 4.5 (Nov 2025): ~320 minutes (revised up from 289)
- GPT-4 Turbo (2024): 3-10 minutes 50% success horizon
- Infrastructure migrated from in-house Vivaria to open-source Inspect framework (UK AI Security Institute)
- Different model versions use varying scaffolds: modular-public, flock-public, triframe_inspect