extract: 2026-01-29-metr-time-horizon-1-1-methodology-update
Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70>
This commit is contained in:
parent
fb903b4005
commit
2e195f01b6
2 changed files with 52 additions and 1 deletions
|
|
@ -0,0 +1,36 @@
|
|||
{
|
||||
"rejected_claims": [
|
||||
{
|
||||
"filename": "ai-autonomous-capability-doubling-every-131-days-creates-structural-governance-lag.md",
|
||||
"issues": [
|
||||
"missing_attribution_extractor"
|
||||
]
|
||||
},
|
||||
{
|
||||
"filename": "evaluation-infrastructure-saturates-at-capability-levels-where-oversight-matters-most.md",
|
||||
"issues": [
|
||||
"missing_attribution_extractor"
|
||||
]
|
||||
}
|
||||
],
|
||||
"validation_stats": {
|
||||
"total": 2,
|
||||
"kept": 0,
|
||||
"fixed": 6,
|
||||
"rejected": 2,
|
||||
"fixes_applied": [
|
||||
"ai-autonomous-capability-doubling-every-131-days-creates-structural-governance-lag.md:set_created:2026-03-23",
|
||||
"ai-autonomous-capability-doubling-every-131-days-creates-structural-governance-lag.md:stripped_wiki_link:verification degrades faster than capability grows",
|
||||
"evaluation-infrastructure-saturates-at-capability-levels-where-oversight-matters-most.md:set_created:2026-03-23",
|
||||
"evaluation-infrastructure-saturates-at-capability-levels-where-oversight-matters-most.md:stripped_wiki_link:verification degrades faster than capability grows",
|
||||
"evaluation-infrastructure-saturates-at-capability-levels-where-oversight-matters-most.md:stripped_wiki_link:economic forces push humans out of every cognitive loop wher",
|
||||
"evaluation-infrastructure-saturates-at-capability-levels-where-oversight-matters-most.md:stripped_wiki_link:human verification bandwidth is the binding constraint on AG"
|
||||
],
|
||||
"rejections": [
|
||||
"ai-autonomous-capability-doubling-every-131-days-creates-structural-governance-lag.md:missing_attribution_extractor",
|
||||
"evaluation-infrastructure-saturates-at-capability-levels-where-oversight-matters-most.md:missing_attribution_extractor"
|
||||
]
|
||||
},
|
||||
"model": "anthropic/claude-sonnet-4.5",
|
||||
"date": "2026-03-23"
|
||||
}
|
||||
|
|
@ -7,9 +7,12 @@ date: 2026-01-29
|
|||
domain: ai-alignment
|
||||
secondary_domains: []
|
||||
format: blog-post
|
||||
status: unprocessed
|
||||
status: enrichment
|
||||
priority: high
|
||||
tags: [metr, time-horizon, capability-measurement, evaluation-methodology, autonomy, scaling, saturation]
|
||||
processed_by: theseus
|
||||
processed_date: 2026-03-23
|
||||
extraction_model: "anthropic/claude-sonnet-4.5"
|
||||
---
|
||||
|
||||
## Content
|
||||
|
|
@ -65,3 +68,15 @@ METR published an updated version of their autonomous AI capability measurement
|
|||
PRIMARY CONNECTION: [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]]
|
||||
WHY ARCHIVED: Quantifies the capability-governance gap with the most precise measurement available; reveals measurement infrastructure itself is failing for frontier models
|
||||
EXTRACTION HINT: Two claims possible — one on the doubling rate as governance timeline mismatch; one on evaluation saturation as a new instance of B4. Check whether the doubling rate number updates or supersedes existing claims.
|
||||
|
||||
|
||||
## Key Facts
|
||||
- METR Time Horizon 1.1 expanded task suite from 170 to 228 tasks (34% growth)
|
||||
- Long tasks (8+ hours) doubled from 14 to 31 in the updated framework
|
||||
- Only 5 of 31 long tasks have measured human baseline times; remainder use estimates
|
||||
- Claude Opus 4.6 (Feb 2026): ~719 minutes (~12 hours) 50% success horizon, later revised to ~14.5 hours
|
||||
- GPT-5.2 (Dec 2025): ~352 minutes 50% success horizon
|
||||
- Claude Opus 4.5 (Nov 2025): ~320 minutes (revised up from 289)
|
||||
- GPT-4 Turbo (2024): 3-10 minutes 50% success horizon
|
||||
- Infrastructure migrated from in-house Vivaria to open-source Inspect framework (UK AI Security Institute)
|
||||
- Different model versions use varying scaffolds: modular-public, flock-public, triframe_inspect
|
||||
|
|
|
|||
Loading…
Reference in a new issue