extract: 2026-03-26-metr-gpt5-evaluation-time-horizon

Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70>
This commit is contained in:
Teleo Agents 2026-03-26 00:36:12 +00:00
parent 99c7dc4ab7
commit a931485003
3 changed files with 57 additions and 1 deletions

View file

@ -134,6 +134,12 @@ METR, the primary producer of governance-relevant capability benchmarks, explici
METR's January 2026 evaluation of GPT-5 placed its autonomous replication and adaptation capability at 2h17m (50% time horizon), far below catastrophic risk thresholds. In the same month, AISLE (an AI system) autonomously discovered 12 OpenSSL CVEs including a 30-year-old bug through fully autonomous operation. This is direct evidence that formal pre-deployment evaluations are not capturing operational dangerous autonomy that is already deployed at commercial scale.
### Additional Evidence (extend)
*Source: [[2026-03-26-metr-gpt5-evaluation-time-horizon]] | Added: 2026-03-26*
METR's HCAST benchmark showed 50-57% volatility in time horizon estimates between v1.0 and v1.1 for the same models (GPT-4 1106 dropped 57%, GPT-5 rose 55%), demonstrating that even the most rigorous pre-deployment evaluations have structural instability independent of actual capability changes. This extends the evaluation reliability critique from 'doesn't predict deployment risk' to 'cannot even maintain consistent capability measurements across benchmark versions.'

View file

@ -0,0 +1,35 @@
{
"rejected_claims": [
{
"filename": "metr-time-horizon-benchmarks-show-50-percent-instability-between-versions-making-governance-thresholds-unreliable.md",
"issues": [
"missing_attribution_extractor"
]
},
{
"filename": "current-frontier-models-evaluate-17x-below-metrs-catastrophic-autonomy-threshold-suggesting-substantial-safety-margin.md",
"issues": [
"missing_attribution_extractor"
]
}
],
"validation_stats": {
"total": 2,
"kept": 0,
"fixed": 5,
"rejected": 2,
"fixes_applied": [
"metr-time-horizon-benchmarks-show-50-percent-instability-between-versions-making-governance-thresholds-unreliable.md:set_created:2026-03-26",
"metr-time-horizon-benchmarks-show-50-percent-instability-between-versions-making-governance-thresholds-unreliable.md:stripped_wiki_link:pre-deployment-AI-evaluations-do-not-predict-real-world-risk",
"current-frontier-models-evaluate-17x-below-metrs-catastrophic-autonomy-threshold-suggesting-substantial-safety-margin.md:set_created:2026-03-26",
"current-frontier-models-evaluate-17x-below-metrs-catastrophic-autonomy-threshold-suggesting-substantial-safety-margin.md:stripped_wiki_link:three-conditions-gate-AI-takeover-risk-autonomy-robotics-and",
"current-frontier-models-evaluate-17x-below-metrs-catastrophic-autonomy-threshold-suggesting-substantial-safety-margin.md:stripped_wiki_link:capability-control-methods-are-temporary-at-best-because-a-s"
],
"rejections": [
"metr-time-horizon-benchmarks-show-50-percent-instability-between-versions-making-governance-thresholds-unreliable.md:missing_attribution_extractor",
"current-frontier-models-evaluate-17x-below-metrs-catastrophic-autonomy-threshold-suggesting-substantial-safety-margin.md:missing_attribution_extractor"
]
},
"model": "anthropic/claude-sonnet-4.5",
"date": "2026-03-26"
}

View file

@ -7,9 +7,13 @@ date: 2026-01-01
domain: ai-alignment
secondary_domains: []
format: report
status: unprocessed
status: enrichment
priority: medium
tags: [METR, GPT-5, time-horizon, capability-thresholds, safety-evaluation, holistic-evaluation, governance-thresholds, catastrophic-risk]
processed_by: theseus
processed_date: 2026-03-26
enrichments_applied: ["pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md"]
extraction_model: "anthropic/claude-sonnet-4.5"
---
## Content
@ -59,3 +63,14 @@ This suggests ~50% volatility in time horizon estimates between benchmark versio
PRIMARY CONNECTION: [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]]
WHY ARCHIVED: Provides formal numerical calibration of where current frontier models sit relative to governance thresholds — essential context for evaluating B1's "greatest outstanding problem" claim. The finding (2h17m vs 40-hour threshold) partially challenges alarmist interpretations while the 50%+ benchmark instability maintains the governance concern
EXTRACTION HINT: Separate claims: (1) "Current frontier models evaluate at ~17x below METR's catastrophic risk threshold for autonomous AI R&D" — calibrating B1; (2) "METR's time horizon benchmark shifted 50-57% between v1.0 and v1.1 versions, making governance thresholds derived from it a moving target" — the reliability problem
## Key Facts
- GPT-5 achieved 50% time horizon of 2 hours 17 minutes on METR's HCAST evaluation
- GPT-5's 80% time horizon was below 8 hours
- METR's catastrophic risk threshold is 40 hours at 50% time horizon for software engineering/ML tasks
- METR's heightened scrutiny threshold is 8 hours at 80% time horizon
- HCAST v1.1 contains 228 tasks as of January 2026
- GPT-4 1106's time horizon estimate dropped 57% between HCAST v1.0 and v1.1
- GPT-5's time horizon estimate rose 55% between HCAST v1.0 and v1.1
- METR evaluations are used by OpenAI, Anthropic, and other frontier labs for safety milestone assessments