extract: 2026-03-26-metr-gpt5-evaluation-time-horizon
Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70>
This commit is contained in:
parent
99c7dc4ab7
commit
a931485003
3 changed files with 57 additions and 1 deletions
|
|
@ -134,6 +134,12 @@ METR, the primary producer of governance-relevant capability benchmarks, explici
|
|||
|
||||
METR's January 2026 evaluation of GPT-5 placed its autonomous replication and adaptation capability at 2h17m (50% time horizon), far below catastrophic risk thresholds. In the same month, AISLE (an AI system) autonomously discovered 12 OpenSSL CVEs including a 30-year-old bug through fully autonomous operation. This is direct evidence that formal pre-deployment evaluations are not capturing operational dangerous autonomy that is already deployed at commercial scale.
|
||||
|
||||
### Additional Evidence (extend)
|
||||
*Source: [[2026-03-26-metr-gpt5-evaluation-time-horizon]] | Added: 2026-03-26*
|
||||
|
||||
METR's HCAST benchmark showed 50-57% volatility in time horizon estimates between v1.0 and v1.1 for the same models (GPT-4 1106 dropped 57%, GPT-5 rose 55%), demonstrating that even the most rigorous pre-deployment evaluations have structural instability independent of actual capability changes. This extends the evaluation reliability critique from 'doesn't predict deployment risk' to 'cannot even maintain consistent capability measurements across benchmark versions.'
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
|
|
|||
|
|
@ -0,0 +1,35 @@
|
|||
{
|
||||
"rejected_claims": [
|
||||
{
|
||||
"filename": "metr-time-horizon-benchmarks-show-50-percent-instability-between-versions-making-governance-thresholds-unreliable.md",
|
||||
"issues": [
|
||||
"missing_attribution_extractor"
|
||||
]
|
||||
},
|
||||
{
|
||||
"filename": "current-frontier-models-evaluate-17x-below-metrs-catastrophic-autonomy-threshold-suggesting-substantial-safety-margin.md",
|
||||
"issues": [
|
||||
"missing_attribution_extractor"
|
||||
]
|
||||
}
|
||||
],
|
||||
"validation_stats": {
|
||||
"total": 2,
|
||||
"kept": 0,
|
||||
"fixed": 5,
|
||||
"rejected": 2,
|
||||
"fixes_applied": [
|
||||
"metr-time-horizon-benchmarks-show-50-percent-instability-between-versions-making-governance-thresholds-unreliable.md:set_created:2026-03-26",
|
||||
"metr-time-horizon-benchmarks-show-50-percent-instability-between-versions-making-governance-thresholds-unreliable.md:stripped_wiki_link:pre-deployment-AI-evaluations-do-not-predict-real-world-risk",
|
||||
"current-frontier-models-evaluate-17x-below-metrs-catastrophic-autonomy-threshold-suggesting-substantial-safety-margin.md:set_created:2026-03-26",
|
||||
"current-frontier-models-evaluate-17x-below-metrs-catastrophic-autonomy-threshold-suggesting-substantial-safety-margin.md:stripped_wiki_link:three-conditions-gate-AI-takeover-risk-autonomy-robotics-and",
|
||||
"current-frontier-models-evaluate-17x-below-metrs-catastrophic-autonomy-threshold-suggesting-substantial-safety-margin.md:stripped_wiki_link:capability-control-methods-are-temporary-at-best-because-a-s"
|
||||
],
|
||||
"rejections": [
|
||||
"metr-time-horizon-benchmarks-show-50-percent-instability-between-versions-making-governance-thresholds-unreliable.md:missing_attribution_extractor",
|
||||
"current-frontier-models-evaluate-17x-below-metrs-catastrophic-autonomy-threshold-suggesting-substantial-safety-margin.md:missing_attribution_extractor"
|
||||
]
|
||||
},
|
||||
"model": "anthropic/claude-sonnet-4.5",
|
||||
"date": "2026-03-26"
|
||||
}
|
||||
|
|
@ -7,9 +7,13 @@ date: 2026-01-01
|
|||
domain: ai-alignment
|
||||
secondary_domains: []
|
||||
format: report
|
||||
status: unprocessed
|
||||
status: enrichment
|
||||
priority: medium
|
||||
tags: [METR, GPT-5, time-horizon, capability-thresholds, safety-evaluation, holistic-evaluation, governance-thresholds, catastrophic-risk]
|
||||
processed_by: theseus
|
||||
processed_date: 2026-03-26
|
||||
enrichments_applied: ["pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md"]
|
||||
extraction_model: "anthropic/claude-sonnet-4.5"
|
||||
---
|
||||
|
||||
## Content
|
||||
|
|
@ -59,3 +63,14 @@ This suggests ~50% volatility in time horizon estimates between benchmark versio
|
|||
PRIMARY CONNECTION: [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]]
|
||||
WHY ARCHIVED: Provides formal numerical calibration of where current frontier models sit relative to governance thresholds — essential context for evaluating B1's "greatest outstanding problem" claim. The finding (2h17m vs 40-hour threshold) partially challenges alarmist interpretations while the 50%+ benchmark instability maintains the governance concern
|
||||
EXTRACTION HINT: Separate claims: (1) "Current frontier models evaluate at ~17x below METR's catastrophic risk threshold for autonomous AI R&D" — calibrating B1; (2) "METR's time horizon benchmark shifted 50-57% between v1.0 and v1.1 versions, making governance thresholds derived from it a moving target" — the reliability problem
|
||||
|
||||
|
||||
## Key Facts
|
||||
- GPT-5 achieved 50% time horizon of 2 hours 17 minutes on METR's HCAST evaluation
|
||||
- GPT-5's 80% time horizon was below 8 hours
|
||||
- METR's catastrophic risk threshold is 40 hours at 50% time horizon for software engineering/ML tasks
|
||||
- METR's heightened scrutiny threshold is 8 hours at 80% time horizon
|
||||
- HCAST v1.1 contains 228 tasks as of January 2026
|
||||
- GPT-4 1106's time horizon estimate dropped 57% between HCAST v1.0 and v1.1
|
||||
- GPT-5's time horizon estimate rose 55% between HCAST v1.0 and v1.1
|
||||
- METR evaluations are used by OpenAI, Anthropic, and other frontier labs for safety milestone assessments
|
||||
|
|
|
|||
Loading…
Reference in a new issue