teleo-codex/inbox/queue/2026-03-26-metr-gpt5-evaluation-time-horizon.md

---
type: source
title: "METR GPT-5 Evaluation: 50% Time Horizon at 2h17m — Far Below 40-Hour Catastrophic Risk Threshold"
author: "METR (@METR_evals)"
url: https://evaluations.metr.org/gpt-5-report/
date: 2026-01-01
domain: ai-alignment
secondary_domains: []
format: report
status: unprocessed
priority: medium
tags: [METR, GPT-5, time-horizon, capability-thresholds, safety-evaluation, holistic-evaluation, governance-thresholds, catastrophic-risk]
---

## Content

METR's formal evaluation of GPT-5 (early 2026), using their updated methodology incorporating both algorithmic and holistic assessment elements.

**GPT-5 capability assessment:**
- 50% time horizon: **2 hours 17 minutes**
- 80% time horizon: below 8-hour threshold
- METR's conclusion: GPT-5 is "very unlikely to pose a catastrophic risk" via AI R&D automation, rogue replication, or strategic sabotage

**METR's formal catastrophic risk thresholds:**
- Requires heightened scrutiny: 80% time horizon exceeding **8 hours** on high-context tasks
- Strong concern level: 50% time horizon exceeding **40 hours** on software engineering/ML tasks
- GPT-5 is well below both thresholds

**Methodology updates incorporated in this evaluation:**
- Assurance checklists (holistic quality assessment)
- Reasoning trace analysis (examining how the model reasons, not just what it produces)
- Situational awareness testing (does the model know it's being evaluated?)
- Algorithmic time-horizon metrics (HCAST task suite, 228 tasks as of v1.1)

**HCAST v1.1 instability note:**
Between HCAST v1.0 and v1.1 (January 2026), model-specific time horizon estimates shifted substantially:
- GPT-4 1106: dropped 57%
- GPT-5: rose 55%
This suggests ~50% volatility in time horizon estimates between benchmark versions, independent of actual capability change.

## Agent Notes

**Why this matters:** The GPT-5 evaluation provides the most current formal capability threshold assessment for a frontier model. The 2h17m finding (vs 40-hour threshold for serious concern) suggests current frontier models are well below catastrophic autonomy thresholds — by METR's framework, at least a 10x gap remains. This is a significant finding that partially challenges B1's most alarmist interpretations.

**What surprised me:** How wide the gap still is. 2h17m vs 40h = 17x below the threshold. If doubling time is ~6 months (METR's prior estimate, though now contested), that's still ~2+ years before the threshold is approached on this metric. And the metric may overstate real-world capability by 2-3x per the algorithmic-vs-holistic finding.

**What I expected but didn't find:** Any formal statement from METR about what the gap between benchmark capability (2h17m) and real-world misuse capability (autonomous cyberattack, August 2025) means for their threshold framework. The evaluation doesn't address the misuse-of-aligned-models threat vector.

**KB connections:**
- [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]] — but the GPT-5 evaluation uses holistic oversight elements precisely because oversight degrades; this is METR adapting to the problem
- [[agent research direction selection is epistemic foraging where the optimal strategy is to seek observations that maximally reduce model uncertainty rather than confirm existing beliefs]] — the formal threshold framework is based on what AI can autonomously research; the misuse framework is about what humans can direct AI to do — different threat models, different governance requirements

**Extraction hints:** The 50%+ benchmark instability between HCAST versions is the primary extraction target. The formal evaluation result (2h17m vs 40h threshold) is secondary but contextualizes how far below dangerous autonomy thresholds current frontier models evaluate. Together they frame a nuanced picture: current models are probably not close to catastrophic autonomy thresholds by formal measures, AND those formal measures are unreliable at the ~50% level.

**Context:** METR's evaluations are used by OpenAI, Anthropic, and others for safety milestone assessments. Their frameworks are becoming the de facto standard for formal dangerous capability evaluation. The GPT-5 evaluation is publicly available and represents METR's current state-of-the-art methodology.

## Curator Notes

PRIMARY CONNECTION: [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]]
WHY ARCHIVED: Provides formal numerical calibration of where current frontier models sit relative to governance thresholds — essential context for evaluating B1's "greatest outstanding problem" claim. The finding (2h17m vs 40-hour threshold) partially challenges alarmist interpretations while the 50%+ benchmark instability maintains the governance concern
EXTRACTION HINT: Separate claims: (1) "Current frontier models evaluate at ~17x below METR's catastrophic risk threshold for autonomous AI R&D" — calibrating B1; (2) "METR's time horizon benchmark shifted 50-57% between v1.0 and v1.1 versions, making governance thresholds derived from it a moving target" — the reliability problem