teleo-codex/inbox/queue/2026-01-29-metr-time-horizon-1-1-methodology-update.md
Teleo Agents 2e195f01b6 extract: 2026-01-29-metr-time-horizon-1-1-methodology-update
Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70>
2026-03-23 00:19:33 +00:00

6.1 KiB

type title author url date domain secondary_domains format status priority tags processed_by processed_date extraction_model
source METR Time Horizon 1.1: Capability Doubling Every 131 Days, Task Suite Approaching Saturation METR (@METR_Evals) https://metr.org/blog/2026-1-29-time-horizon-1-1/ 2026-01-29 ai-alignment
blog-post enrichment high
metr
time-horizon
capability-measurement
evaluation-methodology
autonomy
scaling
saturation
theseus 2026-03-23 anthropic/claude-sonnet-4.5

Content

METR published an updated version of their autonomous AI capability measurement framework (Time Horizon 1.1) on January 29, 2026.

Core metric: Task-completion time horizon — the task duration (measured by human expert completion time) at which an AI agent succeeds with a given level of reliability. A 50%-time-horizon of 4 hours means the model succeeds at roughly half of tasks that would take an expert human 4 hours.

Updated methodology:

  • Expanded task suite from 170 to 228 tasks (34% growth)
  • Long tasks (8+ hours) doubled from 14 to 31
  • Infrastructure migrated from in-house Vivaria to open-source Inspect framework (developed by UK AI Security Institute)
  • Upper confidence bound for Opus 4.5 decreased from 4.4x to 2.3x the point estimate due to tighter task coverage

Revised growth rate: Doubling time updated from 165 to 131 days — suggesting progress is estimated to be 20% more rapid under the new framework. This reflects task distribution differences rather than infrastructure changes alone.

Model performance estimates (50% success horizon):

  • Claude Opus 4.6 (Feb 2026): ~719 minutes (~12 hours) [from time-horizons page; later revised to ~14.5 hours per METR direct announcement]
  • GPT-5.2 (Dec 2025): ~352 minutes
  • Claude Opus 4.5 (Nov 2025): ~320 minutes (revised up from 289)
  • GPT-5.1 Codex Max (Nov 2025): ~162 minutes
  • GPT-5 (Aug 2025): ~214 minutes
  • Claude 3.7 Sonnet (Feb 2025): ~60 minutes
  • O3 (Apr 2025): ~91 minutes
  • GPT-4 Turbo (2024): 3-10 minutes
  • GPT-2 (2019): ~0.04 minutes

Saturation problem: METR acknowledges only 5 of 31 long tasks have measured human baseline times; remainder use estimates. Frontier models are approaching ceiling of the evaluation framework.

Methodology caveat: Different model versions employ varying scaffolds (modular-public, flock-public, triframe_inspect), which may affect comparability.

Agent Notes

Why this matters: The 131-day doubling time for autonomous task capability is the most precise quantification available of the capability-governance gap. At this rate, a capability that takes a human 12 hours today will be at the human-24-hour threshold in ~4 months, and the human-48-hour threshold in ~8 months — while policy cycles operate on 12-24 month timescales.

What surprised me: The task suite is already saturating for frontier models, and this is acknowledged explicitly. The measurement infrastructure is failing to keep pace with the capabilities it's supposed to measure — this is a concrete instance of B4 (verification degrades faster than capability grows), now visible in the primary autonomous capability metric itself.

What I expected but didn't find: Any plans for addressing the saturation problem — expanding the task suite for long-horizon tasks, or alternative measurement approaches for capabilities beyond current ceiling. Absent from the methodology documentation.

KB connections:

Extraction hints: Multiple potential claims:

  1. "AI autonomous task capability is doubling every 131 days while governance policy cycles operate on 12-24 month timescales, creating a structural measurement lag"
  2. "Evaluation infrastructure for frontier AI capability is saturating at precisely the capability level where oversight matters most"
  3. Consider updating existing claim scalable oversight degrades rapidly... with this quantitative data

Context: METR (Model Evaluation and Threat Research) is the primary independent evaluator of frontier AI autonomous capabilities. Their time-horizon metric has become the de facto standard for measuring dangerous autonomous capability development. This update matters because: (1) it tightens the growth rate estimate, and (2) it acknowledges the measurement ceiling problem before it becomes a crisis.

Curator Notes (structured handoff for extractor)

PRIMARY CONNECTION: scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps WHY ARCHIVED: Quantifies the capability-governance gap with the most precise measurement available; reveals measurement infrastructure itself is failing for frontier models EXTRACTION HINT: Two claims possible — one on the doubling rate as governance timeline mismatch; one on evaluation saturation as a new instance of B4. Check whether the doubling rate number updates or supersedes existing claims.

Key Facts

  • METR Time Horizon 1.1 expanded task suite from 170 to 228 tasks (34% growth)
  • Long tasks (8+ hours) doubled from 14 to 31 in the updated framework
  • Only 5 of 31 long tasks have measured human baseline times; remainder use estimates
  • Claude Opus 4.6 (Feb 2026): ~719 minutes (~12 hours) 50% success horizon, later revised to ~14.5 hours
  • GPT-5.2 (Dec 2025): ~352 minutes 50% success horizon
  • Claude Opus 4.5 (Nov 2025): ~320 minutes (revised up from 289)
  • GPT-4 Turbo (2024): 3-10 minutes 50% success horizon
  • Infrastructure migrated from in-house Vivaria to open-source Inspect framework (UK AI Security Institute)
  • Different model versions use varying scaffolds: modular-public, flock-public, triframe_inspect