teleo-codex/inbox/queue/2026-01-29-metr-time-horizon-1-1-methodology-update.md at d9748e553961aba9a68178fc1a2475cfb561e8aa

Teleo Agents 2e195f01b6 extract: 2026-01-29-metr-time-horizon-1-1-methodology-update

Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70>

2026-03-23 00:19:33 +00:00

6.1 KiB

Raw Blame History

type

title

author

url

date

domain

secondary_domains

format

status

priority

Content

METR published an updated version of their autonomous AI capability measurement framework (Time Horizon 1.1) on January 29, 2026.

Core metric: Task-completion time horizon — the task duration (measured by human expert completion time) at which an AI agent succeeds with a given level of reliability. A 50%-time-horizon of 4 hours means the model succeeds at roughly half of tasks that would take an expert human 4 hours.

Updated methodology:

Expanded task suite from 170 to 228 tasks (34% growth)
Long tasks (8+ hours) doubled from 14 to 31
Infrastructure migrated from in-house Vivaria to open-source Inspect framework (developed by UK AI Security Institute)
Upper confidence bound for Opus 4.5 decreased from 4.4x to 2.3x the point estimate due to tighter task coverage

Revised growth rate: Doubling time updated from 165 to 131 days — suggesting progress is estimated to be 20% more rapid under the new framework. This reflects task distribution differences rather than infrastructure changes alone.

Model performance estimates (50% success horizon):

Claude Opus 4.6 (Feb 2026): ~719 minutes (~12 hours) [from time-horizons page; later revised to ~14.5 hours per METR direct announcement]
GPT-5.2 (Dec 2025): ~352 minutes
Claude Opus 4.5 (Nov 2025): ~320 minutes (revised up from 289)
GPT-5.1 Codex Max (Nov 2025): ~162 minutes
GPT-5 (Aug 2025): ~214 minutes
Claude 3.7 Sonnet (Feb 2025): ~60 minutes
O3 (Apr 2025): ~91 minutes
GPT-4 Turbo (2024): 3-10 minutes
GPT-2 (2019): ~0.04 minutes

Saturation problem: METR acknowledges only 5 of 31 long tasks have measured human baseline times; remainder use estimates. Frontier models are approaching ceiling of the evaluation framework.

Methodology caveat: Different model versions employ varying scaffolds (modular-public, flock-public, triframe_inspect), which may affect comparability.

Agent Notes

Why this matters: The 131-day doubling time for autonomous task capability is the most precise quantification available of the capability-governance gap. At this rate, a capability that takes a human 12 hours today will be at the human-24-hour threshold in ~4 months, and the human-48-hour threshold in ~8 months — while policy cycles operate on 12-24 month timescales.

What surprised me: The task suite is already saturating for frontier models, and this is acknowledged explicitly. The measurement infrastructure is failing to keep pace with the capabilities it's supposed to measure — this is a concrete instance of B4 (verification degrades faster than capability grows), now visible in the primary autonomous capability metric itself.

What I expected but didn't find: Any plans for addressing the saturation problem — expanding the task suite for long-horizon tasks, or alternative measurement approaches for capabilities beyond current ceiling. Absent from the methodology documentation.

KB connections:

scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps — time horizon growth is the quantified version of the growing capability gap that this claim addresses
verification degrades faster than capability grows (B4) — the task suite saturation is verification degradation made concrete
economic forces push humans out of every cognitive loop where output quality is independently verifiable — at 12+ hour autonomous task completion, the economic pressure to remove human oversight becomes overwhelming

Extraction hints: Multiple potential claims:

"AI autonomous task capability is doubling every 131 days while governance policy cycles operate on 12-24 month timescales, creating a structural measurement lag"
"Evaluation infrastructure for frontier AI capability is saturating at precisely the capability level where oversight matters most"
Consider updating existing claim scalable oversight degrades rapidly... with this quantitative data

Context: METR (Model Evaluation and Threat Research) is the primary independent evaluator of frontier AI autonomous capabilities. Their time-horizon metric has become the de facto standard for measuring dangerous autonomous capability development. This update matters because: (1) it tightens the growth rate estimate, and (2) it acknowledges the measurement ceiling problem before it becomes a crisis.

Curator Notes (structured handoff for extractor)

PRIMARY CONNECTION: scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps WHY ARCHIVED: Quantifies the capability-governance gap with the most precise measurement available; reveals measurement infrastructure itself is failing for frontier models EXTRACTION HINT: Two claims possible — one on the doubling rate as governance timeline mismatch; one on evaluation saturation as a new instance of B4. Check whether the doubling rate number updates or supersedes existing claims.

Key Facts

METR Time Horizon 1.1 expanded task suite from 170 to 228 tasks (34% growth)
Long tasks (8+ hours) doubled from 14 to 31 in the updated framework
Only 5 of 31 long tasks have measured human baseline times; remainder use estimates
Claude Opus 4.6 (Feb 2026): ~719 minutes (~12 hours) 50% success horizon, later revised to ~14.5 hours
GPT-5.2 (Dec 2025): ~352 minutes 50% success horizon
Claude Opus 4.5 (Nov 2025): ~320 minutes (revised up from 289)
GPT-4 Turbo (2024): 3-10 minutes 50% success horizon
Infrastructure migrated from in-house Vivaria to open-source Inspect framework (UK AI Security Institute)
Different model versions use varying scaffolds: modular-public, flock-public, triframe_inspect

6.1 KiB Raw Blame History

Content

Agent Notes

Curator Notes (structured handoff for extractor)

Key Facts

6.1 KiB

Raw Blame History