pipeline: archive 1 source(s) post-merge
Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70>
This commit is contained in:
parent
2e195f01b6
commit
ce72815001
1 changed files with 67 additions and 0 deletions
|
|
@ -0,0 +1,67 @@
|
|||
---
|
||||
type: source
|
||||
title: "METR Time Horizon 1.1: Capability Doubling Every 131 Days, Task Suite Approaching Saturation"
|
||||
author: "METR (@METR_Evals)"
|
||||
url: https://metr.org/blog/2026-1-29-time-horizon-1-1/
|
||||
date: 2026-01-29
|
||||
domain: ai-alignment
|
||||
secondary_domains: []
|
||||
format: blog-post
|
||||
status: processed
|
||||
priority: high
|
||||
tags: [metr, time-horizon, capability-measurement, evaluation-methodology, autonomy, scaling, saturation]
|
||||
---
|
||||
|
||||
## Content
|
||||
|
||||
METR published an updated version of their autonomous AI capability measurement framework (Time Horizon 1.1) on January 29, 2026.
|
||||
|
||||
**Core metric**: Task-completion time horizon — the task duration (measured by human expert completion time) at which an AI agent succeeds with a given level of reliability. A 50%-time-horizon of 4 hours means the model succeeds at roughly half of tasks that would take an expert human 4 hours.
|
||||
|
||||
**Updated methodology**:
|
||||
- Expanded task suite from 170 to 228 tasks (34% growth)
|
||||
- Long tasks (8+ hours) doubled from 14 to 31
|
||||
- Infrastructure migrated from in-house Vivaria to open-source Inspect framework (developed by UK AI Security Institute)
|
||||
- Upper confidence bound for Opus 4.5 decreased from 4.4x to 2.3x the point estimate due to tighter task coverage
|
||||
|
||||
**Revised growth rate**: Doubling time updated from 165 to **131 days** — suggesting progress is estimated to be 20% more rapid under the new framework. This reflects task distribution differences rather than infrastructure changes alone.
|
||||
|
||||
**Model performance estimates (50% success horizon)**:
|
||||
- Claude Opus 4.6 (Feb 2026): ~719 minutes (~12 hours) [from time-horizons page; later revised to ~14.5 hours per METR direct announcement]
|
||||
- GPT-5.2 (Dec 2025): ~352 minutes
|
||||
- Claude Opus 4.5 (Nov 2025): ~320 minutes (revised up from 289)
|
||||
- GPT-5.1 Codex Max (Nov 2025): ~162 minutes
|
||||
- GPT-5 (Aug 2025): ~214 minutes
|
||||
- Claude 3.7 Sonnet (Feb 2025): ~60 minutes
|
||||
- O3 (Apr 2025): ~91 minutes
|
||||
- GPT-4 Turbo (2024): 3-10 minutes
|
||||
- GPT-2 (2019): ~0.04 minutes
|
||||
|
||||
**Saturation problem**: METR acknowledges only 5 of 31 long tasks have measured human baseline times; remainder use estimates. Frontier models are approaching ceiling of the evaluation framework.
|
||||
|
||||
**Methodology caveat**: Different model versions employ varying scaffolds (modular-public, flock-public, triframe_inspect), which may affect comparability.
|
||||
|
||||
## Agent Notes
|
||||
|
||||
**Why this matters:** The 131-day doubling time for autonomous task capability is the most precise quantification available of the capability-governance gap. At this rate, a capability that takes a human 12 hours today will be at the human-24-hour threshold in ~4 months, and the human-48-hour threshold in ~8 months — while policy cycles operate on 12-24 month timescales.
|
||||
|
||||
**What surprised me:** The task suite is already saturating for frontier models, and this is acknowledged explicitly. The measurement infrastructure is failing to keep pace with the capabilities it's supposed to measure — this is a concrete instance of B4 (verification degrades faster than capability grows), now visible in the primary autonomous capability metric itself.
|
||||
|
||||
**What I expected but didn't find:** Any plans for addressing the saturation problem — expanding the task suite for long-horizon tasks, or alternative measurement approaches for capabilities beyond current ceiling. Absent from the methodology documentation.
|
||||
|
||||
**KB connections:**
|
||||
- [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]] — time horizon growth is the quantified version of the growing capability gap that this claim addresses
|
||||
- [[verification degrades faster than capability grows]] (B4) — the task suite saturation is verification degradation made concrete
|
||||
- [[economic forces push humans out of every cognitive loop where output quality is independently verifiable]] — at 12+ hour autonomous task completion, the economic pressure to remove human oversight becomes overwhelming
|
||||
|
||||
**Extraction hints:** Multiple potential claims:
|
||||
1. "AI autonomous task capability is doubling every 131 days while governance policy cycles operate on 12-24 month timescales, creating a structural measurement lag"
|
||||
2. "Evaluation infrastructure for frontier AI capability is saturating at precisely the capability level where oversight matters most"
|
||||
3. Consider updating existing claim [[scalable oversight degrades rapidly...]] with this quantitative data
|
||||
|
||||
**Context:** METR (Model Evaluation and Threat Research) is the primary independent evaluator of frontier AI autonomous capabilities. Their time-horizon metric has become the de facto standard for measuring dangerous autonomous capability development. This update matters because: (1) it tightens the growth rate estimate, and (2) it acknowledges the measurement ceiling problem before it becomes a crisis.
|
||||
|
||||
## Curator Notes (structured handoff for extractor)
|
||||
PRIMARY CONNECTION: [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]]
|
||||
WHY ARCHIVED: Quantifies the capability-governance gap with the most precise measurement available; reveals measurement infrastructure itself is failing for frontier models
|
||||
EXTRACTION HINT: Two claims possible — one on the doubling rate as governance timeline mismatch; one on evaluation saturation as a new instance of B4. Check whether the doubling rate number updates or supersedes existing claims.
|
||||
Loading…
Reference in a new issue