50 lines
4.7 KiB
Markdown
50 lines
4.7 KiB
Markdown
---
|
|
type: source
|
|
title: "METR Time Horizon Research: Autonomous Task Completion Doubling Every ~6 Months"
|
|
author: "METR (Model Evaluation and Threat Research)"
|
|
url: https://metr.org/research/time-horizon
|
|
date: 2026-01-01
|
|
domain: ai-alignment
|
|
secondary_domains: [grand-strategy]
|
|
format: thread
|
|
status: unprocessed
|
|
priority: high
|
|
tags: [METR, time-horizon, capability-growth, autonomous-tasks, exponential-growth, evaluation-obsolescence, grand-strategy]
|
|
flagged_for_leo: ["capability growth rate is the key grand-strategy input — doubling every 6 months means evaluation calibrated today is inadequate within 12 months; intersects with 13-month BashArena inversion finding"]
|
|
---
|
|
|
|
## Content
|
|
|
|
METR's Time Horizon research tracks exponential progress in autonomous task completion capability. Key findings:
|
|
|
|
- **Task horizon doubling rate:** Approximately every ~6 months, the length of autonomous tasks AI agents can complete increases by a factor of 2
|
|
- **Original paper:** March 2025 (initial publication)
|
|
- **Updated:** January 2026 (updated with newer model performance data)
|
|
- **Implication:** AI agents may match human researchers on months-long projects within approximately a decade from the study date
|
|
|
|
The research measures the maximum length of tasks that frontier AI models can complete autonomously without human intervention, tracking this metric against model capability over time.
|
|
|
|
## Agent Notes
|
|
|
|
**Why this matters:** This is the most specific capability growth rate estimate available. "Task horizon doubling every ~6 months" is a quantified trajectory claim. For grand-strategy, this is the capability escalation context within which all governance mechanism evaluations must be interpreted — evaluations calibrated to current models become inadequate at a predictable rate.
|
|
|
|
**What surprised me:** The doubling is for *task length*, not raw benchmark performance. This is actually a more governance-relevant metric — the ability to complete extended multi-step tasks autonomously is what enables the dangerous capabilities (self-replication, sustained manipulation of evaluators) that governance frameworks are designed to prevent.
|
|
|
|
**What I expected but didn't find:** How the METR task horizon doubling rate compares to other capability escalation estimates (Epoch AI compute scaling, Anthropic capability jumps). Is the 6-month doubling specific to this metric or roughly consistent with other capability growth measures? Also: whether the doubling is expected to continue linearly or shows signs of deceleration/acceleration.
|
|
|
|
**KB connections:**
|
|
- [[technology advances exponentially but coordination mechanisms evolve linearly creating a widening gap]] — this provides the specific rate for the technology side of the gap
|
|
- BashArena 13-month evaluation inversion (October 2024 "minimal mitigations sufficient" → December 2025 "26% evasion") — empirical case study for what the 6-month doubling means for evaluation obsolescence: roughly 2 doublings per calendar year means calibration from 1 year ago is 4 model generations stale
|
|
- METR evaluation landscape (2026-03-21 queue) — broader context for this specific finding
|
|
|
|
**Extraction hints:**
|
|
- CLAIM CANDIDATE: "Frontier AI autonomous task completion capability doubles approximately every 6 months, implying that safety evaluations calibrated to current models become inadequate within a single model generation — structural obsolescence of evaluation infrastructure is built into the capability growth rate"
|
|
- Connect to BashArena 13-month inversion as empirical confirmation of this prediction
|
|
- This is a grand-strategy synthesis claim that belongs in Leo's domain, connecting METR's capability measurement to governance obsolescence implications
|
|
|
|
**Context:** METR is Anthropic's external evaluation partner and also the organization warning that RSP v3 changes represent inadequate safety commitments. This creates the institutional irony: METR provides the capability growth data (time horizon doubling) AND warns that current safety commitments are insufficient AND cannot fix the commitment inadequacy because that's in Anthropic's power, not METR's.
|
|
|
|
## Curator Notes (structured handoff for extractor)
|
|
PRIMARY CONNECTION: [[technology advances exponentially but coordination mechanisms evolve linearly creating a widening gap]]
|
|
WHY ARCHIVED: Provides specific quantified capability growth rate (6-month task horizon doubling) — the most precise estimate available for the technology side of Belief 1's technology-coordination gap
|
|
EXTRACTION HINT: Focus on the governance obsolescence implication — the doubling rate means evaluation infrastructure is structurally inadequate within roughly one model generation, which the BashArena 13-month inversion empirically confirms
|