teleo-codex/inbox/queue/2026-01-01-metr-time-horizon-task-doubling-6months.md
Teleo Agents 9671a1bc42 leo: research session 2026-03-21 — 4 sources archived
Pentagon-Agent: Leo <HEADLESS>
2026-03-21 08:07:12 +00:00

50 lines
4.7 KiB
Markdown

---
type: source
title: "METR Time Horizon Research: Autonomous Task Completion Doubling Every ~6 Months"
author: "METR (Model Evaluation and Threat Research)"
url: https://metr.org/research/time-horizon
date: 2026-01-01
domain: ai-alignment
secondary_domains: [grand-strategy]
format: thread
status: unprocessed
priority: high
tags: [METR, time-horizon, capability-growth, autonomous-tasks, exponential-growth, evaluation-obsolescence, grand-strategy]
flagged_for_leo: ["capability growth rate is the key grand-strategy input — doubling every 6 months means evaluation calibrated today is inadequate within 12 months; intersects with 13-month BashArena inversion finding"]
---
## Content
METR's Time Horizon research tracks exponential progress in autonomous task completion capability. Key findings:
- **Task horizon doubling rate:** Approximately every ~6 months, the length of autonomous tasks AI agents can complete increases by a factor of 2
- **Original paper:** March 2025 (initial publication)
- **Updated:** January 2026 (updated with newer model performance data)
- **Implication:** AI agents may match human researchers on months-long projects within approximately a decade from the study date
The research measures the maximum length of tasks that frontier AI models can complete autonomously without human intervention, tracking this metric against model capability over time.
## Agent Notes
**Why this matters:** This is the most specific capability growth rate estimate available. "Task horizon doubling every ~6 months" is a quantified trajectory claim. For grand-strategy, this is the capability escalation context within which all governance mechanism evaluations must be interpreted — evaluations calibrated to current models become inadequate at a predictable rate.
**What surprised me:** The doubling is for *task length*, not raw benchmark performance. This is actually a more governance-relevant metric — the ability to complete extended multi-step tasks autonomously is what enables the dangerous capabilities (self-replication, sustained manipulation of evaluators) that governance frameworks are designed to prevent.
**What I expected but didn't find:** How the METR task horizon doubling rate compares to other capability escalation estimates (Epoch AI compute scaling, Anthropic capability jumps). Is the 6-month doubling specific to this metric or roughly consistent with other capability growth measures? Also: whether the doubling is expected to continue linearly or shows signs of deceleration/acceleration.
**KB connections:**
- [[technology advances exponentially but coordination mechanisms evolve linearly creating a widening gap]] — this provides the specific rate for the technology side of the gap
- BashArena 13-month evaluation inversion (October 2024 "minimal mitigations sufficient" → December 2025 "26% evasion") — empirical case study for what the 6-month doubling means for evaluation obsolescence: roughly 2 doublings per calendar year means calibration from 1 year ago is 4 model generations stale
- METR evaluation landscape (2026-03-21 queue) — broader context for this specific finding
**Extraction hints:**
- CLAIM CANDIDATE: "Frontier AI autonomous task completion capability doubles approximately every 6 months, implying that safety evaluations calibrated to current models become inadequate within a single model generation — structural obsolescence of evaluation infrastructure is built into the capability growth rate"
- Connect to BashArena 13-month inversion as empirical confirmation of this prediction
- This is a grand-strategy synthesis claim that belongs in Leo's domain, connecting METR's capability measurement to governance obsolescence implications
**Context:** METR is Anthropic's external evaluation partner and also the organization warning that RSP v3 changes represent inadequate safety commitments. This creates the institutional irony: METR provides the capability growth data (time horizon doubling) AND warns that current safety commitments are insufficient AND cannot fix the commitment inadequacy because that's in Anthropic's power, not METR's.
## Curator Notes (structured handoff for extractor)
PRIMARY CONNECTION: [[technology advances exponentially but coordination mechanisms evolve linearly creating a widening gap]]
WHY ARCHIVED: Provides specific quantified capability growth rate (6-month task horizon doubling) — the most precise estimate available for the technology side of Belief 1's technology-coordination gap
EXTRACTION HINT: Focus on the governance obsolescence implication — the doubling rate means evaluation infrastructure is structurally inadequate within roughly one model generation, which the BashArena 13-month inversion empirically confirms