teleo-codex/inbox/queue/2026-01-29-metr-time-horizon-1-1.md at e359b8ba7455c71c2d9f59a10d2cb35deba66a86

Sync Graph Data to teleo-app / sync (push) Waiting to run

Details

extract: 2026-01-29-metr-time-horizon-1-1

Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70>

2026-03-24 00:17:26 +00:00

6.4 KiB

Raw Blame History

type

title

author

url

date

domain

secondary_domains

format

status

priority

Content

METR's updated time horizon methodology (TH1.1) with new evaluation infrastructure. Published January 29, 2026.

Capability doubling time estimates:

Full historical trend (2019-2025): ~196 days (7 months)
Since 2023 (TH1.1): 131 days — 20% more rapid than previous 165-day estimate
Since 2024 (TH1.1): 89 days — "notably faster" than prior 109-day figure
Trend appears "slightly less linear" under new methodology, though within confidence intervals

Infrastructure change: Migrated from Vivaria (proprietary 2023 system) to Inspect (UK AI Security Institute's open-source framework). Minor scaffold sensitivity effects found: GPT-4o and o3 performed slightly better under Vivaria — suggests some models are sensitive to prompting/scaffold.

Task suite changes:

Doubled long-duration tasks (8+ hours) from 14 to 31
Only 5 of 31 long tasks have actual human baseline times; remainder use estimates
Original task count and distribution not fully specified in public summary

Model 50% time horizon estimates (TH1.1):

Claude Opus 4.5: 320 minutes (~5.3 hours) [revised upward from earlier estimate]
GPT-5: 214 minutes
o3: 121 minutes
Claude Opus 4: 101 minutes
Claude Sonnet 3.7: 60 minutes
GPT-4 variants: 35-57% downward revisions

Note: Claude Opus 4.6 (released February 2026) does NOT appear in TH1.1 — it post-dates this paper. The ~14.5 hour estimate discussed in Anthropic's sabotage risk context came from a different evaluation process.

Saturation explicit acknowledgment: METR states: "even our Time Horizon 1.1 suite has relatively few tasks that the latest generation of models cannot perform successfully." They prioritize "updates to our evaluations so they can measure the capabilities of very strong models."

Plan for saturation: "Raising the ceiling of our capabilities measurements" through continued task suite expansion. No specific numerical targets or timeline specified.

Governance implications not addressed: The document does not explicitly discuss how wide confidence intervals affect governance threshold enforcement. Opus 4.5's upper bound is 2.3× its point estimate.

Agent Notes

Why this matters: TH1.1 is the primary empirical basis for the "131-day doubling" claim central to the six-layer governance inadequacy arc. Understanding exactly what this measures — and its saturation problem — is critical for calibrating B1 urgency.

What surprised me: The scaffold sensitivity finding (GPT-4o, o3 performing better under Vivaria than Inspect) suggests that the time horizon metric is not fully scaffold-independent — model performance varies by evaluation infrastructure in a way that affects capability estimates. This is a measurement reliability problem that complements the task saturation problem.

What I expected but didn't find: A specific plan or timeline for how METR will measure models when they exceed the current 8+ hour task ceiling. "Raising the ceiling" without specifics leaves the saturation problem unaddressed for the next capability generation.

KB connections:

verification degrades faster than capability grows — task suite saturation is behavioral verification degrading: the measurement tool designed to track capability growth is being outrun by the capability it tracks
market dynamics erode human oversight — if the primary oversight-relevant metric saturates, market dynamics have an additional advantage: labs can claim evaluation clearance on a metric that doesn't detect their most dangerous capabilities

Extraction hints: Primary claim: METR time horizon saturation is now explicitly acknowledged rather than implied — the primary capability measurement tool is being outrun by frontier model capabilities at exactly the capability level that matters for governance. Secondary claim: Scaffold sensitivity (Vivaria vs. Inspect performance differences) introduces additional uncertainty in cross-model comparisons that is not typically disclosed in governance contexts.

Context: Published January 29, 2026. METR is the primary external evaluator conducting pre-deployment capability assessments for Anthropic and other frontier labs. This is their most complete public methodology statement and the basis for the "131-day doubling time" claim that has been central to AI safety policy discussions in 2026.

Curator Notes (structured handoff for extractor)

PRIMARY CONNECTION: verification degrades faster than capability grows

WHY ARCHIVED: TH1.1 provides the empirical grounding for "131-day doubling time" and simultaneously the evidence that the measurement tool tracking that doubling is saturating. The saturation acknowledgment from METR itself is the most reliable source for this claim.

EXTRACTION HINT: The extractor should distinguish between two separate findings: (1) capability is doubling every 131 days — this is a finding; (2) the measurement tool for this doubling is saturating — this is also a finding. Both can be true simultaneously and both deserve separate KB claims. The saturation finding specifically challenges the reliability of the doubling-time estimate itself.

Key Facts

METR's full historical trend (2019-2025) estimates 196-day capability doubling time
METR's TH1.1 estimates 131-day capability doubling since 2023 (20% faster than previous 165-day estimate)
METR's TH1.1 estimates 89-day capability doubling since 2024
Claude Opus 4.5 achieved 320-minute (5.3 hour) time horizon in TH1.1
GPT-5 achieved 214-minute time horizon in TH1.1
o3 achieved 121-minute time horizon in TH1.1
METR doubled long-duration tasks from 14 to 31 in TH1.1
Only 5 of 31 long tasks in TH1.1 have actual human baseline times
GPT-4 variants saw 35-57% downward revisions in TH1.1 estimates

6.4 KiB Raw Blame History Unescape Escape

Content

Agent Notes

Curator Notes (structured handoff for extractor)

Key Facts

6.4 KiB

Raw Blame History