55 lines
5.3 KiB
Markdown
55 lines
5.3 KiB
Markdown
---
|
|
type: source
|
|
title: "METR: Modeling Assumptions Create 1.5-2x Variation in Opus 4.6 Time Horizon Estimates"
|
|
author: "METR (@METR_Evals)"
|
|
url: https://metr.org/notes/2026-03-20-impact-of-modelling-assumptions-on-time-horizon-results/
|
|
date: 2026-03-20
|
|
domain: ai-alignment
|
|
secondary_domains: []
|
|
format: technical-note
|
|
status: unprocessed
|
|
priority: high
|
|
tags: [metr, time-horizon, measurement-reliability, evaluation-saturation, Opus-4.6, modeling-uncertainty]
|
|
---
|
|
|
|
## Content
|
|
|
|
METR published a technical note (March 20, 2026 — 3 days before this session) analyzing how modeling assumptions affect time horizon estimates, with Opus 4.6 identified as the model most sensitive to these choices.
|
|
|
|
**Primary finding**: Opus 4.6 experiences the largest variations across modeling approaches because it operates near the edge of the task suite's ceiling. Results:
|
|
- 50% time horizon: approximately **1.5x variation** across reasonable modeling choices
|
|
- 80% time horizon: approximately **2x variation**
|
|
- Older models: smaller impact (more data, less extrapolation required)
|
|
|
|
**Three major sources of uncertainty**:
|
|
1. **Task length noise** (25-40% potential reduction): Human time estimates for tasks vary within ~3x, and estimates within ~4x of actual values. Substantial uncertainty in what counts as "X hours of human work."
|
|
2. **Success rate curve modeling** (up to 35% reduction): The logistic sigmoid may inadequately account for unexpected failures on easy tasks, artificially flattening curve fits.
|
|
3. **Public vs. private tasks** (variable impact): Opus 4.6 shows 40% reduction when excluding public tasks, driven by exceptional performance on RE-Bench optimization problems. If those specific public benchmarks are excluded, the time horizon estimate drops substantially.
|
|
|
|
**METR's own caveat**: "Task distribution uncertainty matters more than analytical choices" and "often a factor of 2 in both directions." The confidence intervals are wide because the extrapolation is genuinely uncertain.
|
|
|
|
**Structural implication**: The confidence interval for Opus 4.6's 50% time horizon spans 6 hours to 98 hours — a 16x range. Policy or governance thresholds set based on time horizon measurements would face enormous uncertainty about whether any specific model had crossed them.
|
|
|
|
## Agent Notes
|
|
|
|
**Why this matters:** This is METR doing honest epistemic accounting on their own flagship measurement tool — and the finding is that their primary metric for frontier capability has measurement uncertainty of 1.5-2x exactly where it matters most. If a governance framework used "12-hour task horizon" as a trigger for mandatory evaluation requirements, METR's own methodology would produce confidence intervals spanning 6-98 hours. You cannot set enforceable thresholds on a metric with that uncertainty range.
|
|
|
|
**What surprised me:** The connection to RSP v3.0's admission ("the science of model evaluation isn't well-developed enough"). Anthropic and METR are independently arriving at the same conclusion — the measurement problem is not solved — within two months of each other. These reinforce each other as a convergent finding.
|
|
|
|
**What I expected but didn't find:** Any proposed solutions to the saturation/uncertainty problem. The note describes the problem with precision but doesn't propose a path to measurement improvement.
|
|
|
|
**KB connections:**
|
|
- [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]] — the measurement saturation is a concrete instantiation of this structural claim
|
|
- [[AI capability and reliability are independent dimensions]] — capability and measurement reliability are also independent; you can have a highly capable model with highly uncertain capability measurements
|
|
- [[formal verification of AI-generated proofs provides scalable oversight]] — formal verification doesn't help here because task completion doesn't admit of formal verification; this is the domain where verification is specifically hard
|
|
|
|
**Extraction hints:**
|
|
1. Candidate claim: "The primary autonomous capability evaluation metric (METR time horizon) has 1.5-2x measurement uncertainty for frontier models because task suites saturate before frontier capabilities do, creating a measurement gap that makes capability threshold governance unenforceable"
|
|
2. This could also be framed as an update to B4 (Belief 4: verification degrades faster than capability grows) — now with a specific quantitative example
|
|
|
|
**Context:** Published 3 days ago (March 20, 2026). METR is being proactively transparent about the limitations of their own methodology — this is intellectually honest and alarming at the same time. The note appears in response to the very wide confidence intervals in the Opus 4.6 time horizon estimate.
|
|
|
|
## Curator Notes (structured handoff for extractor)
|
|
PRIMARY CONNECTION: [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]]
|
|
WHY ARCHIVED: Direct evidence that the primary capability measurement tool has 1.5-2x uncertainty at the frontier — governance cannot set enforceable thresholds on unmeasurable capabilities
|
|
EXTRACTION HINT: The "measurement saturation" concept may deserve its own claim distinct from the scalable oversight degradation claim — it's about the measurement tools themselves failing, not the oversight mechanisms
|