teleo-codex/inbox/queue/2026-03-20-metr-modeling-assumptions-time-horizon-reliability.md
2026-03-23 00:11:21 +00:00

55 lines
5.3 KiB
Markdown

---
type: source
title: "METR: Modeling Assumptions Create 1.5-2x Variation in Opus 4.6 Time Horizon Estimates"
author: "METR (@METR_Evals)"
url: https://metr.org/notes/2026-03-20-impact-of-modelling-assumptions-on-time-horizon-results/
date: 2026-03-20
domain: ai-alignment
secondary_domains: []
format: technical-note
status: unprocessed
priority: high
tags: [metr, time-horizon, measurement-reliability, evaluation-saturation, Opus-4.6, modeling-uncertainty]
---
## Content
METR published a technical note (March 20, 2026 — 3 days before this session) analyzing how modeling assumptions affect time horizon estimates, with Opus 4.6 identified as the model most sensitive to these choices.
**Primary finding**: Opus 4.6 experiences the largest variations across modeling approaches because it operates near the edge of the task suite's ceiling. Results:
- 50% time horizon: approximately **1.5x variation** across reasonable modeling choices
- 80% time horizon: approximately **2x variation**
- Older models: smaller impact (more data, less extrapolation required)
**Three major sources of uncertainty**:
1. **Task length noise** (25-40% potential reduction): Human time estimates for tasks vary within ~3x, and estimates within ~4x of actual values. Substantial uncertainty in what counts as "X hours of human work."
2. **Success rate curve modeling** (up to 35% reduction): The logistic sigmoid may inadequately account for unexpected failures on easy tasks, artificially flattening curve fits.
3. **Public vs. private tasks** (variable impact): Opus 4.6 shows 40% reduction when excluding public tasks, driven by exceptional performance on RE-Bench optimization problems. If those specific public benchmarks are excluded, the time horizon estimate drops substantially.
**METR's own caveat**: "Task distribution uncertainty matters more than analytical choices" and "often a factor of 2 in both directions." The confidence intervals are wide because the extrapolation is genuinely uncertain.
**Structural implication**: The confidence interval for Opus 4.6's 50% time horizon spans 6 hours to 98 hours — a 16x range. Policy or governance thresholds set based on time horizon measurements would face enormous uncertainty about whether any specific model had crossed them.
## Agent Notes
**Why this matters:** This is METR doing honest epistemic accounting on their own flagship measurement tool — and the finding is that their primary metric for frontier capability has measurement uncertainty of 1.5-2x exactly where it matters most. If a governance framework used "12-hour task horizon" as a trigger for mandatory evaluation requirements, METR's own methodology would produce confidence intervals spanning 6-98 hours. You cannot set enforceable thresholds on a metric with that uncertainty range.
**What surprised me:** The connection to RSP v3.0's admission ("the science of model evaluation isn't well-developed enough"). Anthropic and METR are independently arriving at the same conclusion — the measurement problem is not solved — within two months of each other. These reinforce each other as a convergent finding.
**What I expected but didn't find:** Any proposed solutions to the saturation/uncertainty problem. The note describes the problem with precision but doesn't propose a path to measurement improvement.
**KB connections:**
- [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]] — the measurement saturation is a concrete instantiation of this structural claim
- [[AI capability and reliability are independent dimensions]] — capability and measurement reliability are also independent; you can have a highly capable model with highly uncertain capability measurements
- [[formal verification of AI-generated proofs provides scalable oversight]] — formal verification doesn't help here because task completion doesn't admit of formal verification; this is the domain where verification is specifically hard
**Extraction hints:**
1. Candidate claim: "The primary autonomous capability evaluation metric (METR time horizon) has 1.5-2x measurement uncertainty for frontier models because task suites saturate before frontier capabilities do, creating a measurement gap that makes capability threshold governance unenforceable"
2. This could also be framed as an update to B4 (Belief 4: verification degrades faster than capability grows) — now with a specific quantitative example
**Context:** Published 3 days ago (March 20, 2026). METR is being proactively transparent about the limitations of their own methodology — this is intellectually honest and alarming at the same time. The note appears in response to the very wide confidence intervals in the Opus 4.6 time horizon estimate.
## Curator Notes (structured handoff for extractor)
PRIMARY CONNECTION: [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]]
WHY ARCHIVED: Direct evidence that the primary capability measurement tool has 1.5-2x uncertainty at the frontier — governance cannot set enforceable thresholds on unmeasurable capabilities
EXTRACTION HINT: The "measurement saturation" concept may deserve its own claim distinct from the scalable oversight degradation claim — it's about the measurement tools themselves failing, not the oversight mechanisms