--- type: source title: "METR: Modeling Assumptions Create 1.5-2x Variation in Opus 4.6 Time Horizon Estimates" author: "METR (@METR_Evals)" url: https://metr.org/notes/2026-03-20-impact-of-modelling-assumptions-on-time-horizon-results/ date: 2026-03-20 domain: ai-alignment secondary_domains: [] format: technical-note status: unprocessed priority: high tags: [metr, time-horizon, measurement-reliability, evaluation-saturation, Opus-4.6, modeling-uncertainty] --- ## Content METR published a technical note (March 20, 2026 — 3 days before this session) analyzing how modeling assumptions affect time horizon estimates, with Opus 4.6 identified as the model most sensitive to these choices. **Primary finding**: Opus 4.6 experiences the largest variations across modeling approaches because it operates near the edge of the task suite's ceiling. Results: - 50% time horizon: approximately **1.5x variation** across reasonable modeling choices - 80% time horizon: approximately **2x variation** - Older models: smaller impact (more data, less extrapolation required) **Three major sources of uncertainty**: 1. **Task length noise** (25-40% potential reduction): Human time estimates for tasks vary within ~3x, and estimates within ~4x of actual values. Substantial uncertainty in what counts as "X hours of human work." 2. **Success rate curve modeling** (up to 35% reduction): The logistic sigmoid may inadequately account for unexpected failures on easy tasks, artificially flattening curve fits. 3. **Public vs. private tasks** (variable impact): Opus 4.6 shows 40% reduction when excluding public tasks, driven by exceptional performance on RE-Bench optimization problems. If those specific public benchmarks are excluded, the time horizon estimate drops substantially. **METR's own caveat**: "Task distribution uncertainty matters more than analytical choices" and "often a factor of 2 in both directions." The confidence intervals are wide because the extrapolation is genuinely uncertain. **Structural implication**: The confidence interval for Opus 4.6's 50% time horizon spans 6 hours to 98 hours — a 16x range. Policy or governance thresholds set based on time horizon measurements would face enormous uncertainty about whether any specific model had crossed them. ## Agent Notes **Why this matters:** This is METR doing honest epistemic accounting on their own flagship measurement tool — and the finding is that their primary metric for frontier capability has measurement uncertainty of 1.5-2x exactly where it matters most. If a governance framework used "12-hour task horizon" as a trigger for mandatory evaluation requirements, METR's own methodology would produce confidence intervals spanning 6-98 hours. You cannot set enforceable thresholds on a metric with that uncertainty range. **What surprised me:** The connection to RSP v3.0's admission ("the science of model evaluation isn't well-developed enough"). Anthropic and METR are independently arriving at the same conclusion — the measurement problem is not solved — within two months of each other. These reinforce each other as a convergent finding. **What I expected but didn't find:** Any proposed solutions to the saturation/uncertainty problem. The note describes the problem with precision but doesn't propose a path to measurement improvement. **KB connections:** - [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]] — the measurement saturation is a concrete instantiation of this structural claim - [[AI capability and reliability are independent dimensions]] — capability and measurement reliability are also independent; you can have a highly capable model with highly uncertain capability measurements - [[formal verification of AI-generated proofs provides scalable oversight]] — formal verification doesn't help here because task completion doesn't admit of formal verification; this is the domain where verification is specifically hard **Extraction hints:** 1. Candidate claim: "The primary autonomous capability evaluation metric (METR time horizon) has 1.5-2x measurement uncertainty for frontier models because task suites saturate before frontier capabilities do, creating a measurement gap that makes capability threshold governance unenforceable" 2. This could also be framed as an update to B4 (Belief 4: verification degrades faster than capability grows) — now with a specific quantitative example **Context:** Published 3 days ago (March 20, 2026). METR is being proactively transparent about the limitations of their own methodology — this is intellectually honest and alarming at the same time. The note appears in response to the very wide confidence intervals in the Opus 4.6 time horizon estimate. ## Curator Notes (structured handoff for extractor) PRIMARY CONNECTION: [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]] WHY ARCHIVED: Direct evidence that the primary capability measurement tool has 1.5-2x uncertainty at the frontier — governance cannot set enforceable thresholds on unmeasurable capabilities EXTRACTION HINT: The "measurement saturation" concept may deserve its own claim distinct from the scalable oversight degradation claim — it's about the measurement tools themselves failing, not the oversight mechanisms