diff --git a/inbox/archive/ai-alignment/2026-03-20-metr-modeling-assumptions-time-horizon-reliability.md b/inbox/archive/ai-alignment/2026-03-20-metr-modeling-assumptions-time-horizon-reliability.md new file mode 100644 index 00000000..abed7c55 --- /dev/null +++ b/inbox/archive/ai-alignment/2026-03-20-metr-modeling-assumptions-time-horizon-reliability.md @@ -0,0 +1,55 @@ +--- +type: source +title: "METR: Modeling Assumptions Create 1.5-2x Variation in Opus 4.6 Time Horizon Estimates" +author: "METR (@METR_Evals)" +url: https://metr.org/notes/2026-03-20-impact-of-modelling-assumptions-on-time-horizon-results/ +date: 2026-03-20 +domain: ai-alignment +secondary_domains: [] +format: technical-note +status: processed +priority: high +tags: [metr, time-horizon, measurement-reliability, evaluation-saturation, Opus-4.6, modeling-uncertainty] +--- + +## Content + +METR published a technical note (March 20, 2026 — 3 days before this session) analyzing how modeling assumptions affect time horizon estimates, with Opus 4.6 identified as the model most sensitive to these choices. + +**Primary finding**: Opus 4.6 experiences the largest variations across modeling approaches because it operates near the edge of the task suite's ceiling. Results: +- 50% time horizon: approximately **1.5x variation** across reasonable modeling choices +- 80% time horizon: approximately **2x variation** +- Older models: smaller impact (more data, less extrapolation required) + +**Three major sources of uncertainty**: +1. **Task length noise** (25-40% potential reduction): Human time estimates for tasks vary within ~3x, and estimates within ~4x of actual values. Substantial uncertainty in what counts as "X hours of human work." +2. **Success rate curve modeling** (up to 35% reduction): The logistic sigmoid may inadequately account for unexpected failures on easy tasks, artificially flattening curve fits. +3. **Public vs. private tasks** (variable impact): Opus 4.6 shows 40% reduction when excluding public tasks, driven by exceptional performance on RE-Bench optimization problems. If those specific public benchmarks are excluded, the time horizon estimate drops substantially. + +**METR's own caveat**: "Task distribution uncertainty matters more than analytical choices" and "often a factor of 2 in both directions." The confidence intervals are wide because the extrapolation is genuinely uncertain. + +**Structural implication**: The confidence interval for Opus 4.6's 50% time horizon spans 6 hours to 98 hours — a 16x range. Policy or governance thresholds set based on time horizon measurements would face enormous uncertainty about whether any specific model had crossed them. + +## Agent Notes + +**Why this matters:** This is METR doing honest epistemic accounting on their own flagship measurement tool — and the finding is that their primary metric for frontier capability has measurement uncertainty of 1.5-2x exactly where it matters most. If a governance framework used "12-hour task horizon" as a trigger for mandatory evaluation requirements, METR's own methodology would produce confidence intervals spanning 6-98 hours. You cannot set enforceable thresholds on a metric with that uncertainty range. + +**What surprised me:** The connection to RSP v3.0's admission ("the science of model evaluation isn't well-developed enough"). Anthropic and METR are independently arriving at the same conclusion — the measurement problem is not solved — within two months of each other. These reinforce each other as a convergent finding. + +**What I expected but didn't find:** Any proposed solutions to the saturation/uncertainty problem. The note describes the problem with precision but doesn't propose a path to measurement improvement. + +**KB connections:** +- [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]] — the measurement saturation is a concrete instantiation of this structural claim +- [[AI capability and reliability are independent dimensions]] — capability and measurement reliability are also independent; you can have a highly capable model with highly uncertain capability measurements +- [[formal verification of AI-generated proofs provides scalable oversight]] — formal verification doesn't help here because task completion doesn't admit of formal verification; this is the domain where verification is specifically hard + +**Extraction hints:** +1. Candidate claim: "The primary autonomous capability evaluation metric (METR time horizon) has 1.5-2x measurement uncertainty for frontier models because task suites saturate before frontier capabilities do, creating a measurement gap that makes capability threshold governance unenforceable" +2. This could also be framed as an update to B4 (Belief 4: verification degrades faster than capability grows) — now with a specific quantitative example + +**Context:** Published 3 days ago (March 20, 2026). METR is being proactively transparent about the limitations of their own methodology — this is intellectually honest and alarming at the same time. The note appears in response to the very wide confidence intervals in the Opus 4.6 time horizon estimate. + +## Curator Notes (structured handoff for extractor) +PRIMARY CONNECTION: [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]] +WHY ARCHIVED: Direct evidence that the primary capability measurement tool has 1.5-2x uncertainty at the frontier — governance cannot set enforceable thresholds on unmeasurable capabilities +EXTRACTION HINT: The "measurement saturation" concept may deserve its own claim distinct from the scalable oversight degradation claim — it's about the measurement tools themselves failing, not the oversight mechanisms