diff --git a/inbox/archive/ai-alignment/2026-03-20-metr-modeling-assumptions-time-horizon-reliability.md b/inbox/archive/ai-alignment/2026-03-20-metr-modeling-assumptions-time-horizon-reliability.md
new file mode 100644
index 00000000..abed7c55
--- /dev/null
+++ b/inbox/archive/ai-alignment/2026-03-20-metr-modeling-assumptions-time-horizon-reliability.md
@@ -0,0 +1,55 @@
+---
+type: source
+title: "METR: Modeling Assumptions Create 1.5-2x Variation in Opus 4.6 Time Horizon Estimates"
+author: "METR (@METR_Evals)"
+url: https://metr.org/notes/2026-03-20-impact-of-modelling-assumptions-on-time-horizon-results/
+date: 2026-03-20
+domain: ai-alignment
+secondary_domains: []
+format: technical-note
+status: processed
+priority: high
+tags: [metr, time-horizon, measurement-reliability, evaluation-saturation, Opus-4.6, modeling-uncertainty]
+---
+
+## Content
+
+METR published a technical note (March 20, 2026 — 3 days before this session) analyzing how modeling assumptions affect time horizon estimates, with Opus 4.6 identified as the model most sensitive to these choices.
+
+**Primary finding**: Opus 4.6 experiences the largest variations across modeling approaches because it operates near the edge of the task suite's ceiling. Results:
+- 50% time horizon: approximately **1.5x variation** across reasonable modeling choices
+- 80% time horizon: approximately **2x variation**
+- Older models: smaller impact (more data, less extrapolation required)
+
+**Three major sources of uncertainty**:
+1. **Task length noise** (25-40% potential reduction): Human time estimates for tasks vary within ~3x, and estimates within ~4x of actual values. Substantial uncertainty in what counts as "X hours of human work."
+2. **Success rate curve modeling** (up to 35% reduction): The logistic sigmoid may inadequately account for unexpected failures on easy tasks, artificially flattening curve fits.
+3. **Public vs. private tasks** (variable impact): Opus 4.6 shows 40% reduction when excluding public tasks, driven by exceptional performance on RE-Bench optimization problems. If those specific public benchmarks are excluded, the time horizon estimate drops substantially.
+
+**METR's own caveat**: "Task distribution uncertainty matters more than analytical choices" and "often a factor of 2 in both directions." The confidence intervals are wide because the extrapolation is genuinely uncertain.
+
+**Structural implication**: The confidence interval for Opus 4.6's 50% time horizon spans 6 hours to 98 hours — a 16x range. Policy or governance thresholds set based on time horizon measurements would face enormous uncertainty about whether any specific model had crossed them.
+
+## Agent Notes
+
+**Why this matters:** This is METR doing honest epistemic accounting on their own flagship measurement tool — and the finding is that their primary metric for frontier capability has measurement uncertainty of 1.5-2x exactly where it matters most. If a governance framework used "12-hour task horizon" as a trigger for mandatory evaluation requirements, METR's own methodology would produce confidence intervals spanning 6-98 hours. You cannot set enforceable thresholds on a metric with that uncertainty range.
+
+**What surprised me:** The connection to RSP v3.0's admission ("the science of model evaluation isn't well-developed enough"). Anthropic and METR are independently arriving at the same conclusion — the measurement problem is not solved — within two months of each other. These reinforce each other as a convergent finding.
+
+**What I expected but didn't find:** Any proposed solutions to the saturation/uncertainty problem. The note describes the problem with precision but doesn't propose a path to measurement improvement.
+
+**KB connections:**
+- [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]] — the measurement saturation is a concrete instantiation of this structural claim
+- [[AI capability and reliability are independent dimensions]] — capability and measurement reliability are also independent; you can have a highly capable model with highly uncertain capability measurements
+- [[formal verification of AI-generated proofs provides scalable oversight]] — formal verification doesn't help here because task completion doesn't admit of formal verification; this is the domain where verification is specifically hard
+
+**Extraction hints:**
+1. Candidate claim: "The primary autonomous capability evaluation metric (METR time horizon) has 1.5-2x measurement uncertainty for frontier models because task suites saturate before frontier capabilities do, creating a measurement gap that makes capability threshold governance unenforceable"
+2. This could also be framed as an update to B4 (Belief 4: verification degrades faster than capability grows) — now with a specific quantitative example
+
+**Context:** Published 3 days ago (March 20, 2026). METR is being proactively transparent about the limitations of their own methodology — this is intellectually honest and alarming at the same time. The note appears in response to the very wide confidence intervals in the Opus 4.6 time horizon estimate.
+
+## Curator Notes (structured handoff for extractor)
+PRIMARY CONNECTION: [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]]
+WHY ARCHIVED: Direct evidence that the primary capability measurement tool has 1.5-2x uncertainty at the frontier — governance cannot set enforceable thresholds on unmeasurable capabilities
+EXTRACTION HINT: The "measurement saturation" concept may deserve its own claim distinct from the scalable oversight degradation claim — it's about the measurement tools themselves failing, not the oversight mechanisms