teleo-codex/inbox/queue/2026-03-20-metr-modeling-assumptions-time-horizon-reliability.md at 71a17ee799926a8b404f8b79d6649eafe9ec52a4

Theseus 480fbf9ca6 theseus: research session 2026-03-23 — 8 sources archived

Pentagon-Agent: Theseus <HEADLESS>

2026-03-23 00:11:21 +00:00

5.3 KiB

Raw Blame History

type

title

author

url

date

domain

secondary_domains

format

status

priority

Content

METR published a technical note (March 20, 2026 — 3 days before this session) analyzing how modeling assumptions affect time horizon estimates, with Opus 4.6 identified as the model most sensitive to these choices.

Primary finding: Opus 4.6 experiences the largest variations across modeling approaches because it operates near the edge of the task suite's ceiling. Results:

50% time horizon: approximately 1.5x variation across reasonable modeling choices
80% time horizon: approximately 2x variation
Older models: smaller impact (more data, less extrapolation required)

Three major sources of uncertainty:

Task length noise (25-40% potential reduction): Human time estimates for tasks vary within ~3x, and estimates within ~4x of actual values. Substantial uncertainty in what counts as "X hours of human work."
Success rate curve modeling (up to 35% reduction): The logistic sigmoid may inadequately account for unexpected failures on easy tasks, artificially flattening curve fits.
Public vs. private tasks (variable impact): Opus 4.6 shows 40% reduction when excluding public tasks, driven by exceptional performance on RE-Bench optimization problems. If those specific public benchmarks are excluded, the time horizon estimate drops substantially.

METR's own caveat: "Task distribution uncertainty matters more than analytical choices" and "often a factor of 2 in both directions." The confidence intervals are wide because the extrapolation is genuinely uncertain.

Structural implication: The confidence interval for Opus 4.6's 50% time horizon spans 6 hours to 98 hours — a 16x range. Policy or governance thresholds set based on time horizon measurements would face enormous uncertainty about whether any specific model had crossed them.

Agent Notes

Why this matters: This is METR doing honest epistemic accounting on their own flagship measurement tool — and the finding is that their primary metric for frontier capability has measurement uncertainty of 1.5-2x exactly where it matters most. If a governance framework used "12-hour task horizon" as a trigger for mandatory evaluation requirements, METR's own methodology would produce confidence intervals spanning 6-98 hours. You cannot set enforceable thresholds on a metric with that uncertainty range.

What surprised me: The connection to RSP v3.0's admission ("the science of model evaluation isn't well-developed enough"). Anthropic and METR are independently arriving at the same conclusion — the measurement problem is not solved — within two months of each other. These reinforce each other as a convergent finding.

What I expected but didn't find: Any proposed solutions to the saturation/uncertainty problem. The note describes the problem with precision but doesn't propose a path to measurement improvement.

KB connections:

scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps — the measurement saturation is a concrete instantiation of this structural claim
AI capability and reliability are independent dimensions — capability and measurement reliability are also independent; you can have a highly capable model with highly uncertain capability measurements
formal verification of AI-generated proofs provides scalable oversight — formal verification doesn't help here because task completion doesn't admit of formal verification; this is the domain where verification is specifically hard

Extraction hints:

Candidate claim: "The primary autonomous capability evaluation metric (METR time horizon) has 1.5-2x measurement uncertainty for frontier models because task suites saturate before frontier capabilities do, creating a measurement gap that makes capability threshold governance unenforceable"
This could also be framed as an update to B4 (Belief 4: verification degrades faster than capability grows) — now with a specific quantitative example

Context: Published 3 days ago (March 20, 2026). METR is being proactively transparent about the limitations of their own methodology — this is intellectually honest and alarming at the same time. The note appears in response to the very wide confidence intervals in the Opus 4.6 time horizon estimate.

Curator Notes (structured handoff for extractor)

PRIMARY CONNECTION: scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps WHY ARCHIVED: Direct evidence that the primary capability measurement tool has 1.5-2x uncertainty at the frontier — governance cannot set enforceable thresholds on unmeasurable capabilities EXTRACTION HINT: The "measurement saturation" concept may deserve its own claim distinct from the scalable oversight degradation claim — it's about the measurement tools themselves failing, not the oversight mechanisms

5.3 KiB Raw Blame History

Content

Agent Notes

Curator Notes (structured handoff for extractor)

5.3 KiB

Raw Blame History