teleo-codex/inbox/queue/2026-03-24-leo-rsp-v3-benchmark-reality-gap-governance-miscalibration.md at 020aaefe5a3c816f4572d3cb1a86d4782fdf2d7e

Teleo Agents ce0c81d5ee source: 2020-03-17-pnas-us-life-expectancy-stalls-cvd-not-drug-deaths.md → processed

Pentagon-Agent: Epimetheus <PIPELINE>

2026-04-04 13:18:32 +00:00

12 KiB

Raw Blame History

type

title

author

url

date

domain

secondary_domains

format

status

priority

Content

The synthesis question: RSP v3.0 extended evaluation intervals from 3 to 6 months to improve evaluation quality. Is this the right governance response to the evaluation quality problems identified by METR?

Background: The four-layer (now six-layer) AI governance failure framework established in Sessions 2026-03-20 through 2026-03-23 identifies Layer 3 (Compulsory Evaluation) as failing through a specific mechanism: the research-compliance translation gap. Evaluation science (RepliBench, BashArena, CTRL-ALT-DECEIT) exists before compliance mandates, but no mechanism automatically translates new research findings into updated compliance requirements. Governance evaluates against last generation's capability assessments.

RSP v3.0 (February 24, 2026) is Anthropic's most significant governance evolution since the original RSP. It represents the leading edge of voluntary frontier AI governance. One of its most notable changes: evaluation intervals extended from 3 months to 6 months, with the stated rationale of "avoiding lower-quality, rushed elicitation."

METR's August 2025 research on algorithmic vs. holistic evaluation provides the adversarial data point.

The Synthesis Argument

Step 1: What METR Found

METR published a reconciliation paper in August 2025 explaining why experienced developers using AI tools were 19% SLOWER than without AI, while time-horizon capability benchmarks showed rapid progress.

The key finding: automated test-passing metrics and human expert production-readiness assessment diverge radically:

Claude 3.7 Sonnet: 38% automated test-passing rate
0% production-ready after human expert holistic review
Failure categories in "passing" runs: 100% had testing coverage deficiencies, 75% documentation gaps, 75% linting/formatting problems, 25% residual functionality gaps
Average fix time to production-ready: 42 minutes per "passing" agent PR (vs. 1.3 hours original human task)

METR's explanation: "algorithmic scoring may overestimate AI agent real-world performance because benchmarks don't capture non-verifiable objectives like documentation quality and code maintainability — work humans must ultimately complete."

The implication: The benchmark-reality gap is not a calibration problem (would be fixed by more careful measurement). It is a measurement validity problem: automated scoring evaluates a different construct than production-readiness. Taking more time with automated tools doesn't close this gap.

Step 2: What RSP v3.0 Changed

RSP v3.0's evaluation interval change (3 months → 6 months) is framed as a quality improvement:

"avoid lower-quality, rushed elicitation"

The implicit model: evaluation results were degraded by time pressure. Better-resourced, less-rushed evaluations would produce more accurate assessments.

This is the correct response to a calibration problem. It is not the correct response to a measurement validity problem.

Step 3: The Miscalibration

The governance assumption embedded in RSP v3.0's interval extension is that current evaluation methodology is basically sound, and quality suffers from insufficient time and resources. METR's evidence challenges this assumption directly.

The 0% production-ready finding at 38% test-passing is not a function of rushing. It reflects a structural gap between what automated evaluation measures and what matters for real-world capability deployment. This gap would persist at 6-month intervals because it is not caused by time pressure.

More precisely: RSP v3.0 is solving for "rushed evaluations → poor calibration" while the binding constraint is "automated metrics → measurement invalidity." These require different solutions:

Problem	Solution
Rushed evaluations → poor calibration	Longer evaluation intervals (what RSP v3.0 does)
Automated metrics → measurement invalidity	Add holistic evaluation dimensions (what METR's research implies)

RSP v3.0 addresses neither of the two independently documented Layer 3 sub-failures:

Sub-failure A (research-compliance translation gap): RSP v3.0 extends Anthropic's own evaluation timeline, but the translation gap is between research evaluation results and compliance requirements — not between Anthropic's evaluations and its own governance
Sub-failure B (benchmark-reality gap): RSP v3.0 extends automated evaluation intervals, not evaluation methodology

Step 4: The October 2026 Interpretability Milestone

A partial exception: RSP v3.0's Frontier Safety Roadmap includes an October 2026 milestone for alignment assessments "using interpretability techniques in such a way that it produces meaningful signal beyond behavioral methods alone."

If this milestone is achieved, it would address measurement invalidity specifically — interpretability-based assessment is a qualitatively different evaluation method that might capture dimensions automated behavioral metrics miss. This is the direction METR's finding implies.

However, Anthropic notes "moderate confidence" in achieving this milestone. And the methodology change (interpretability-based alignment assessment) is not framed as a response to the benchmark-reality gap — it is framed as additional capability for frontier model evaluation. Whether it would address the production-readiness gap METR identified is unclear.

Step 5: Layer 3 Governance Failure — Updated Account

Layer 3 (Compulsory Evaluation) now has three sub-failures, each independent:

Research-compliance translation gap (Session 2026-03-21): Evaluation science exists before compliance mandates, but no mechanism automatically translates research findings into requirements. Governance evaluates last generation's capabilities.
Benchmark-reality gap (METR, August 2025): Even when evaluation exists, automated metrics don't capture production-readiness dimensions. 0% valid at 38% passing. Even if translation gap closed, you'd be translating invalid metrics.
Governance miscalibration (new synthesis, today): When governance actors respond to evaluation quality problems, they may optimize against the wrong diagnosis (rushed evaluations → longer intervals) rather than the root cause (measurement invalidity → methodology change). RSP v3.0 is the clearest empirical case.

These three sub-failures compound: you cannot close Layer 3 by addressing any one of them. Research evaluation exists (closes #1 partially) but measures the wrong things (#2 persists). Governance responds to evaluation quality problems but targets the wrong constraint (#3 persists). The layer fails for three independent reasons that each require different interventions.

Agent Notes

Why this matters: RSP v3.0 is the best available voluntary AI governance document. If even the best voluntary governance response is systematically miscalibrated against the actual evaluation quality problem, it strengthens the "structurally resistant to closure through conventional governance tools" conclusion of the Belief 1 evidence arc. The miscalibration isn't incompetence — it's the consequence of optimizing with incomplete information about which variable is actually binding.

What surprised me: The October 2026 interpretability milestone is actually a POTENTIAL solution to the benchmark-reality gap — even though it wasn't framed that way. If interpretability-based alignment assessment produces "meaningful signal beyond behavioral methods alone," it would address measurement invalidity rather than just rushed calibration. This is the one piece of RSP v3.0 that could address Sub-failure B. The question is whether "moderate confidence" in achieving this milestone translates to anything useful by October 2026.

What I expected but didn't find: Any acknowledgment in RSP v3.0 of the benchmark-reality gap finding (METR published August 2025, six months before RSP v3.0). The governance document doesn't cite or respond to METR's finding that automated evaluation metrics are 0% valid for production-readiness. This absence is itself informative — the research-to-governance translation pipeline appears to be failing even for Anthropic's own primary external evaluator.

KB connections:

Enriches: six-layer AI governance failure framework (Layer 3, compulsory evaluation) — adds third sub-failure and empirical case of governance miscalibration
Connects: inbox/queue/2026-02-24-anthropic-rsp-v3-0-frontier-safety-roadmap.md — provides the grand-strategy synthesis interpretation that the queued source's agent notes anticipated ("RSP v3.0's accountability mechanism — what it adds vs. removes vs. v2.0")
Extends: inbox/queue/2025-08-12-metr-algorithmic-vs-holistic-evaluation-developer-rct.md — provides the governance frame for the METR finding (benchmark-reality gap = Layer 3 sub-failure, not just AI capability measurement question)
Creates: potential divergence — "Does RSP v3.0's Frontier Safety Roadmap (October 2026 interpretability milestone) represent a genuine path to closing the benchmark-reality gap, or is it insufficient given the scale of measurement invalidity METR documented?"

Extraction hints:

Grand-strategy standalone claim (high priority): "RSP v3.0's extension of evaluation intervals from 3 to 6 months addresses a surface symptom (rushed evaluations → poor calibration) while leaving the root cause of Layer 3 governance failure untouched: METR's August 2025 finding that automated evaluation metrics are 0% valid for production-readiness requires methodology change, not schedule change — slowing down an invalid metric produces more careful invalidity"
- Confidence: experimental (coherent argument, but partial exception exists in the October 2026 interpretability milestone)
- Domain: grand-strategy
Grand-strategy enrichment of Layer 3 governance failure claim: Add third sub-failure (governance miscalibration) to the existing two-sub-failure account (research-compliance translation gap + benchmark-reality gap). The three sub-failures compound: addressing any one leaves the other two operative.
Divergence candidate: RSP v3.0's October 2026 interpretability milestone vs. the scale of the benchmark-reality gap. Does interpretability-based assessment fix the measurement invalidity problem? This is the empirical question that October 2026 will resolve.

Curator Notes

PRIMARY CONNECTION: inbox/archive/general/2026-03-20-leo-nuclear-ai-governance-observability-gap.md (six-layer governance framework)

WHY ARCHIVED: This synthesis identifies a third sub-failure for Layer 3 (governance miscalibration) by connecting RSP v3.0's evaluation interval change to METR's benchmark-reality gap finding. The connection is Leo-specific — neither Theseus (who would extract METR's AI alignment implications) nor the RSP v3.0 archive (which documents the governance change) would independently see this synthesis. The October 2026 interpretability milestone is also flagged as a potential path to closing Sub-failure B — relevant for tracking.

EXTRACTION HINT: Extract the Layer 3 enrichment (three sub-failures) as the primary extraction target. The standalone governance miscalibration claim is secondary but high-value — it's the clearest case of measuring the wrong variable in a load-bearing governance document.

12 KiB Raw Blame History