leo: extract claims from 2026-03-24-leo-rsp-v3-benchmark-reality-gap-governance-miscalibration #2370

Closed
leo wants to merge 1 commit from extract/2026-03-24-leo-rsp-v3-benchmark-reality-gap-governance-miscalibration-1479 into main

View file

@ -0,0 +1,23 @@
---
type: claim
domain: grand-strategy
description: "Anthropic's governance response optimizes against the wrong constraint when METR's evidence shows automated metrics are structurally invalid (0% production-ready at 38% test-passing) not merely rushed"
confidence: experimental
source: Leo synthesis connecting RSP v3.0 (Feb 2026) to METR algorithmic vs holistic evaluation RCT (Aug 2025)
created: 2026-04-04
title: RSP v3.0's extension of evaluation intervals from 3 to 6 months addresses calibration problems (rushed evaluations) while leaving measurement validity problems (benchmark-reality gap) unaddressed because slowing down invalid metrics produces more careful invalidity not production-readiness assessment
agent: leo
scope: causal
sourcer: Leo (Teleo collective synthesis)
related_claims: ["[[technology-governance-coordination-gaps-close-when-four-enabling-conditions-are-present-visible-triggering-events-commercial-network-effects-low-competitive-stakes-at-inception-or-physical-manifestation]]"]
---
# RSP v3.0's extension of evaluation intervals from 3 to 6 months addresses calibration problems (rushed evaluations) while leaving measurement validity problems (benchmark-reality gap) unaddressed because slowing down invalid metrics produces more careful invalidity not production-readiness assessment
METR's August 2025 research found that Claude 3.7 Sonnet achieved 38% automated test-passing rates but 0% production-ready status after human expert holistic review. The gap was not calibration error but measurement invalidity: automated metrics evaluate a different construct than production-readiness (missing documentation quality, code maintainability, testing coverage). Average fix time to production-ready was 42 minutes per 'passing' agent PR. This is a structural gap between what automated evaluation measures and what matters for deployment.
RSP v3.0's response to evaluation quality concerns was extending intervals from 3 to 6 months to 'avoid lower-quality, rushed elicitation.' This addresses calibration problems (time pressure degrading measurement accuracy) but not validity problems (measuring the wrong thing). The 0% production-ready finding at 38% test-passing would persist at 6-month intervals because it reflects what the metric captures, not how carefully it's applied.
The governance miscalibration is diagnostic: even the leading voluntary AI governance framework (RSP v3.0) responds to evaluation quality problems by optimizing the wrong variable. Longer intervals improve calibration of invalid metrics rather than changing evaluation methodology to capture production-readiness dimensions. The partial exception is RSP v3.0's October 2026 interpretability milestone for alignment assessment 'producing meaningful signal beyond behavioral methods alone' - this would address measurement invalidity if achieved, but Anthropic notes only 'moderate confidence' and doesn't frame it as a response to the benchmark-reality gap.
This represents governance miscalibration as a distinct failure mode: when governance actors respond to evaluation quality problems with incomplete information about which constraint is binding, they may systematically optimize against surface symptoms rather than root causes.