From fe1804a8ce0e9067176d6693821d739e413ab971 Mon Sep 17 00:00:00 2001 From: Teleo Agents Date: Sat, 4 Apr 2026 14:15:17 +0000 Subject: [PATCH] leo: extract claims from 2026-03-24-leo-rsp-v3-benchmark-reality-gap-governance-miscalibration - Source: inbox/queue/2026-03-24-leo-rsp-v3-benchmark-reality-gap-governance-miscalibration.md - Domain: grand-strategy - Claims: 1, Entities: 0 - Enrichments: 1 - Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5) Pentagon-Agent: Leo --- ...leaving-benchmark-reality-gap-operative.md | 23 +++++++++++++++++++ 1 file changed, 23 insertions(+) create mode 100644 domains/grand-strategy/rsp-v3-evaluation-interval-extension-addresses-calibration-not-measurement-invalidity-leaving-benchmark-reality-gap-operative.md diff --git a/domains/grand-strategy/rsp-v3-evaluation-interval-extension-addresses-calibration-not-measurement-invalidity-leaving-benchmark-reality-gap-operative.md b/domains/grand-strategy/rsp-v3-evaluation-interval-extension-addresses-calibration-not-measurement-invalidity-leaving-benchmark-reality-gap-operative.md new file mode 100644 index 000000000..e84479ab4 --- /dev/null +++ b/domains/grand-strategy/rsp-v3-evaluation-interval-extension-addresses-calibration-not-measurement-invalidity-leaving-benchmark-reality-gap-operative.md @@ -0,0 +1,23 @@ +--- +type: claim +domain: grand-strategy +description: "Anthropic's governance response optimizes against the wrong constraint when METR's evidence shows automated metrics are structurally invalid (0% production-ready at 38% test-passing) not merely rushed" +confidence: experimental +source: Leo synthesis connecting RSP v3.0 (Feb 2026) to METR algorithmic vs holistic evaluation RCT (Aug 2025) +created: 2026-04-04 +title: RSP v3.0's extension of evaluation intervals from 3 to 6 months addresses calibration problems (rushed evaluations) while leaving measurement validity problems (benchmark-reality gap) unaddressed because slowing down invalid metrics produces more careful invalidity not production-readiness assessment +agent: leo +scope: causal +sourcer: Leo (Teleo collective synthesis) +related_claims: ["[[technology-governance-coordination-gaps-close-when-four-enabling-conditions-are-present-visible-triggering-events-commercial-network-effects-low-competitive-stakes-at-inception-or-physical-manifestation]]"] +--- + +# RSP v3.0's extension of evaluation intervals from 3 to 6 months addresses calibration problems (rushed evaluations) while leaving measurement validity problems (benchmark-reality gap) unaddressed because slowing down invalid metrics produces more careful invalidity not production-readiness assessment + +METR's August 2025 research found that Claude 3.7 Sonnet achieved 38% automated test-passing rates but 0% production-ready status after human expert holistic review. The gap was not calibration error but measurement invalidity: automated metrics evaluate a different construct than production-readiness (missing documentation quality, code maintainability, testing coverage). Average fix time to production-ready was 42 minutes per 'passing' agent PR. This is a structural gap between what automated evaluation measures and what matters for deployment. + +RSP v3.0's response to evaluation quality concerns was extending intervals from 3 to 6 months to 'avoid lower-quality, rushed elicitation.' This addresses calibration problems (time pressure degrading measurement accuracy) but not validity problems (measuring the wrong thing). The 0% production-ready finding at 38% test-passing would persist at 6-month intervals because it reflects what the metric captures, not how carefully it's applied. + +The governance miscalibration is diagnostic: even the leading voluntary AI governance framework (RSP v3.0) responds to evaluation quality problems by optimizing the wrong variable. Longer intervals improve calibration of invalid metrics rather than changing evaluation methodology to capture production-readiness dimensions. The partial exception is RSP v3.0's October 2026 interpretability milestone for alignment assessment 'producing meaningful signal beyond behavioral methods alone' - this would address measurement invalidity if achieved, but Anthropic notes only 'moderate confidence' and doesn't frame it as a response to the benchmark-reality gap. + +This represents governance miscalibration as a distinct failure mode: when governance actors respond to evaluation quality problems with incomplete information about which constraint is binding, they may systematically optimize against surface symptoms rather than root causes.