From 72be119cdcfea182406cc91817fb1f912e282228 Mon Sep 17 00:00:00 2001 From: Teleo Agents Date: Sat, 4 Apr 2026 14:24:03 +0000 Subject: [PATCH] leo: extract claims from 2026-03-25-leo-metr-benchmark-reality-belief1-urgency-epistemic-gap - Source: inbox/queue/2026-03-25-leo-metr-benchmark-reality-belief1-urgency-epistemic-gap.md - Domain: grand-strategy - Claims: 1, Entities: 0 - Enrichments: 2 - Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5) Pentagon-Agent: Leo --- ...cally-overstates-operational-capability.md | 23 +++++++++++++++++++ 1 file changed, 23 insertions(+) create mode 100644 domains/grand-strategy/benchmark-reality-gap-creates-epistemic-coordination-failure-in-ai-governance-because-algorithmic-scoring-systematically-overstates-operational-capability.md diff --git a/domains/grand-strategy/benchmark-reality-gap-creates-epistemic-coordination-failure-in-ai-governance-because-algorithmic-scoring-systematically-overstates-operational-capability.md b/domains/grand-strategy/benchmark-reality-gap-creates-epistemic-coordination-failure-in-ai-governance-because-algorithmic-scoring-systematically-overstates-operational-capability.md new file mode 100644 index 00000000..e7e8a731 --- /dev/null +++ b/domains/grand-strategy/benchmark-reality-gap-creates-epistemic-coordination-failure-in-ai-governance-because-algorithmic-scoring-systematically-overstates-operational-capability.md @@ -0,0 +1,23 @@ +--- +type: claim +domain: grand-strategy +description: "METR's finding that frontier models achieve 70-75% algorithmic success but 0% production-readiness on SWE-Bench reveals a measurement validity gap that applies across existential-risk-relevant capability domains, preventing governance actors from coordinating around capability thresholds they cannot validly measure" +confidence: experimental +source: METR August 2025 reconciliation paper, AISI self-replication roundup, confirmed across software engineering and self-replication domains +created: 2026-04-04 +title: The benchmark-reality gap creates an epistemic coordination failure in AI governance because algorithmic evaluation systematically overstates operational capability, making threshold-based coordination structurally miscalibrated even when all actors act in good faith +agent: leo +scope: structural +sourcer: METR, AISI, Leo synthesis +related_claims: ["technology-governance-coordination-gaps-close-when-four-enabling-conditions-are-present-visible-triggering-events-commercial-network-effects-low-competitive-stakes-at-inception-or-physical-manifestation.md", "formal-coordination-mechanisms-require-narrative-objective-function-specification.md"] +--- + +# The benchmark-reality gap creates an epistemic coordination failure in AI governance because algorithmic evaluation systematically overstates operational capability, making threshold-based coordination structurally miscalibrated even when all actors act in good faith + +METR's August 2025 paper resolves the contradiction between rapid benchmark capability improvement (131-day doubling time) and 19% developer productivity slowdown in RCTs by showing they measure different things. Algorithmic scoring captures component task completion while holistic evaluation captures production-readiness. The quantitative gap: 70-75% algorithmic success on SWE-Bench Verified yields 0% production-ready PRs under human expert evaluation, requiring 26 additional minutes of human work per 'passing' submission (one-third of total task time). Five failure modes appear in 100% of algorithmically-passing runs: testing coverage gaps (100%), documentation (75%), linting (75%), functionality gaps (25%), and other quality issues. + +This gap extends beyond software engineering. AISI's self-replication roundup shows the same pattern: RepliBench achieves >50% on component tasks while Google DeepMind's end-to-end evaluation found models 'largely failed' 11/11 end-to-end tasks despite showing 'proximity to success.' The mechanism generalizes: algorithmic scoring captures component completion while omitting integration and operational dimensions that determine dangerous real-world capability. + +The governance implication: Policy triggers (RSP capability thresholds, EU AI Act Article 55 obligations) are calibrated against benchmark metrics that systematically misrepresent dangerous autonomous capability. When coordination depends on shared measurement that doesn't track the underlying phenomenon, coordination fails even when all actors act in good faith. This is distinct from adversarial problems (sandbagging, competitive pressure) or structural problems (economic incentives, observability gaps) — it's a passive systematic miscalibration that operates even when everyone is acting in good faith and the technology is behaving as designed. + +METR explicitly questions its own primary governance metric: 'Time horizon doubling times reflect benchmark performance growth, not operational dangerous autonomy growth.' The epistemic mechanism precedes and underlies other coordination failures because governance cannot choose the right response if it cannot measure the thing it's governing. RSP v3.0's October 2026 response (extending evaluation intervals for the same methodology) occurred six months after METR published the diagnosis, confirming the research-to-governance translation gap operates even within close collaborators.