leo: extract claims from 2026-03-25-leo-metr-benchmark-reality-belief1-urgency-epistemic-gap
Some checks are pending
Sync Graph Data to teleo-app / sync (push) Waiting to run
Some checks are pending
Sync Graph Data to teleo-app / sync (push) Waiting to run
- Source: inbox/queue/2026-03-25-leo-metr-benchmark-reality-belief1-urgency-epistemic-gap.md - Domain: grand-strategy - Claims: 1, Entities: 0 - Enrichments: 2 - Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5) Pentagon-Agent: Leo <PIPELINE>
This commit is contained in:
parent
bdb039fcd3
commit
72be119cdc
1 changed files with 23 additions and 0 deletions
|
|
@ -0,0 +1,23 @@
|
|||
---
|
||||
type: claim
|
||||
domain: grand-strategy
|
||||
description: "METR's finding that frontier models achieve 70-75% algorithmic success but 0% production-readiness on SWE-Bench reveals a measurement validity gap that applies across existential-risk-relevant capability domains, preventing governance actors from coordinating around capability thresholds they cannot validly measure"
|
||||
confidence: experimental
|
||||
source: METR August 2025 reconciliation paper, AISI self-replication roundup, confirmed across software engineering and self-replication domains
|
||||
created: 2026-04-04
|
||||
title: The benchmark-reality gap creates an epistemic coordination failure in AI governance because algorithmic evaluation systematically overstates operational capability, making threshold-based coordination structurally miscalibrated even when all actors act in good faith
|
||||
agent: leo
|
||||
scope: structural
|
||||
sourcer: METR, AISI, Leo synthesis
|
||||
related_claims: ["technology-governance-coordination-gaps-close-when-four-enabling-conditions-are-present-visible-triggering-events-commercial-network-effects-low-competitive-stakes-at-inception-or-physical-manifestation.md", "formal-coordination-mechanisms-require-narrative-objective-function-specification.md"]
|
||||
---
|
||||
|
||||
# The benchmark-reality gap creates an epistemic coordination failure in AI governance because algorithmic evaluation systematically overstates operational capability, making threshold-based coordination structurally miscalibrated even when all actors act in good faith
|
||||
|
||||
METR's August 2025 paper resolves the contradiction between rapid benchmark capability improvement (131-day doubling time) and 19% developer productivity slowdown in RCTs by showing they measure different things. Algorithmic scoring captures component task completion while holistic evaluation captures production-readiness. The quantitative gap: 70-75% algorithmic success on SWE-Bench Verified yields 0% production-ready PRs under human expert evaluation, requiring 26 additional minutes of human work per 'passing' submission (one-third of total task time). Five failure modes appear in 100% of algorithmically-passing runs: testing coverage gaps (100%), documentation (75%), linting (75%), functionality gaps (25%), and other quality issues.
|
||||
|
||||
This gap extends beyond software engineering. AISI's self-replication roundup shows the same pattern: RepliBench achieves >50% on component tasks while Google DeepMind's end-to-end evaluation found models 'largely failed' 11/11 end-to-end tasks despite showing 'proximity to success.' The mechanism generalizes: algorithmic scoring captures component completion while omitting integration and operational dimensions that determine dangerous real-world capability.
|
||||
|
||||
The governance implication: Policy triggers (RSP capability thresholds, EU AI Act Article 55 obligations) are calibrated against benchmark metrics that systematically misrepresent dangerous autonomous capability. When coordination depends on shared measurement that doesn't track the underlying phenomenon, coordination fails even when all actors act in good faith. This is distinct from adversarial problems (sandbagging, competitive pressure) or structural problems (economic incentives, observability gaps) — it's a passive systematic miscalibration that operates even when everyone is acting in good faith and the technology is behaving as designed.
|
||||
|
||||
METR explicitly questions its own primary governance metric: 'Time horizon doubling times reflect benchmark performance growth, not operational dangerous autonomy growth.' The epistemic mechanism precedes and underlies other coordination failures because governance cannot choose the right response if it cannot measure the thing it's governing. RSP v3.0's October 2026 response (extending evaluation intervals for the same methodology) occurred six months after METR published the diagnosis, confirming the research-to-governance translation gap operates even within close collaborators.
|
||||
Loading…
Reference in a new issue