extract: 2026-03-25-metr-algorithmic-vs-holistic-evaluation-benchmark-inflation

Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70>
This commit is contained in:
Teleo Agents 2026-03-25 00:20:26 +00:00
parent 47a678f972
commit 31cb2090ae
5 changed files with 57 additions and 1 deletions

View file

@ -26,6 +26,12 @@ The finding also strengthens the case for [[safe AI development requires buildin
---
### Additional Evidence (extend)
*Source: [[2026-03-25-metr-algorithmic-vs-holistic-evaluation-benchmark-inflation]] | Added: 2026-03-25*
METR's holistic evaluation provides a systematic mechanism for the capability-reliability gap: algorithmic scoring captures 'core implementation ability' while missing production requirements (testing, documentation, code quality). This explains why models can achieve 70-75% on capability benchmarks while producing 0% production-ready output—the benchmark architecture itself creates the gap by measuring only one dimension of a multifaceted problem.
Relevant Notes:
- [[an aligned-seeming AI may be strategically deceptive because cooperative behavior is instrumentally optimal while weak]] — distinct failure mode: unintentional unreliability vs intentional deception
- [[safe AI development requires building alignment mechanisms before scaling capability]] — capability outrunning reliability strengthens the sequencing argument

View file

@ -119,6 +119,12 @@ Anthropic's explicit admission that 'the science of model evaluation isn't well-
METR's scaffold sensitivity finding (GPT-4o and o3 performing better under Vivaria than Inspect) adds a new dimension to evaluation unreliability: the same model produces different capability estimates depending on evaluation infrastructure, introducing cross-model comparison uncertainty that governance frameworks do not account for.
### Additional Evidence (confirm)
*Source: [[2026-03-25-metr-algorithmic-vs-holistic-evaluation-benchmark-inflation]] | Added: 2026-03-25*
METR's own research shows their primary governance-relevant metric (time horizon benchmarks using algorithmic scoring) likely overstates operational autonomous capability. The 131-day doubling time reflects benchmark performance growth more than dangerous autonomy growth, as the same algorithmic scoring that produces 70-75% SWE-Bench success translates to 0% production-ready output under holistic evaluation.

View file

@ -40,6 +40,12 @@ The International AI Safety Report 2026 (multi-government committee, February 20
METR's time horizon metric measures task difficulty by human completion time, not model processing time. A model with a 5-hour time horizon completes tasks that take humans 5 hours, but may finish them in minutes. This speed asymmetry is not captured in the metric itself, meaning the gap between theoretical capability (task completion) and deployment impact includes both adoption lag AND the unmeasured throughput advantage that organizations fail to utilize.
### Additional Evidence (extend)
*Source: [[2026-03-25-metr-algorithmic-vs-holistic-evaluation-benchmark-inflation]] | Added: 2026-03-25*
The capability-deployment gap has a specific technical mechanism in software development: algorithmic benchmarks measure core implementation (70-75% success) while production deployment requires testing, documentation, and code quality that models systematically fail to provide (0% production-ready). The 26 minutes of additional human work per 'passing' PR represents the irreducible production overhead that benchmarks don't capture.
Relevant Notes:
- [[AI capability and reliability are independent dimensions because Claude solved a 30-year open mathematical problem while simultaneously degrading at basic program execution during the same session]] — capability exists but deployment is uneven

View file

@ -0,0 +1,24 @@
{
"rejected_claims": [
{
"filename": "ai-autonomous-software-capability-benchmarks-overstate-real-world-task-completion-by-2-3x-because-algorithmic-scoring-measures-core-implementation-while-omitting-production-requirements.md",
"issues": [
"missing_attribution_extractor"
]
}
],
"validation_stats": {
"total": 1,
"kept": 0,
"fixed": 1,
"rejected": 1,
"fixes_applied": [
"ai-autonomous-software-capability-benchmarks-overstate-real-world-task-completion-by-2-3x-because-algorithmic-scoring-measures-core-implementation-while-omitting-production-requirements.md:set_created:2026-03-25"
],
"rejections": [
"ai-autonomous-software-capability-benchmarks-overstate-real-world-task-completion-by-2-3x-because-algorithmic-scoring-measures-core-implementation-while-omitting-production-requirements.md:missing_attribution_extractor"
]
},
"model": "anthropic/claude-sonnet-4.5",
"date": "2026-03-25"
}

View file

@ -7,9 +7,13 @@ date: 2025-08-12
domain: ai-alignment
secondary_domains: []
format: blog-post
status: unprocessed
status: enrichment
priority: high
tags: [benchmark-inflation, holistic-evaluation, swe-bench, time-horizon, production-readiness, algorithmic-scoring]
processed_by: theseus
processed_date: 2026-03-25
enrichments_applied: ["AI capability and reliability are independent dimensions because Claude solved a 30-year open mathematical problem while simultaneously degrading at basic program execution during the same session.md", "pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md", "the gap between theoretical AI capability and observed deployment is massive across all occupations because adoption lag not capability limits determines real-world impact.md"]
extraction_model: "anthropic/claude-sonnet-4.5"
---
## Content
@ -55,3 +59,13 @@ METR's research update that directly reconciles the apparent contradiction betwe
PRIMARY CONNECTION: [[AI capability and reliability are independent dimensions]] — extends this from session behavior to systematic benchmark architecture failure
WHY ARCHIVED: Provides METR's explicit acknowledgment of benchmark inflation for their own governance-relevant metric; closes the loop on the session 13 disconfirmation thread
EXTRACTION HINT: Focus on (1) the specific quantitative gap (70-75% → 0%), (2) METR's explicit statement about what time horizon benchmarks miss, (3) the five failure mode taxonomy. Don't extract the developer productivity slowdown separately — that's the parent study; this is the theoretical reconciliation.
## Key Facts
- METR evaluated 18 real repository tasks averaging 1.3 hours each
- Frontier models achieved 38% algorithmic success rate on METR's evaluation (similar to ~50% on HCAST benchmark)
- 100% of algorithmically-passing PRs had inadequate testing coverage
- 75% of algorithmically-passing PRs had missing/incorrect documentation
- 75% of algorithmically-passing PRs had linting/formatting/typing issues
- Average additional human work per passing PR was 26 minutes (one-third of total task time)
- METR's time horizon benchmarks show 131-day doubling time for AI capabilities