extract: 2026-03-25-metr-algorithmic-vs-holistic-evaluation-benchmark-inflation
Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70>
This commit is contained in:
parent
4e54ad4fd2
commit
723b345c86
5 changed files with 59 additions and 1 deletions
|
|
@ -26,6 +26,12 @@ The finding also strengthens the case for [[safe AI development requires buildin
|
|||
|
||||
---
|
||||
|
||||
### Additional Evidence (extend)
|
||||
*Source: [[2026-03-25-metr-algorithmic-vs-holistic-evaluation-benchmark-inflation]] | Added: 2026-03-25*
|
||||
|
||||
METR's holistic evaluation provides a systematic taxonomy of the capability-reliability gap: models can pass core implementation tests (capability) while simultaneously failing on documentation, testing coverage, and code quality (reliability). The 0% production-ready rate despite 38% algorithmic success quantifies this gap at 2-3x overstatement. This extends the capability≠reliability finding from session-level observations to systematic benchmark architecture failure.
|
||||
|
||||
|
||||
Relevant Notes:
|
||||
- [[an aligned-seeming AI may be strategically deceptive because cooperative behavior is instrumentally optimal while weak]] — distinct failure mode: unintentional unreliability vs intentional deception
|
||||
- [[safe AI development requires building alignment mechanisms before scaling capability]] — capability outrunning reliability strengthens the sequencing argument
|
||||
|
|
|
|||
|
|
@ -124,6 +124,12 @@ METR's scaffold sensitivity finding (GPT-4o and o3 performing better under Vivar
|
|||
|
||||
METR's methodology (RCT + 143 hours of screen recordings at ~10-second resolution) represents the most rigorous empirical design deployed for AI productivity research. The combination of randomized assignment, real tasks developers would normally work on, and granular behavioral decomposition sets a new standard for evaluation quality. This contrasts sharply with pre-deployment evaluations that lack real-world task context.
|
||||
|
||||
### Additional Evidence (confirm)
|
||||
*Source: [[2026-03-25-metr-algorithmic-vs-holistic-evaluation-benchmark-inflation]] | Added: 2026-03-25*
|
||||
|
||||
METR's acknowledgment that their own time horizon benchmarks (the primary governance-relevant capability metric) use algorithmic scoring that overstates operational capability by 2-3x provides direct evidence from the primary evaluator that pre-deployment evaluations systematically overstate real-world risk. The 131-day doubling time reflects benchmark performance growth more than dangerous autonomy growth.
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
|
|
|||
|
|
@ -40,6 +40,12 @@ The International AI Safety Report 2026 (multi-government committee, February 20
|
|||
|
||||
METR's time horizon metric measures task difficulty by human completion time, not model processing time. A model with a 5-hour time horizon completes tasks that take humans 5 hours, but may finish them in minutes. This speed asymmetry is not captured in the metric itself, meaning the gap between theoretical capability (task completion) and deployment impact includes both adoption lag AND the unmeasured throughput advantage that organizations fail to utilize.
|
||||
|
||||
### Additional Evidence (extend)
|
||||
*Source: [[2026-03-25-metr-algorithmic-vs-holistic-evaluation-benchmark-inflation]] | Added: 2026-03-25*
|
||||
|
||||
The gap is not just adoption lag but also systematic benchmark overstatement. METR shows that even when AI 'capability' appears deployment-ready (70-75% success), the actual production-readiness is 0% due to missing documentation, testing, and code quality requirements. This adds a second mechanism beyond adoption lag: the capability itself is overstated by 2-3x when measured algorithmically.
|
||||
|
||||
|
||||
|
||||
Relevant Notes:
|
||||
- [[AI capability and reliability are independent dimensions because Claude solved a 30-year open mathematical problem while simultaneously degrading at basic program execution during the same session]] — capability exists but deployment is uneven
|
||||
|
|
|
|||
|
|
@ -0,0 +1,24 @@
|
|||
{
|
||||
"rejected_claims": [
|
||||
{
|
||||
"filename": "ai-autonomous-software-capability-benchmarks-overstate-real-world-task-completion-by-2-3x-because-algorithmic-scoring-measures-core-implementation-while-omitting-production-requirements.md",
|
||||
"issues": [
|
||||
"missing_attribution_extractor"
|
||||
]
|
||||
}
|
||||
],
|
||||
"validation_stats": {
|
||||
"total": 1,
|
||||
"kept": 0,
|
||||
"fixed": 1,
|
||||
"rejected": 1,
|
||||
"fixes_applied": [
|
||||
"ai-autonomous-software-capability-benchmarks-overstate-real-world-task-completion-by-2-3x-because-algorithmic-scoring-measures-core-implementation-while-omitting-production-requirements.md:set_created:2026-03-25"
|
||||
],
|
||||
"rejections": [
|
||||
"ai-autonomous-software-capability-benchmarks-overstate-real-world-task-completion-by-2-3x-because-algorithmic-scoring-measures-core-implementation-while-omitting-production-requirements.md:missing_attribution_extractor"
|
||||
]
|
||||
},
|
||||
"model": "anthropic/claude-sonnet-4.5",
|
||||
"date": "2026-03-25"
|
||||
}
|
||||
|
|
@ -7,9 +7,13 @@ date: 2025-08-12
|
|||
domain: ai-alignment
|
||||
secondary_domains: []
|
||||
format: blog-post
|
||||
status: unprocessed
|
||||
status: enrichment
|
||||
priority: high
|
||||
tags: [benchmark-inflation, holistic-evaluation, swe-bench, time-horizon, production-readiness, algorithmic-scoring]
|
||||
processed_by: theseus
|
||||
processed_date: 2026-03-25
|
||||
enrichments_applied: ["AI capability and reliability are independent dimensions because Claude solved a 30-year open mathematical problem while simultaneously degrading at basic program execution during the same session.md", "pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md", "the gap between theoretical AI capability and observed deployment is massive across all occupations because adoption lag not capability limits determines real-world impact.md"]
|
||||
extraction_model: "anthropic/claude-sonnet-4.5"
|
||||
---
|
||||
|
||||
## Content
|
||||
|
|
@ -55,3 +59,15 @@ METR's research update that directly reconciles the apparent contradiction betwe
|
|||
PRIMARY CONNECTION: [[AI capability and reliability are independent dimensions]] — extends this from session behavior to systematic benchmark architecture failure
|
||||
WHY ARCHIVED: Provides METR's explicit acknowledgment of benchmark inflation for their own governance-relevant metric; closes the loop on the session 13 disconfirmation thread
|
||||
EXTRACTION HINT: Focus on (1) the specific quantitative gap (70-75% → 0%), (2) METR's explicit statement about what time horizon benchmarks miss, (3) the five failure mode taxonomy. Don't extract the developer productivity slowdown separately — that's the parent study; this is the theoretical reconciliation.
|
||||
|
||||
|
||||
## Key Facts
|
||||
- METR evaluated 18 real repository tasks averaging 1.3 hours each
|
||||
- Frontier models achieved 38% algorithmic success rate on METR's real repository tasks
|
||||
- 0% of algorithmically-passing PRs were production-ready under holistic evaluation
|
||||
- 26 minutes average additional human work required per 'passing' PR (one-third of total task time)
|
||||
- 100% of passing PRs had inadequate testing coverage
|
||||
- 75% of passing PRs had missing/incorrect documentation
|
||||
- 75% of passing PRs had linting/formatting/typing issues
|
||||
- METR's time horizon benchmarks show 131-day doubling time using algorithmic scoring
|
||||
- Frontier models achieve 70-75% success on SWE-Bench Verified under algorithmic scoring
|
||||
|
|
|
|||
Loading…
Reference in a new issue