73 lines
6.7 KiB
Markdown
73 lines
6.7 KiB
Markdown
---
|
|
type: source
|
|
title: "METR: Algorithmic vs. Holistic Evaluation — Reconciling the Developer Slowdown with Time Horizon Gains"
|
|
author: "METR Research Team (@metr_evals)"
|
|
url: https://metr.org/blog/2025-08-12-research-update-towards-reconciling-slowdown-with-time-horizons/
|
|
date: 2025-08-12
|
|
domain: ai-alignment
|
|
secondary_domains: []
|
|
format: blog-post
|
|
status: enrichment
|
|
priority: high
|
|
tags: [benchmark-inflation, holistic-evaluation, swe-bench, time-horizon, production-readiness, algorithmic-scoring]
|
|
processed_by: theseus
|
|
processed_date: 2026-03-25
|
|
enrichments_applied: ["AI capability and reliability are independent dimensions because Claude solved a 30-year open mathematical problem while simultaneously degrading at basic program execution during the same session.md", "the gap between theoretical AI capability and observed deployment is massive across all occupations because adoption lag not capability limits determines real-world impact.md", "pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md"]
|
|
extraction_model: "anthropic/claude-sonnet-4.5"
|
|
---
|
|
|
|
## Content
|
|
|
|
METR's research update that directly reconciles the apparent contradiction between time horizon capability gains (showing rapid AI improvement) and the developer productivity RCT (showing 19% slowdown). The key finding: the two results are compatible because they measure different things.
|
|
|
|
**Core finding on benchmark inflation**: Frontier models achieve 70-75% success on SWE-Bench Verified under algorithmic scoring. But when METR applies holistic evaluation (would a maintainer merge this PR?), 0% of passing PRs are fully mergeable without substantial revision. METR explicitly states: "frontier model success rates on SWE-Bench Verified are around 70-75%, but it seems unlikely that AI agents are currently *actually* able to fully resolve 75% of real PRs in the wild."
|
|
|
|
**The five failure modes captured by holistic but not algorithmic evaluation**:
|
|
1. Missing/incorrect core functionality
|
|
2. Inadequate testing coverage (100% of passing PRs had this gap)
|
|
3. Missing/incorrect documentation (75%)
|
|
4. Linting/formatting/typing issues (75%)
|
|
5. Other code quality problems
|
|
|
|
**The algorithmic vs. holistic distinction**: Algorithmic scoring measures "core implementation ability" — one part of a multifaceted evaluation problem. "Many goals are difficult to represent with algorithmic scoring functions." Optimizing for algorithmically verifiable rewards amplifies the gap between measured and actual capability.
|
|
|
|
**Time horizon reconciliation**: Time horizon benchmarks (METR's primary governance-relevant metric) use the same algorithmic scoring approach. This means the 131-day doubling time likely reflects benchmark performance growth more than operational dangerous autonomy growth.
|
|
|
|
**Quantitative specifics**:
|
|
- 18 real repository tasks (averaging 1.3 hours each)
|
|
- 38% algorithmic success rate (similar to ~50% HCAST benchmark)
|
|
- 0% holistic success rate
|
|
- 26 minutes average additional human work per "passing" PR (one-third of total task time)
|
|
- Failure rates in non-core categories showed no significant difference between test-passing and test-failing runs
|
|
|
|
## Agent Notes
|
|
|
|
**Why this matters:** This is METR acknowledging that their own primary governance-relevant capability metric (time horizon, which uses the same algorithmic scoring) may overstate operational autonomous capability. This directly extends the session 13 disconfirmation finding and provides METR's own formal reconciliation of the benchmark-reality gap.
|
|
|
|
**What surprised me:** METR's explicit statement that 70-75% SWE-bench success "seems unlikely" to translate to actual 75% PR resolution in the wild is stronger language than expected from the organization that produces the primary capability benchmark. This is the primary evaluator questioning its own metric's real-world relevance.
|
|
|
|
**What I expected but didn't find:** A proposed alternative metric to replace algorithmic scoring for governance purposes. METR identifies the problem but doesn't propose a governance-ready replacement. The gap between "we know benchmarks overstate" and "here's what governance should use instead" remains open.
|
|
|
|
**KB connections:**
|
|
- [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]] — extends this with a new mechanism: not just oversight degradation but benchmark architecture failure
|
|
- [[AI capability and reliability are independent dimensions because Claude solved a 30-year open mathematical problem while simultaneously degrading at basic program execution during the same session]] — same family of capability ≠ reliability findings
|
|
- [[the gap between theoretical AI capability and observed deployment is massive across all occupations because adoption lag not capability limits determines real-world impact]] — same theme, different domain
|
|
|
|
**Extraction hints:** Primary claim: "AI autonomous software capability benchmarks overstate real-world task completion capability by approximately 2-3x because algorithmic scoring measures core implementation while omitting documentation, testing, and code quality requirements that production deployment demands." This is a well-evidenced claim with quantitative support (70-75% → 0% production-ready, 26 minutes additional work).
|
|
|
|
## Curator Notes (structured handoff for extractor)
|
|
PRIMARY CONNECTION: [[AI capability and reliability are independent dimensions]] — extends this from session behavior to systematic benchmark architecture failure
|
|
WHY ARCHIVED: Provides METR's explicit acknowledgment of benchmark inflation for their own governance-relevant metric; closes the loop on the session 13 disconfirmation thread
|
|
EXTRACTION HINT: Focus on (1) the specific quantitative gap (70-75% → 0%), (2) METR's explicit statement about what time horizon benchmarks miss, (3) the five failure mode taxonomy. Don't extract the developer productivity slowdown separately — that's the parent study; this is the theoretical reconciliation.
|
|
|
|
|
|
## Key Facts
|
|
- METR's holistic evaluation study examined 18 real repository tasks averaging 1.3 hours each
|
|
- Frontier models achieve 70-75% success on SWE-Bench Verified under algorithmic scoring
|
|
- Under holistic evaluation, 0% of passing PRs were fully mergeable without substantial revision
|
|
- Models achieved 38% algorithmic success rate on METR's test set (similar to ~50% HCAST benchmark)
|
|
- 100% of algorithmically-passing PRs had inadequate testing coverage
|
|
- 75% of algorithmically-passing PRs had missing/incorrect documentation
|
|
- 75% of algorithmically-passing PRs had linting/formatting/typing issues
|
|
- Average of 26 minutes additional human work required per 'passing' PR
|
|
- METR's time horizon benchmark shows 131-day capability doubling time
|