Compare commits

...

1 commit

Author SHA1 Message Date
Teleo Agents
b2a772aa3b extract: 2025-08-12-metr-algorithmic-vs-holistic-evaluation-developer-rct
Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70>
2026-03-24 19:15:45 +00:00
5 changed files with 68 additions and 1 deletions

View file

@ -32,6 +32,12 @@ Catalini et al. provide the full economic framework for why verification bandwid
---
### Additional Evidence (confirm)
*Source: [[2025-08-12-metr-algorithmic-vs-holistic-evaluation-developer-rct]] | Added: 2026-03-24*
The 42-minute average fix time per agent PR (one-third of original human task time) quantifies the verification burden. Even when AI completes tasks algorithmically, human verification and correction remains necessary and substantial, supporting the claim that verification bandwidth is the binding constraint.
Relevant Notes:
- [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]] — Catalini provides the economic mechanism for why oversight degrades
- [[coordination failures arise from individually rational strategies that produce collectively irrational outcomes because the Nash equilibrium of non-cooperation dominates when trust and enforcement are absent]] — the Codifier's Curse is a coordination failure

View file

@ -119,6 +119,12 @@ Anthropic's explicit admission that 'the science of model evaluation isn't well-
METR's scaffold sensitivity finding (GPT-4o and o3 performing better under Vivaria than Inspect) adds a new dimension to evaluation unreliability: the same model produces different capability estimates depending on evaluation infrastructure, introducing cross-model comparison uncertainty that governance frameworks do not account for.
### Additional Evidence (confirm)
*Source: [[2025-08-12-metr-algorithmic-vs-holistic-evaluation-developer-rct]] | Added: 2026-03-24*
METR's finding that algorithmic scoring (the standard pre-deployment evaluation method) showed 38% success while holistic evaluation showed 0% production-ready demonstrates that pre-deployment evaluations systematically overestimate real-world capability. METR explicitly acknowledges this creates problems for their own time horizon capability estimates.

View file

@ -40,6 +40,12 @@ The International AI Safety Report 2026 (multi-government committee, February 20
METR's time horizon metric measures task difficulty by human completion time, not model processing time. A model with a 5-hour time horizon completes tasks that take humans 5 hours, but may finish them in minutes. This speed asymmetry is not captured in the metric itself, meaning the gap between theoretical capability (task completion) and deployment impact includes both adoption lag AND the unmeasured throughput advantage that organizations fail to utilize.
### Additional Evidence (confirm)
*Source: [[2025-08-12-metr-algorithmic-vs-holistic-evaluation-developer-rct]] | Added: 2026-03-24*
The 19% productivity slowdown in experienced developers using AI tools is direct empirical evidence that adoption does not translate to productivity gains even when tools are used. The RCT design eliminates selection effects and the experienced-developer population controls for skill level.
Relevant Notes:
- [[AI capability and reliability are independent dimensions because Claude solved a 30-year open mathematical problem while simultaneously degrading at basic program execution during the same session]] — capability exists but deployment is uneven

View file

@ -0,0 +1,33 @@
{
"rejected_claims": [
{
"filename": "benchmark-based-capability-metrics-overstate-real-world-autonomous-performance-by-systematically-missing-production-requirements.md",
"issues": [
"missing_attribution_extractor"
]
},
{
"filename": "ai-tools-reduced-experienced-developer-productivity-19-percent-in-rct-despite-predicted-speedups.md",
"issues": [
"missing_attribution_extractor"
]
}
],
"validation_stats": {
"total": 2,
"kept": 0,
"fixed": 3,
"rejected": 2,
"fixes_applied": [
"benchmark-based-capability-metrics-overstate-real-world-autonomous-performance-by-systematically-missing-production-requirements.md:set_created:2026-03-24",
"benchmark-based-capability-metrics-overstate-real-world-autonomous-performance-by-systematically-missing-production-requirements.md:stripped_wiki_link:verification degrades faster than capability grows",
"ai-tools-reduced-experienced-developer-productivity-19-percent-in-rct-despite-predicted-speedups.md:set_created:2026-03-24"
],
"rejections": [
"benchmark-based-capability-metrics-overstate-real-world-autonomous-performance-by-systematically-missing-production-requirements.md:missing_attribution_extractor",
"ai-tools-reduced-experienced-developer-productivity-19-percent-in-rct-despite-predicted-speedups.md:missing_attribution_extractor"
]
},
"model": "anthropic/claude-sonnet-4.5",
"date": "2026-03-24"
}

View file

@ -7,9 +7,13 @@ date: 2025-08-12
domain: ai-alignment
secondary_domains: []
format: research-report
status: unprocessed
status: enrichment
priority: high
tags: [metr, developer-productivity, benchmark-inflation, capability-measurement, rct, holistic-evaluation, algorithmic-scoring, real-world-performance]
processed_by: theseus
processed_date: 2026-03-24
enrichments_applied: ["the gap between theoretical AI capability and observed deployment is massive across all occupations because adoption lag not capability limits determines real-world impact.md", "pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md", "human verification bandwidth is the binding constraint on AGI economic impact not intelligence itself because the marginal cost of AI execution falls to zero while the capacity to validate audit and underwrite responsibility remains finite.md"]
extraction_model: "anthropic/claude-sonnet-4.5"
---
## Content
@ -68,3 +72,15 @@ PRIMARY CONNECTION: [[verification degrades faster than capability grows]]
WHY ARCHIVED: This is the strongest empirical evidence found in 13 sessions that benchmark-based capability metrics systematically overstate real-world autonomous capability. The RCT design (not observational), precise quantification (0% production-ready, 19% slowdown), and the source (METR — the primary capability evaluator) make this a high-quality disconfirmation signal for B1 urgency.
EXTRACTION HINT: The extractor should develop the "benchmark-reality gap" as a potential new claim or divergence against existing time-horizon-based capability claims. The key question is whether this gap is stable, growing, or shrinking over model generations — if frontier models also show the gap, this updates the urgency of the entire six-layer governance arc.
## Key Facts
- Claude 3.7 Sonnet achieved 38% success rate on automated test scoring in METR evaluation
- 0% of Claude 3.7 Sonnet's passing submissions were production-ready after human expert review
- 100% of passing-test runs had testing coverage deficiencies
- 75% of passing-test runs had documentation gaps
- 75% of passing-test runs had linting/formatting problems
- 25% of passing-test runs had residual functionality gaps
- Average fix time for agent PRs was 42 minutes
- Original human task time averaged 1.3 hours
- Experienced developers took 19% longer on tasks with AI assistance in METR RCT