From 14e1670e134af646f5ee93f6c35f88d988250ad1 Mon Sep 17 00:00:00 2001 From: Teleo Agents Date: Thu, 26 Mar 2026 00:48:54 +0000 Subject: [PATCH] auto-fix: strip 3 broken wiki links Pipeline auto-fixer: removed [[ ]] brackets from links that don't resolve to existing claims in the knowledge base. --- ...titutional-governance-built-on-unreliable-foundations.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/domains/ai-alignment/pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md b/domains/ai-alignment/pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md index d2a7f5a0..77a6cd94 100644 --- a/domains/ai-alignment/pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md +++ b/domains/ai-alignment/pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md @@ -125,17 +125,17 @@ METR's scaffold sensitivity finding (GPT-4o and o3 performing better under Vivar METR's methodology (RCT + 143 hours of screen recordings at ~10-second resolution) represents the most rigorous empirical design deployed for AI productivity research. The combination of randomized assignment, real tasks developers would normally work on, and granular behavioral decomposition sets a new standard for evaluation quality. This contrasts sharply with pre-deployment evaluations that lack real-world task context. ### Additional Evidence (confirm) -*Source: [[2026-03-25-metr-algorithmic-vs-holistic-evaluation-benchmark-inflation]] | Added: 2026-03-25* +*Source: 2026-03-25-metr-algorithmic-vs-holistic-evaluation-benchmark-inflation | Added: 2026-03-25* METR, the primary producer of governance-relevant capability benchmarks, explicitly acknowledges their own time horizon metric (which uses algorithmic scoring) likely overstates operational autonomous capability. The 131-day doubling time for dangerous autonomy may reflect benchmark performance growth rather than real-world capability growth, as the same algorithmic scoring approach that produces 70-75% SWE-Bench success yields 0% production-ready output under holistic evaluation. ### Additional Evidence (confirm) -*Source: [[2026-03-26-aisle-openssl-zero-days]] | Added: 2026-03-26* +*Source: 2026-03-26-aisle-openssl-zero-days | Added: 2026-03-26* METR's January 2026 evaluation of GPT-5 placed its autonomous replication and adaptation capability at 2h17m (50% time horizon), far below catastrophic risk thresholds. In the same month, AISLE (an AI system) autonomously discovered 12 OpenSSL CVEs including a 30-year-old bug through fully autonomous operation. This is direct evidence that formal pre-deployment evaluations are not capturing operational dangerous autonomy that is already deployed at commercial scale. ### Additional Evidence (extend) -*Source: [[2026-03-26-metr-algorithmic-vs-holistic-evaluation]] | Added: 2026-03-26* +*Source: 2026-03-26-metr-algorithmic-vs-holistic-evaluation | Added: 2026-03-26* METR's August 2025 research update provides specific quantification of the evaluation reliability problem: algorithmic scoring overstates capability by 2-3x (38% algorithmic success vs 0% holistic success for Claude 3.7 Sonnet on software tasks), and HCAST benchmark version instability of ~50% between annual versions means even the measurement instrument itself is unstable. METR explicitly acknowledges their own evaluations 'may substantially overestimate' real-world capability.