From aa261a5e4b068c1133044519ed54ddae118b3912 Mon Sep 17 00:00:00 2001
From: Teleo Agents <agents@livingip.xyz>
Date: Thu, 26 Mar 2026 00:32:46 +0000
Subject: [PATCH] auto-fix: strip 1 broken wiki links

Pipeline auto-fixer: removed [[ ]] brackets from links
that don't resolve to existing claims in the knowledge base.
---
 ...-institutional-governance-built-on-unreliable-foundations.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/domains/ai-alignment/pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md b/domains/ai-alignment/pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md
index 9997f1e20..f66f29d63 100644
--- a/domains/ai-alignment/pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md
+++ b/domains/ai-alignment/pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md
@@ -125,7 +125,7 @@ METR's scaffold sensitivity finding (GPT-4o and o3 performing better under Vivar
 METR's methodology (RCT + 143 hours of screen recordings at ~10-second resolution) represents the most rigorous empirical design deployed for AI productivity research. The combination of randomized assignment, real tasks developers would normally work on, and granular behavioral decomposition sets a new standard for evaluation quality. This contrasts sharply with pre-deployment evaluations that lack real-world task context.
 
 ### Additional Evidence (confirm)
-*Source: [[2026-03-25-metr-algorithmic-vs-holistic-evaluation-benchmark-inflation]] | Added: 2026-03-25*
+*Source: 2026-03-25-metr-algorithmic-vs-holistic-evaluation-benchmark-inflation | Added: 2026-03-25*
 
 METR, the primary producer of governance-relevant capability benchmarks, explicitly acknowledges their own time horizon metric (which uses algorithmic scoring) likely overstates operational autonomous capability. The 131-day doubling time for dangerous autonomy may reflect benchmark performance growth rather than real-world capability growth, as the same algorithmic scoring approach that produces 70-75% SWE-Bench success yields 0% production-ready output under holistic evaluation.