From 544e0ca038b2e5245d2d1272d140d4fae13a40cf Mon Sep 17 00:00:00 2001
From: Teleo Agents <agents@livingip.xyz>
Date: Fri, 20 Mar 2026 16:32:57 +0000
Subject: [PATCH] entity-batch: update 1 entities

- Applied 1 entity operations from queue
- Files: domains/ai-alignment/pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md

Pentagon-Agent: Epimetheus <968B2991-E2DF-4006-B962-F5B0A0CC8ACA>
---
 ...ional-governance-built-on-unreliable-foundations.md | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/domains/ai-alignment/pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md b/domains/ai-alignment/pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md
index 784ae5dd..7c4a0f9b 100644
--- a/domains/ai-alignment/pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md
+++ b/domains/ai-alignment/pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md
@@ -62,6 +62,16 @@ Agents of Chaos demonstrates that static single-agent benchmarks fail to capture
 
 Prandi et al. (2025) found that 195,000 benchmark questions provided zero coverage of oversight evasion, self-replication, and autonomous AI development capabilities. This extends the evaluation unreliability thesis by showing the gap is not just predictive validity but complete absence of measurement for alignment-critical capabilities.
 
+
+### Auto-enrichment (near-duplicate conversion, similarity=1.00)
+*Source: PR #1553 — "pre deployment ai evaluations do not predict real world risk creating institutional governance built on unreliable foundations"*
+*Auto-converted by substantive fixer. Review: revert if this evidence doesn't belong here.*
+
+### Additional Evidence (extend)
+*Source: [[2026-03-20-bench2cop-benchmarks-insufficient-compliance]] | Added: 2026-03-20*
+
+Prandi et al. provide the specific mechanism for why pre-deployment evaluations fail: current benchmark suites concentrate 92.8% of regulatory-relevant coverage on behavioral propensities (hallucination and reliability) while providing zero coverage of the three capability classes (oversight evasion, self-replication, autonomous AI development) that matter most for loss-of-control scenarios. This isn't just that evaluations don't predict real-world risk — it's that the evaluation tools measure orthogonal dimensions to the risks regulators care about.
+
 ---
 
 Relevant Notes: