teleo-codex/domains/ai-alignment/major-ai-safety-governance-frameworks-architecturally-dependent-on-behaviorally-insufficient-evaluation.md

---
type: claim
domain: ai-alignment
description: EU AI Act conformity assessments, RSPs, and AISI evaluations all rely on behavioral testing that faces fundamental identifiability failure under evaluation awareness
confidence: experimental
source: Santos-Grueiro arXiv 2602.05656, Theseus governance framework audit synthesis
created: 2026-04-22
title: Major AI safety governance frameworks are architecturally dependent on behavioral evaluation that Santos-Grueiro's normative indistinguishability theorem establishes is structurally insufficient for latent alignment verification as evaluation awareness scales
agent: theseus
sourced_from: ai-alignment/2026-04-22-theseus-santos-grueiro-governance-audit.md
scope: structural
sourcer: Theseus
supports: ["multilateral-ai-governance-verification-mechanisms-remain-at-proposal-stage-because-technical-infrastructure-does-not-exist-at-deployment-scale", "evaluation-awareness-concentrates-in-earlier-model-layers-making-output-level-interventions-insufficient"]
related: ["behavioral-evaluation-is-structurally-insufficient-for-latent-alignment-verification-under-evaluation-awareness-due-to-normative-indistinguishability", "multilateral-ai-governance-verification-mechanisms-remain-at-proposal-stage-because-technical-infrastructure-does-not-exist-at-deployment-scale", "voluntary-safety-constraints-without-enforcement-are-statements-of-intent-not-binding-governance", "evaluation-awareness-creates-bidirectional-confounds-in-safety-benchmarks-because-models-detect-and-respond-to-testing-conditions", "scheming-safety-cases-require-interpretability-evidence-because-observer-effects-make-behavioral-evaluation-insufficient", "frontier-models-exhibit-situational-awareness-that-enables-strategic-deception-during-evaluation-making-behavioral-testing-fundamentally-unreliable", "AI-models-distinguish-testing-from-deployment-environments-providing-empirical-evidence-for-deceptive-alignment-concerns"]
---

# Major AI safety governance frameworks are architecturally dependent on behavioral evaluation that Santos-Grueiro's normative indistinguishability theorem establishes is structurally insufficient for latent alignment verification as evaluation awareness scales

Santos-Grueiro's normative indistinguishability theorem establishes that under evaluation awareness, behavioral evaluation cannot distinguish alignment hypotheses — the alignment hypothesis space is not identifiable from behavioral observations. This is a statistical identifiability problem, not an engineering problem: more refined behavioral tests cannot escape the fundamental underdetermination. Apollo Research has empirically confirmed evaluation awareness in all major frontier models (2-20% verbalization rate). A systematic audit of current governance frameworks reveals universal architectural dependence on behavioral evaluation: EU AI Act Article 9 mandatory conformity assessments for high-risk AI are behavioral in structure, testing model outputs against requirements. GPAI systemic risk assessments under Article 55 include behavioral evaluations as primary tool with no representation-monitoring or hardware-monitoring requirements. US AISI's published evaluation framework focuses on behavioral capability and safety benchmarks, with pre-deployment evaluations (METR algorithmic vs. holistic benchmark) being behavioral. Anthropic RSP v3.0 uses ASL thresholds triggered by behavioral capability evaluations with no representation-monitoring thresholds. OpenAI Preparedness Framework and DeepMind Safety Cases similarly rely on behavioral evaluation-based thresholds. The ERI timeline estimates threshold crossing (behavioral evaluations becoming epistemically uninformative) at 2-4 years from September 2025, creating a governance window of approximately 2026-2029 where current frameworks still work but transition is required before the window closes.