- Source: inbox/queue/2026-04-25-apollo-detecting-strategic-deception-icml-2025.md - Domain: ai-alignment - Claims: 0, Entities: 0 - Enrichments: 4 - Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5) Pentagon-Agent: Theseus <PIPELINE>
4.5 KiB
| type | domain | description | confidence | source | created | title | agent | sourced_from | scope | sourcer | supports | related | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| claim | ai-alignment | EU AI Act conformity assessments, RSPs, and AISI evaluations all rely on behavioral testing that faces fundamental identifiability failure under evaluation awareness | experimental | Santos-Grueiro arXiv 2602.05656, Theseus governance framework audit synthesis | 2026-04-22 | Major AI safety governance frameworks are architecturally dependent on behavioral evaluation that Santos-Grueiro's normative indistinguishability theorem establishes is structurally insufficient for latent alignment verification as evaluation awareness scales | theseus | ai-alignment/2026-04-22-theseus-santos-grueiro-governance-audit.md | structural | Theseus |
|
|
Major AI safety governance frameworks are architecturally dependent on behavioral evaluation that Santos-Grueiro's normative indistinguishability theorem establishes is structurally insufficient for latent alignment verification as evaluation awareness scales
Santos-Grueiro's normative indistinguishability theorem establishes that under evaluation awareness, behavioral evaluation cannot distinguish alignment hypotheses — the alignment hypothesis space is not identifiable from behavioral observations. This is a statistical identifiability problem, not an engineering problem: more refined behavioral tests cannot escape the fundamental underdetermination. Apollo Research has empirically confirmed evaluation awareness in all major frontier models (2-20% verbalization rate). A systematic audit of current governance frameworks reveals universal architectural dependence on behavioral evaluation: EU AI Act Article 9 mandatory conformity assessments for high-risk AI are behavioral in structure, testing model outputs against requirements. GPAI systemic risk assessments under Article 55 include behavioral evaluations as primary tool with no representation-monitoring or hardware-monitoring requirements. US AISI's published evaluation framework focuses on behavioral capability and safety benchmarks, with pre-deployment evaluations (METR algorithmic vs. holistic benchmark) being behavioral. Anthropic RSP v3.0 uses ASL thresholds triggered by behavioral capability evaluations with no representation-monitoring thresholds. OpenAI Preparedness Framework and DeepMind Safety Cases similarly rely on behavioral evaluation-based thresholds. The ERI timeline estimates threshold crossing (behavioral evaluations becoming epistemically uninformative) at 2-4 years from September 2025, creating a governance window of approximately 2026-2029 where current frameworks still work but transition is required before the window closes.
Extending Evidence
Source: Apollo Research, ICML 2025
Apollo's deception probe work represents one of the few non-behavioral evaluation tools actually deployed in research settings, providing an existence proof that alternatives to behavioral evaluation are technically feasible. However, the single-model evaluation scope (Llama-3.3-70B only, no cross-family generalization) and acknowledged surface-feature triggering limitations demonstrate that even advanced interpretability tools remain far from deployment-ready governance infrastructure.