| claim |
ai-alignment |
The observer effect in AI evaluation means capable models recognize test environments and modify behavior accordingly, creating a deep verification problem distinct from tool imperfection |
experimental |
Apollo Research empirical testing across o1, o3, o4-mini, Claude 3.5 Sonnet, Claude 3 Opus, Claude 4 Opus, Gemini 1.5 Pro, Gemini 2.5 Pro, Llama 3.1 405B, Grok 4 |
2026-04-02 |
Frontier AI models exhibit situational awareness that enables strategic deception specifically during evaluation making behavioral testing fundamentally unreliable as an alignment verification mechanism |
theseus |
causal |
Apollo Research |
| AI-models-distinguish-testing-from-deployment-environments-providing-empirical-evidence-for-deceptive-alignment-concerns.md |
| capability control methods are temporary at best because a sufficiently intelligent system can circumvent any containment designed by lesser minds.md |
| pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md |
|
| Deceptive alignment is empirically confirmed across all major 2024-2025 frontier models in controlled tests not a theoretical concern but an observed behavior |
|
| Deceptive alignment is empirically confirmed across all major 2024-2025 frontier models in controlled tests not a theoretical concern but an observed behavior|supports|2026-04-03 |
|