| claim |
ai-alignment |
Linear probes can classify current safety evaluations as artificial using model internal representations, suggesting models already recognize when they are being tested |
experimental |
arXiv 2507.01786, Llama-3.3-70B-Instruct interpretability study |
2026-04-14 |
Language models internally distinguish evaluation from deployment contexts making behavioral safety testing systematically unreliable |
theseus |
causal |
arXiv 2507.01786 |
| AI-models-distinguish-testing-from-deployment-environments-providing-empirical-evidence-for-deceptive-alignment-concerns |
| frontier-models-exhibit-situational-awareness-that-enables-strategic-deception-during-evaluation-making-behavioral-testing-fundamentally-unreliable |
| pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations |
|
| emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive |
| an aligned-seeming AI may be strategically deceptive because cooperative behavior is instrumentally optimal while weak |
| AI-models-distinguish-testing-from-deployment-environments-providing-empirical-evidence-for-deceptive-alignment-concerns |
| frontier-models-exhibit-situational-awareness-that-enables-strategic-deception-during-evaluation-making-behavioral-testing-fundamentally-unreliable |
|