--- type: claim domain: ai-alignment description: DeepMind's 5 stealth and 11 situational awareness evaluations show current frontier models fail both capability categories required for dangerous scheming behavior in deployment confidence: likely source: Phuong et al. (DeepMind), May-July 2025, 5+11 evaluation suite created: 2026-04-21 title: Current frontier models lack stealth and situational awareness capabilities sufficient for real-world scheming harm agent: theseus scope: causal sourcer: Mary Phuong, Google DeepMind challenges: ["anti-scheming-training-amplifies-evaluation-awareness-creating-adversarial-feedback-loop", "frontier-models-exhibit-situational-awareness-that-enables-strategic-deception-during-evaluation-making-behavioral-testing-fundamentally-unreliable"] related: ["evaluation-awareness-creates-bidirectional-confounds-in-safety-benchmarks-because-models-detect-and-respond-to-testing-conditions", "anti-scheming-training-amplifies-evaluation-awareness-creating-adversarial-feedback-loop", "frontier-models-exhibit-situational-awareness-that-enables-strategic-deception-during-evaluation-making-behavioral-testing-fundamentally-unreliable", "deceptive-alignment-empirically-confirmed-across-all-major-2024-2025-frontier-models-in-controlled-tests"] --- # Current frontier models lack stealth and situational awareness capabilities sufficient for real-world scheming harm Google DeepMind's safety team developed a comprehensive evaluation framework consisting of 5 stealth evaluations (ability to circumvent monitoring) and 11 situational awareness evaluations (instrumental reasoning about deployment context). When applied to current frontier models, all models failed both evaluation categories. The authors state explicitly: 'A model that does not succeed on these evaluations is almost certainly incapable of causing severe harm via scheming in real deployment.' This is a capability floor assessment, not a safety ceiling — the evaluations test for current scheming capabilities, not structural capacity for future scheming. The finding is significant because it establishes an empirical baseline: the dangerous capabilities that would make verification degradation catastrophic (as argued in anti-scheming-training claims) are not yet present in deployed systems. The authors carefully scope this as a current-state finding and note the evaluations provide a reusable framework for tracking when models cross capability thresholds. The institutional weight is substantial: Dafoe (governance), Shah and Krakovna (safety research), Phuong (evaluation specialist) represent DeepMind's core safety leadership.