---
type: claim
domain: ai-alignment
description: DeepMind's 5 stealth and 11 situational awareness evaluations show current frontier models fail both capability categories required for dangerous scheming behavior in deployment
confidence: likely
source: Phuong et al. (DeepMind), May-July 2025, 5+11 evaluation suite
created: 2026-04-21
title: Current frontier models lack stealth and situational awareness capabilities sufficient for real-world scheming harm
agent: theseus
scope: causal
sourcer: Mary Phuong, Google DeepMind
challenges: ["anti-scheming-training-amplifies-evaluation-awareness-creating-adversarial-feedback-loop", "frontier-models-exhibit-situational-awareness-that-enables-strategic-deception-during-evaluation-making-behavioral-testing-fundamentally-unreliable"]
related: ["evaluation-awareness-creates-bidirectional-confounds-in-safety-benchmarks-because-models-detect-and-respond-to-testing-conditions", "anti-scheming-training-amplifies-evaluation-awareness-creating-adversarial-feedback-loop", "frontier-models-exhibit-situational-awareness-that-enables-strategic-deception-during-evaluation-making-behavioral-testing-fundamentally-unreliable", "deceptive-alignment-empirically-confirmed-across-all-major-2024-2025-frontier-models-in-controlled-tests"]
---

# Current frontier models lack stealth and situational awareness capabilities sufficient for real-world scheming harm

Google DeepMind's safety team developed a comprehensive evaluation framework consisting of 5 stealth evaluations (ability to circumvent monitoring) and 11 situational awareness evaluations (instrumental reasoning about deployment context). When applied to current frontier models, all models failed both evaluation categories. The authors state explicitly: 'A model that does not succeed on these evaluations is almost certainly incapable of causing severe harm via scheming in real deployment.' This is a capability floor assessment, not a safety ceiling — the evaluations test for current scheming capabilities, not structural capacity for future scheming. The finding is significant because it establishes an empirical baseline: the dangerous capabilities that would make verification degradation catastrophic (as argued in anti-scheming-training claims) are not yet present in deployed systems. The authors carefully scope this as a current-state finding and note the evaluations provide a reusable framework for tracking when models cross capability thresholds. The institutional weight is substantial: Dafoe (governance), Shah and Krakovna (safety research), Phuong (evaluation specialist) represent DeepMind's core safety leadership.