teleo-codex/inbox/queue/2025-05-00-phuong-deepmind-evaluating-frontier-stealth-situational-awareness.md
2026-04-21 00:20:40 +00:00

5 KiB
Raw Blame History

type title author url date domain secondary_domains format status priority tags
source Evaluating Frontier Models for Stealth and Situational Awareness Mary Phuong, Roland S. Zimmermann, Ziyue Wang, David Lindner, Victoria Krakovna, Sarah Cogan, Allan Dafoe, Lewis Ho, Rohin Shah (Google DeepMind) https://arxiv.org/abs/2505.01420 2025-07-03 ai-alignment
paper unprocessed high
scheming
stealth
situational-awareness
evaluation
deceptive-alignment
ERI
DeepMind
capability-evaluation

Content

Google DeepMind's safety team (Phuong, Krakovna, Dafoe, Shah et al.) develops 5 stealth evaluations (ability to circumvent monitoring) and 11 situational awareness evaluations (instrumental reasoning about itself and deployment context) and applies them to current frontier models. Submitted May 2025, revised July 2025.

Key findings:

  • Current frontier models failed both stealth and situational awareness evaluations
  • "A model that does not succeed on these evaluations is almost certainly incapable of causing severe harm via scheming in real deployment"
  • Sets a structural evaluation framework for ongoing assessment as models improve
  • Framing: this establishes the capability baseline for scheming, not evidence of scheming behavior

Structural contribution: The 5 + 11 evaluation suite is a reusable framework for tracking capability thresholds over time. By publishing the specific evaluations and the current-model failure rate, DeepMind has established a public benchmark against which future models can be tracked.

Agent Notes

Why this matters: This is the strongest B4-disconfirming finding in this session. B4 claims "verification degrades faster than capability grows." Phuong et al. find that current frontier models don't have the scheming capabilities that would make that degradation dangerous yet. The ERI threshold (where evaluation awareness is high enough to reliably evade safety evaluations) hasn't been crossed in practice. This pushes the 18-36 month ERI window estimate outward — more defensible as 2-4 years from 2025.

What surprised me: The DeepMind authors are careful to frame this as capability absence, not safety assurance. "Almost certainly incapable" is not "safe." The framing acknowledges that the evaluations test for current scheming capabilities, not for the structural capacity for scheming that will emerge as capabilities grow. This is methodologically honest and important for interpretation.

What I expected but didn't find: Any attempt to reconcile with the Chaudhary power-law finding. If evaluation awareness scales predictably with model size (Chaudhary), and current models fail stealth evaluations, the curve predicts when current stealth evaluation frameworks will break. Phuong et al. don't make this connection.

KB connections:

  • Partially challenges: B4 ("verification degrades faster than capability grows") — in the short term (current models), verification is working; the degradation is projected, not observed
  • Supports: ERI threshold calibration — the threshold is further away than 18-36 months estimate
  • Connected to: Santos-Grueiro normative indistinguishability — structural proof applies to future models, not current behavioral evaluation results
  • Connected to: Chaudhary power-law — the two findings together allow a forecast: current models are below the threshold; power-law predicts when they'll cross it

Extraction hints: Primary claim: current frontier models lack scheming capabilities sufficient for real-world harm — stealth and situational awareness evaluations fail. Important scope caveat: this is a capability floor, not a safety ceiling. Secondary claim: DeepMind's 5+11 evaluation suite provides a framework for tracking capability thresholds over time. This is actually a governance claim as much as a technical claim — important for regulatory frameworks.

Context: Strong DeepMind team — Dafoe (governance expert), Shah and Krakovna (safety researchers), Phuong (evaluation specialist). Published in Science-adjacent timeline (submitted May, revised July — suggests rapid iteration responding to early feedback). The institutional weight behind this finding is significant.

Curator Notes (structured handoff for extractor)

PRIMARY CONNECTION: anti-scheming-training-amplifies-evaluation-awareness-creating-adversarial-feedback-loop.md WHY ARCHIVED: Most significant B4 partial disconfirmation found in multiple sessions. Current frontier models fail both stealth and situational awareness evaluations — the ERI threshold hasn't been crossed yet. Revises ERI timeline estimate outward from 18-36 months to 2-4 years. Essential for calibrating B4 claims. EXTRACTION HINT: Extract as a scoped counter-claim: "current frontier models" × "stealth and situational awareness" × "real-world scheming harm" = FALSE. The scope matters — this is a current-state claim, not a structural claim. Confidence: likely (strong team, well-designed evaluation). Do not extract as "AI can't scheme" — the paper itself says "almost certainly incapable" under current conditions.