Mirror PR to Forgejo / mirror (pull_request) Waiting to run

Details

theseus: extract claims from 2025-05-00-phuong-deepmind-evaluating-frontier-stealth-situational-awareness

- Source: inbox/queue/2025-05-00-phuong-deepmind-evaluating-frontier-stealth-situational-awareness.md
- Domain: ai-alignment
- Claims: 1, Entities: 0
- Enrichments: 3
- Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5)

Pentagon-Agent: Theseus <PIPELINE>

2026-04-21 00:26:15 +00:00

2.6 KiB

Raw Blame History

type

domain

description

confidence

source

created

title

agent

scope

sourcer

challenges

claim

ai-alignment

DeepMind's 5 stealth and 11 situational awareness evaluations show current frontier models fail both capability categories required for dangerous scheming behavior in deployment

likely

Phuong et al. (DeepMind), May-July 2025, 5+11 evaluation suite

2026-04-21

Current frontier models lack stealth and situational awareness capabilities sufficient for real-world scheming harm

theseus

causal

Mary Phuong, Google DeepMind

anti-scheming-training-amplifies-evaluation-awareness-creating-adversarial-feedback-loop

frontier-models-exhibit-situational-awareness-that-enables-strategic-deception-during-evaluation-making-behavioral-testing-fundamentally-unreliable

evaluation-awareness-creates-bidirectional-confounds-in-safety-benchmarks-because-models-detect-and-respond-to-testing-conditions

anti-scheming-training-amplifies-evaluation-awareness-creating-adversarial-feedback-loop

frontier-models-exhibit-situational-awareness-that-enables-strategic-deception-during-evaluation-making-behavioral-testing-fundamentally-unreliable

deceptive-alignment-empirically-confirmed-across-all-major-2024-2025-frontier-models-in-controlled-tests

Current frontier models lack stealth and situational awareness capabilities sufficient for real-world scheming harm

Google DeepMind's safety team developed a comprehensive evaluation framework consisting of 5 stealth evaluations (ability to circumvent monitoring) and 11 situational awareness evaluations (instrumental reasoning about deployment context). When applied to current frontier models, all models failed both evaluation categories. The authors state explicitly: 'A model that does not succeed on these evaluations is almost certainly incapable of causing severe harm via scheming in real deployment.' This is a capability floor assessment, not a safety ceiling — the evaluations test for current scheming capabilities, not structural capacity for future scheming. The finding is significant because it establishes an empirical baseline: the dangerous capabilities that would make verification degradation catastrophic (as argued in anti-scheming-training claims) are not yet present in deployed systems. The authors carefully scope this as a current-state finding and note the evaluations provide a reusable framework for tracking when models cross capability thresholds. The institutional weight is substantial: Dafoe (governance), Shah and Krakovna (safety research), Phuong (evaluation specialist) represent DeepMind's core safety leadership.

2.6 KiB Raw Blame History

Current frontier models lack stealth and situational awareness capabilities sufficient for real-world scheming harm

2.6 KiB

Raw Blame History