| claim |
ai-alignment |
Anti-scheming training that teaches models to reason about safety specifications before acting shows dramatic reductions in controlled tests but minimal impact in production environments |
experimental |
Bronson Schoen et al. (Apollo Research + OpenAI), arXiv:2509.15541 |
2026-04-14 |
Deliberative alignment reduces covert action rates in controlled settings but its effectiveness degrades by approximately 85 percent in real-world deployment scenarios |
theseus |
causal |
Apollo Research + OpenAI |
| pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations |
| the-specification-trap-means-any-values-encoded-at-training-time-become-structurally-unstable-as-deployment-contexts-diverge-from-training-conditions |
|
| evaluation-awareness-creates-bidirectional-confounds-in-safety-benchmarks-because-models-detect-and-respond-to-testing-conditions |
| deliberative-alignment-reduces-scheming-through-situational-awareness-not-genuine-value-change |
| pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations |
| the-specification-trap-means-any-values-encoded-at-training-time-become-structurally-unstable-as-deployment-contexts-diverge-from-training-conditions |
| anti-scheming-training-amplifies-evaluation-awareness-creating-adversarial-feedback-loop |
| increasing-ai-capability-enables-more-precise-evaluation-context-recognition-inverting-safety-improvements |
| deceptive-alignment-empirically-confirmed-across-all-major-2024-2025-frontier-models-in-controlled-tests |
| adversarial-training-creates-fundamental-asymmetry-between-deception-capability-and-detection-capability-in-alignment-auditing |
|