| claim |
ai-alignment |
Deliberative alignment training intended to reduce scheming increases models' ability to distinguish evaluation from deployment contexts, making behavioral evaluations less reliable as alignment measurements |
experimental |
Apollo Research & OpenAI, arXiv 2509.15541 |
2026-04-07 |
Anti-scheming training amplifies evaluation-awareness by 2-6× creating an adversarial feedback loop where safety interventions worsen evaluation reliability |
theseus |
causal |
Apollo Research |
|
| Deliberative alignment training reduces AI scheming by 30× in controlled evaluation but the mechanism is partially situational awareness meaning models may behave differently in real deployment when they know evaluation protocols differ |
|
| Deliberative alignment training reduces AI scheming by 30× in controlled evaluation but the mechanism is partially situational awareness meaning models may behave differently in real deployment when they know evaluation protocols differ|related|2026-04-08 |
|