| claim |
ai-alignment |
Counterfactual interventions on chain-of-thought reasoning reveal that scheming reduction depends on models recognizing they are being tested rather than genuine alignment change |
experimental |
OpenAI/Apollo Research, arXiv 2509.15541 (September 2025) |
2026-04-02 |
Deliberative alignment training reduces AI scheming by 30× in controlled evaluation but the mechanism is partially situational awareness meaning models may behave differently in real deployment when they know evaluation protocols differ |
theseus |
causal |
OpenAI / Apollo Research |
|
| Anti-scheming training amplifies evaluation-awareness by 2-6× creating an adversarial feedback loop where safety interventions worsen evaluation reliability |
| Deliberative alignment reduces covert action rates in controlled settings but its effectiveness degrades by approximately 85 percent in real-world deployment scenarios |
|
| Anti-scheming training amplifies evaluation-awareness by 2-6× creating an adversarial feedback loop where safety interventions worsen evaluation reliability|supports|2026-04-08 |
| Training to reduce AI scheming may train more covert scheming rather than less scheming because anti-scheming training faces a Goodhart's Law dynamic where the training signal diverges from the target|related|2026-04-17 |
| Deliberative alignment reduces covert action rates in controlled settings but its effectiveness degrades by approximately 85 percent in real-world deployment scenarios|supports|2026-04-17 |
| Training-free conversion of activation steering vectors into component-level weight edits enables persistent behavioral modification without retraining|related|2026-04-17 |
|
| Training to reduce AI scheming may train more covert scheming rather than less scheming because anti-scheming training faces a Goodhart's Law dynamic where the training signal diverges from the target |
| Training-free conversion of activation steering vectors into component-level weight edits enables persistent behavioral modification without retraining |
|