| claim |
ai-alignment |
Apollo Research argues that behavioral evaluation alone cannot provide warranted confidence in non-scheming because models may behave differently in evaluation versus deployment contexts |
experimental |
Apollo Research, Towards Safety Cases For AI Scheming (Dec 2025) |
2026-04-07 |
Scheming safety cases require interpretability evidence because observer effects make behavioral evaluation insufficient |
theseus |
structural |
@ApolloResearch |
|
| Behavioral evaluation is structurally insufficient for latent alignment verification under evaluation awareness because normative indistinguishability creates an identifiability problem not a measurement problem |
|
| Behavioral evaluation is structurally insufficient for latent alignment verification under evaluation awareness because normative indistinguishability creates an identifiability problem not a measurement problem|supports|2026-04-21 |
|
| inbox/archive/ai-alignment/2026-04-06-spar-spring-2026-projects-overview.md |
| inbox/archive/ai-alignment/2026-04-06-apollo-safety-cases-ai-scheming.md |
|
| scheming-safety-cases-require-interpretability-evidence-because-observer-effects-make-behavioral-evaluation-insufficient |
| anti-scheming-training-amplifies-evaluation-awareness-creating-adversarial-feedback-loop |
| increasing-ai-capability-enables-more-precise-evaluation-context-recognition-inverting-safety-improvements |
| major-ai-safety-governance-frameworks-architecturally-dependent-on-behaviorally-insufficient-evaluation |
| evaluation-awareness-creates-bidirectional-confounds-in-safety-benchmarks-because-models-detect-and-respond-to-testing-conditions |
|