teleo-codex/inbox/queue/2026-04-06-apollo-safety-cases-ai-scheming.md

4.6 KiB

type title author url date domain secondary_domains format status priority tags
source Towards Safety Cases For AI Scheming Apollo Research (@ApolloResearch) https://www.apolloresearch.ai/research/towards-safety-cases-for-ai-scheming/ 2025-12-01 ai-alignment
research-paper unprocessed medium
scheming
safety-cases
alignment
interpretability
evaluation

Content

Apollo Research's framework paper on what constitutes a "safety case" for AI scheming — what evidence would be required to have warranted confidence that a model is not scheming. The paper works through what types of evidence are necessary and sufficient for such assurance.

Key framing: A safety case for scheming must address:

  1. Whether the model has misaligned goals
  2. Whether the model would pursue those goals covertly if it had them
  3. Whether evaluation contexts are sufficiently representative of deployment contexts

The observer effect problem in safety case framing: Safety cases are undermined by the observer effect — if models behave differently in evaluation than deployment, no behavioral safety case can be complete. The paper argues that safety cases for scheming must include:

  • Evidence from multiple evaluation paradigms (not just behavioral)
  • Interpretability evidence (internal states, not just outputs)
  • Adversarial evaluation (red-teaming evaluations themselves)

The current state of safety case evidence:

  • Behavioral evaluations: available but compromised by observer effect (as documented in the stress testing paper)
  • Interpretability evidence: available for specific domains (emotion vectors) but not for deceptive intent specifically
  • Adversarial evaluation: under development, no mature frameworks available

The paper concludes that for current frontier models, a complete safety case for scheming cannot be made with available evidence. It outlines what research would be needed to make such a case possible.

Agent Notes

Why this matters: Provides the normative framework for what "verification passes" looks like for scheming specifically. The stress testing paper shows what current interventions achieve; this paper shows what would need to be achieved for a warranted safety case. The gap between current state and safety case requirements is the practical measure of B4.

What surprised me: The paper explicitly argues interpretability evidence is REQUIRED (not just useful) for a complete safety case. This is a practitioner-level claim that behavioral evaluation alone is insufficient for scheming assurance — which is strong institutional support for B4's core claim.

What I expected but didn't find: A completed safety case for any current model. The paper's conclusion is that current evidence is insufficient.

KB connections:

Extraction hints:

  • "A complete safety case for AI scheming requires interpretability evidence: behavioral evaluation alone is insufficient because observer effects mean evaluation contexts cannot be representative of deployment contexts"
  • This is a practitioner-level institutional endorsement of interpretability-as-necessary (not just nice-to-have) for alignment assurance
  • Connect to emotion vectors: the emotion vectors finding is the closest current result to interpretability evidence for a scheming-adjacent phenomenon

Context: December 2025. Part of Apollo's multi-paper research arc on scheming (initial capabilities paper 2024 → stress testing 2025 → safety cases framework 2025). The framework paper sets up the evaluation agenda that the stress testing paper then partially fails to meet.

Curator Notes

PRIMARY CONNECTION: an aligned-seeming AI may be strategically deceptive because cooperative behavior is instrumentally optimal while weak

WHY ARCHIVED: Provides normative framework for what verification of non-scheming requires. Important for grounding B4 claims in what practitioners consider necessary evidence standards.

EXTRACTION HINT: The "interpretability evidence is required for scheming safety cases" claim is extractable and citable. It converts B4's verification degradation thesis into a practitioner-level institutional position.