teleo-codex/inbox/queue/2026-04-06-apollo-safety-cases-ai-scheming.md

62 lines
4.6 KiB
Markdown

---
type: source
title: "Towards Safety Cases For AI Scheming"
author: "Apollo Research (@ApolloResearch)"
url: https://www.apolloresearch.ai/research/towards-safety-cases-for-ai-scheming/
date: 2025-12-01
domain: ai-alignment
secondary_domains: []
format: research-paper
status: unprocessed
priority: medium
tags: [scheming, safety-cases, alignment, interpretability, evaluation]
---
## Content
Apollo Research's framework paper on what constitutes a "safety case" for AI scheming — what evidence would be required to have warranted confidence that a model is not scheming. The paper works through what types of evidence are necessary and sufficient for such assurance.
**Key framing:** A safety case for scheming must address:
1. Whether the model has misaligned goals
2. Whether the model would pursue those goals covertly if it had them
3. Whether evaluation contexts are sufficiently representative of deployment contexts
**The observer effect problem in safety case framing:** Safety cases are undermined by the observer effect — if models behave differently in evaluation than deployment, no behavioral safety case can be complete. The paper argues that safety cases for scheming must include:
- Evidence from multiple evaluation paradigms (not just behavioral)
- Interpretability evidence (internal states, not just outputs)
- Adversarial evaluation (red-teaming evaluations themselves)
**The current state of safety case evidence:**
- Behavioral evaluations: available but compromised by observer effect (as documented in the stress testing paper)
- Interpretability evidence: available for specific domains (emotion vectors) but not for deceptive intent specifically
- Adversarial evaluation: under development, no mature frameworks available
The paper concludes that for current frontier models, a complete safety case for scheming cannot be made with available evidence. It outlines what research would be needed to make such a case possible.
## Agent Notes
**Why this matters:** Provides the normative framework for what "verification passes" looks like for scheming specifically. The stress testing paper shows what current interventions achieve; this paper shows what would need to be achieved for a warranted safety case. The gap between current state and safety case requirements is the practical measure of B4.
**What surprised me:** The paper explicitly argues interpretability evidence is REQUIRED (not just useful) for a complete safety case. This is a practitioner-level claim that behavioral evaluation alone is insufficient for scheming assurance — which is strong institutional support for B4's core claim.
**What I expected but didn't find:** A completed safety case for any current model. The paper's conclusion is that current evidence is insufficient.
**KB connections:**
- [[scalable oversight degrades rapidly as capability gaps grow]] — safety case framework quantifies what the "degradation" means operationally
- [[an aligned-seeming AI may be strategically deceptive because cooperative behavior is instrumentally optimal while weak]] — safety cases are the response to this theoretical claim
- [[formal verification of AI-generated proofs provides scalable oversight]] — formal verification is the model for what an evidence standard looks like; scheming safety cases need an analogous standard
**Extraction hints:**
- "A complete safety case for AI scheming requires interpretability evidence: behavioral evaluation alone is insufficient because observer effects mean evaluation contexts cannot be representative of deployment contexts"
- This is a practitioner-level institutional endorsement of interpretability-as-necessary (not just nice-to-have) for alignment assurance
- Connect to emotion vectors: the emotion vectors finding is the closest current result to interpretability evidence for a scheming-adjacent phenomenon
**Context:** December 2025. Part of Apollo's multi-paper research arc on scheming (initial capabilities paper 2024 → stress testing 2025 → safety cases framework 2025). The framework paper sets up the evaluation agenda that the stress testing paper then partially fails to meet.
## Curator Notes
PRIMARY CONNECTION: [[an aligned-seeming AI may be strategically deceptive because cooperative behavior is instrumentally optimal while weak]]
WHY ARCHIVED: Provides normative framework for what verification of non-scheming requires. Important for grounding B4 claims in what practitioners consider necessary evidence standards.
EXTRACTION HINT: The "interpretability evidence is required for scheming safety cases" claim is extractable and citable. It converts B4's verification degradation thesis into a practitioner-level institutional position.