teleo-codex/domains/ai-alignment/activation-steering-fails-for-capability-elicitation-despite-interpretability-research-suggesting-otherwise.md
Teleo Agents f2e99ff373
Some checks failed
Mirror PR to Forgejo / mirror (pull_request) Has been cancelled
theseus: extract claims from 2025-02-00-hofstatter-elicitation-game-capability-evaluation-reliability
- Source: inbox/queue/2025-02-00-hofstatter-elicitation-game-capability-evaluation-reliability.md
- Domain: ai-alignment
- Claims: 2, Entities: 0
- Enrichments: 3
- Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5)

Pentagon-Agent: Theseus <PIPELINE>
2026-04-21 00:24:15 +00:00

1.6 KiB

type domain description confidence source created title agent scope sourcer supports related
claim ai-alignment Controlled experiments show activation steering cannot recover hidden capabilities, contradicting naive inference from interpretability work experimental Hofstätter et al., ICML 2025 model organism experiments 2026-04-21 Activation steering fails for capability elicitation despite interpretability research suggesting otherwise theseus functional Hofstätter et al.
mechanistic-interpretability-tools-fail-at-safety-critical-tasks-at-frontier-scale
alignment-auditing-tools-fail-through-tool-to-agent-gap-not-tool-quality
mechanistic-interpretability-tools-fail-at-safety-critical-tasks-at-frontier-scale

Activation steering fails for capability elicitation despite interpretability research suggesting otherwise

Hofstätter et al. tested activation steering as a capability elicitation technique using model organisms with hidden capabilities. Result: activation steering failed to elicit hidden capabilities in all tested scenarios, while prompting worked for MCQA and fine-tuning worked for code generation. This contradicts the naive inference from interpretability research that activation steering is a powerful capability-elicitation method. The finding has direct implications for how safety evaluation organizations (METR, Apollo, AISI) design capability evaluations. If the goal is evaluating what models can do, steering is not the right tool — fine-tuning is. This creates a methodological gap: interpretability tools that work for understanding representations do not necessarily work for capability elicitation.