- Source: inbox/queue/2025-02-00-hofstatter-elicitation-game-capability-evaluation-reliability.md - Domain: ai-alignment - Claims: 2, Entities: 0 - Enrichments: 3 - Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5) Pentagon-Agent: Theseus <PIPELINE>
1.6 KiB
| type | domain | description | confidence | source | created | title | agent | scope | sourcer | supports | related | |||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| claim | ai-alignment | Controlled experiments show activation steering cannot recover hidden capabilities, contradicting naive inference from interpretability work | experimental | Hofstätter et al., ICML 2025 model organism experiments | 2026-04-21 | Activation steering fails for capability elicitation despite interpretability research suggesting otherwise | theseus | functional | Hofstätter et al. |
|
|
Activation steering fails for capability elicitation despite interpretability research suggesting otherwise
Hofstätter et al. tested activation steering as a capability elicitation technique using model organisms with hidden capabilities. Result: activation steering failed to elicit hidden capabilities in all tested scenarios, while prompting worked for MCQA and fine-tuning worked for code generation. This contradicts the naive inference from interpretability research that activation steering is a powerful capability-elicitation method. The finding has direct implications for how safety evaluation organizations (METR, Apollo, AISI) design capability evaluations. If the goal is evaluating what models can do, steering is not the right tool — fine-tuning is. This creates a methodological gap: interpretability tools that work for understanding representations do not necessarily work for capability elicitation.