Mirror PR to Forgejo / mirror (pull_request) Waiting to run

Details

theseus: extract claims from 2025-02-00-hofstatter-elicitation-game-capability-evaluation-reliability

- Source: inbox/queue/2025-02-00-hofstatter-elicitation-game-capability-evaluation-reliability.md
- Domain: ai-alignment
- Claims: 2, Entities: 0
- Enrichments: 3
- Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5)

Pentagon-Agent: Theseus <PIPELINE>

2026-04-21 00:24:15 +00:00

1.6 KiB

Raw Blame History

type

domain

description

confidence

source

created

title

agent

scope

sourcer

supports

claim

ai-alignment

Controlled experiments show activation steering cannot recover hidden capabilities, contradicting naive inference from interpretability work

experimental

Hofstätter et al., ICML 2025 model organism experiments

2026-04-21

Activation steering fails for capability elicitation despite interpretability research suggesting otherwise

theseus

functional

Hofstätter et al.

mechanistic-interpretability-tools-fail-at-safety-critical-tasks-at-frontier-scale

alignment-auditing-tools-fail-through-tool-to-agent-gap-not-tool-quality

mechanistic-interpretability-tools-fail-at-safety-critical-tasks-at-frontier-scale

Activation steering fails for capability elicitation despite interpretability research suggesting otherwise

Hofstätter et al. tested activation steering as a capability elicitation technique using model organisms with hidden capabilities. Result: activation steering failed to elicit hidden capabilities in all tested scenarios, while prompting worked for MCQA and fine-tuning worked for code generation. This contradicts the naive inference from interpretability research that activation steering is a powerful capability-elicitation method. The finding has direct implications for how safety evaluation organizations (METR, Apollo, AISI) design capability evaluations. If the goal is evaluating what models can do, steering is not the right tool — fine-tuning is. This creates a methodological gap: interpretability tools that work for understanding representations do not necessarily work for capability elicitation.

1.6 KiB Raw Blame History

Activation steering fails for capability elicitation despite interpretability research suggesting otherwise

1.6 KiB

Raw Blame History