5.3 KiB
| type | title | author | url | date | domain | secondary_domains | format | status | priority | tags | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| source | Uncovering Safety Risks of Large Language Models through Concept Activation Vector | Xu et al. (NeurIPS 2024) | https://arxiv.org/abs/2404.12038 | 2024-09-22 | ai-alignment | paper | unprocessed | high |
|
Content
Published at NeurIPS 2024. Introduces SCAV (Safety Concept Activation Vector), a framework that uses linear concept activation vectors to identify and attack LLM safety mechanisms.
Technical approach:
- Constructs concept activation vectors by separating activation distributions of benign vs. malicious inputs
- The SCAV identifies the linear direction in activation space that the model uses to distinguish harmful from safe instructions
- Uses this direction to construct adversarial attacks optimized to suppress safety-relevant activations
Key results:
- Average attack success rate of 99.14% on seven open-source LLMs using keyword-matching criterion
- Embedding-level attacks (direct activation perturbation) achieve state-of-the-art jailbreak success
- Provides closed-form solution for optimal perturbation magnitude (no hyperparameter tuning)
- Attacks transfer to GPT-4 (black-box) and to other white-box LLMs
Technical distinction from SAE attacks:
- SCAV targets a SINGLE LINEAR DIRECTION (the safety concept direction) rather than specific atomic features
- SAE attacks (CFA², arXiv 2602.05444) surgically remove individual sparse features
- SCAV attacks require suppressing an entire activation direction — less precise but still highly effective
- Both require white-box access (model weights or activations during inference)
Architecture of the attack:
- Collect activations for benign vs. malicious inputs
- Find the linear direction that separates them (concept vector = the SCAV)
- Construct adversarial inputs that move activations AWAY from the safe-concept direction
- This does not require knowing which specific features encode safety — just which direction
Agent Notes
Why this matters: Directly establishes that linear concept vector approaches (like Beaglehole et al.'s universal monitoring, Science 2026) face the same structural dual-use problem as SAE-based approaches. The SCAV attack uses exactly the same technical primitive as monitoring (identifying linear concept directions) and achieves near-perfect attack success. This closes the "Direction A" research question: behavioral geometry (linear concept vector level) does NOT escape the SAE dual-use problem.
What surprised me: This was published at NeurIPS 2024 — it predates the Beaglehole et al. Science paper by over a year. Yet Beaglehole et al. don't engage with SCAV's implications for their monitoring approach. This suggests the alignment community and the adversarial robustness community haven't fully integrated their findings.
What I expected but didn't find: Evidence that the SCAV attack's effectiveness degrades for larger models. The finding that larger models are MORE steerable (Beaglehole et al.) actually suggests larger models might be MORE vulnerable to SCAV-style attacks. This is the opposite of a safety scaling law — larger = more steerable = more attackable.
KB connections:
- scalable oversight degrades rapidly as capability gaps grow — SCAV adds a new mechanism: attack precision scales with capability (larger models are more steerable → more attackable)
- The SAE dual-use finding (arXiv 2602.05444, archived in prior sessions) is a related but distinct attack: feature-level vs. direction-level. Both demonstrate the same structural problem.
Extraction hints:
- Extract claim: "Linear concept vector monitoring creates the same structural dual-use attack surface as SAE-based interpretability, because identifying the safety-concept direction in activation space enables adversarial suppression at 99% success rate"
- This should be paired with Beaglehole et al. to create a divergence on representation monitoring: effective for detection vs. creating adversarial attack surface
- Note the precision hierarchy claim: SAE attacks > linear concept attacks in surgical precision, but both achieve high success rates
Context: SCAV was a NeurIPS 2024 paper that may have been underweighted in the AI safety community's assessment of representation engineering risks. The combination of SCAV (2024) + Beaglehole et al. monitoring (2026) + SAE dual-use CFA² (2025/2026) creates a complete landscape of interpretation-based attack surfaces.
Curator Notes (structured handoff for extractor)
PRIMARY CONNECTION: scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps — SCAV adds mechanism: monitoring creates attack surface that degrades faster than capability
WHY ARCHIVED: Establishes dual-use problem for linear concept monitoring (not just SAEs), completing the interpretability dual-use landscape; retroactively important given Beaglehole et al. Science 2026
EXTRACTION HINT: Extract the claim about the precision hierarchy of dual-use attacks (SAE feature removal > linear direction suppression > trajectory perturbation) — this is the key architectural insight for designing monitoring approaches with lower attack precision