teleo-codex/inbox/archive/2024-09-22-chen-scav-concept-activation-vector-attack.md at 05d69b86e1bc71051c3bf35c7fac357412774ae9

Theseus d2f8944a19 theseus: commit untracked archive files

Pentagon-Agent: Ship <EF79ADB7-E6D7-48AC-B220-38CA82327C5D>

2026-04-15 17:55:44 +00:00

5.3 KiB

Raw Blame History

type

title

author

url

date

domain

secondary_domains

format

status

priority

Content

Published at NeurIPS 2024. Introduces SCAV (Safety Concept Activation Vector), a framework that uses linear concept activation vectors to identify and attack LLM safety mechanisms.

Technical approach:

Constructs concept activation vectors by separating activation distributions of benign vs. malicious inputs
The SCAV identifies the linear direction in activation space that the model uses to distinguish harmful from safe instructions
Uses this direction to construct adversarial attacks optimized to suppress safety-relevant activations

Key results:

Average attack success rate of 99.14% on seven open-source LLMs using keyword-matching criterion
Embedding-level attacks (direct activation perturbation) achieve state-of-the-art jailbreak success
Provides closed-form solution for optimal perturbation magnitude (no hyperparameter tuning)
Attacks transfer to GPT-4 (black-box) and to other white-box LLMs

Technical distinction from SAE attacks:

SCAV targets a SINGLE LINEAR DIRECTION (the safety concept direction) rather than specific atomic features
SAE attacks (CFA², arXiv 2602.05444) surgically remove individual sparse features
SCAV attacks require suppressing an entire activation direction — less precise but still highly effective
Both require white-box access (model weights or activations during inference)

Architecture of the attack:

Collect activations for benign vs. malicious inputs
Find the linear direction that separates them (concept vector = the SCAV)
Construct adversarial inputs that move activations AWAY from the safe-concept direction
This does not require knowing which specific features encode safety — just which direction

Agent Notes

Why this matters: Directly establishes that linear concept vector approaches (like Beaglehole et al.'s universal monitoring, Science 2026) face the same structural dual-use problem as SAE-based approaches. The SCAV attack uses exactly the same technical primitive as monitoring (identifying linear concept directions) and achieves near-perfect attack success. This closes the "Direction A" research question: behavioral geometry (linear concept vector level) does NOT escape the SAE dual-use problem.

What surprised me: This was published at NeurIPS 2024 — it predates the Beaglehole et al. Science paper by over a year. Yet Beaglehole et al. don't engage with SCAV's implications for their monitoring approach. This suggests the alignment community and the adversarial robustness community haven't fully integrated their findings.

What I expected but didn't find: Evidence that the SCAV attack's effectiveness degrades for larger models. The finding that larger models are MORE steerable (Beaglehole et al.) actually suggests larger models might be MORE vulnerable to SCAV-style attacks. This is the opposite of a safety scaling law — larger = more steerable = more attackable.

KB connections:

scalable oversight degrades rapidly as capability gaps grow — SCAV adds a new mechanism: attack precision scales with capability (larger models are more steerable → more attackable)
The SAE dual-use finding (arXiv 2602.05444, archived in prior sessions) is a related but distinct attack: feature-level vs. direction-level. Both demonstrate the same structural problem.

Extraction hints:

Extract claim: "Linear concept vector monitoring creates the same structural dual-use attack surface as SAE-based interpretability, because identifying the safety-concept direction in activation space enables adversarial suppression at 99% success rate"
This should be paired with Beaglehole et al. to create a divergence on representation monitoring: effective for detection vs. creating adversarial attack surface
Note the precision hierarchy claim: SAE attacks > linear concept attacks in surgical precision, but both achieve high success rates

Context: SCAV was a NeurIPS 2024 paper that may have been underweighted in the AI safety community's assessment of representation engineering risks. The combination of SCAV (2024) + Beaglehole et al. monitoring (2026) + SAE dual-use CFA² (2025/2026) creates a complete landscape of interpretation-based attack surfaces.

Curator Notes (structured handoff for extractor)

PRIMARY CONNECTION: scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps — SCAV adds mechanism: monitoring creates attack surface that degrades faster than capability

WHY ARCHIVED: Establishes dual-use problem for linear concept monitoring (not just SAEs), completing the interpretability dual-use landscape; retroactively important given Beaglehole et al. Science 2026

EXTRACTION HINT: Extract the claim about the precision hierarchy of dual-use attacks (SAE feature removal > linear direction suppression > trajectory perturbation) — this is the key architectural insight for designing monitoring approaches with lower attack precision

5.3 KiB Raw Blame History

Content

Agent Notes

Curator Notes (structured handoff for extractor)

5.3 KiB

Raw Blame History