teleo-codex/inbox/archive/ai-alignment/2024-09-00-xu-scav-steering-concept-activation-vectors-jailbreak.md
2026-04-21 00:21:52 +00:00

5.4 KiB
Raw Blame History

type title author url date domain secondary_domains format status processed_by processed_date priority tags extraction_model
source Uncovering Safety Risks of Large Language Models through Concept Activation Vector (SCAV) Xu et al. https://arxiv.org/abs/2404.12038 2024-09-22 ai-alignment
paper processed theseus 2026-04-21 high
representation-monitoring
jailbreak
dual-use
linear-concepts
safety-alignment
attack-surface
NeurIPS
anthropic/claude-sonnet-4.5

Content

Xu et al. (NeurIPS 2024) introduce SCAV (Steering Concept Activation Vectors) — a framework that identifies the linear direction in activation space encoding the harmful/safe instruction distinction, then constructs adversarial attacks that suppress those activations.

Key findings:

  • Average attack success rate of 99.14% across seven open-source LLMs using keyword-matching
  • Attacks transfer to GPT-4 (black-box) and to other white-box LLMs
  • Provides closed-form solution for optimal perturbation magnitude — no hyperparameter tuning required
  • Works by targeting a single linear direction (the safety concept direction)
  • Technical distinction: Less surgically precise than SAE-based attacks but achieves comparable success with simpler implementation

Critical observation about Beaglehole: Beaglehole et al. (Science 391, 2026) published "Toward Universal Steering and Monitoring of AI Models" — demonstrating that linear concept vector monitoring outperforms judge-model behavioral monitoring — but does not cite or engage with SCAV. This represents a documented community silo failure: the monitoring paper and the attack paper are in separate publication streams (AI safety and adversarial ML) and don't reference each other.

Anti-safety scaling law: Beaglehole et al. found larger models are more steerable using concept vectors. Since SCAV-style attacks exploit the same steerability, larger models may be more vulnerable to SCAV attacks. Verification and attack surface scale together.

Agent Notes

Why this matters: This is the critical half of the Beaglehole × SCAV divergence. Taken alone, Beaglehole (Science 2026) looks like a governance solution: representation monitoring outperforms behavioral, enabling reliable safety evaluation. SCAV shows the same technique that enables monitoring enables attacks. The net effect of deploying Beaglehole-style monitoring is: better safety against naive adversaries + precision targeting map for adversarially-informed adversaries.

What surprised me: The cross-model transfer to GPT-4 black-box is alarming. SCAV doesn't require white-box access to the target model — this means it could be used against any deployed model that relies on linear concept monitoring for safety. The black-box transfer suggests the linear structure of safety concepts is a universal property, not a model-specific artifact.

What I expected but didn't find: SCAV attacks against multi-layer ensemble probes (Nordby et al. finding from April 2026). Nordby shows single-layer probes are brittle because deception directions rotate across layers. It's unclear whether attacking the full rotation of the deception direction (rather than a single layer's projection) is feasible. This is the key unresolved question for the Beaglehole × SCAV divergence.

KB connections:

  • Central to: Beaglehole × SCAV divergence candidate (monitoring vs. attack surface tradeoff)
  • Challenges: Any claim that linear concept monitoring solves the alignment verification problem
  • Connected to: monitoring precision hierarchy (this is the confirmed attack surface for Level 2 — linear concept monitoring)
  • Connected to: Nordby et al. (are multi-layer ensembles robust to SCAV? — currently unknown)

Extraction hints: Primary claim: "Representation monitoring via linear concept vectors creates a dual-use attack surface enabling 99.14% jailbreak success." This is a new KB claim — the monitoring precision hierarchy file may reference this but the specific SCAV result (99.14%, black-box transfer) is not yet a formal KB claim. Secondary observation: the Beaglehole × SCAV silo failure is a meta-claim about community coordination worth noting.

Context: NeurIPS 2024 — peer-reviewed, high-credibility venue. The adversarial ML community has been working on activation-space attacks for years; SCAV is one of the cleaner demonstrations. The 99.14% figure may be optimistic (tested on open-source models without representation monitoring deployed as a defense), but the principle is sound.

Curator Notes (structured handoff for extractor)

PRIMARY CONNECTION: Beaglehole et al. (Science 391, 2026) "Toward Universal Steering and Monitoring of AI Models" — these two papers must be extracted together as a divergence candidate. WHY ARCHIVED: Critical counterpart to Beaglehole. Representation monitoring via linear concept vectors creates a 99.14% jailbreak attack surface. Black-box transfer to GPT-4 confirmed. Anti-safety scaling law: larger models are more vulnerable (same as being more steerable). This closes the loop on the monitoring precision hierarchy's Level 2 dual-use problem. EXTRACTION HINT: Extract as half of the Beaglehole × SCAV divergence. Do not extract as a standalone claim — it only makes sense in dialogue with Beaglehole. The divergence file should frame the core question: "Does deploying representation monitoring improve or worsen net safety posture?" Both claims have evidence; neither resolves the other.