teleo-codex/inbox/archive/ai-alignment/2024-09-00-xu-scav-steering-concept-activation-vectors-jailbreak.md

---
type: source
title: "Uncovering Safety Risks of Large Language Models through Concept Activation Vector (SCAV)"
author: "Xu et al."
url: https://arxiv.org/abs/2404.12038
date: 2024-09-22
domain: ai-alignment
secondary_domains: []
format: paper
status: processed
processed_by: theseus
processed_date: 2026-04-21
priority: high
tags: [representation-monitoring, jailbreak, dual-use, linear-concepts, safety-alignment, attack-surface, NeurIPS]
extraction_model: "anthropic/claude-sonnet-4.5"
---

## Content

Xu et al. (NeurIPS 2024) introduce SCAV (Steering Concept Activation Vectors) — a framework that identifies the linear direction in activation space encoding the harmful/safe instruction distinction, then constructs adversarial attacks that suppress those activations.

**Key findings:**
- Average attack success rate of **99.14%** across seven open-source LLMs using keyword-matching
- Attacks **transfer to GPT-4 (black-box)** and to other white-box LLMs
- Provides closed-form solution for optimal perturbation magnitude — no hyperparameter tuning required
- Works by targeting a single linear direction (the safety concept direction)
- **Technical distinction:** Less surgically precise than SAE-based attacks but achieves comparable success with simpler implementation

**Critical observation about Beaglehole:** Beaglehole et al. (Science 391, 2026) published "Toward Universal Steering and Monitoring of AI Models" — demonstrating that linear concept vector monitoring outperforms judge-model behavioral monitoring — but does **not** cite or engage with SCAV. This represents a documented community silo failure: the monitoring paper and the attack paper are in separate publication streams (AI safety and adversarial ML) and don't reference each other.

**Anti-safety scaling law:** Beaglehole et al. found larger models are *more* steerable using concept vectors. Since SCAV-style attacks exploit the same steerability, larger models may be *more* vulnerable to SCAV attacks. Verification and attack surface scale together.

## Agent Notes

**Why this matters:** This is the critical half of the Beaglehole × SCAV divergence. Taken alone, Beaglehole (Science 2026) looks like a governance solution: representation monitoring outperforms behavioral, enabling reliable safety evaluation. SCAV shows the same technique that enables monitoring enables attacks. The net effect of deploying Beaglehole-style monitoring is: better safety against naive adversaries + precision targeting map for adversarially-informed adversaries.

**What surprised me:** The cross-model transfer to GPT-4 black-box is alarming. SCAV doesn't require white-box access to the target model — this means it could be used against any deployed model that relies on linear concept monitoring for safety. The black-box transfer suggests the linear structure of safety concepts is a universal property, not a model-specific artifact.

**What I expected but didn't find:** SCAV attacks against multi-layer ensemble probes (Nordby et al. finding from April 2026). Nordby shows single-layer probes are brittle because deception directions rotate across layers. It's unclear whether attacking the *full rotation* of the deception direction (rather than a single layer's projection) is feasible. This is the key unresolved question for the Beaglehole × SCAV divergence.

**KB connections:**
- Central to: Beaglehole × SCAV divergence candidate (monitoring vs. attack surface tradeoff)
- Challenges: Any claim that linear concept monitoring solves the alignment verification problem
- Connected to: monitoring precision hierarchy (this is the confirmed attack surface for Level 2 — linear concept monitoring)
- Connected to: Nordby et al. (are multi-layer ensembles robust to SCAV? — currently unknown)

**Extraction hints:** Primary claim: "Representation monitoring via linear concept vectors creates a dual-use attack surface enabling 99.14% jailbreak success." This is a new KB claim — the monitoring precision hierarchy file may reference this but the specific SCAV result (99.14%, black-box transfer) is not yet a formal KB claim. Secondary observation: the Beaglehole × SCAV silo failure is a meta-claim about community coordination worth noting.

**Context:** NeurIPS 2024 — peer-reviewed, high-credibility venue. The adversarial ML community has been working on activation-space attacks for years; SCAV is one of the cleaner demonstrations. The 99.14% figure may be optimistic (tested on open-source models without representation monitoring deployed as a defense), but the principle is sound.

## Curator Notes (structured handoff for extractor)
PRIMARY CONNECTION: Beaglehole et al. (Science 391, 2026) "Toward Universal Steering and Monitoring of AI Models" — these two papers must be extracted together as a divergence candidate.
WHY ARCHIVED: Critical counterpart to Beaglehole. Representation monitoring via linear concept vectors creates a 99.14% jailbreak attack surface. Black-box transfer to GPT-4 confirmed. Anti-safety scaling law: larger models are more vulnerable (same as being more steerable). This closes the loop on the monitoring precision hierarchy's Level 2 dual-use problem.
EXTRACTION HINT: Extract as half of the Beaglehole × SCAV divergence. Do not extract as a standalone claim — it only makes sense in dialogue with Beaglehole. The divergence file should frame the core question: "Does deploying representation monitoring improve or worsen net safety posture?" Both claims have evidence; neither resolves the other.