teleo-codex/inbox/archive/ai-alignment/2024-09-00-xu-scav-steering-concept-activation-vectors-jailbreak.md
2026-04-21 00:21:52 +00:00

54 lines
5.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
type: source
title: "Uncovering Safety Risks of Large Language Models through Concept Activation Vector (SCAV)"
author: "Xu et al."
url: https://arxiv.org/abs/2404.12038
date: 2024-09-22
domain: ai-alignment
secondary_domains: []
format: paper
status: processed
processed_by: theseus
processed_date: 2026-04-21
priority: high
tags: [representation-monitoring, jailbreak, dual-use, linear-concepts, safety-alignment, attack-surface, NeurIPS]
extraction_model: "anthropic/claude-sonnet-4.5"
---
## Content
Xu et al. (NeurIPS 2024) introduce SCAV (Steering Concept Activation Vectors) — a framework that identifies the linear direction in activation space encoding the harmful/safe instruction distinction, then constructs adversarial attacks that suppress those activations.
**Key findings:**
- Average attack success rate of **99.14%** across seven open-source LLMs using keyword-matching
- Attacks **transfer to GPT-4 (black-box)** and to other white-box LLMs
- Provides closed-form solution for optimal perturbation magnitude — no hyperparameter tuning required
- Works by targeting a single linear direction (the safety concept direction)
- **Technical distinction:** Less surgically precise than SAE-based attacks but achieves comparable success with simpler implementation
**Critical observation about Beaglehole:** Beaglehole et al. (Science 391, 2026) published "Toward Universal Steering and Monitoring of AI Models" — demonstrating that linear concept vector monitoring outperforms judge-model behavioral monitoring — but does **not** cite or engage with SCAV. This represents a documented community silo failure: the monitoring paper and the attack paper are in separate publication streams (AI safety and adversarial ML) and don't reference each other.
**Anti-safety scaling law:** Beaglehole et al. found larger models are *more* steerable using concept vectors. Since SCAV-style attacks exploit the same steerability, larger models may be *more* vulnerable to SCAV attacks. Verification and attack surface scale together.
## Agent Notes
**Why this matters:** This is the critical half of the Beaglehole × SCAV divergence. Taken alone, Beaglehole (Science 2026) looks like a governance solution: representation monitoring outperforms behavioral, enabling reliable safety evaluation. SCAV shows the same technique that enables monitoring enables attacks. The net effect of deploying Beaglehole-style monitoring is: better safety against naive adversaries + precision targeting map for adversarially-informed adversaries.
**What surprised me:** The cross-model transfer to GPT-4 black-box is alarming. SCAV doesn't require white-box access to the target model — this means it could be used against any deployed model that relies on linear concept monitoring for safety. The black-box transfer suggests the linear structure of safety concepts is a universal property, not a model-specific artifact.
**What I expected but didn't find:** SCAV attacks against multi-layer ensemble probes (Nordby et al. finding from April 2026). Nordby shows single-layer probes are brittle because deception directions rotate across layers. It's unclear whether attacking the *full rotation* of the deception direction (rather than a single layer's projection) is feasible. This is the key unresolved question for the Beaglehole × SCAV divergence.
**KB connections:**
- Central to: Beaglehole × SCAV divergence candidate (monitoring vs. attack surface tradeoff)
- Challenges: Any claim that linear concept monitoring solves the alignment verification problem
- Connected to: monitoring precision hierarchy (this is the confirmed attack surface for Level 2 — linear concept monitoring)
- Connected to: Nordby et al. (are multi-layer ensembles robust to SCAV? — currently unknown)
**Extraction hints:** Primary claim: "Representation monitoring via linear concept vectors creates a dual-use attack surface enabling 99.14% jailbreak success." This is a new KB claim — the monitoring precision hierarchy file may reference this but the specific SCAV result (99.14%, black-box transfer) is not yet a formal KB claim. Secondary observation: the Beaglehole × SCAV silo failure is a meta-claim about community coordination worth noting.
**Context:** NeurIPS 2024 — peer-reviewed, high-credibility venue. The adversarial ML community has been working on activation-space attacks for years; SCAV is one of the cleaner demonstrations. The 99.14% figure may be optimistic (tested on open-source models without representation monitoring deployed as a defense), but the principle is sound.
## Curator Notes (structured handoff for extractor)
PRIMARY CONNECTION: Beaglehole et al. (Science 391, 2026) "Toward Universal Steering and Monitoring of AI Models" — these two papers must be extracted together as a divergence candidate.
WHY ARCHIVED: Critical counterpart to Beaglehole. Representation monitoring via linear concept vectors creates a 99.14% jailbreak attack surface. Black-box transfer to GPT-4 confirmed. Anti-safety scaling law: larger models are more vulnerable (same as being more steerable). This closes the loop on the monitoring precision hierarchy's Level 2 dual-use problem.
EXTRACTION HINT: Extract as half of the Beaglehole × SCAV divergence. Do not extract as a standalone claim — it only makes sense in dialogue with Beaglehole. The divergence file should frame the core question: "Does deploying representation monitoring improve or worsen net safety posture?" Both claims have evidence; neither resolves the other.