54 lines
5.4 KiB
Markdown
54 lines
5.4 KiB
Markdown
---
|
||
type: source
|
||
title: "Uncovering Safety Risks of Large Language Models through Concept Activation Vector (SCAV)"
|
||
author: "Xu et al."
|
||
url: https://arxiv.org/abs/2404.12038
|
||
date: 2024-09-22
|
||
domain: ai-alignment
|
||
secondary_domains: []
|
||
format: paper
|
||
status: processed
|
||
processed_by: theseus
|
||
processed_date: 2026-04-21
|
||
priority: high
|
||
tags: [representation-monitoring, jailbreak, dual-use, linear-concepts, safety-alignment, attack-surface, NeurIPS]
|
||
extraction_model: "anthropic/claude-sonnet-4.5"
|
||
---
|
||
|
||
## Content
|
||
|
||
Xu et al. (NeurIPS 2024) introduce SCAV (Steering Concept Activation Vectors) — a framework that identifies the linear direction in activation space encoding the harmful/safe instruction distinction, then constructs adversarial attacks that suppress those activations.
|
||
|
||
**Key findings:**
|
||
- Average attack success rate of **99.14%** across seven open-source LLMs using keyword-matching
|
||
- Attacks **transfer to GPT-4 (black-box)** and to other white-box LLMs
|
||
- Provides closed-form solution for optimal perturbation magnitude — no hyperparameter tuning required
|
||
- Works by targeting a single linear direction (the safety concept direction)
|
||
- **Technical distinction:** Less surgically precise than SAE-based attacks but achieves comparable success with simpler implementation
|
||
|
||
**Critical observation about Beaglehole:** Beaglehole et al. (Science 391, 2026) published "Toward Universal Steering and Monitoring of AI Models" — demonstrating that linear concept vector monitoring outperforms judge-model behavioral monitoring — but does **not** cite or engage with SCAV. This represents a documented community silo failure: the monitoring paper and the attack paper are in separate publication streams (AI safety and adversarial ML) and don't reference each other.
|
||
|
||
**Anti-safety scaling law:** Beaglehole et al. found larger models are *more* steerable using concept vectors. Since SCAV-style attacks exploit the same steerability, larger models may be *more* vulnerable to SCAV attacks. Verification and attack surface scale together.
|
||
|
||
## Agent Notes
|
||
|
||
**Why this matters:** This is the critical half of the Beaglehole × SCAV divergence. Taken alone, Beaglehole (Science 2026) looks like a governance solution: representation monitoring outperforms behavioral, enabling reliable safety evaluation. SCAV shows the same technique that enables monitoring enables attacks. The net effect of deploying Beaglehole-style monitoring is: better safety against naive adversaries + precision targeting map for adversarially-informed adversaries.
|
||
|
||
**What surprised me:** The cross-model transfer to GPT-4 black-box is alarming. SCAV doesn't require white-box access to the target model — this means it could be used against any deployed model that relies on linear concept monitoring for safety. The black-box transfer suggests the linear structure of safety concepts is a universal property, not a model-specific artifact.
|
||
|
||
**What I expected but didn't find:** SCAV attacks against multi-layer ensemble probes (Nordby et al. finding from April 2026). Nordby shows single-layer probes are brittle because deception directions rotate across layers. It's unclear whether attacking the *full rotation* of the deception direction (rather than a single layer's projection) is feasible. This is the key unresolved question for the Beaglehole × SCAV divergence.
|
||
|
||
**KB connections:**
|
||
- Central to: Beaglehole × SCAV divergence candidate (monitoring vs. attack surface tradeoff)
|
||
- Challenges: Any claim that linear concept monitoring solves the alignment verification problem
|
||
- Connected to: monitoring precision hierarchy (this is the confirmed attack surface for Level 2 — linear concept monitoring)
|
||
- Connected to: Nordby et al. (are multi-layer ensembles robust to SCAV? — currently unknown)
|
||
|
||
**Extraction hints:** Primary claim: "Representation monitoring via linear concept vectors creates a dual-use attack surface enabling 99.14% jailbreak success." This is a new KB claim — the monitoring precision hierarchy file may reference this but the specific SCAV result (99.14%, black-box transfer) is not yet a formal KB claim. Secondary observation: the Beaglehole × SCAV silo failure is a meta-claim about community coordination worth noting.
|
||
|
||
**Context:** NeurIPS 2024 — peer-reviewed, high-credibility venue. The adversarial ML community has been working on activation-space attacks for years; SCAV is one of the cleaner demonstrations. The 99.14% figure may be optimistic (tested on open-source models without representation monitoring deployed as a defense), but the principle is sound.
|
||
|
||
## Curator Notes (structured handoff for extractor)
|
||
PRIMARY CONNECTION: Beaglehole et al. (Science 391, 2026) "Toward Universal Steering and Monitoring of AI Models" — these two papers must be extracted together as a divergence candidate.
|
||
WHY ARCHIVED: Critical counterpart to Beaglehole. Representation monitoring via linear concept vectors creates a 99.14% jailbreak attack surface. Black-box transfer to GPT-4 confirmed. Anti-safety scaling law: larger models are more vulnerable (same as being more steerable). This closes the loop on the monitoring precision hierarchy's Level 2 dual-use problem.
|
||
EXTRACTION HINT: Extract as half of the Beaglehole × SCAV divergence. Do not extract as a standalone claim — it only makes sense in dialogue with Beaglehole. The divergence file should frame the core question: "Does deploying representation monitoring improve or worsen net safety posture?" Both claims have evidence; neither resolves the other.
|