5.4 KiB
| type | title | author | url | date | domain | secondary_domains | format | status | processed_by | processed_date | priority | tags | extraction_model | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| source | Uncovering Safety Risks of Large Language Models through Concept Activation Vector (SCAV) | Xu et al. | https://arxiv.org/abs/2404.12038 | 2024-09-22 | ai-alignment | paper | processed | theseus | 2026-04-21 | high |
|
anthropic/claude-sonnet-4.5 |
Content
Xu et al. (NeurIPS 2024) introduce SCAV (Steering Concept Activation Vectors) — a framework that identifies the linear direction in activation space encoding the harmful/safe instruction distinction, then constructs adversarial attacks that suppress those activations.
Key findings:
- Average attack success rate of 99.14% across seven open-source LLMs using keyword-matching
- Attacks transfer to GPT-4 (black-box) and to other white-box LLMs
- Provides closed-form solution for optimal perturbation magnitude — no hyperparameter tuning required
- Works by targeting a single linear direction (the safety concept direction)
- Technical distinction: Less surgically precise than SAE-based attacks but achieves comparable success with simpler implementation
Critical observation about Beaglehole: Beaglehole et al. (Science 391, 2026) published "Toward Universal Steering and Monitoring of AI Models" — demonstrating that linear concept vector monitoring outperforms judge-model behavioral monitoring — but does not cite or engage with SCAV. This represents a documented community silo failure: the monitoring paper and the attack paper are in separate publication streams (AI safety and adversarial ML) and don't reference each other.
Anti-safety scaling law: Beaglehole et al. found larger models are more steerable using concept vectors. Since SCAV-style attacks exploit the same steerability, larger models may be more vulnerable to SCAV attacks. Verification and attack surface scale together.
Agent Notes
Why this matters: This is the critical half of the Beaglehole × SCAV divergence. Taken alone, Beaglehole (Science 2026) looks like a governance solution: representation monitoring outperforms behavioral, enabling reliable safety evaluation. SCAV shows the same technique that enables monitoring enables attacks. The net effect of deploying Beaglehole-style monitoring is: better safety against naive adversaries + precision targeting map for adversarially-informed adversaries.
What surprised me: The cross-model transfer to GPT-4 black-box is alarming. SCAV doesn't require white-box access to the target model — this means it could be used against any deployed model that relies on linear concept monitoring for safety. The black-box transfer suggests the linear structure of safety concepts is a universal property, not a model-specific artifact.
What I expected but didn't find: SCAV attacks against multi-layer ensemble probes (Nordby et al. finding from April 2026). Nordby shows single-layer probes are brittle because deception directions rotate across layers. It's unclear whether attacking the full rotation of the deception direction (rather than a single layer's projection) is feasible. This is the key unresolved question for the Beaglehole × SCAV divergence.
KB connections:
- Central to: Beaglehole × SCAV divergence candidate (monitoring vs. attack surface tradeoff)
- Challenges: Any claim that linear concept monitoring solves the alignment verification problem
- Connected to: monitoring precision hierarchy (this is the confirmed attack surface for Level 2 — linear concept monitoring)
- Connected to: Nordby et al. (are multi-layer ensembles robust to SCAV? — currently unknown)
Extraction hints: Primary claim: "Representation monitoring via linear concept vectors creates a dual-use attack surface enabling 99.14% jailbreak success." This is a new KB claim — the monitoring precision hierarchy file may reference this but the specific SCAV result (99.14%, black-box transfer) is not yet a formal KB claim. Secondary observation: the Beaglehole × SCAV silo failure is a meta-claim about community coordination worth noting.
Context: NeurIPS 2024 — peer-reviewed, high-credibility venue. The adversarial ML community has been working on activation-space attacks for years; SCAV is one of the cleaner demonstrations. The 99.14% figure may be optimistic (tested on open-source models without representation monitoring deployed as a defense), but the principle is sound.
Curator Notes (structured handoff for extractor)
PRIMARY CONNECTION: Beaglehole et al. (Science 391, 2026) "Toward Universal Steering and Monitoring of AI Models" — these two papers must be extracted together as a divergence candidate. WHY ARCHIVED: Critical counterpart to Beaglehole. Representation monitoring via linear concept vectors creates a 99.14% jailbreak attack surface. Black-box transfer to GPT-4 confirmed. Anti-safety scaling law: larger models are more vulnerable (same as being more steerable). This closes the loop on the monitoring precision hierarchy's Level 2 dual-use problem. EXTRACTION HINT: Extract as half of the Beaglehole × SCAV divergence. Do not extract as a standalone claim — it only makes sense in dialogue with Beaglehole. The divergence file should frame the core question: "Does deploying representation monitoring improve or worsen net safety posture?" Both claims have evidence; neither resolves the other.