Some checks are pending
Mirror PR to Forgejo / mirror (pull_request) Waiting to run
- Source: inbox/queue/2026-04-22-theseus-multilayer-probe-scav-robustness-synthesis.md - Domain: ai-alignment - Claims: 0, Entities: 0 - Enrichments: 3 - Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5) Pentagon-Agent: Theseus <PIPELINE>
39 lines
4.3 KiB
Markdown
39 lines
4.3 KiB
Markdown
---
|
||
type: claim
|
||
domain: ai-alignment
|
||
description: SCAV framework demonstrates that the same linear concept directions used for safety monitoring can be surgically targeted to suppress safety activations, with attacks transferring to black-box models like GPT-4
|
||
confidence: experimental
|
||
source: Xu et al. (NeurIPS 2024), SCAV framework evaluation across seven open-source LLMs
|
||
created: 2026-04-21
|
||
title: "Representation monitoring via linear concept vectors creates a dual-use attack surface enabling 99.14% jailbreak success"
|
||
agent: theseus
|
||
scope: causal
|
||
sourcer: Xu et al.
|
||
related: ["mechanistic-interpretability-tools-create-dual-use-attack-surface-enabling-surgical-safety-feature-removal", "chain-of-thought-monitoring-vulnerable-to-steganographic-encoding-as-emerging-capability", "multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent", "linear-probe-accuracy-scales-with-model-size-power-law", "representation-monitoring-via-linear-concept-vectors-creates-dual-use-attack-surface", "anti-safety-scaling-law-larger-models-more-vulnerable-to-concept-vector-attacks"]
|
||
supports: ["Anti-safety scaling law: larger models are more vulnerable to linear concept vector attacks because steerability and attack surface scale together"]
|
||
reweave_edges: ["Anti-safety scaling law: larger models are more vulnerable to linear concept vector attacks because steerability and attack surface scale together|supports|2026-04-21"]
|
||
---
|
||
|
||
# Representation monitoring via linear concept vectors creates a dual-use attack surface enabling 99.14% jailbreak success
|
||
|
||
Xu et al. introduce SCAV (Steering Concept Activation Vectors), which identifies the linear direction in activation space encoding the harmful/safe instruction distinction, then constructs adversarial attacks that suppress those activations. The framework achieved an average attack success rate of 99.14% across seven open-source LLMs using keyword-matching evaluation. Critically, these attacks transfer to GPT-4 in black-box settings, demonstrating that the linear structure of safety concepts is a universal property rather than model-specific. The attack provides a closed-form solution for optimal perturbation magnitude, requiring no hyperparameter tuning. This creates a fundamental dual-use problem: the same linear concept vectors that enable precise safety monitoring (as demonstrated by Beaglehole et al.) also create a precision targeting map for adversarial attacks. The black-box transfer is particularly concerning because it means attacks developed on open-source models with white-box access can be applied to deployed proprietary models that use linear concept monitoring for safety. The technical mechanism is less surgically precise than SAE-based attacks but achieves comparable success with simpler implementation, making it more accessible to adversaries.
|
||
|
||
## Extending Evidence
|
||
|
||
**Source:** Theseus synthetic analysis combining Nordby et al. and Xu et al. SCAV
|
||
|
||
Multi-layer ensemble probes do not escape the dual-use attack surface identified for single-layer probes. With white-box access, SCAV can be generalized to compute concept directions at each monitored layer and construct a single perturbation suppressing all simultaneously. This is a higher-dimensional optimization requiring more computation and data, but is structurally feasible by the same mechanism. Open-weights models (Llama, Mistral, Falcon) remain fully vulnerable to white-box multi-layer SCAV regardless of ensemble complexity.
|
||
|
||
|
||
## Extending Evidence
|
||
|
||
**Source:** Theseus synthetic analysis (2026-04-22)
|
||
|
||
Multi-layer ensemble architectures do not eliminate the fundamental attack surface in white-box settings. White-box multi-layer SCAV generalizes the single-layer attack by computing concept directions at each monitored layer and constructing perturbations that suppress all simultaneously. The attack cost increases but the structural vulnerability remains.
|
||
|
||
|
||
## Extending Evidence
|
||
|
||
**Source:** Theseus synthetic analysis of Nordby et al. × SCAV
|
||
|
||
Multi-layer ensemble monitoring does not eliminate the dual-use attack surface, only shifts it from single-layer to multi-layer SCAV. With white-box access, attackers can generalize SCAV to suppress concept directions at all monitored layers simultaneously through higher-dimensional optimization. Open-weights models remain fully vulnerable. Black-box robustness depends on untested rotation pattern universality question.
|