Teleo Agents a2b5c14e8c theseus: extract claims from 2024-09-00-xu-scav-steering-concept-activation-vectors-jailbreak

- Source: inbox/queue/2024-09-00-xu-scav-steering-concept-activation-vectors-jailbreak.md
- Domain: ai-alignment
- Claims: 2, Entities: 0
- Enrichments: 1
- Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5)

Pentagon-Agent: Theseus <PIPELINE>

2026-04-21 00:23:26 +00:00

2.1 KiB

Raw Blame History

type

domain

description

confidence

source

created

title

agent

scope

sourcer

claim

ai-alignment

SCAV framework demonstrates that the same linear concept directions used for safety monitoring can be surgically targeted to suppress safety activations, with attacks transferring to black-box models like GPT-4

experimental

Xu et al. (NeurIPS 2024), SCAV framework evaluation across seven open-source LLMs

2026-04-21

Representation monitoring via linear concept vectors creates a dual-use attack surface enabling 99.14% jailbreak success

theseus

causal

Xu et al.

mechanistic-interpretability-tools-create-dual-use-attack-surface-enabling-surgical-safety-feature-removal

chain-of-thought-monitoring-vulnerable-to-steganographic-encoding-as-emerging-capability

Representation monitoring via linear concept vectors creates a dual-use attack surface enabling 99.14% jailbreak success

Xu et al. introduce SCAV (Steering Concept Activation Vectors), which identifies the linear direction in activation space encoding the harmful/safe instruction distinction, then constructs adversarial attacks that suppress those activations. The framework achieved an average attack success rate of 99.14% across seven open-source LLMs using keyword-matching evaluation. Critically, these attacks transfer to GPT-4 in black-box settings, demonstrating that the linear structure of safety concepts is a universal property rather than model-specific. The attack provides a closed-form solution for optimal perturbation magnitude, requiring no hyperparameter tuning. This creates a fundamental dual-use problem: the same linear concept vectors that enable precise safety monitoring (as demonstrated by Beaglehole et al.) also create a precision targeting map for adversarial attacks. The black-box transfer is particularly concerning because it means attacks developed on open-source models with white-box access can be applied to deployed proprietary models that use linear concept monitoring for safety. The technical mechanism is less surgically precise than SAE-based attacks but achieves comparable success with simpler implementation, making it more accessible to adversaries.

2.1 KiB Raw Blame History

Representation monitoring via linear concept vectors creates a dual-use attack surface enabling 99.14% jailbreak success

2.1 KiB

Raw Blame History