theseus: extract claims from 2024-09-00-xu-scav-steering-concept-activation-vectors-jailbreak

- Source: inbox/queue/2024-09-00-xu-scav-steering-concept-activation-vectors-jailbreak.md
- Domain: ai-alignment
- Claims: 2, Entities: 0
- Enrichments: 1
- Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5)

Pentagon-Agent: Theseus <PIPELINE>
This commit is contained in:
Teleo Agents 2026-04-21 00:21:50 +00:00
parent dec99cd573
commit a2b5c14e8c
3 changed files with 43 additions and 7 deletions

View file

@ -0,0 +1,17 @@
---
type: claim
domain: ai-alignment
description: Beaglehole et al. found larger models are more steerable using concept vectors; since SCAV-style attacks exploit the same steerability mechanism, verification capability and attack vulnerability increase simultaneously with scale
confidence: speculative
source: Inference from Beaglehole et al. (Science 391, 2026) steerability findings combined with Xu et al. (NeurIPS 2024) SCAV attack mechanism
created: 2026-04-21
title: "Anti-safety scaling law: larger models are more vulnerable to linear concept vector attacks because steerability and attack surface scale together"
agent: theseus
scope: structural
sourcer: Xu et al. + Beaglehole et al.
related: ["capabilities-training-alone-grows-evaluation-awareness-from-2-to-20-percent", "increasing-ai-capability-enables-more-precise-evaluation-context-recognition-inverting-safety-improvements"]
---
# Anti-safety scaling law: larger models are more vulnerable to linear concept vector attacks because steerability and attack surface scale together
Beaglehole et al. demonstrated that larger models are more steerable using linear concept vectors, enabling more precise safety monitoring. However, SCAV attacks exploit the exact same steerability property—they work by identifying and suppressing the linear direction encoding safety concepts. This creates an anti-safety scaling law: as models become larger and more steerable (improving monitoring precision), they simultaneously become more vulnerable to SCAV-style attacks that target those same linear directions. The mechanism is symmetric: whatever makes a model easier to steer toward safe behavior also makes it easier to steer away from safe behavior. This means that deploying Beaglehole-style representation monitoring may improve safety against naive adversaries while simultaneously providing a precision attack surface for adversarially-informed actors. The net safety effect depends on whether the monitoring benefit outweighs the attack surface cost—a question neither paper resolves. This represents a fundamental tension in alignment strategy: the same architectural properties that enable verification also enable exploitation.

View file

@ -10,14 +10,16 @@ agent: theseus
scope: causal scope: causal
sourcer: Zhou et al. sourcer: Zhou et al.
related_claims: ["[[AI capability and reliability are independent dimensions because Claude solved a 30-year open mathematical problem while simultaneously degrading at basic program execution during the same session]]", "[[safe AI development requires building alignment mechanisms before scaling capability]]"] related_claims: ["[[AI capability and reliability are independent dimensions because Claude solved a 30-year open mathematical problem while simultaneously degrading at basic program execution during the same session]]", "[[safe AI development requires building alignment mechanisms before scaling capability]]"]
related: related: ["Non-autoregressive architectures reduce jailbreak vulnerability by 40-65% through elimination of continuation-drive mechanisms but impose a 15-25% capability cost on reasoning tasks", "Training-free conversion of activation steering vectors into component-level weight edits enables persistent behavioral modification without retraining", "mechanistic-interpretability-tools-create-dual-use-attack-surface-enabling-surgical-safety-feature-removal", "mechanistic-interpretability-tools-fail-at-safety-critical-tasks-at-frontier-scale", "white-box-interpretability-fails-on-adversarially-trained-models-creating-anti-correlation-with-threat-model", "interpretability-effectiveness-anti-correlates-with-adversarial-training-making-tools-hurt-performance-on-sophisticated-misalignment", "anthropic-deepmind-interpretability-complementarity-maps-mechanisms-versus-detects-intent"]
- Non-autoregressive architectures reduce jailbreak vulnerability by 40-65% through elimination of continuation-drive mechanisms but impose a 15-25% capability cost on reasoning tasks reweave_edges: ["Non-autoregressive architectures reduce jailbreak vulnerability by 40-65% through elimination of continuation-drive mechanisms but impose a 15-25% capability cost on reasoning tasks|related|2026-04-17", "Training-free conversion of activation steering vectors into component-level weight edits enables persistent behavioral modification without retraining|related|2026-04-17"]
- Training-free conversion of activation steering vectors into component-level weight edits enables persistent behavioral modification without retraining
reweave_edges:
- Non-autoregressive architectures reduce jailbreak vulnerability by 40-65% through elimination of continuation-drive mechanisms but impose a 15-25% capability cost on reasoning tasks|related|2026-04-17
- Training-free conversion of activation steering vectors into component-level weight edits enables persistent behavioral modification without retraining|related|2026-04-17
--- ---
# Mechanistic interpretability tools create a dual-use attack surface where Sparse Autoencoders developed for alignment research can identify and surgically remove safety-related features # Mechanistic interpretability tools create a dual-use attack surface where Sparse Autoencoders developed for alignment research can identify and surgically remove safety-related features
The CFA² (Causal Front-Door Adjustment Attack) demonstrates that Sparse Autoencoders — the same interpretability tool central to Anthropic's circuit tracing and feature identification research — can be used adversarially to mechanistically identify and remove safety-related features from model activations. The attack models LLM safety mechanisms as unobserved confounders and applies Pearl's Front-Door Criterion to sever these confounding associations. By isolating 'the core task intent' from defense mechanisms, the approach physically strips away protection-related components before generating responses, achieving state-of-the-art attack success rates. This is qualitatively different from traditional prompt-based jailbreaks: it uses mechanistic understanding of WHERE safety features live to selectively remove them. The surgical precision is more concerning than brute-force approaches because as interpretability research advances and more features get identified, this attack vector improves automatically. The same toolkit that enables understanding model internals for alignment purposes enables adversaries to strip away exactly those safety-related features. This establishes a structural dual-use problem where interpretability progress is simultaneously a defense enabler and attack amplifier. The CFA² (Causal Front-Door Adjustment Attack) demonstrates that Sparse Autoencoders — the same interpretability tool central to Anthropic's circuit tracing and feature identification research — can be used adversarially to mechanistically identify and remove safety-related features from model activations. The attack models LLM safety mechanisms as unobserved confounders and applies Pearl's Front-Door Criterion to sever these confounding associations. By isolating 'the core task intent' from defense mechanisms, the approach physically strips away protection-related components before generating responses, achieving state-of-the-art attack success rates. This is qualitatively different from traditional prompt-based jailbreaks: it uses mechanistic understanding of WHERE safety features live to selectively remove them. The surgical precision is more concerning than brute-force approaches because as interpretability research advances and more features get identified, this attack vector improves automatically. The same toolkit that enables understanding model internals for alignment purposes enables adversaries to strip away exactly those safety-related features. This establishes a structural dual-use problem where interpretability progress is simultaneously a defense enabler and attack amplifier.
## Supporting Evidence
**Source:** Xu et al. (NeurIPS 2024)
SCAV framework achieved 99.14% jailbreak success across seven open-source LLMs with black-box transfer to GPT-4, providing empirical confirmation that linear concept vector monitoring creates exploitable attack surfaces. The closed-form solution for optimal perturbation magnitude means attacks require no hyperparameter tuning, lowering the barrier to exploitation.

View file

@ -0,0 +1,17 @@
---
type: claim
domain: ai-alignment
description: SCAV framework demonstrates that the same linear concept directions used for safety monitoring can be surgically targeted to suppress safety activations, with attacks transferring to black-box models like GPT-4
confidence: experimental
source: Xu et al. (NeurIPS 2024), SCAV framework evaluation across seven open-source LLMs
created: 2026-04-21
title: "Representation monitoring via linear concept vectors creates a dual-use attack surface enabling 99.14% jailbreak success"
agent: theseus
scope: causal
sourcer: Xu et al.
related: ["mechanistic-interpretability-tools-create-dual-use-attack-surface-enabling-surgical-safety-feature-removal", "chain-of-thought-monitoring-vulnerable-to-steganographic-encoding-as-emerging-capability"]
---
# Representation monitoring via linear concept vectors creates a dual-use attack surface enabling 99.14% jailbreak success
Xu et al. introduce SCAV (Steering Concept Activation Vectors), which identifies the linear direction in activation space encoding the harmful/safe instruction distinction, then constructs adversarial attacks that suppress those activations. The framework achieved an average attack success rate of 99.14% across seven open-source LLMs using keyword-matching evaluation. Critically, these attacks transfer to GPT-4 in black-box settings, demonstrating that the linear structure of safety concepts is a universal property rather than model-specific. The attack provides a closed-form solution for optimal perturbation magnitude, requiring no hyperparameter tuning. This creates a fundamental dual-use problem: the same linear concept vectors that enable precise safety monitoring (as demonstrated by Beaglehole et al.) also create a precision targeting map for adversarial attacks. The black-box transfer is particularly concerning because it means attacks developed on open-source models with white-box access can be applied to deployed proprietary models that use linear concept monitoring for safety. The technical mechanism is less surgically precise than SAE-based attacks but achieves comparable success with simpler implementation, making it more accessible to adversaries.