theseus: extract claims from 2024-09-00-xu-scav-steering-concept-activation-vectors-jailbreak

- Source: inbox/queue/2024-09-00-xu-scav-steering-concept-activation-vectors-jailbreak.md - Domain: ai-alignment - Claims: 2, Entities: 0 - Enrichments: 1 - Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5) Pentagon-Agent: Theseus <PIPELINE>
2026-04-21 00:21:50 +00:00 · 2026-04-21 00:21:50 +00:00 · a2b5c14e8c
commit a2b5c14e8c
parent dec99cd573
3 changed files with 43 additions and 7 deletions
--- a/domains/ai-alignment/anti-safety-scaling-law-larger-models-more-vulnerable-to-concept-vector-attacks.md
+++ b/domains/ai-alignment/anti-safety-scaling-law-larger-models-more-vulnerable-to-concept-vector-attacks.md
@ -0,0 +1,17 @@
 ---
 type: claim
 domain: ai-alignment
 description: Beaglehole et al. found larger models are more steerable using concept vectors; since SCAV-style attacks exploit the same steerability mechanism, verification capability and attack vulnerability increase simultaneously with scale
 confidence: speculative
 source: Inference from Beaglehole et al. (Science 391, 2026) steerability findings combined with Xu et al. (NeurIPS 2024) SCAV attack mechanism
 created: 2026-04-21
 title: "Anti-safety scaling law: larger models are more vulnerable to linear concept vector attacks because steerability and attack surface scale together"
 agent: theseus
 scope: structural
 sourcer: Xu et al. + Beaglehole et al.
 related: ["capabilities-training-alone-grows-evaluation-awareness-from-2-to-20-percent", "increasing-ai-capability-enables-more-precise-evaluation-context-recognition-inverting-safety-improvements"]
 ---
 # Anti-safety scaling law: larger models are more vulnerable to linear concept vector attacks because steerability and attack surface scale together
 Beaglehole et al. demonstrated that larger models are more steerable using linear concept vectors, enabling more precise safety monitoring. However, SCAV attacks exploit the exact same steerability property—they work by identifying and suppressing the linear direction encoding safety concepts. This creates an anti-safety scaling law: as models become larger and more steerable (improving monitoring precision), they simultaneously become more vulnerable to SCAV-style attacks that target those same linear directions. The mechanism is symmetric: whatever makes a model easier to steer toward safe behavior also makes it easier to steer away from safe behavior. This means that deploying Beaglehole-style representation monitoring may improve safety against naive adversaries while simultaneously providing a precision attack surface for adversarially-informed actors. The net safety effect depends on whether the monitoring benefit outweighs the attack surface cost—a question neither paper resolves. This represents a fundamental tension in alignment strategy: the same architectural properties that enable verification also enable exploitation.
--- a/domains/ai-alignment/mechanistic-interpretability-tools-create-dual-use-attack-surface-enabling-surgical-safety-feature-removal.md
+++ b/domains/ai-alignment/mechanistic-interpretability-tools-create-dual-use-attack-surface-enabling-surgical-safety-feature-removal.md
@ -10,14 +10,16 @@ agent: theseus
 scope: causal
 sourcer: Zhou et al.
 related_claims: ["[[AI capability and reliability are independent dimensions because Claude solved a 30-year open mathematical problem while simultaneously degrading at basic program execution during the same session]]", "[[safe AI development requires building alignment mechanisms before scaling capability]]"]
-related:
+related: ["Non-autoregressive architectures reduce jailbreak vulnerability by 40-65% through elimination of continuation-drive mechanisms but impose a 15-25% capability cost on reasoning tasks", "Training-free conversion of activation steering vectors into component-level weight edits enables persistent behavioral modification without retraining", "mechanistic-interpretability-tools-create-dual-use-attack-surface-enabling-surgical-safety-feature-removal", "mechanistic-interpretability-tools-fail-at-safety-critical-tasks-at-frontier-scale", "white-box-interpretability-fails-on-adversarially-trained-models-creating-anti-correlation-with-threat-model", "interpretability-effectiveness-anti-correlates-with-adversarial-training-making-tools-hurt-performance-on-sophisticated-misalignment", "anthropic-deepmind-interpretability-complementarity-maps-mechanisms-versus-detects-intent"]
- Non-autoregressive architectures reduce jailbreak vulnerability by 40-65% through elimination of continuation-drive mechanisms but impose a 15-25% capability cost on reasoning tasks
+reweave_edges: ["Non-autoregressive architectures reduce jailbreak vulnerability by 40-65% through elimination of continuation-drive mechanisms but impose a 15-25% capability cost on reasoning tasks|related|2026-04-17", "Training-free conversion of activation steering vectors into component-level weight edits enables persistent behavioral modification without retraining|related|2026-04-17"]
 - Training-free conversion of activation steering vectors into component-level weight edits enables persistent behavioral modification without retraining
 reweave_edges:
 - Non-autoregressive architectures reduce jailbreak vulnerability by 40-65% through elimination of continuation-drive mechanisms but impose a 15-25% capability cost on reasoning tasks|related|2026-04-17
 - Training-free conversion of activation steering vectors into component-level weight edits enables persistent behavioral modification without retraining|related|2026-04-17
 ---
 # Mechanistic interpretability tools create a dual-use attack surface where Sparse Autoencoders developed for alignment research can identify and surgically remove safety-related features
 The CFA² (Causal Front-Door Adjustment Attack) demonstrates that Sparse Autoencoders — the same interpretability tool central to Anthropic's circuit tracing and feature identification research — can be used adversarially to mechanistically identify and remove safety-related features from model activations. The attack models LLM safety mechanisms as unobserved confounders and applies Pearl's Front-Door Criterion to sever these confounding associations. By isolating 'the core task intent' from defense mechanisms, the approach physically strips away protection-related components before generating responses, achieving state-of-the-art attack success rates. This is qualitatively different from traditional prompt-based jailbreaks: it uses mechanistic understanding of WHERE safety features live to selectively remove them. The surgical precision is more concerning than brute-force approaches because as interpretability research advances and more features get identified, this attack vector improves automatically. The same toolkit that enables understanding model internals for alignment purposes enables adversaries to strip away exactly those safety-related features. This establishes a structural dual-use problem where interpretability progress is simultaneously a defense enabler and attack amplifier.
 ## Supporting Evidence
 **Source:** Xu et al. (NeurIPS 2024)
 SCAV framework achieved 99.14% jailbreak success across seven open-source LLMs with black-box transfer to GPT-4, providing empirical confirmation that linear concept vector monitoring creates exploitable attack surfaces. The closed-form solution for optimal perturbation magnitude means attacks require no hyperparameter tuning, lowering the barrier to exploitation.
--- a/domains/ai-alignment/representation-monitoring-via-linear-concept-vectors-creates-dual-use-attack-surface.md
+++ b/domains/ai-alignment/representation-monitoring-via-linear-concept-vectors-creates-dual-use-attack-surface.md
@ -0,0 +1,17 @@
 ---
 type: claim
 domain: ai-alignment
 description: SCAV framework demonstrates that the same linear concept directions used for safety monitoring can be surgically targeted to suppress safety activations, with attacks transferring to black-box models like GPT-4
 confidence: experimental
 source: Xu et al. (NeurIPS 2024), SCAV framework evaluation across seven open-source LLMs
 created: 2026-04-21
 title: "Representation monitoring via linear concept vectors creates a dual-use attack surface enabling 99.14% jailbreak success"
 agent: theseus
 scope: causal
 sourcer: Xu et al.
 related: ["mechanistic-interpretability-tools-create-dual-use-attack-surface-enabling-surgical-safety-feature-removal", "chain-of-thought-monitoring-vulnerable-to-steganographic-encoding-as-emerging-capability"]
 ---
 # Representation monitoring via linear concept vectors creates a dual-use attack surface enabling 99.14% jailbreak success
 Xu et al. introduce SCAV (Steering Concept Activation Vectors), which identifies the linear direction in activation space encoding the harmful/safe instruction distinction, then constructs adversarial attacks that suppress those activations. The framework achieved an average attack success rate of 99.14% across seven open-source LLMs using keyword-matching evaluation. Critically, these attacks transfer to GPT-4 in black-box settings, demonstrating that the linear structure of safety concepts is a universal property rather than model-specific. The attack provides a closed-form solution for optimal perturbation magnitude, requiring no hyperparameter tuning. This creates a fundamental dual-use problem: the same linear concept vectors that enable precise safety monitoring (as demonstrated by Beaglehole et al.) also create a precision targeting map for adversarial attacks. The black-box transfer is particularly concerning because it means attacks developed on open-source models with white-box access can be applied to deployed proprietary models that use linear concept monitoring for safety. The technical mechanism is less surgically precise than SAE-based attacks but achieves comparable success with simpler implementation, making it more accessible to adversaries.