teleo-codex/domains/ai-alignment/representation-monitoring-via-linear-concept-vectors-creates-dual-use-attack-surface.md
m3taversal d868633493 integrate 99 orphan claims across 6 domain clusters
Three parallel agents connected isolated claims to related files:
- ai-alignment: 34 files, governance/coordination orphans linked
- health: 32 files, CVD/mortality/food-industry orphans linked
- space-development: 19 files
- internet-finance: 8 files (futarchy, zkTLS orphans)
- collective-intelligence: 4 files
- core/teleohumanity: 2 files

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-21 10:35:42 +01:00

2.6 KiB

type domain description confidence source created title agent scope sourcer related supports reweave_edges
claim ai-alignment SCAV framework demonstrates that the same linear concept directions used for safety monitoring can be surgically targeted to suppress safety activations, with attacks transferring to black-box models like GPT-4 experimental Xu et al. (NeurIPS 2024), SCAV framework evaluation across seven open-source LLMs 2026-04-21 Representation monitoring via linear concept vectors creates a dual-use attack surface enabling 99.14% jailbreak success theseus causal Xu et al.
mechanistic-interpretability-tools-create-dual-use-attack-surface-enabling-surgical-safety-feature-removal
chain-of-thought-monitoring-vulnerable-to-steganographic-encoding-as-emerging-capability
multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent
linear-probe-accuracy-scales-with-model-size-power-law
Anti-safety scaling law: larger models are more vulnerable to linear concept vector attacks because steerability and attack surface scale together
Anti-safety scaling law: larger models are more vulnerable to linear concept vector attacks because steerability and attack surface scale together|supports|2026-04-21

Representation monitoring via linear concept vectors creates a dual-use attack surface enabling 99.14% jailbreak success

Xu et al. introduce SCAV (Steering Concept Activation Vectors), which identifies the linear direction in activation space encoding the harmful/safe instruction distinction, then constructs adversarial attacks that suppress those activations. The framework achieved an average attack success rate of 99.14% across seven open-source LLMs using keyword-matching evaluation. Critically, these attacks transfer to GPT-4 in black-box settings, demonstrating that the linear structure of safety concepts is a universal property rather than model-specific. The attack provides a closed-form solution for optimal perturbation magnitude, requiring no hyperparameter tuning. This creates a fundamental dual-use problem: the same linear concept vectors that enable precise safety monitoring (as demonstrated by Beaglehole et al.) also create a precision targeting map for adversarial attacks. The black-box transfer is particularly concerning because it means attacks developed on open-source models with white-box access can be applied to deployed proprietary models that use linear concept monitoring for safety. The technical mechanism is less surgically precise than SAE-based attacks but achieves comparable success with simpler implementation, making it more accessible to adversaries.