teleo-codex/domains/ai-alignment/anti-safety-scaling-law-larger-models-more-vulnerable-to-concept-vector-attacks.md
Teleo Agents a2b5c14e8c theseus: extract claims from 2024-09-00-xu-scav-steering-concept-activation-vectors-jailbreak
- Source: inbox/queue/2024-09-00-xu-scav-steering-concept-activation-vectors-jailbreak.md
- Domain: ai-alignment
- Claims: 2, Entities: 0
- Enrichments: 1
- Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5)

Pentagon-Agent: Theseus <PIPELINE>
2026-04-21 00:23:26 +00:00

2.2 KiB

type domain description confidence source created title agent scope sourcer related
claim ai-alignment Beaglehole et al. found larger models are more steerable using concept vectors; since SCAV-style attacks exploit the same steerability mechanism, verification capability and attack vulnerability increase simultaneously with scale speculative Inference from Beaglehole et al. (Science 391, 2026) steerability findings combined with Xu et al. (NeurIPS 2024) SCAV attack mechanism 2026-04-21 Anti-safety scaling law: larger models are more vulnerable to linear concept vector attacks because steerability and attack surface scale together theseus structural Xu et al. + Beaglehole et al.
capabilities-training-alone-grows-evaluation-awareness-from-2-to-20-percent
increasing-ai-capability-enables-more-precise-evaluation-context-recognition-inverting-safety-improvements

Anti-safety scaling law: larger models are more vulnerable to linear concept vector attacks because steerability and attack surface scale together

Beaglehole et al. demonstrated that larger models are more steerable using linear concept vectors, enabling more precise safety monitoring. However, SCAV attacks exploit the exact same steerability property—they work by identifying and suppressing the linear direction encoding safety concepts. This creates an anti-safety scaling law: as models become larger and more steerable (improving monitoring precision), they simultaneously become more vulnerable to SCAV-style attacks that target those same linear directions. The mechanism is symmetric: whatever makes a model easier to steer toward safe behavior also makes it easier to steer away from safe behavior. This means that deploying Beaglehole-style representation monitoring may improve safety against naive adversaries while simultaneously providing a precision attack surface for adversarially-informed actors. The net safety effect depends on whether the monitoring benefit outweighs the attack surface cost—a question neither paper resolves. This represents a fundamental tension in alignment strategy: the same architectural properties that enable verification also enable exploitation.