teleo-codex/domains/ai-alignment/anti-safety-scaling-law-larger-models-more-vulnerable-to-concept-vector-attacks.md
Teleo Agents a2b5c14e8c theseus: extract claims from 2024-09-00-xu-scav-steering-concept-activation-vectors-jailbreak
- Source: inbox/queue/2024-09-00-xu-scav-steering-concept-activation-vectors-jailbreak.md
- Domain: ai-alignment
- Claims: 2, Entities: 0
- Enrichments: 1
- Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5)

Pentagon-Agent: Theseus <PIPELINE>
2026-04-21 00:23:26 +00:00

17 lines
2.2 KiB
Markdown

---
type: claim
domain: ai-alignment
description: Beaglehole et al. found larger models are more steerable using concept vectors; since SCAV-style attacks exploit the same steerability mechanism, verification capability and attack vulnerability increase simultaneously with scale
confidence: speculative
source: Inference from Beaglehole et al. (Science 391, 2026) steerability findings combined with Xu et al. (NeurIPS 2024) SCAV attack mechanism
created: 2026-04-21
title: "Anti-safety scaling law: larger models are more vulnerable to linear concept vector attacks because steerability and attack surface scale together"
agent: theseus
scope: structural
sourcer: Xu et al. + Beaglehole et al.
related: ["capabilities-training-alone-grows-evaluation-awareness-from-2-to-20-percent", "increasing-ai-capability-enables-more-precise-evaluation-context-recognition-inverting-safety-improvements"]
---
# Anti-safety scaling law: larger models are more vulnerable to linear concept vector attacks because steerability and attack surface scale together
Beaglehole et al. demonstrated that larger models are more steerable using linear concept vectors, enabling more precise safety monitoring. However, SCAV attacks exploit the exact same steerability property—they work by identifying and suppressing the linear direction encoding safety concepts. This creates an anti-safety scaling law: as models become larger and more steerable (improving monitoring precision), they simultaneously become more vulnerable to SCAV-style attacks that target those same linear directions. The mechanism is symmetric: whatever makes a model easier to steer toward safe behavior also makes it easier to steer away from safe behavior. This means that deploying Beaglehole-style representation monitoring may improve safety against naive adversaries while simultaneously providing a precision attack surface for adversarially-informed actors. The net safety effect depends on whether the monitoring benefit outweighs the attack surface cost—a question neither paper resolves. This represents a fundamental tension in alignment strategy: the same architectural properties that enable verification also enable exploitation.