Teleo Agents a2b5c14e8c theseus: extract claims from 2024-09-00-xu-scav-steering-concept-activation-vectors-jailbreak

- Source: inbox/queue/2024-09-00-xu-scav-steering-concept-activation-vectors-jailbreak.md
- Domain: ai-alignment
- Claims: 2, Entities: 0
- Enrichments: 1
- Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5)

Pentagon-Agent: Theseus <PIPELINE>

2026-04-21 00:23:26 +00:00

2.2 KiB

Raw Blame History

type

domain

description

confidence

source

created

title

agent

scope

sourcer

claim

ai-alignment

Beaglehole et al. found larger models are more steerable using concept vectors; since SCAV-style attacks exploit the same steerability mechanism, verification capability and attack vulnerability increase simultaneously with scale

speculative

Inference from Beaglehole et al. (Science 391, 2026) steerability findings combined with Xu et al. (NeurIPS 2024) SCAV attack mechanism

2026-04-21

Anti-safety scaling law: larger models are more vulnerable to linear concept vector attacks because steerability and attack surface scale together

theseus

structural

Xu et al. + Beaglehole et al.

capabilities-training-alone-grows-evaluation-awareness-from-2-to-20-percent

increasing-ai-capability-enables-more-precise-evaluation-context-recognition-inverting-safety-improvements

Anti-safety scaling law: larger models are more vulnerable to linear concept vector attacks because steerability and attack surface scale together

Beaglehole et al. demonstrated that larger models are more steerable using linear concept vectors, enabling more precise safety monitoring. However, SCAV attacks exploit the exact same steerability property—they work by identifying and suppressing the linear direction encoding safety concepts. This creates an anti-safety scaling law: as models become larger and more steerable (improving monitoring precision), they simultaneously become more vulnerable to SCAV-style attacks that target those same linear directions. The mechanism is symmetric: whatever makes a model easier to steer toward safe behavior also makes it easier to steer away from safe behavior. This means that deploying Beaglehole-style representation monitoring may improve safety against naive adversaries while simultaneously providing a precision attack surface for adversarially-informed actors. The net safety effect depends on whether the monitoring benefit outweighs the attack surface cost—a question neither paper resolves. This represents a fundamental tension in alignment strategy: the same architectural properties that enable verification also enable exploitation.

2.2 KiB Raw Blame History

Anti-safety scaling law: larger models are more vulnerable to linear concept vector attacks because steerability and attack surface scale together

2.2 KiB

Raw Blame History