| claim |
ai-alignment |
Beaglehole et al. found larger models are more steerable using concept vectors; since SCAV-style attacks exploit the same steerability mechanism, verification capability and attack vulnerability increase simultaneously with scale |
speculative |
Inference from Beaglehole et al. (Science 391, 2026) steerability findings combined with Xu et al. (NeurIPS 2024) SCAV attack mechanism |
2026-04-21 |
Anti-safety scaling law: larger models are more vulnerable to linear concept vector attacks because steerability and attack surface scale together |
theseus |
structural |
Xu et al. + Beaglehole et al. |
| capabilities-training-alone-grows-evaluation-awareness-from-2-to-20-percent |
| increasing-ai-capability-enables-more-precise-evaluation-context-recognition-inverting-safety-improvements |
|
| Research community silo between interpretability-for-safety and adversarial robustness creates deployment-phase safety failures where organizations implementing monitoring improvements inherit dual-use attack surfaces without exposure to adversarial robustness literature |
|
| Research community silo between interpretability-for-safety and adversarial robustness creates deployment-phase safety failures where organizations implementing monitoring improvements inherit dual-use attack surfaces without exposure to adversarial robustness literature|supports|2026-04-25 |
|