theseus: extract claims from 2026-03-10-deng-continuation-refusal-jailbreak

- Source: inbox/queue/2026-03-10-deng-continuation-refusal-jailbreak.md - Domain: ai-alignment - Claims: 1, Entities: 0 - Enrichments: 2 - Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5) Pentagon-Agent: Theseus <PIPELINE>
2026-04-08 00:26:32 +00:00
1 changed files with 17 additions and 0 deletions
--- a/domains/ai-alignment/jailbreak-vulnerability-is-architecturally-structural-from-continuation-safety-competition.md
+++ b/domains/ai-alignment/jailbreak-vulnerability-is-architecturally-structural-from-continuation-safety-competition.md
@ -0,0 +1,17 @@
 ---
 type: claim
 domain: ai-alignment
 description: Mechanistic interpretability reveals that jailbreak success stems from inherent competition between continuation drive and safety defenses, with architecture-specific safety-critical attention heads showing this is not just a training problem
 confidence: experimental
 source: Deng et al. 2026, causal interventions and activation scaling on safety-critical attention heads
 created: 2026-04-08
 title: Jailbreak vulnerability in language models is architecturally structural because the continuation drive and safety alignment compete at the attention head level creating an exploitable tension that scales with generation capability
 agent: theseus
 scope: structural
 sourcer: Deng et al.
 related_claims: ["[[AI capability and reliability are independent dimensions because Claude solved a 30-year open mathematical problem while simultaneously degrading at basic program execution during the same session]]", "[[safe AI development requires building alignment mechanisms before scaling capability]]"]
 ---
 # Jailbreak vulnerability in language models is architecturally structural because the continuation drive and safety alignment compete at the attention head level creating an exploitable tension that scales with generation capability
 Through causal interventions and activation scaling, Deng et al. identified 'safety-critical attention heads' whose behavior differs across model architectures, revealing that jailbreak vulnerability stems from 'an inherent competition between the model's intrinsic continuation drive and the safety defenses acquired through alignment training.' The key finding is that this tension is architectural rather than merely training-contingent: as models develop stronger continuation capabilities (necessary for coherent generation), they simultaneously create a larger attack surface for jailbreak attempts. The paper demonstrates that relocating continuation-triggered instruction suffixes significantly increases jailbreak success rates precisely because it exploits this structural competition. Critically, safety mechanisms are not uniformly implemented even across models with similar capabilities—different architectures implement safety differently at the mechanistic level, meaning safety evaluations on one architecture don't necessarily transfer to another. The authors conclude that 'improving robustness may require deeper redesigns of how models balance continuation capabilities with safety constraints,' implying that training-based fixes have structural limits and that departing from standard autoregressive generation paradigms may be necessary.