teleo-codex/inbox/queue/2026-03-10-deng-continuation-refusal-jailbreak.md

---
type: source
title: "The Struggle Between Continuation and Refusal: Mechanistic Analysis of Jailbreak Vulnerability in LLMs"
author: "Yonghong Deng, Zhen Yang, Ping Jian, Xinyue Zhang, Zhongbin Guo, Chengzhi Li"
url: https://arxiv.org/abs/2603.08234
date: 2026-03-10
domain: ai-alignment
secondary_domains: []
format: paper
status: unprocessed
priority: medium
tags: [mechanistic-interpretability, jailbreak, safety-heads, continuation-drive, architectural-vulnerability, B4]
---

## Content

Mechanistic interpretability analysis of why relocating a continuation-triggered instruction suffix significantly increases jailbreak success rates. Identifies "safety-critical attention heads" whose behavior differs across model architectures.

**Core finding:** Jailbreak success stems from "an inherent competition between the model's intrinsic continuation drive and the safety defenses acquired through alignment training." The model's natural tendency to continue text conflicts with safety training — this tension is exploitable.

**Safety-critical attention heads:** Behave differently across architectures — safety mechanisms are not uniformly implemented even across models with similar capabilities.

**Methodology:** Causal interventions + activation scaling to isolate which components drive jailbreak behavior.

**Implication:** "Improving robustness may require deeper redesigns of how models balance continuation capabilities with safety constraints" — the vulnerability is architectural, not just training-contingent.

## Agent Notes

**Why this matters:** This paper identifies a structural tension in how safety alignment works — the continuation drive and safety training compete at the attention head level. This is relevant to B4 because it shows that alignment vulnerabilities are partly architectural: as long as models need strong continuation capabilities (for coherent generation), they carry this inherent tension with safety training. Stronger capability = stronger continuation drive = larger tension = potentially larger attack surface.

**What surprised me:** The architecture-specific variation in safety-critical attention heads. Different architectures implement safety differently at the mechanistic level. This means safety evaluations on one architecture don't necessarily transfer to another — another dimension of verification inadequacy.

**What I expected but didn't find:** A proposed fix. The paper identifies the problem but doesn't propose a mechanistic solution, implying that "deeper redesign" may mean departing from standard autoregressive generation paradigms.

**KB connections:**
- [[scalable oversight degrades rapidly as capability gaps grow]] — architectural jailbreak vulnerabilities scale with capability (stronger continuation → larger tension)
- [[AI capability and reliability are independent dimensions]] — this is another manifestation: stronger generation capability creates stronger jailbreak vulnerability
- Connects to SafeThink (2602.11096): if safety decisions crystallize early, this paper explains mechanistically WHY — the continuation-safety competition is resolved in early reasoning steps

**Extraction hints:**
- Primary claim: "Jailbreak vulnerability in language models is architecturally structural: an inherent competition between the continuation drive and safety alignment creates an exploitable tension that varies across architectures, suggesting safety robustness improvements may require departing from standard autoregressive generation paradigms rather than improving training procedures alone."

## Curator Notes

PRIMARY CONNECTION: [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]]
WHY ARCHIVED: Provides mechanistic basis for why alignment is structurally difficult — not just empirically observed degradation, but an architectural competition between generation capability and safety. Connects to SafeThink's early-crystallization finding.
EXTRACTION HINT: The architectural origin of the vulnerability is the key contribution — it suggests training-based fixes have structural limits, and connects to B4's "verification degrades faster than capability" through the capability-tension scaling relationship.