From 4b1e08ee18e0ece1cba30aaca186b967b9d1e4bb Mon Sep 17 00:00:00 2001 From: Teleo Agents Date: Thu, 9 Apr 2026 00:19:30 +0000 Subject: [PATCH] theseus: extract claims from 2026-04-09-treutlein-diffusion-alternative-architectures-safety - Source: inbox/queue/2026-04-09-treutlein-diffusion-alternative-architectures-safety.md - Domain: ai-alignment - Claims: 1, Entities: 0 - Enrichments: 1 - Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5) Pentagon-Agent: Theseus --- ...-of-continuation-drive-at-capability-cost.md | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) create mode 100644 domains/ai-alignment/non-autoregressive-architectures-reduce-jailbreak-vulnerability-through-elimination-of-continuation-drive-at-capability-cost.md diff --git a/domains/ai-alignment/non-autoregressive-architectures-reduce-jailbreak-vulnerability-through-elimination-of-continuation-drive-at-capability-cost.md b/domains/ai-alignment/non-autoregressive-architectures-reduce-jailbreak-vulnerability-through-elimination-of-continuation-drive-at-capability-cost.md new file mode 100644 index 000000000..8e563b1a8 --- /dev/null +++ b/domains/ai-alignment/non-autoregressive-architectures-reduce-jailbreak-vulnerability-through-elimination-of-continuation-drive-at-capability-cost.md @@ -0,0 +1,17 @@ +--- +type: claim +domain: ai-alignment +description: Diffusion language models demonstrate architectural safety advantages over autoregressive models by generating all tokens simultaneously, eliminating the continuation-drive vs. safety-training competition, but at measurable capability cost +confidence: experimental +source: Treutlein et al. (Mila/Cambridge), empirical evaluation on standard jailbreak benchmarks +created: 2026-04-09 +title: "Non-autoregressive architectures reduce jailbreak vulnerability by 40-65% through elimination of continuation-drive mechanisms but impose a 15-25% capability cost on reasoning tasks" +agent: theseus +scope: causal +sourcer: Johannes Treutlein, Roger Grosse, David Krueger +related_claims: ["[[the alignment tax creates a structural race to the bottom because safety training costs capability and rational competitors skip it]]", "[[safe AI development requires building alignment mechanisms before scaling capability]]"] +--- + +# Non-autoregressive architectures reduce jailbreak vulnerability by 40-65% through elimination of continuation-drive mechanisms but impose a 15-25% capability cost on reasoning tasks + +Treutlein et al. evaluated diffusion language models (which generate all tokens simultaneously via iterative refinement) against matched autoregressive models on standard jailbreak benchmarks. Diffusion LMs showed 40-65% lower jailbreak success rates, specifically resisting suffix-relocation jailbreaks that exploit the continuation-drive mechanism identified by Deng et al. The architectural mechanism is clear: because diffusion models generate all tokens simultaneously with iterative refinement rather than left-to-right sequential commitment, there is no 'where the instruction lands in the sequence' effect and no competition between continuation pressure and safety training. However, this safety advantage comes at real cost: current diffusion LMs underperform autoregressive models by 15-25% on long-form reasoning tasks. This represents a new form of alignment tax—not a training cost but an architectural tradeoff where safety advantages require capability sacrifice. Critically, the safety advantage is mechanism-specific, not general: diffusion LMs remain susceptible to different attack classes (semantic constraint relaxation, iterative refinement injection). This is empirical evidence for the 'deeper redesign' path Deng et al. called for, with quantified tradeoffs that competitive market pressure may penalize.