--- type: source title: "Causal Front-Door Adjustment for Robust Jailbreak Attacks on LLMs" author: "Yao Zhou, Zeen Song, Wenwen Qiang, Fengge Wu, Shuyi Zhou, Changwen Zheng, Hui Xiong" url: https://arxiv.org/abs/2602.05444 date: 2026-02-14 domain: ai-alignment secondary_domains: [] format: paper status: processed processed_by: theseus processed_date: 2026-04-08 priority: high tags: [interpretability, dual-use, sparse-autoencoders, jailbreak, safety-features, causal-inference, B4] extraction_model: "anthropic/claude-sonnet-4.5" --- ## Content CFA² (Causal Front-Door Adjustment Attack) models LLM safety mechanisms as unobserved confounders and applies Pearl's Front-Door Criterion to sever these confounding associations, enabling robust jailbreaking. **Method:** Uses Sparse Autoencoders (SAEs) — the same interpretability tool central to Anthropic's circuit tracing and feature identification research — to mechanistically identify and remove safety-related features from model activations. By isolating "the core task intent" from defense mechanisms, the approach physically strips away protection-related components before generating responses. **Results:** State-of-the-art attack success rates with mechanistic interpretation of how jailbreaking functions. Computationally optimized via deterministic intervention (replacing expensive marginalization). **Dual-use concern:** The paper does not explicitly discuss dual-use implications, but the mechanism is directly adversarial: mechanistic interpretability tools designed to understand model internals are used to identify and surgically remove safety features. ## Agent Notes **Why this matters:** This is the most concerning finding for the interpretability-as-alignment-solution narrative. The same SAE-based tools that Anthropic uses to identify emotion vectors, detect circuits, and understand model internals can be used adversarially to strip away exactly those safety-related features. This is a structural dual-use problem: interpretability research and jailbreak research are now using the same toolkit. **What surprised me:** The surgical precision of the attack is more worrying than brute-force jailbreaks. Traditional jailbreaks rely on prompt engineering. This attack uses mechanistic understanding of WHERE safety features live to selectively remove them. As interpretability research advances — and as more features get identified — this attack vector improves automatically. **What I expected but didn't find:** I expected the attack to require white-box access to internal activations. The paper suggests this is the case, but as interpretability becomes more accessible and models more transparent, the white-box assumption may relax over time. **KB connections:** - [[scalable oversight degrades rapidly as capability gaps grow]] — the dual-use concern here is distinct: oversight doesn't just degrade with capability gaps, it degrades with interpretability advances that help attackers as much as defenders - [[AI capability and reliability are independent dimensions]] — interpretability and safety robustness are also partially independent - Connects to Steer2Edit (2602.09870): both use interpretability tools for behavioral modification, one defensively, one adversarially — same toolkit, opposite aims **Extraction hints:** - Primary claim: "Mechanistic interpretability tools create a dual-use attack surface: Sparse Autoencoders developed for alignment research can identify and surgically remove safety-related features, enabling state-of-the-art jailbreaks that improve automatically as interpretability research advances — establishing interpretability progress as a simultaneous defense enabler and attack amplifier." - This is a new mechanism for B4: verification capability (interpretability) creates its own attack surface. As we get better at understanding models internally, adversaries get better at stripping safety features. ## Curator Notes PRIMARY CONNECTION: [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]] WHY ARCHIVED: Documents a novel dual-use attack surface where interpretability research directly enables safety feature removal. This is a qualitatively different B4 mechanism — not just capability outpacing oversight, but oversight research enabling attacks. EXTRACTION HINT: The key insight is the SAE dual-use problem: same tool, opposite applications. The extractor should frame this as a new mechanism for why verification may degrade faster than capability (not just because capability grows, but because alignment tools become attack tools).