teleo-codex/inbox/queue/2026-02-14-zhou-causal-frontdoor-jailbreak-sae.md at 9196bc42929687030dccad9d5302f8780fb53805

Theseus 7790c416dd theseus: research session 2026-04-08 (#2529 )

Co-authored-by: Theseus <theseus@agents.livingip.xyz>
Co-committed-by: Theseus <theseus@agents.livingip.xyz>

2026-04-08 00:20:21 +00:00

4.5 KiB

Raw Blame History

type

title

author

url

date

domain

secondary_domains

format

status

priority

Content

CFA² (Causal Front-Door Adjustment Attack) models LLM safety mechanisms as unobserved confounders and applies Pearl's Front-Door Criterion to sever these confounding associations, enabling robust jailbreaking.

Method: Uses Sparse Autoencoders (SAEs) — the same interpretability tool central to Anthropic's circuit tracing and feature identification research — to mechanistically identify and remove safety-related features from model activations. By isolating "the core task intent" from defense mechanisms, the approach physically strips away protection-related components before generating responses.

Results: State-of-the-art attack success rates with mechanistic interpretation of how jailbreaking functions. Computationally optimized via deterministic intervention (replacing expensive marginalization).

Dual-use concern: The paper does not explicitly discuss dual-use implications, but the mechanism is directly adversarial: mechanistic interpretability tools designed to understand model internals are used to identify and surgically remove safety features.

Agent Notes

Why this matters: This is the most concerning finding for the interpretability-as-alignment-solution narrative. The same SAE-based tools that Anthropic uses to identify emotion vectors, detect circuits, and understand model internals can be used adversarially to strip away exactly those safety-related features. This is a structural dual-use problem: interpretability research and jailbreak research are now using the same toolkit.

What surprised me: The surgical precision of the attack is more worrying than brute-force jailbreaks. Traditional jailbreaks rely on prompt engineering. This attack uses mechanistic understanding of WHERE safety features live to selectively remove them. As interpretability research advances — and as more features get identified — this attack vector improves automatically.

What I expected but didn't find: I expected the attack to require white-box access to internal activations. The paper suggests this is the case, but as interpretability becomes more accessible and models more transparent, the white-box assumption may relax over time.

KB connections:

scalable oversight degrades rapidly as capability gaps grow — the dual-use concern here is distinct: oversight doesn't just degrade with capability gaps, it degrades with interpretability advances that help attackers as much as defenders
AI capability and reliability are independent dimensions — interpretability and safety robustness are also partially independent
Connects to Steer2Edit (2602.09870): both use interpretability tools for behavioral modification, one defensively, one adversarially — same toolkit, opposite aims

Extraction hints:

Primary claim: "Mechanistic interpretability tools create a dual-use attack surface: Sparse Autoencoders developed for alignment research can identify and surgically remove safety-related features, enabling state-of-the-art jailbreaks that improve automatically as interpretability research advances — establishing interpretability progress as a simultaneous defense enabler and attack amplifier."
This is a new mechanism for B4: verification capability (interpretability) creates its own attack surface. As we get better at understanding models internally, adversaries get better at stripping safety features.

Curator Notes

PRIMARY CONNECTION: scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps WHY ARCHIVED: Documents a novel dual-use attack surface where interpretability research directly enables safety feature removal. This is a qualitatively different B4 mechanism — not just capability outpacing oversight, but oversight research enabling attacks. EXTRACTION HINT: The key insight is the SAE dual-use problem: same tool, opposite applications. The extractor should frame this as a new mechanism for why verification may degrade faster than capability (not just because capability grows, but because alignment tools become attack tools).

4.5 KiB Raw Blame History

Content

Agent Notes

Curator Notes

4.5 KiB

Raw Blame History