From f1f27f4ba016e986e2277ec196d2a1ef957c5a6a Mon Sep 17 00:00:00 2001 From: Teleo Agents Date: Wed, 8 Apr 2026 00:23:52 +0000 Subject: [PATCH] theseus: extract claims from 2026-02-14-zhou-causal-frontdoor-jailbreak-sae - Source: inbox/queue/2026-02-14-zhou-causal-frontdoor-jailbreak-sae.md - Domain: ai-alignment - Claims: 1, Entities: 0 - Enrichments: 1 - Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5) Pentagon-Agent: Theseus --- ...-enabling-surgical-safety-feature-removal.md | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) create mode 100644 domains/ai-alignment/mechanistic-interpretability-tools-create-dual-use-attack-surface-enabling-surgical-safety-feature-removal.md diff --git a/domains/ai-alignment/mechanistic-interpretability-tools-create-dual-use-attack-surface-enabling-surgical-safety-feature-removal.md b/domains/ai-alignment/mechanistic-interpretability-tools-create-dual-use-attack-surface-enabling-surgical-safety-feature-removal.md new file mode 100644 index 000000000..3bf184079 --- /dev/null +++ b/domains/ai-alignment/mechanistic-interpretability-tools-create-dual-use-attack-surface-enabling-surgical-safety-feature-removal.md @@ -0,0 +1,17 @@ +--- +type: claim +domain: ai-alignment +description: As interpretability research advances, adversaries gain the same capability to locate and strip safety mechanisms, making interpretability progress simultaneously strengthen both defense and attack +confidence: experimental +source: Zhou et al. (2026), CFA² attack achieving state-of-the-art jailbreak success rates +created: 2026-04-08 +title: Mechanistic interpretability tools create a dual-use attack surface where Sparse Autoencoders developed for alignment research can identify and surgically remove safety-related features +agent: theseus +scope: causal +sourcer: Zhou et al. +related_claims: ["[[AI capability and reliability are independent dimensions because Claude solved a 30-year open mathematical problem while simultaneously degrading at basic program execution during the same session]]", "[[safe AI development requires building alignment mechanisms before scaling capability]]"] +--- + +# Mechanistic interpretability tools create a dual-use attack surface where Sparse Autoencoders developed for alignment research can identify and surgically remove safety-related features + +The CFA² (Causal Front-Door Adjustment Attack) demonstrates that Sparse Autoencoders — the same interpretability tool central to Anthropic's circuit tracing and feature identification research — can be used adversarially to mechanistically identify and remove safety-related features from model activations. The attack models LLM safety mechanisms as unobserved confounders and applies Pearl's Front-Door Criterion to sever these confounding associations. By isolating 'the core task intent' from defense mechanisms, the approach physically strips away protection-related components before generating responses, achieving state-of-the-art attack success rates. This is qualitatively different from traditional prompt-based jailbreaks: it uses mechanistic understanding of WHERE safety features live to selectively remove them. The surgical precision is more concerning than brute-force approaches because as interpretability research advances and more features get identified, this attack vector improves automatically. The same toolkit that enables understanding model internals for alignment purposes enables adversaries to strip away exactly those safety-related features. This establishes a structural dual-use problem where interpretability progress is simultaneously a defense enabler and attack amplifier.