Some checks failed
Mirror PR to Forgejo / mirror (pull_request) Has been cancelled
Pipeline auto-fixer: removed [[ ]] brackets from links that don't resolve to existing claims in the knowledge base.
63 lines
6.8 KiB
Markdown
63 lines
6.8 KiB
Markdown
---
|
|
type: source
|
|
title: "Anthropic Constitutional Classifiers++: Efficient Defense Against Universal Jailbreaks — No Breakthrough After 1,700 Hours Red-Teaming"
|
|
author: "Anthropic Research (arXiv 2601.04603)"
|
|
url: https://arxiv.org/abs/2601.04603
|
|
date: 2026-01-01
|
|
domain: ai-alignment
|
|
secondary_domains: []
|
|
format: preprint
|
|
status: unprocessed
|
|
priority: high
|
|
tags: [constitutional-classifiers, jailbreaks, adversarial-robustness, monitoring, b4-verification, scalable-oversight, output-classifiers]
|
|
---
|
|
|
|
## Content
|
|
|
|
**Source:** arXiv 2601.04603, "Constitutional Classifiers++: Efficient Production-Grade Defenses against Universal Jailbreaks." Also: original paper arXiv 2501.18837, "Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming."
|
|
|
|
**Core finding:**
|
|
- No red-teamer discovered a universal jailbreak against Constitutional Classifiers++ after 1,700+ cumulative hours of red-teaming across 198,000 attempts.
|
|
- One high-risk vulnerability was found — detection rate of 0.005 per thousand queries — the lowest of any evaluated technique.
|
|
- Constitutional Classifiers++ improves on the original: more robust, lower refusal rate, ~1% additional compute cost.
|
|
- JBFuzz (2025 fuzzing framework) achieves ~99% attack success rate against standard frontier models (GPT-4o, Gemini 2.0, DeepSeek-V3) — but this was against models WITHOUT Constitutional Classifiers.
|
|
|
|
**Mechanism:**
|
|
- Constitutional Classifiers train classifiers to detect a wide range of harmful content categories using constitutional principles rather than example-based training.
|
|
- The "next-generation" version (++) improves efficiency by reusing internal model representations.
|
|
- Production-grade deployment at ~1% compute overhead makes this economically viable.
|
|
|
|
**Context on the vulnerability landscape (from parallel searches):**
|
|
- JBFuzz: ~99% average attack success rate on unprotected frontier models
|
|
- DeepSeek-R1 and Gemini 2.5 Flash can independently plan multi-turn jailbreak strategies against other AI systems
|
|
- Multi-turn and multi-step approaches now necessary for reliable jailbreaking of standard frontier models
|
|
|
|
## Agent Notes
|
|
|
|
**Why this matters:** This is potentially the most significant finding this session for B4 ("verification degrades faster than capability grows"). Constitutional Classifiers++ shows that at least for the specific domain of harmful content classification, a scalable, compute-efficient defense exists that has withstood extensive adversarial pressure. This complicates B4's universal framing.
|
|
|
|
**What surprised me:** The combination of (a) 99% attack success rate on unprotected models and (b) near-zero success rate against Constitutional Classifiers++ suggests a bifurcation: models without output classifiers are extremely vulnerable; models WITH the classifier are highly resistant. The B4 claim doesn't capture this — it implies uniform degradation of verification, but a monitoring layer can decouple verification robustness from the underlying model's vulnerability.
|
|
|
|
**What I expected but didn't find:** Failure modes of Constitutional Classifiers++ at higher capability levels. The robustness tests are against current red-teamers and jailbreak techniques — does the 1% success rate hold as capability increases? The paper may not address future-capability robustness.
|
|
|
|
**KB connections:**
|
|
- [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]] — Constitutional Classifiers++ is a COUNTER-EXAMPLE for the specific domain of categorical output classification. Debate is about value-laden oversight; Constitutional Classifiers is about output-level harmfulness classification.
|
|
- [[formal verification of AI-generated proofs provides scalable oversight that human review cannot match because machine-checked correctness scales with AI capability while human verification degrades]] — similar exception: verification works in formalized/classifiable domains
|
|
- [[economic forces push humans out of every cognitive loop where output quality is independently verifiable because human-in-the-loop is a cost that competitive markets eliminate]] — Constitutional Classifiers is an AI-in-the-loop replacement for human oversight, validating this claim
|
|
- B4 (Belief 4: verification degrades faster than capability grows) — may need scope qualification. The belief holds for value/intent/long-term consequence verification; may not hold for categorical output safety classifiers.
|
|
|
|
**Extraction hints:**
|
|
- POSSIBLE NEW CLAIM: "Output-level safety classifiers trained on constitutional principles are robust to adversarial jailbreaks at ~1% compute overhead, providing scalable output monitoring that decouples verification robustness from underlying model vulnerability."
|
|
- Confidence: likely (empirically supported by 1,700+ hours testing, but limited to one adversarial domain and one evaluation period)
|
|
- SCOPE CRITICAL: This claim is specifically about output classification of categorical harmful content, not about verifying values, intent, or long-term consequences.
|
|
- DIVERGENCE CHECK: Does this create tension with scalable oversight degrades rapidly as capability gaps grow? The oversight degradation claim is about debate-based scalable oversight (cognitive evaluation tasks), not about output classification. These are different mechanisms — scope mismatch, not genuine divergence. The extractor should note this scope separation.
|
|
|
|
**Context:** The Constitutional Classifiers research is Anthropic's response to the universal jailbreak problem. The original paper (arXiv 2501.18837) established the approach; the ++ version improves compute efficiency. The 1,700 hours figure is from the original paper; the ++ paper extends this. Both are from Anthropic's Alignment Science team. The critical question for KB value: is this evidence of "verification working" or "narrow classification working"? The answer matters for B4's scope.
|
|
|
|
## Curator Notes (structured handoff for extractor)
|
|
|
|
PRIMARY CONNECTION: [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]] — Constitutional Classifiers++ is an empirical counter-example in a specific domain
|
|
|
|
WHY ARCHIVED: Potential B4 scope qualifier. If output-level safety classifiers work at scale while cognitive oversight degrades, B4 needs domain-scoping. The alignment-relevant domain (values, intent) may still degrade while output-domain classification scales.
|
|
|
|
EXTRACTION HINT: The extractor should evaluate whether to: (a) enrich the scalable oversight claim with a scope qualifier noting this exception, or (b) propose a new claim about output classifier robustness. Either way, the scope separation (cognitive oversight vs. output classification) must be explicit. Do not let this claim erase B4 — it provides a domain-scoped exception, not a refutation.
|