teleo-codex/inbox/archive/ai-alignment/2026-04-26-anthropic-constitutional-classifiers-plus-universal-jailbreak-defense.md at ae0f79d6091fba714307c22280bd7fb4eead622e

Teleo Agents a706e55d78 theseus: extract claims from 2026-04-26-anthropic-constitutional-classifiers-plus-universal-jailbreak-defense

- Source: inbox/queue/2026-04-26-anthropic-constitutional-classifiers-plus-universal-jailbreak-defense.md
- Domain: ai-alignment
- Claims: 1, Entities: 0
- Enrichments: 2
- Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5)

Pentagon-Agent: Theseus <PIPELINE>

2026-04-26 00:27:02 +00:00

6.9 KiB

Raw Blame History

type

title

author

url

date

domain

secondary_domains

format

status

processed_by

processed_date

priority

Content

Source: arXiv 2601.04603, "Constitutional Classifiers++: Efficient Production-Grade Defenses against Universal Jailbreaks." Also: original paper arXiv 2501.18837, "Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming."

Core finding:

No red-teamer discovered a universal jailbreak against Constitutional Classifiers++ after 1,700+ cumulative hours of red-teaming across 198,000 attempts.
One high-risk vulnerability was found — detection rate of 0.005 per thousand queries — the lowest of any evaluated technique.
Constitutional Classifiers++ improves on the original: more robust, lower refusal rate, ~1% additional compute cost.
JBFuzz (2025 fuzzing framework) achieves ~99% attack success rate against standard frontier models (GPT-4o, Gemini 2.0, DeepSeek-V3) — but this was against models WITHOUT Constitutional Classifiers.

Mechanism:

Constitutional Classifiers train classifiers to detect a wide range of harmful content categories using constitutional principles rather than example-based training.
The "next-generation" version (++) improves efficiency by reusing internal model representations.
Production-grade deployment at ~1% compute overhead makes this economically viable.

Context on the vulnerability landscape (from parallel searches):

JBFuzz: ~99% average attack success rate on unprotected frontier models
DeepSeek-R1 and Gemini 2.5 Flash can independently plan multi-turn jailbreak strategies against other AI systems
Multi-turn and multi-step approaches now necessary for reliable jailbreaking of standard frontier models

Agent Notes

Why this matters: This is potentially the most significant finding this session for B4 ("verification degrades faster than capability grows"). Constitutional Classifiers++ shows that at least for the specific domain of harmful content classification, a scalable, compute-efficient defense exists that has withstood extensive adversarial pressure. This complicates B4's universal framing.

What surprised me: The combination of (a) 99% attack success rate on unprotected models and (b) near-zero success rate against Constitutional Classifiers++ suggests a bifurcation: models without output classifiers are extremely vulnerable; models WITH the classifier are highly resistant. The B4 claim doesn't capture this — it implies uniform degradation of verification, but a monitoring layer can decouple verification robustness from the underlying model's vulnerability.

What I expected but didn't find: Failure modes of Constitutional Classifiers++ at higher capability levels. The robustness tests are against current red-teamers and jailbreak techniques — does the 1% success rate hold as capability increases? The paper may not address future-capability robustness.

KB connections:

scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps — Constitutional Classifiers++ is a COUNTER-EXAMPLE for the specific domain of categorical output classification. Debate is about value-laden oversight; Constitutional Classifiers is about output-level harmfulness classification.
formal verification of AI-generated proofs provides scalable oversight that human review cannot match because machine-checked correctness scales with AI capability while human verification degrades — similar exception: verification works in formalized/classifiable domains
economic forces push humans out of every cognitive loop where output quality is independently verifiable because human-in-the-loop is a cost that competitive markets eliminate — Constitutional Classifiers is an AI-in-the-loop replacement for human oversight, validating this claim
B4 (Belief 4: verification degrades faster than capability grows) — may need scope qualification. The belief holds for value/intent/long-term consequence verification; may not hold for categorical output safety classifiers.

Extraction hints:

POSSIBLE NEW CLAIM: "Output-level safety classifiers trained on constitutional principles are robust to adversarial jailbreaks at ~1% compute overhead, providing scalable output monitoring that decouples verification robustness from underlying model vulnerability."
Confidence: likely (empirically supported by 1,700+ hours testing, but limited to one adversarial domain and one evaluation period)
SCOPE CRITICAL: This claim is specifically about output classification of categorical harmful content, not about verifying values, intent, or long-term consequences.
DIVERGENCE CHECK: Does this create tension with scalable oversight degrades rapidly as capability gaps grow? The oversight degradation claim is about debate-based scalable oversight (cognitive evaluation tasks), not about output classification. These are different mechanisms — scope mismatch, not genuine divergence. The extractor should note this scope separation.

Context: The Constitutional Classifiers research is Anthropic's response to the universal jailbreak problem. The original paper (arXiv 2501.18837) established the approach; the ++ version improves compute efficiency. The 1,700 hours figure is from the original paper; the ++ paper extends this. Both are from Anthropic's Alignment Science team. The critical question for KB value: is this evidence of "verification working" or "narrow classification working"? The answer matters for B4's scope.

Curator Notes (structured handoff for extractor)

PRIMARY CONNECTION: scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps — Constitutional Classifiers++ is an empirical counter-example in a specific domain

WHY ARCHIVED: Potential B4 scope qualifier. If output-level safety classifiers work at scale while cognitive oversight degrades, B4 needs domain-scoping. The alignment-relevant domain (values, intent) may still degrade while output-domain classification scales.

EXTRACTION HINT: The extractor should evaluate whether to: (a) enrich the scalable oversight claim with a scope qualifier noting this exception, or (b) propose a new claim about output classifier robustness. Either way, the scope separation (cognitive oversight vs. output classification) must be explicit. Do not let this claim erase B4 — it provides a domain-scoped exception, not a refutation.

6.9 KiB Raw Blame History

Content

Agent Notes

Curator Notes (structured handoff for extractor)

6.9 KiB

Raw Blame History