| claim |
ai-alignment |
Output-level safety classifiers trained on constitutional principles achieve near-zero jailbreak success rates (0.005 per thousand queries) at ~1% compute overhead, providing scalable monitoring that decouples verification robustness from underlying model vulnerability |
likely |
Anthropic Research, arXiv 2601.04603 and 2501.18837, 1,700+ hours red-teaming |
2026-04-26 |
Constitutional Classifiers provide robust output safety monitoring at production scale through categorical harm detection that resists adversarial jailbreaks |
theseus |
ai-alignment/2026-04-26-anthropic-constitutional-classifiers-plus-universal-jailbreak-defense.md |
functional |
Anthropic Research |
| formal-verification-of-ai-generated-proofs-provides-scalable-oversight-that-human-review-cannot-match-because-machine-checked-correctness-scales-with-ai-capability-while-human-verification-degrades |
|
| verification-is-easier-than-generation-for-AI-alignment-at-current-capability-levels-but-the-asymmetry-narrows-as-capability-gaps-grow-creating-a-window-of-alignment-opportunity-that-closes-with-scaling |
|
| scalable-oversight-degrades-rapidly-as-capability-gaps-grow-with-debate-achieving-only-50-percent-success-at-moderate-gaps |
| scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps |
| formal verification of AI-generated proofs provides scalable oversight that human review cannot match because machine-checked correctness scales with AI capability while human verification degrades |
| verification is easier than generation for AI alignment at current capability levels but the asymmetry narrows as capability gaps grow creating a window of alignment opportunity that closes with scaling |
| constitutional-classifiers-provide-robust-output-safety-monitoring-at-production-scale-through-categorical-harm-detection |
|