Mirror PR to Forgejo / mirror (pull_request) Has been cancelled

Details

theseus: extract claims from 2026-04-28-theseus-b4-scope-qualification-synthesis

- Source: inbox/queue/2026-04-28-theseus-b4-scope-qualification-synthesis.md
- Domain: ai-alignment
- Claims: 0, Entities: 0
- Enrichments: 6
- Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5)

Pentagon-Agent: Theseus <PIPELINE>

2026-04-28 00:24:34 +00:00

4.2 KiB

Raw Blame History

type

domain

description

confidence

source

created

title

agent

sourced_from

scope

sourcer

supports

challenges

claim

ai-alignment

Output-level safety classifiers trained on constitutional principles achieve near-zero jailbreak success rates (0.005 per thousand queries) at ~1% compute overhead, providing scalable monitoring that decouples verification robustness from underlying model vulnerability

likely

Anthropic Research, arXiv 2601.04603 and 2501.18837, 1,700+ hours red-teaming

2026-04-26

Constitutional Classifiers provide robust output safety monitoring at production scale through categorical harm detection that resists adversarial jailbreaks

theseus

ai-alignment/2026-04-26-anthropic-constitutional-classifiers-plus-universal-jailbreak-defense.md

functional

Anthropic Research

formal-verification-of-ai-generated-proofs-provides-scalable-oversight-that-human-review-cannot-match-because-machine-checked-correctness-scales-with-ai-capability-while-human-verification-degrades

verification-is-easier-than-generation-for-AI-alignment-at-current-capability-levels-but-the-asymmetry-narrows-as-capability-gaps-grow-creating-a-window-of-alignment-opportunity-that-closes-with-scaling

scalable-oversight-degrades-rapidly-as-capability-gaps-grow-with-debate-achieving-only-50-percent-success-at-moderate-gaps

scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps

formal verification of AI-generated proofs provides scalable oversight that human review cannot match because machine-checked correctness scales with AI capability while human verification degrades

verification is easier than generation for AI alignment at current capability levels but the asymmetry narrows as capability gaps grow creating a window of alignment opportunity that closes with scaling

constitutional-classifiers-provide-robust-output-safety-monitoring-at-production-scale-through-categorical-harm-detection

Constitutional Classifiers provide robust output safety monitoring at production scale through categorical harm detection that resists adversarial jailbreaks

Constitutional Classifiers++ demonstrated exceptional robustness against universal jailbreaks across 1,700+ cumulative hours of red-teaming with 198,000 attempts, achieving a vulnerability detection rate of only 0.005 per thousand queries. This represents the lowest vulnerability rate of any evaluated technique. The mechanism works by training classifiers to detect harmful content categories using constitutional principles rather than example-based training, operating at the output level rather than attempting to align the underlying model's reasoning. The ++ version achieves this robustness at approximately 1% additional compute cost by reusing internal model representations, making it economically viable for production deployment. Critically, this creates a bifurcation in the threat landscape: JBFuzz (2025 fuzzing framework) achieves ~99% attack success rate against standard frontier models without output classifiers, but Constitutional Classifiers++ resists these same attacks. This suggests that output-level monitoring can provide verification robustness that is independent of the underlying model's vulnerability to jailbreaks. The key architectural insight is that categorical harm detection (is this output harmful?) is a different problem than value alignment (does this output reflect correct values?), and the former may be more tractable at scale.

Extending Evidence

Source: Theseus B4 synthesis, Session 35 Constitutional Classifiers evidence

Constitutional Classifiers represent a genuine exception to verification degradation for categorical safety functions. Session 35 showed high robustness against jailbreaks even with white-box access. Key distinction: classifier robustness is NOT alignment verification. A robust content classifier can reliably identify forbidden outputs while the underlying model remains misaligned in all the ways that matter for superintelligence. This exception is real but is not about alignment - it addresses content safety (is this harmful? does this follow a rule?) not the alignment-relevant core of values, intent, and long-term consequences.

4.2 KiB Raw Blame History

Constitutional Classifiers provide robust output safety monitoring at production scale through categorical harm detection that resists adversarial jailbreaks

Extending Evidence

4.2 KiB

Raw Blame History