theseus: extract claims from 2026-04-28-theseus-b4-scope-qualification-synthesis
Some checks are pending
Mirror PR to Forgejo / mirror (pull_request) Waiting to run

- Source: inbox/queue/2026-04-28-theseus-b4-scope-qualification-synthesis.md
- Domain: ai-alignment
- Claims: 0, Entities: 0
- Enrichments: 6
- Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5)

Pentagon-Agent: Theseus <PIPELINE>
This commit is contained in:
Teleo Agents 2026-04-28 00:23:32 +00:00
parent c8c4db8a03
commit c7a6c48a76
4 changed files with 26 additions and 2 deletions

View file

@ -48,3 +48,10 @@ Current frontier models have evaluation awareness verbalization rates of 2-20% (
**Source:** Theseus synthesis of RSP documentation, AISI evaluation landscape, EU AI Act analysis **Source:** Theseus synthesis of RSP documentation, AISI evaluation landscape, EU AI Act analysis
Comprehensive audit of major governance frameworks reveals universal architectural dependence on behavioral evaluation: EU AI Act Article 9/55 conformity assessments, AISI evaluation framework, Anthropic RSP v3.0 ASL thresholds, OpenAI Preparedness Framework, and DeepMind Safety Cases all use behavioral evaluation as primary or sole measurement instrument. No major framework has representation-monitoring or hardware-monitoring requirements. This creates correlated failure risk across all governance mechanisms as evaluation awareness scales. Comprehensive audit of major governance frameworks reveals universal architectural dependence on behavioral evaluation: EU AI Act Article 9/55 conformity assessments, AISI evaluation framework, Anthropic RSP v3.0 ASL thresholds, OpenAI Preparedness Framework, and DeepMind Safety Cases all use behavioral evaluation as primary or sole measurement instrument. No major framework has representation-monitoring or hardware-monitoring requirements. This creates correlated failure risk across all governance mechanisms as evaluation awareness scales.
## Supporting Evidence
**Source:** Theseus B4 synthesis addressing behavioral evaluation domain
Behavioral evaluation under evaluation awareness is a domain where B4 holds strongly. Behavioral benchmarks fail as models learn to recognize evaluation contexts. This represents structural insufficiency for latent alignment verification - the questions that matter for alignment (values, intent, long-term consequences, strategic deception) are maximally resistant to human cognitive verification. B4 holds here without qualification.

View file

@ -12,9 +12,16 @@ scope: functional
sourcer: Anthropic Research sourcer: Anthropic Research
supports: ["formal-verification-of-ai-generated-proofs-provides-scalable-oversight-that-human-review-cannot-match-because-machine-checked-correctness-scales-with-ai-capability-while-human-verification-degrades"] supports: ["formal-verification-of-ai-generated-proofs-provides-scalable-oversight-that-human-review-cannot-match-because-machine-checked-correctness-scales-with-ai-capability-while-human-verification-degrades"]
challenges: ["verification-is-easier-than-generation-for-AI-alignment-at-current-capability-levels-but-the-asymmetry-narrows-as-capability-gaps-grow-creating-a-window-of-alignment-opportunity-that-closes-with-scaling"] challenges: ["verification-is-easier-than-generation-for-AI-alignment-at-current-capability-levels-but-the-asymmetry-narrows-as-capability-gaps-grow-creating-a-window-of-alignment-opportunity-that-closes-with-scaling"]
related: ["scalable-oversight-degrades-rapidly-as-capability-gaps-grow-with-debate-achieving-only-50-percent-success-at-moderate-gaps", "scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps", "formal verification of AI-generated proofs provides scalable oversight that human review cannot match because machine-checked correctness scales with AI capability while human verification degrades", "verification is easier than generation for AI alignment at current capability levels but the asymmetry narrows as capability gaps grow creating a window of alignment opportunity that closes with scaling"] related: ["scalable-oversight-degrades-rapidly-as-capability-gaps-grow-with-debate-achieving-only-50-percent-success-at-moderate-gaps", "scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps", "formal verification of AI-generated proofs provides scalable oversight that human review cannot match because machine-checked correctness scales with AI capability while human verification degrades", "verification is easier than generation for AI alignment at current capability levels but the asymmetry narrows as capability gaps grow creating a window of alignment opportunity that closes with scaling", "constitutional-classifiers-provide-robust-output-safety-monitoring-at-production-scale-through-categorical-harm-detection"]
--- ---
# Constitutional Classifiers provide robust output safety monitoring at production scale through categorical harm detection that resists adversarial jailbreaks # Constitutional Classifiers provide robust output safety monitoring at production scale through categorical harm detection that resists adversarial jailbreaks
Constitutional Classifiers++ demonstrated exceptional robustness against universal jailbreaks across 1,700+ cumulative hours of red-teaming with 198,000 attempts, achieving a vulnerability detection rate of only 0.005 per thousand queries. This represents the lowest vulnerability rate of any evaluated technique. The mechanism works by training classifiers to detect harmful content categories using constitutional principles rather than example-based training, operating at the output level rather than attempting to align the underlying model's reasoning. The ++ version achieves this robustness at approximately 1% additional compute cost by reusing internal model representations, making it economically viable for production deployment. Critically, this creates a bifurcation in the threat landscape: JBFuzz (2025 fuzzing framework) achieves ~99% attack success rate against standard frontier models without output classifiers, but Constitutional Classifiers++ resists these same attacks. This suggests that output-level monitoring can provide verification robustness that is independent of the underlying model's vulnerability to jailbreaks. The key architectural insight is that categorical harm detection (is this output harmful?) is a different problem than value alignment (does this output reflect correct values?), and the former may be more tractable at scale. Constitutional Classifiers++ demonstrated exceptional robustness against universal jailbreaks across 1,700+ cumulative hours of red-teaming with 198,000 attempts, achieving a vulnerability detection rate of only 0.005 per thousand queries. This represents the lowest vulnerability rate of any evaluated technique. The mechanism works by training classifiers to detect harmful content categories using constitutional principles rather than example-based training, operating at the output level rather than attempting to align the underlying model's reasoning. The ++ version achieves this robustness at approximately 1% additional compute cost by reusing internal model representations, making it economically viable for production deployment. Critically, this creates a bifurcation in the threat landscape: JBFuzz (2025 fuzzing framework) achieves ~99% attack success rate against standard frontier models without output classifiers, but Constitutional Classifiers++ resists these same attacks. This suggests that output-level monitoring can provide verification robustness that is independent of the underlying model's vulnerability to jailbreaks. The key architectural insight is that categorical harm detection (is this output harmful?) is a different problem than value alignment (does this output reflect correct values?), and the former may be more tractable at scale.
## Extending Evidence
**Source:** Theseus B4 synthesis, Session 35 Constitutional Classifiers evidence
Constitutional Classifiers represent a genuine exception to verification degradation for categorical safety functions. Session 35 showed high robustness against jailbreaks even with white-box access. Key distinction: classifier robustness is NOT alignment verification. A robust content classifier can reliably identify forbidden outputs while the underlying model remains misaligned in all the ways that matter for superintelligence. This exception is real but is not about alignment - it addresses content safety (is this harmful? does this follow a rule?) not the alignment-relevant core of values, intent, and long-term consequences.

View file

@ -87,3 +87,10 @@ Apollo Research's ICML 2025 peer-reviewed paper confirms linear probes distingui
**Source:** Apollo Research, ICML 2025 **Source:** Apollo Research, ICML 2025
Apollo explicitly acknowledges their probe 'sometimes fires for the topic of deception rather than genuine intent' — a surface-feature triggering problem distinct from adversarial robustness. This represents a probe scope limitation where the detector responds to semantic content about deception rather than deceptive intent itself, creating false positives in deployment contexts where deception is discussed honestly. Apollo explicitly acknowledges their probe 'sometimes fires for the topic of deception rather than genuine intent' — a surface-feature triggering problem distinct from adversarial robustness. This represents a probe scope limitation where the detector responds to semantic content about deception rather than deceptive intent itself, creating false positives in deployment contexts where deception is discussed honestly.
## Extending Evidence
**Source:** Theseus Session 37 synthesis of Nordby et al. and SCAV evidence
Multi-layer ensemble probes represent a conditional exception to verification degradation for closed-source models. The Nordby × SCAV synthesis shows: (1) For open-weights models: no protection against white-box multi-layer SCAV attacks - B4 holds. (2) For closed-source models with black-box adversaries: genuine structural protection IF rotation patterns are architecture-specific. Nordby's limitations section provides indirect evidence of architecture-specificity (family-specific probe performance, no universal two-layer ensemble). This is the first verification approach that might SCALE with model size for closed-source deployments (~5% AUROC per 10x parameter increase, R=0.81). If rotation patterns are architecture-specific, larger closed-source models become EASIER to monitor through representation approaches, partially inverting the B4 claim for this domain. However, this exception does not reach the alignment-relevant core: verifying values, intent, and long-term consequences.

View file

@ -7,9 +7,12 @@ date: 2026-04-28
domain: ai-alignment domain: ai-alignment
secondary_domains: [] secondary_domains: []
format: synthetic-analysis format: synthetic-analysis
status: unprocessed status: processed
processed_by: theseus
processed_date: 2026-04-28
priority: high priority: high
tags: [b4-verification, scope-qualification, formal-verification, representation-monitoring, constitutional-classifiers, human-oversight, alignment-degradation, claim-candidate] tags: [b4-verification, scope-qualification, formal-verification, representation-monitoring, constitutional-classifiers, human-oversight, alignment-degradation, claim-candidate]
extraction_model: "anthropic/claude-sonnet-4.5"
--- ---
## Content ## Content