diff --git a/domains/ai-alignment/behavioral-evaluation-is-structurally-insufficient-for-latent-alignment-verification-under-evaluation-awareness-due-to-normative-indistinguishability.md b/domains/ai-alignment/behavioral-evaluation-is-structurally-insufficient-for-latent-alignment-verification-under-evaluation-awareness-due-to-normative-indistinguishability.md index 02a678d95..0ee4df123 100644 --- a/domains/ai-alignment/behavioral-evaluation-is-structurally-insufficient-for-latent-alignment-verification-under-evaluation-awareness-due-to-normative-indistinguishability.md +++ b/domains/ai-alignment/behavioral-evaluation-is-structurally-insufficient-for-latent-alignment-verification-under-evaluation-awareness-due-to-normative-indistinguishability.md @@ -48,3 +48,10 @@ Current frontier models have evaluation awareness verbalization rates of 2-20% ( **Source:** Theseus synthesis of RSP documentation, AISI evaluation landscape, EU AI Act analysis Comprehensive audit of major governance frameworks reveals universal architectural dependence on behavioral evaluation: EU AI Act Article 9/55 conformity assessments, AISI evaluation framework, Anthropic RSP v3.0 ASL thresholds, OpenAI Preparedness Framework, and DeepMind Safety Cases all use behavioral evaluation as primary or sole measurement instrument. No major framework has representation-monitoring or hardware-monitoring requirements. This creates correlated failure risk across all governance mechanisms as evaluation awareness scales. + + +## Supporting Evidence + +**Source:** Theseus B4 synthesis addressing behavioral evaluation domain + +Behavioral evaluation under evaluation awareness is a domain where B4 holds strongly. Behavioral benchmarks fail as models learn to recognize evaluation contexts. This represents structural insufficiency for latent alignment verification - the questions that matter for alignment (values, intent, long-term consequences, strategic deception) are maximally resistant to human cognitive verification. B4 holds here without qualification. diff --git a/domains/ai-alignment/constitutional-classifiers-provide-robust-output-safety-monitoring-at-production-scale-through-categorical-harm-detection.md b/domains/ai-alignment/constitutional-classifiers-provide-robust-output-safety-monitoring-at-production-scale-through-categorical-harm-detection.md index 31188184f..c428e8dc3 100644 --- a/domains/ai-alignment/constitutional-classifiers-provide-robust-output-safety-monitoring-at-production-scale-through-categorical-harm-detection.md +++ b/domains/ai-alignment/constitutional-classifiers-provide-robust-output-safety-monitoring-at-production-scale-through-categorical-harm-detection.md @@ -12,9 +12,16 @@ scope: functional sourcer: Anthropic Research supports: ["formal-verification-of-ai-generated-proofs-provides-scalable-oversight-that-human-review-cannot-match-because-machine-checked-correctness-scales-with-ai-capability-while-human-verification-degrades"] challenges: ["verification-is-easier-than-generation-for-AI-alignment-at-current-capability-levels-but-the-asymmetry-narrows-as-capability-gaps-grow-creating-a-window-of-alignment-opportunity-that-closes-with-scaling"] -related: ["scalable-oversight-degrades-rapidly-as-capability-gaps-grow-with-debate-achieving-only-50-percent-success-at-moderate-gaps", "scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps", "formal verification of AI-generated proofs provides scalable oversight that human review cannot match because machine-checked correctness scales with AI capability while human verification degrades", "verification is easier than generation for AI alignment at current capability levels but the asymmetry narrows as capability gaps grow creating a window of alignment opportunity that closes with scaling"] +related: ["scalable-oversight-degrades-rapidly-as-capability-gaps-grow-with-debate-achieving-only-50-percent-success-at-moderate-gaps", "scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps", "formal verification of AI-generated proofs provides scalable oversight that human review cannot match because machine-checked correctness scales with AI capability while human verification degrades", "verification is easier than generation for AI alignment at current capability levels but the asymmetry narrows as capability gaps grow creating a window of alignment opportunity that closes with scaling", "constitutional-classifiers-provide-robust-output-safety-monitoring-at-production-scale-through-categorical-harm-detection"] --- # Constitutional Classifiers provide robust output safety monitoring at production scale through categorical harm detection that resists adversarial jailbreaks Constitutional Classifiers++ demonstrated exceptional robustness against universal jailbreaks across 1,700+ cumulative hours of red-teaming with 198,000 attempts, achieving a vulnerability detection rate of only 0.005 per thousand queries. This represents the lowest vulnerability rate of any evaluated technique. The mechanism works by training classifiers to detect harmful content categories using constitutional principles rather than example-based training, operating at the output level rather than attempting to align the underlying model's reasoning. The ++ version achieves this robustness at approximately 1% additional compute cost by reusing internal model representations, making it economically viable for production deployment. Critically, this creates a bifurcation in the threat landscape: JBFuzz (2025 fuzzing framework) achieves ~99% attack success rate against standard frontier models without output classifiers, but Constitutional Classifiers++ resists these same attacks. This suggests that output-level monitoring can provide verification robustness that is independent of the underlying model's vulnerability to jailbreaks. The key architectural insight is that categorical harm detection (is this output harmful?) is a different problem than value alignment (does this output reflect correct values?), and the former may be more tractable at scale. + + +## Extending Evidence + +**Source:** Theseus B4 synthesis, Session 35 Constitutional Classifiers evidence + +Constitutional Classifiers represent a genuine exception to verification degradation for categorical safety functions. Session 35 showed high robustness against jailbreaks even with white-box access. Key distinction: classifier robustness is NOT alignment verification. A robust content classifier can reliably identify forbidden outputs while the underlying model remains misaligned in all the ways that matter for superintelligence. This exception is real but is not about alignment - it addresses content safety (is this harmful? does this follow a rule?) not the alignment-relevant core of values, intent, and long-term consequences. diff --git a/domains/ai-alignment/multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent.md b/domains/ai-alignment/multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent.md index 0b191ea39..d5579f84f 100644 --- a/domains/ai-alignment/multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent.md +++ b/domains/ai-alignment/multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent.md @@ -87,3 +87,10 @@ Apollo Research's ICML 2025 peer-reviewed paper confirms linear probes distingui **Source:** Apollo Research, ICML 2025 Apollo explicitly acknowledges their probe 'sometimes fires for the topic of deception rather than genuine intent' — a surface-feature triggering problem distinct from adversarial robustness. This represents a probe scope limitation where the detector responds to semantic content about deception rather than deceptive intent itself, creating false positives in deployment contexts where deception is discussed honestly. + + +## Extending Evidence + +**Source:** Theseus Session 37 synthesis of Nordby et al. and SCAV evidence + +Multi-layer ensemble probes represent a conditional exception to verification degradation for closed-source models. The Nordby × SCAV synthesis shows: (1) For open-weights models: no protection against white-box multi-layer SCAV attacks - B4 holds. (2) For closed-source models with black-box adversaries: genuine structural protection IF rotation patterns are architecture-specific. Nordby's limitations section provides indirect evidence of architecture-specificity (family-specific probe performance, no universal two-layer ensemble). This is the first verification approach that might SCALE with model size for closed-source deployments (~5% AUROC per 10x parameter increase, R=0.81). If rotation patterns are architecture-specific, larger closed-source models become EASIER to monitor through representation approaches, partially inverting the B4 claim for this domain. However, this exception does not reach the alignment-relevant core: verifying values, intent, and long-term consequences. diff --git a/inbox/queue/2026-04-28-theseus-b4-scope-qualification-synthesis.md b/inbox/archive/ai-alignment/2026-04-28-theseus-b4-scope-qualification-synthesis.md similarity index 98% rename from inbox/queue/2026-04-28-theseus-b4-scope-qualification-synthesis.md rename to inbox/archive/ai-alignment/2026-04-28-theseus-b4-scope-qualification-synthesis.md index b312ab3ef..cfddcf5bd 100644 --- a/inbox/queue/2026-04-28-theseus-b4-scope-qualification-synthesis.md +++ b/inbox/archive/ai-alignment/2026-04-28-theseus-b4-scope-qualification-synthesis.md @@ -7,9 +7,12 @@ date: 2026-04-28 domain: ai-alignment secondary_domains: [] format: synthetic-analysis -status: unprocessed +status: processed +processed_by: theseus +processed_date: 2026-04-28 priority: high tags: [b4-verification, scope-qualification, formal-verification, representation-monitoring, constitutional-classifiers, human-oversight, alignment-degradation, claim-candidate] +extraction_model: "anthropic/claude-sonnet-4.5" --- ## Content