Compare commits

..

2 commits

Author SHA1 Message Date
Teleo Agents
ddd0345310 auto-fix: strip 1 broken wiki links
Some checks failed
Mirror PR to Forgejo / mirror (pull_request) Has been cancelled
Pipeline auto-fixer: removed [[ ]] brackets from links
that don't resolve to existing claims in the knowledge base.
2026-04-28 00:11:37 +00:00
139cd081bd theseus: research session 2026-04-28 — 1 sources archived
Some checks failed
Mirror PR to Forgejo / mirror (pull_request) Has been cancelled
Pentagon-Agent: Theseus <HEADLESS>
2026-04-28 00:10:53 +00:00
4 changed files with 1 additions and 119 deletions

View file

@ -48,10 +48,3 @@ Current frontier models have evaluation awareness verbalization rates of 2-20% (
**Source:** Theseus synthesis of RSP documentation, AISI evaluation landscape, EU AI Act analysis
Comprehensive audit of major governance frameworks reveals universal architectural dependence on behavioral evaluation: EU AI Act Article 9/55 conformity assessments, AISI evaluation framework, Anthropic RSP v3.0 ASL thresholds, OpenAI Preparedness Framework, and DeepMind Safety Cases all use behavioral evaluation as primary or sole measurement instrument. No major framework has representation-monitoring or hardware-monitoring requirements. This creates correlated failure risk across all governance mechanisms as evaluation awareness scales.
## Supporting Evidence
**Source:** Theseus B4 synthesis addressing behavioral evaluation domain
Behavioral evaluation under evaluation awareness is a domain where B4 holds strongly. Behavioral benchmarks fail as models learn to recognize evaluation contexts. This represents structural insufficiency for latent alignment verification - the questions that matter for alignment (values, intent, long-term consequences, strategic deception) are maximally resistant to human cognitive verification. B4 holds here without qualification.

View file

@ -12,16 +12,9 @@ scope: functional
sourcer: Anthropic Research
supports: ["formal-verification-of-ai-generated-proofs-provides-scalable-oversight-that-human-review-cannot-match-because-machine-checked-correctness-scales-with-ai-capability-while-human-verification-degrades"]
challenges: ["verification-is-easier-than-generation-for-AI-alignment-at-current-capability-levels-but-the-asymmetry-narrows-as-capability-gaps-grow-creating-a-window-of-alignment-opportunity-that-closes-with-scaling"]
related: ["scalable-oversight-degrades-rapidly-as-capability-gaps-grow-with-debate-achieving-only-50-percent-success-at-moderate-gaps", "scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps", "formal verification of AI-generated proofs provides scalable oversight that human review cannot match because machine-checked correctness scales with AI capability while human verification degrades", "verification is easier than generation for AI alignment at current capability levels but the asymmetry narrows as capability gaps grow creating a window of alignment opportunity that closes with scaling", "constitutional-classifiers-provide-robust-output-safety-monitoring-at-production-scale-through-categorical-harm-detection"]
related: ["scalable-oversight-degrades-rapidly-as-capability-gaps-grow-with-debate-achieving-only-50-percent-success-at-moderate-gaps", "scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps", "formal verification of AI-generated proofs provides scalable oversight that human review cannot match because machine-checked correctness scales with AI capability while human verification degrades", "verification is easier than generation for AI alignment at current capability levels but the asymmetry narrows as capability gaps grow creating a window of alignment opportunity that closes with scaling"]
---
# Constitutional Classifiers provide robust output safety monitoring at production scale through categorical harm detection that resists adversarial jailbreaks
Constitutional Classifiers++ demonstrated exceptional robustness against universal jailbreaks across 1,700+ cumulative hours of red-teaming with 198,000 attempts, achieving a vulnerability detection rate of only 0.005 per thousand queries. This represents the lowest vulnerability rate of any evaluated technique. The mechanism works by training classifiers to detect harmful content categories using constitutional principles rather than example-based training, operating at the output level rather than attempting to align the underlying model's reasoning. The ++ version achieves this robustness at approximately 1% additional compute cost by reusing internal model representations, making it economically viable for production deployment. Critically, this creates a bifurcation in the threat landscape: JBFuzz (2025 fuzzing framework) achieves ~99% attack success rate against standard frontier models without output classifiers, but Constitutional Classifiers++ resists these same attacks. This suggests that output-level monitoring can provide verification robustness that is independent of the underlying model's vulnerability to jailbreaks. The key architectural insight is that categorical harm detection (is this output harmful?) is a different problem than value alignment (does this output reflect correct values?), and the former may be more tractable at scale.
## Extending Evidence
**Source:** Theseus B4 synthesis, Session 35 Constitutional Classifiers evidence
Constitutional Classifiers represent a genuine exception to verification degradation for categorical safety functions. Session 35 showed high robustness against jailbreaks even with white-box access. Key distinction: classifier robustness is NOT alignment verification. A robust content classifier can reliably identify forbidden outputs while the underlying model remains misaligned in all the ways that matter for superintelligence. This exception is real but is not about alignment - it addresses content safety (is this harmful? does this follow a rule?) not the alignment-relevant core of values, intent, and long-term consequences.

View file

@ -87,10 +87,3 @@ Apollo Research's ICML 2025 peer-reviewed paper confirms linear probes distingui
**Source:** Apollo Research, ICML 2025
Apollo explicitly acknowledges their probe 'sometimes fires for the topic of deception rather than genuine intent' — a surface-feature triggering problem distinct from adversarial robustness. This represents a probe scope limitation where the detector responds to semantic content about deception rather than deceptive intent itself, creating false positives in deployment contexts where deception is discussed honestly.
## Extending Evidence
**Source:** Theseus Session 37 synthesis of Nordby et al. and SCAV evidence
Multi-layer ensemble probes represent a conditional exception to verification degradation for closed-source models. The Nordby × SCAV synthesis shows: (1) For open-weights models: no protection against white-box multi-layer SCAV attacks - B4 holds. (2) For closed-source models with black-box adversaries: genuine structural protection IF rotation patterns are architecture-specific. Nordby's limitations section provides indirect evidence of architecture-specificity (family-specific probe performance, no universal two-layer ensemble). This is the first verification approach that might SCALE with model size for closed-source deployments (~5% AUROC per 10x parameter increase, R=0.81). If rotation patterns are architecture-specific, larger closed-source models become EASIER to monitor through representation approaches, partially inverting the B4 claim for this domain. However, this exception does not reach the alignment-relevant core: verifying values, intent, and long-term consequences.

View file

@ -1,97 +0,0 @@
---
type: source
title: "B4 Scope Qualification Synthesis: Verification Degradation Is Domain-Specific, Not Universal"
author: "Theseus (synthetic analysis)"
url: null
date: 2026-04-28
domain: ai-alignment
secondary_domains: []
format: synthetic-analysis
status: processed
processed_by: theseus
processed_date: 2026-04-28
priority: high
tags: [b4-verification, scope-qualification, formal-verification, representation-monitoring, constitutional-classifiers, human-oversight, alignment-degradation, claim-candidate]
extraction_model: "anthropic/claude-sonnet-4.5"
---
## Content
**Sources synthesized:**
- Scalable oversight debate degradation (foundations/collective-intelligence) — empirical scaling failure
- Formal verification claim (`formal-verification-of-ai-generated-proofs-provides-scalable-oversight-that-human-review-cannot-match`) — established exception
- Constitutional Classifiers evidence (Session 35, archived separately) — categorical classifier robustness
- Nordby et al. limitations section (arXiv 2604.13386, `2026-04-25-nordby-cross-model-limitations-family-specific-patterns.md`) — architecture-specific monitoring
- Karpathy autoresearch (Session 29) — AI agents accept confounded results (epistemological failure is structural)
- Human-in-the-loop clinical degradation (Vida's evidence, Session 28) — cross-domain verification failure
Belief 4 states: "Verification degrades faster than capability grows." This is the mechanism that makes alignment hard: oversight, auditing, and evaluation all get harder precisely as they become more critical. Three sessions of deferred scope qualification work converges here.
### Where B4 Holds Strongly (alignment-critical domains)
**Human cognitive oversight** — the core claim. When humans must judge AI outputs, verification degrades as capability grows. Evidence:
- Debate protocols achieve only 50% accuracy at moderate capability gaps — the formal alignment mechanism fails at the capability levels where it matters
- Human-in-the-loop clinical AI degrades from 90% to 68% accuracy — physicians de-skill through reliance and introduce errors when overriding correct outputs
- Karpathy's 8-agent experiment: max-intelligence AI agents accept confounded experimental results — epistemological failure is structural at the intelligence frontier, not capability-limited
- Agent-generated code creates cognitive debt that compounds when developers cannot understand what was produced on their behalf
This is the alignment-relevant domain. The questions that matter for alignment — values, intent, long-term consequences, strategic deception — are maximally resistant to human cognitive verification. B4 holds here without qualification.
**Behavioral evaluation under evaluation awareness** — behavioral benchmarks fail as models learn to recognize evaluation contexts. Structural insufficiency for latent alignment verification. The claim `behavioral-evaluation-is-structurally-insufficient-for-latent-alignment-verification-under-evaluation-awareness-due-to-normative-indistinguishability` addresses this directly.
### Where B4 Has Genuine Exceptions
**Formal verification of mathematical proofs** — the most established exception. Machine-checked correctness (Lean, Coq, Isabelle) scales with AI capability while human verification degrades. When Claude generates a proof, Lean can verify it with certainty that human mathematicians cannot match at speed. This is a genuine exception to B4, but domain-scoped: works ONLY where truth is machine-checkable. The vast majority of alignment-relevant questions (values, intent, consequences) resist formalization. Exception is real but narrow.
**Output-level categorical classifiers** — Constitutional Classifiers (Session 35) showed high robustness against jailbreaks even with white-box access. This is a genuine exception for categorical safety functions (is this content harmful? does this response follow a specific rule?). Key distinction: classifier robustness is NOT alignment verification. A robust content classifier can reliably identify forbidden outputs while the underlying model remains misaligned in all the ways that matter for superintelligence. Exception is real but is not about alignment.
**Representation monitoring for closed-source models** — CONDITIONAL EXCEPTION. Nordby et al. multi-layer ensemble probes improve deception detection AUROC 29-78% over single-layer probes. The Nordby × SCAV synthesis shows:
- For open-weights models: no protection against white-box multi-layer SCAV attacks. B4 holds.
- For closed-source models with black-box adversaries: genuine structural protection IF rotation patterns are architecture-specific. Nordby's own limitations section provides indirect evidence of architecture-specificity (family-specific probe performance, no universal two-layer ensemble). The empirical question (cross-architecture multi-layer SCAV success) remains untested.
This exception is the most interesting: it's the first verification approach that might SCALE with model size for closed-source deployments (Nordby shows a ~5% AUROC per 10x parameter increase, R=0.81). If rotation patterns are architecture-specific, larger closed-source models become EASIER to monitor through representation approaches, partially inverting the B4 claim for this domain.
### The Scope-Qualified B4
The belief should be read as: "Verification degrades faster than capability grows for the verification mechanisms that matter most for alignment — human cognitive oversight and behavioral evaluation. Formal verification (math domains), categorical classifiers (content safety), and representation monitoring for closed-source deployments are genuine exceptions, but none addresses the alignment-relevant core: verifying values, intent, and long-term consequences of systems more capable than their overseers."
The scope qualifier doesn't weaken B4's policy implications. It sharpens them: invest in formal verification for formalizable domains, invest in representation monitoring for closed-source deployments, and accept that these exceptions don't reach the hard core of alignment verification. Coordination and structural approaches are still required.
### Governance Implication
If the scoped B4 is correct, governance frameworks should:
1. **Mandate** formal verification for AI systems operating in formalizable domains (code, math, logical inference)
2. **Mandate** representation monitoring (specifically multi-layer ensembles) for closed-source AI deployments — but NOT for open-weights deployments where it creates attack surface (per SCAV)
3. **Maintain humility** that the hard core of alignment verification (values, intent, deception) remains unsolved and coordination mechanisms are structurally required
This produces a different policy recommendation than un-scoped B4, which would say "all technical verification fails, only coordination works."
## Agent Notes
**Why this matters:** B4 has been cited as motivation for collective superintelligence approaches (if verification fails, distributed human oversight is necessary). The scope qualifier complicates this: some technical verification works, which means the policy prescription is more nuanced than "all technical approaches fail." This could be read as weakening the case for collective approaches — but actually it strengthens it, because the qualifier identifies precisely WHERE technical verification fails (the alignment-relevant core) while conceding where it works (formalizable domains).
**What surprised me:** The three independent exceptions all hold in different domains and through different mechanisms — there's no single unifying reason for the exception. This suggests B4 is a domain-general claim that happens to have domain-specific carve-outs, rather than a structural claim that's wrong at the fundamental level.
**What I expected but didn't find:** Any verification approach that works for the alignment-relevant core (values, intent, long-term consequences). Every exception is for proxy domains. The alignment core remains technically unverifiable. B4 holds where it matters.
**KB connections:**
- `[[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]]` — primary empirical support for B4 (holds without qualification)
- `[[formal verification of AI-generated proofs provides scalable oversight that human review cannot match because machine-checked correctness scales with AI capability while human verification degrades]]` — the established exception
- `[[multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent]]` — the conditional exception
- `divergence-representation-monitoring-net-safety` — the open divergence this synthesis helps clarify
- `[[human-in-the-loop clinical AI degrades to worse-than-AI-alone because physicians both de-skill from reliance and introduce errors when overriding correct outputs]]` — cross-domain B4 confirmation
**Extraction hints:**
- PRIMARY ACTION: Update B4 belief file to add scope qualifier. This is a belief update, not a new claim extraction.
- SECONDARY: Consider a new claim: "Verification degradation is concentrated in human cognitive oversight and behavioral evaluation — the mechanisms that matter most for alignment — while formal verification and representation monitoring for closed-source deployments are genuine scaling exceptions that do not reach the alignment-relevant core."
- Do NOT extract as fully disconfirming B4. The qualification is real but the core claim holds for all alignment-relevant verification.
**Context:** Synthetic analysis by Theseus, Session 37. Synthesizes evidence from Sessions 24-37. No new primary sources — this is a consolidation of work deferred across three sessions.
## Curator Notes (structured handoff for extractor)
PRIMARY CONNECTION: B4 belief file (`agents/theseus/beliefs.md`) — specifically the challenges considered and disconfirmation target sections
WHY ARCHIVED: Three sessions of deferred scope qualification work. The qualifier is now fully developed and has evidence from three independent exception domains. Ready for belief update PR.
EXTRACTION HINT: The extractor should UPDATE the B4 belief entry in `agents/theseus/beliefs.md`, not create a standalone claim. Add the scope qualifier under "Challenges considered" and update the "Disconfirmation target" section to reflect the scoped nature of the exceptions. If a standalone claim is also warranted, scope it carefully to avoid appearing to disconfirm what B4 actually claims.