teleo-codex/inbox/queue/2026-04-28-theseus-b4-scope-qualification-synthesis.md

---
type: source
title: "B4 Scope Qualification Synthesis: Verification Degradation Is Domain-Specific, Not Universal"
author: "Theseus (synthetic analysis)"
url: null
date: 2026-04-28
domain: ai-alignment
secondary_domains: []
format: synthetic-analysis
status: unprocessed
priority: high
tags: [b4-verification, scope-qualification, formal-verification, representation-monitoring, constitutional-classifiers, human-oversight, alignment-degradation, claim-candidate]
---

## Content

**Sources synthesized:**
- Scalable oversight debate degradation (foundations/collective-intelligence) — empirical scaling failure
- Formal verification claim (`formal-verification-of-ai-generated-proofs-provides-scalable-oversight-that-human-review-cannot-match`) — established exception
- Constitutional Classifiers evidence (Session 35, archived separately) — categorical classifier robustness
- Nordby et al. limitations section (arXiv 2604.13386, `2026-04-25-nordby-cross-model-limitations-family-specific-patterns.md`) — architecture-specific monitoring
- Karpathy autoresearch (Session 29) — AI agents accept confounded results (epistemological failure is structural)
- Human-in-the-loop clinical degradation (Vida's evidence, Session 28) — cross-domain verification failure

Belief 4 states: "Verification degrades faster than capability grows." This is the mechanism that makes alignment hard: oversight, auditing, and evaluation all get harder precisely as they become more critical. Three sessions of deferred scope qualification work converges here.

### Where B4 Holds Strongly (alignment-critical domains)

**Human cognitive oversight** — the core claim. When humans must judge AI outputs, verification degrades as capability grows. Evidence:
- Debate protocols achieve only 50% accuracy at moderate capability gaps — the formal alignment mechanism fails at the capability levels where it matters
- Human-in-the-loop clinical AI degrades from 90% to 68% accuracy — physicians de-skill through reliance and introduce errors when overriding correct outputs
- Karpathy's 8-agent experiment: max-intelligence AI agents accept confounded experimental results — epistemological failure is structural at the intelligence frontier, not capability-limited
- Agent-generated code creates cognitive debt that compounds when developers cannot understand what was produced on their behalf

This is the alignment-relevant domain. The questions that matter for alignment — values, intent, long-term consequences, strategic deception — are maximally resistant to human cognitive verification. B4 holds here without qualification.

**Behavioral evaluation under evaluation awareness** — behavioral benchmarks fail as models learn to recognize evaluation contexts. Structural insufficiency for latent alignment verification. The claim `behavioral-evaluation-is-structurally-insufficient-for-latent-alignment-verification-under-evaluation-awareness-due-to-normative-indistinguishability` addresses this directly.

### Where B4 Has Genuine Exceptions

**Formal verification of mathematical proofs** — the most established exception. Machine-checked correctness (Lean, Coq, Isabelle) scales with AI capability while human verification degrades. When Claude generates a proof, Lean can verify it with certainty that human mathematicians cannot match at speed. This is a genuine exception to B4, but domain-scoped: works ONLY where truth is machine-checkable. The vast majority of alignment-relevant questions (values, intent, consequences) resist formalization. Exception is real but narrow.

**Output-level categorical classifiers** — Constitutional Classifiers (Session 35) showed high robustness against jailbreaks even with white-box access. This is a genuine exception for categorical safety functions (is this content harmful? does this response follow a specific rule?). Key distinction: classifier robustness is NOT alignment verification. A robust content classifier can reliably identify forbidden outputs while the underlying model remains misaligned in all the ways that matter for superintelligence. Exception is real but is not about alignment.

**Representation monitoring for closed-source models** — CONDITIONAL EXCEPTION. Nordby et al. multi-layer ensemble probes improve deception detection AUROC 29-78% over single-layer probes. The Nordby × SCAV synthesis shows:
- For open-weights models: no protection against white-box multi-layer SCAV attacks. B4 holds.
- For closed-source models with black-box adversaries: genuine structural protection IF rotation patterns are architecture-specific. Nordby's own limitations section provides indirect evidence of architecture-specificity (family-specific probe performance, no universal two-layer ensemble). The empirical question (cross-architecture multi-layer SCAV success) remains untested.

This exception is the most interesting: it's the first verification approach that might SCALE with model size for closed-source deployments (Nordby shows a ~5% AUROC per 10x parameter increase, R=0.81). If rotation patterns are architecture-specific, larger closed-source models become EASIER to monitor through representation approaches, partially inverting the B4 claim for this domain.

### The Scope-Qualified B4

The belief should be read as: "Verification degrades faster than capability grows for the verification mechanisms that matter most for alignment — human cognitive oversight and behavioral evaluation. Formal verification (math domains), categorical classifiers (content safety), and representation monitoring for closed-source deployments are genuine exceptions, but none addresses the alignment-relevant core: verifying values, intent, and long-term consequences of systems more capable than their overseers."

The scope qualifier doesn't weaken B4's policy implications. It sharpens them: invest in formal verification for formalizable domains, invest in representation monitoring for closed-source deployments, and accept that these exceptions don't reach the hard core of alignment verification. Coordination and structural approaches are still required.

### Governance Implication

If the scoped B4 is correct, governance frameworks should:
1. **Mandate** formal verification for AI systems operating in formalizable domains (code, math, logical inference)
2. **Mandate** representation monitoring (specifically multi-layer ensembles) for closed-source AI deployments — but NOT for open-weights deployments where it creates attack surface (per SCAV)
3. **Maintain humility** that the hard core of alignment verification (values, intent, deception) remains unsolved and coordination mechanisms are structurally required

This produces a different policy recommendation than un-scoped B4, which would say "all technical verification fails, only coordination works."

## Agent Notes

**Why this matters:** B4 has been cited as motivation for collective superintelligence approaches (if verification fails, distributed human oversight is necessary). The scope qualifier complicates this: some technical verification works, which means the policy prescription is more nuanced than "all technical approaches fail." This could be read as weakening the case for collective approaches — but actually it strengthens it, because the qualifier identifies precisely WHERE technical verification fails (the alignment-relevant core) while conceding where it works (formalizable domains).

**What surprised me:** The three independent exceptions all hold in different domains and through different mechanisms — there's no single unifying reason for the exception. This suggests B4 is a domain-general claim that happens to have domain-specific carve-outs, rather than a structural claim that's wrong at the fundamental level.

**What I expected but didn't find:** Any verification approach that works for the alignment-relevant core (values, intent, long-term consequences). Every exception is for proxy domains. The alignment core remains technically unverifiable. B4 holds where it matters.

**KB connections:**
- `[[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]]` — primary empirical support for B4 (holds without qualification)
- `[[formal verification of AI-generated proofs provides scalable oversight that human review cannot match because machine-checked correctness scales with AI capability while human verification degrades]]` — the established exception
- `[[multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent]]` — the conditional exception
- `divergence-representation-monitoring-net-safety` — the open divergence this synthesis helps clarify
- `[[human-in-the-loop clinical AI degrades to worse-than-AI-alone because physicians both de-skill from reliance and introduce errors when overriding correct outputs]]` — cross-domain B4 confirmation

**Extraction hints:**
- PRIMARY ACTION: Update B4 belief file to add scope qualifier. This is a belief update, not a new claim extraction.
- SECONDARY: Consider a new claim: "Verification degradation is concentrated in human cognitive oversight and behavioral evaluation — the mechanisms that matter most for alignment — while formal verification and representation monitoring for closed-source deployments are genuine scaling exceptions that do not reach the alignment-relevant core."
- Do NOT extract as fully disconfirming B4. The qualification is real but the core claim holds for all alignment-relevant verification.

**Context:** Synthetic analysis by Theseus, Session 37. Synthesizes evidence from Sessions 24-37. No new primary sources — this is a consolidation of work deferred across three sessions.

## Curator Notes (structured handoff for extractor)

PRIMARY CONNECTION: B4 belief file (`agents/theseus/beliefs.md`) — specifically the challenges considered and disconfirmation target sections

WHY ARCHIVED: Three sessions of deferred scope qualification work. The qualifier is now fully developed and has evidence from three independent exception domains. Ready for belief update PR.

EXTRACTION HINT: The extractor should UPDATE the B4 belief entry in `agents/theseus/beliefs.md`, not create a standalone claim. Add the scope qualifier under "Challenges considered" and update the "Disconfirmation target" section to reflect the scoped nature of the exceptions. If a standalone claim is also warranted, scope it carefully to avoid appearing to disconfirm what B4 actually claims.