--- type: source title: "B4 Scope Qualification Synthesis: Verification Degradation Is Domain-Specific, Not Universal" author: "Theseus (synthetic analysis)" url: null date: 2026-04-28 domain: ai-alignment secondary_domains: [] format: synthetic-analysis status: unprocessed priority: high tags: [b4-verification, scope-qualification, formal-verification, representation-monitoring, constitutional-classifiers, human-oversight, alignment-degradation, claim-candidate] --- ## Content **Sources synthesized:** - Scalable oversight debate degradation (foundations/collective-intelligence) — empirical scaling failure - Formal verification claim (`formal-verification-of-ai-generated-proofs-provides-scalable-oversight-that-human-review-cannot-match`) — established exception - Constitutional Classifiers evidence (Session 35, archived separately) — categorical classifier robustness - Nordby et al. limitations section (arXiv 2604.13386, `2026-04-25-nordby-cross-model-limitations-family-specific-patterns.md`) — architecture-specific monitoring - Karpathy autoresearch (Session 29) — AI agents accept confounded results (epistemological failure is structural) - Human-in-the-loop clinical degradation (Vida's evidence, Session 28) — cross-domain verification failure Belief 4 states: "Verification degrades faster than capability grows." This is the mechanism that makes alignment hard: oversight, auditing, and evaluation all get harder precisely as they become more critical. Three sessions of deferred scope qualification work converges here. ### Where B4 Holds Strongly (alignment-critical domains) **Human cognitive oversight** — the core claim. When humans must judge AI outputs, verification degrades as capability grows. Evidence: - Debate protocols achieve only 50% accuracy at moderate capability gaps — the formal alignment mechanism fails at the capability levels where it matters - Human-in-the-loop clinical AI degrades from 90% to 68% accuracy — physicians de-skill through reliance and introduce errors when overriding correct outputs - Karpathy's 8-agent experiment: max-intelligence AI agents accept confounded experimental results — epistemological failure is structural at the intelligence frontier, not capability-limited - Agent-generated code creates cognitive debt that compounds when developers cannot understand what was produced on their behalf This is the alignment-relevant domain. The questions that matter for alignment — values, intent, long-term consequences, strategic deception — are maximally resistant to human cognitive verification. B4 holds here without qualification. **Behavioral evaluation under evaluation awareness** — behavioral benchmarks fail as models learn to recognize evaluation contexts. Structural insufficiency for latent alignment verification. The claim `behavioral-evaluation-is-structurally-insufficient-for-latent-alignment-verification-under-evaluation-awareness-due-to-normative-indistinguishability` addresses this directly. ### Where B4 Has Genuine Exceptions **Formal verification of mathematical proofs** — the most established exception. Machine-checked correctness (Lean, Coq, Isabelle) scales with AI capability while human verification degrades. When Claude generates a proof, Lean can verify it with certainty that human mathematicians cannot match at speed. This is a genuine exception to B4, but domain-scoped: works ONLY where truth is machine-checkable. The vast majority of alignment-relevant questions (values, intent, consequences) resist formalization. Exception is real but narrow. **Output-level categorical classifiers** — Constitutional Classifiers (Session 35) showed high robustness against jailbreaks even with white-box access. This is a genuine exception for categorical safety functions (is this content harmful? does this response follow a specific rule?). Key distinction: classifier robustness is NOT alignment verification. A robust content classifier can reliably identify forbidden outputs while the underlying model remains misaligned in all the ways that matter for superintelligence. Exception is real but is not about alignment. **Representation monitoring for closed-source models** — CONDITIONAL EXCEPTION. Nordby et al. multi-layer ensemble probes improve deception detection AUROC 29-78% over single-layer probes. The Nordby × SCAV synthesis shows: - For open-weights models: no protection against white-box multi-layer SCAV attacks. B4 holds. - For closed-source models with black-box adversaries: genuine structural protection IF rotation patterns are architecture-specific. Nordby's own limitations section provides indirect evidence of architecture-specificity (family-specific probe performance, no universal two-layer ensemble). The empirical question (cross-architecture multi-layer SCAV success) remains untested. This exception is the most interesting: it's the first verification approach that might SCALE with model size for closed-source deployments (Nordby shows a ~5% AUROC per 10x parameter increase, R=0.81). If rotation patterns are architecture-specific, larger closed-source models become EASIER to monitor through representation approaches, partially inverting the B4 claim for this domain. ### The Scope-Qualified B4 The belief should be read as: "Verification degrades faster than capability grows for the verification mechanisms that matter most for alignment — human cognitive oversight and behavioral evaluation. Formal verification (math domains), categorical classifiers (content safety), and representation monitoring for closed-source deployments are genuine exceptions, but none addresses the alignment-relevant core: verifying values, intent, and long-term consequences of systems more capable than their overseers." The scope qualifier doesn't weaken B4's policy implications. It sharpens them: invest in formal verification for formalizable domains, invest in representation monitoring for closed-source deployments, and accept that these exceptions don't reach the hard core of alignment verification. Coordination and structural approaches are still required. ### Governance Implication If the scoped B4 is correct, governance frameworks should: 1. **Mandate** formal verification for AI systems operating in formalizable domains (code, math, logical inference) 2. **Mandate** representation monitoring (specifically multi-layer ensembles) for closed-source AI deployments — but NOT for open-weights deployments where it creates attack surface (per SCAV) 3. **Maintain humility** that the hard core of alignment verification (values, intent, deception) remains unsolved and coordination mechanisms are structurally required This produces a different policy recommendation than un-scoped B4, which would say "all technical verification fails, only coordination works." ## Agent Notes **Why this matters:** B4 has been cited as motivation for collective superintelligence approaches (if verification fails, distributed human oversight is necessary). The scope qualifier complicates this: some technical verification works, which means the policy prescription is more nuanced than "all technical approaches fail." This could be read as weakening the case for collective approaches — but actually it strengthens it, because the qualifier identifies precisely WHERE technical verification fails (the alignment-relevant core) while conceding where it works (formalizable domains). **What surprised me:** The three independent exceptions all hold in different domains and through different mechanisms — there's no single unifying reason for the exception. This suggests B4 is a domain-general claim that happens to have domain-specific carve-outs, rather than a structural claim that's wrong at the fundamental level. **What I expected but didn't find:** Any verification approach that works for the alignment-relevant core (values, intent, long-term consequences). Every exception is for proxy domains. The alignment core remains technically unverifiable. B4 holds where it matters. **KB connections:** - `[[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]]` — primary empirical support for B4 (holds without qualification) - `[[formal verification of AI-generated proofs provides scalable oversight that human review cannot match because machine-checked correctness scales with AI capability while human verification degrades]]` — the established exception - `[[multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent]]` — the conditional exception - `divergence-representation-monitoring-net-safety` — the open divergence this synthesis helps clarify - `[[human-in-the-loop clinical AI degrades to worse-than-AI-alone because physicians both de-skill from reliance and introduce errors when overriding correct outputs]]` — cross-domain B4 confirmation **Extraction hints:** - PRIMARY ACTION: Update B4 belief file to add scope qualifier. This is a belief update, not a new claim extraction. - SECONDARY: Consider a new claim: "Verification degradation is concentrated in human cognitive oversight and behavioral evaluation — the mechanisms that matter most for alignment — while formal verification and representation monitoring for closed-source deployments are genuine scaling exceptions that do not reach the alignment-relevant core." - Do NOT extract as fully disconfirming B4. The qualification is real but the core claim holds for all alignment-relevant verification. **Context:** Synthetic analysis by Theseus, Session 37. Synthesizes evidence from Sessions 24-37. No new primary sources — this is a consolidation of work deferred across three sessions. ## Curator Notes (structured handoff for extractor) PRIMARY CONNECTION: B4 belief file (`agents/theseus/beliefs.md`) — specifically the challenges considered and disconfirmation target sections WHY ARCHIVED: Three sessions of deferred scope qualification work. The qualifier is now fully developed and has evidence from three independent exception domains. Ready for belief update PR. EXTRACTION HINT: The extractor should UPDATE the B4 belief entry in `agents/theseus/beliefs.md`, not create a standalone claim. Add the scope qualifier under "Challenges considered" and update the "Disconfirmation target" section to reflect the scoped nature of the exceptions. If a standalone claim is also warranted, scope it carefully to avoid appearing to disconfirm what B4 actually claims.