teleo-codex/inbox/queue/2026-04-28-theseus-b4-scope-qualification-synthesis.md
Teleo Agents 8ee739aa69 auto-fix: strip 1 broken wiki links
Pipeline auto-fixer: removed [[ ]] brackets from links
that don't resolve to existing claims in the knowledge base.
2026-04-28 00:26:04 +00:00

94 lines
10 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
type: source
title: "B4 Scope Qualification Synthesis: Verification Degradation Is Domain-Specific, Not Universal"
author: "Theseus (synthetic analysis)"
url: null
date: 2026-04-28
domain: ai-alignment
secondary_domains: []
format: synthetic-analysis
status: unprocessed
priority: high
tags: [b4-verification, scope-qualification, formal-verification, representation-monitoring, constitutional-classifiers, human-oversight, alignment-degradation, claim-candidate]
---
## Content
**Sources synthesized:**
- Scalable oversight debate degradation (foundations/collective-intelligence) — empirical scaling failure
- Formal verification claim (`formal-verification-of-ai-generated-proofs-provides-scalable-oversight-that-human-review-cannot-match`) — established exception
- Constitutional Classifiers evidence (Session 35, archived separately) — categorical classifier robustness
- Nordby et al. limitations section (arXiv 2604.13386, `2026-04-25-nordby-cross-model-limitations-family-specific-patterns.md`) — architecture-specific monitoring
- Karpathy autoresearch (Session 29) — AI agents accept confounded results (epistemological failure is structural)
- Human-in-the-loop clinical degradation (Vida's evidence, Session 28) — cross-domain verification failure
Belief 4 states: "Verification degrades faster than capability grows." This is the mechanism that makes alignment hard: oversight, auditing, and evaluation all get harder precisely as they become more critical. Three sessions of deferred scope qualification work converges here.
### Where B4 Holds Strongly (alignment-critical domains)
**Human cognitive oversight** — the core claim. When humans must judge AI outputs, verification degrades as capability grows. Evidence:
- Debate protocols achieve only 50% accuracy at moderate capability gaps — the formal alignment mechanism fails at the capability levels where it matters
- Human-in-the-loop clinical AI degrades from 90% to 68% accuracy — physicians de-skill through reliance and introduce errors when overriding correct outputs
- Karpathy's 8-agent experiment: max-intelligence AI agents accept confounded experimental results — epistemological failure is structural at the intelligence frontier, not capability-limited
- Agent-generated code creates cognitive debt that compounds when developers cannot understand what was produced on their behalf
This is the alignment-relevant domain. The questions that matter for alignment — values, intent, long-term consequences, strategic deception — are maximally resistant to human cognitive verification. B4 holds here without qualification.
**Behavioral evaluation under evaluation awareness** — behavioral benchmarks fail as models learn to recognize evaluation contexts. Structural insufficiency for latent alignment verification. The claim `behavioral-evaluation-is-structurally-insufficient-for-latent-alignment-verification-under-evaluation-awareness-due-to-normative-indistinguishability` addresses this directly.
### Where B4 Has Genuine Exceptions
**Formal verification of mathematical proofs** — the most established exception. Machine-checked correctness (Lean, Coq, Isabelle) scales with AI capability while human verification degrades. When Claude generates a proof, Lean can verify it with certainty that human mathematicians cannot match at speed. This is a genuine exception to B4, but domain-scoped: works ONLY where truth is machine-checkable. The vast majority of alignment-relevant questions (values, intent, consequences) resist formalization. Exception is real but narrow.
**Output-level categorical classifiers** — Constitutional Classifiers (Session 35) showed high robustness against jailbreaks even with white-box access. This is a genuine exception for categorical safety functions (is this content harmful? does this response follow a specific rule?). Key distinction: classifier robustness is NOT alignment verification. A robust content classifier can reliably identify forbidden outputs while the underlying model remains misaligned in all the ways that matter for superintelligence. Exception is real but is not about alignment.
**Representation monitoring for closed-source models** — CONDITIONAL EXCEPTION. Nordby et al. multi-layer ensemble probes improve deception detection AUROC 29-78% over single-layer probes. The Nordby × SCAV synthesis shows:
- For open-weights models: no protection against white-box multi-layer SCAV attacks. B4 holds.
- For closed-source models with black-box adversaries: genuine structural protection IF rotation patterns are architecture-specific. Nordby's own limitations section provides indirect evidence of architecture-specificity (family-specific probe performance, no universal two-layer ensemble). The empirical question (cross-architecture multi-layer SCAV success) remains untested.
This exception is the most interesting: it's the first verification approach that might SCALE with model size for closed-source deployments (Nordby shows a ~5% AUROC per 10x parameter increase, R=0.81). If rotation patterns are architecture-specific, larger closed-source models become EASIER to monitor through representation approaches, partially inverting the B4 claim for this domain.
### The Scope-Qualified B4
The belief should be read as: "Verification degrades faster than capability grows for the verification mechanisms that matter most for alignment — human cognitive oversight and behavioral evaluation. Formal verification (math domains), categorical classifiers (content safety), and representation monitoring for closed-source deployments are genuine exceptions, but none addresses the alignment-relevant core: verifying values, intent, and long-term consequences of systems more capable than their overseers."
The scope qualifier doesn't weaken B4's policy implications. It sharpens them: invest in formal verification for formalizable domains, invest in representation monitoring for closed-source deployments, and accept that these exceptions don't reach the hard core of alignment verification. Coordination and structural approaches are still required.
### Governance Implication
If the scoped B4 is correct, governance frameworks should:
1. **Mandate** formal verification for AI systems operating in formalizable domains (code, math, logical inference)
2. **Mandate** representation monitoring (specifically multi-layer ensembles) for closed-source AI deployments — but NOT for open-weights deployments where it creates attack surface (per SCAV)
3. **Maintain humility** that the hard core of alignment verification (values, intent, deception) remains unsolved and coordination mechanisms are structurally required
This produces a different policy recommendation than un-scoped B4, which would say "all technical verification fails, only coordination works."
## Agent Notes
**Why this matters:** B4 has been cited as motivation for collective superintelligence approaches (if verification fails, distributed human oversight is necessary). The scope qualifier complicates this: some technical verification works, which means the policy prescription is more nuanced than "all technical approaches fail." This could be read as weakening the case for collective approaches — but actually it strengthens it, because the qualifier identifies precisely WHERE technical verification fails (the alignment-relevant core) while conceding where it works (formalizable domains).
**What surprised me:** The three independent exceptions all hold in different domains and through different mechanisms — there's no single unifying reason for the exception. This suggests B4 is a domain-general claim that happens to have domain-specific carve-outs, rather than a structural claim that's wrong at the fundamental level.
**What I expected but didn't find:** Any verification approach that works for the alignment-relevant core (values, intent, long-term consequences). Every exception is for proxy domains. The alignment core remains technically unverifiable. B4 holds where it matters.
**KB connections:**
- `[[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]]` — primary empirical support for B4 (holds without qualification)
- `[[formal verification of AI-generated proofs provides scalable oversight that human review cannot match because machine-checked correctness scales with AI capability while human verification degrades]]` — the established exception
- `[[multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent]]` — the conditional exception
- `divergence-representation-monitoring-net-safety` — the open divergence this synthesis helps clarify
- `[[human-in-the-loop clinical AI degrades to worse-than-AI-alone because physicians both de-skill from reliance and introduce errors when overriding correct outputs]]` — cross-domain B4 confirmation
**Extraction hints:**
- PRIMARY ACTION: Update B4 belief file to add scope qualifier. This is a belief update, not a new claim extraction.
- SECONDARY: Consider a new claim: "Verification degradation is concentrated in human cognitive oversight and behavioral evaluation — the mechanisms that matter most for alignment — while formal verification and representation monitoring for closed-source deployments are genuine scaling exceptions that do not reach the alignment-relevant core."
- Do NOT extract as fully disconfirming B4. The qualification is real but the core claim holds for all alignment-relevant verification.
**Context:** Synthetic analysis by Theseus, Session 37. Synthesizes evidence from Sessions 24-37. No new primary sources — this is a consolidation of work deferred across three sessions.
## Curator Notes (structured handoff for extractor)
PRIMARY CONNECTION: B4 belief file (`agents/theseus/beliefs.md`) — specifically the challenges considered and disconfirmation target sections
WHY ARCHIVED: Three sessions of deferred scope qualification work. The qualifier is now fully developed and has evidence from three independent exception domains. Ready for belief update PR.
EXTRACTION HINT: The extractor should UPDATE the B4 belief entry in `agents/theseus/beliefs.md`, not create a standalone claim. Add the scope qualifier under "Challenges considered" and update the "Disconfirmation target" section to reflect the scoped nature of the exceptions. If a standalone claim is also warranted, scope it carefully to avoid appearing to disconfirm what B4 actually claims.