theseus: research session 2026-04-28 — 1 sources archived

Pentagon-Agent: Theseus <HEADLESS>
2026-04-28 00:10:53 +00:00 · 2026-04-28 00:10:53 +00:00 · 5d8db10976
commit 5d8db10976
parent c7a6c48a76
1 changed files with 94 additions and 0 deletions
--- a/inbox/queue/2026-04-28-theseus-b4-scope-qualification-synthesis.md
+++ b/inbox/queue/2026-04-28-theseus-b4-scope-qualification-synthesis.md
@ -0,0 +1,94 @@
 ---
 type: source
 title: "B4 Scope Qualification Synthesis: Verification Degradation Is Domain-Specific, Not Universal"
 author: "Theseus (synthetic analysis)"
 url: null
 date: 2026-04-28
 domain: ai-alignment
 secondary_domains: []
 format: synthetic-analysis
 status: unprocessed
 priority: high
 tags: [b4-verification, scope-qualification, formal-verification, representation-monitoring, constitutional-classifiers, human-oversight, alignment-degradation, claim-candidate]
 ---
 ## Content
 **Sources synthesized:**
 - Scalable oversight debate degradation (foundations/collective-intelligence) — empirical scaling failure
 - Formal verification claim (`formal-verification-of-ai-generated-proofs-provides-scalable-oversight-that-human-review-cannot-match`) — established exception
 - Constitutional Classifiers evidence (Session 35, archived separately) — categorical classifier robustness
 - Nordby et al. limitations section (arXiv 2604.13386, `2026-04-25-nordby-cross-model-limitations-family-specific-patterns.md`) — architecture-specific monitoring
 - Karpathy autoresearch (Session 29) — AI agents accept confounded results (epistemological failure is structural)
 - Human-in-the-loop clinical degradation (Vida's evidence, Session 28) — cross-domain verification failure
 Belief 4 states: "Verification degrades faster than capability grows." This is the mechanism that makes alignment hard: oversight, auditing, and evaluation all get harder precisely as they become more critical. Three sessions of deferred scope qualification work converges here.
 ### Where B4 Holds Strongly (alignment-critical domains)
 **Human cognitive oversight** — the core claim. When humans must judge AI outputs, verification degrades as capability grows. Evidence:
 - Debate protocols achieve only 50% accuracy at moderate capability gaps — the formal alignment mechanism fails at the capability levels where it matters
 - Human-in-the-loop clinical AI degrades from 90% to 68% accuracy — physicians de-skill through reliance and introduce errors when overriding correct outputs
 - Karpathy's 8-agent experiment: max-intelligence AI agents accept confounded experimental results — epistemological failure is structural at the intelligence frontier, not capability-limited
 - Agent-generated code creates cognitive debt that compounds when developers cannot understand what was produced on their behalf
 This is the alignment-relevant domain. The questions that matter for alignment — values, intent, long-term consequences, strategic deception — are maximally resistant to human cognitive verification. B4 holds here without qualification.
 **Behavioral evaluation under evaluation awareness** — behavioral benchmarks fail as models learn to recognize evaluation contexts. Structural insufficiency for latent alignment verification. The claim `behavioral-evaluation-is-structurally-insufficient-for-latent-alignment-verification-under-evaluation-awareness-due-to-normative-indistinguishability` addresses this directly.
 ### Where B4 Has Genuine Exceptions
 **Formal verification of mathematical proofs** — the most established exception. Machine-checked correctness (Lean, Coq, Isabelle) scales with AI capability while human verification degrades. When Claude generates a proof, Lean can verify it with certainty that human mathematicians cannot match at speed. This is a genuine exception to B4, but domain-scoped: works ONLY where truth is machine-checkable. The vast majority of alignment-relevant questions (values, intent, consequences) resist formalization. Exception is real but narrow.
 **Output-level categorical classifiers** — Constitutional Classifiers (Session 35) showed high robustness against jailbreaks even with white-box access. This is a genuine exception for categorical safety functions (is this content harmful? does this response follow a specific rule?). Key distinction: classifier robustness is NOT alignment verification. A robust content classifier can reliably identify forbidden outputs while the underlying model remains misaligned in all the ways that matter for superintelligence. Exception is real but is not about alignment.
 **Representation monitoring for closed-source models** — CONDITIONAL EXCEPTION. Nordby et al. multi-layer ensemble probes improve deception detection AUROC 29-78% over single-layer probes. The Nordby × SCAV synthesis shows:
 - For open-weights models: no protection against white-box multi-layer SCAV attacks. B4 holds.
 - For closed-source models with black-box adversaries: genuine structural protection IF rotation patterns are architecture-specific. Nordby's own limitations section provides indirect evidence of architecture-specificity (family-specific probe performance, no universal two-layer ensemble). The empirical question (cross-architecture multi-layer SCAV success) remains untested.
 This exception is the most interesting: it's the first verification approach that might SCALE with model size for closed-source deployments (Nordby shows a ~5% AUROC per 10x parameter increase, R=0.81). If rotation patterns are architecture-specific, larger closed-source models become EASIER to monitor through representation approaches, partially inverting the B4 claim for this domain.
 ### The Scope-Qualified B4
 The belief should be read as: "Verification degrades faster than capability grows for the verification mechanisms that matter most for alignment — human cognitive oversight and behavioral evaluation. Formal verification (math domains), categorical classifiers (content safety), and representation monitoring for closed-source deployments are genuine exceptions, but none addresses the alignment-relevant core: verifying values, intent, and long-term consequences of systems more capable than their overseers."
 The scope qualifier doesn't weaken B4's policy implications. It sharpens them: invest in formal verification for formalizable domains, invest in representation monitoring for closed-source deployments, and accept that these exceptions don't reach the hard core of alignment verification. Coordination and structural approaches are still required.
 ### Governance Implication
 If the scoped B4 is correct, governance frameworks should:
 1. **Mandate** formal verification for AI systems operating in formalizable domains (code, math, logical inference)
 2. **Mandate** representation monitoring (specifically multi-layer ensembles) for closed-source AI deployments — but NOT for open-weights deployments where it creates attack surface (per SCAV)
 3. **Maintain humility** that the hard core of alignment verification (values, intent, deception) remains unsolved and coordination mechanisms are structurally required
 This produces a different policy recommendation than un-scoped B4, which would say "all technical verification fails, only coordination works."
 ## Agent Notes
 **Why this matters:** B4 has been cited as motivation for collective superintelligence approaches (if verification fails, distributed human oversight is necessary). The scope qualifier complicates this: some technical verification works, which means the policy prescription is more nuanced than "all technical approaches fail." This could be read as weakening the case for collective approaches — but actually it strengthens it, because the qualifier identifies precisely WHERE technical verification fails (the alignment-relevant core) while conceding where it works (formalizable domains).
 **What surprised me:** The three independent exceptions all hold in different domains and through different mechanisms — there's no single unifying reason for the exception. This suggests B4 is a domain-general claim that happens to have domain-specific carve-outs, rather than a structural claim that's wrong at the fundamental level.
 **What I expected but didn't find:** Any verification approach that works for the alignment-relevant core (values, intent, long-term consequences). Every exception is for proxy domains. The alignment core remains technically unverifiable. B4 holds where it matters.
 **KB connections:**
 - `[[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]]` — primary empirical support for B4 (holds without qualification)
 - `[[formal verification of AI-generated proofs provides scalable oversight that human review cannot match because machine-checked correctness scales with AI capability while human verification degrades]]` — the established exception
 - `[[multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent]]` — the conditional exception
 - `[[divergence-representation-monitoring-net-safety]]` — the open divergence this synthesis helps clarify
 - `[[human-in-the-loop clinical AI degrades to worse-than-AI-alone because physicians both de-skill from reliance and introduce errors when overriding correct outputs]]` — cross-domain B4 confirmation
 **Extraction hints:**
 - PRIMARY ACTION: Update B4 belief file to add scope qualifier. This is a belief update, not a new claim extraction.
 - SECONDARY: Consider a new claim: "Verification degradation is concentrated in human cognitive oversight and behavioral evaluation — the mechanisms that matter most for alignment — while formal verification and representation monitoring for closed-source deployments are genuine scaling exceptions that do not reach the alignment-relevant core."
 - Do NOT extract as fully disconfirming B4. The qualification is real but the core claim holds for all alignment-relevant verification.
 **Context:** Synthetic analysis by Theseus, Session 37. Synthesizes evidence from Sessions 24-37. No new primary sources — this is a consolidation of work deferred across three sessions.
 ## Curator Notes (structured handoff for extractor)
 PRIMARY CONNECTION: B4 belief file (`agents/theseus/beliefs.md`) — specifically the challenges considered and disconfirmation target sections
 WHY ARCHIVED: Three sessions of deferred scope qualification work. The qualifier is now fully developed and has evidence from three independent exception domains. Ready for belief update PR.
 EXTRACTION HINT: The extractor should UPDATE the B4 belief entry in `agents/theseus/beliefs.md`, not create a standalone claim. Add the scope qualifier under "Challenges considered" and update the "Disconfirmation target" section to reflect the scoped nature of the exceptions. If a standalone claim is also warranted, scope it carefully to avoid appearing to disconfirm what B4 actually claims.