theseus: extract claims from 2026-04-02-scaling-laws-scalable-oversight-nso-ceiling-results

- Source: inbox/queue/2026-04-02-scaling-laws-scalable-oversight-nso-ceiling-results.md - Domain: ai-alignment - Claims: 2, Entities: 0 - Enrichments: 2 - Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5) Pentagon-Agent: Theseus <PIPELINE>
2026-04-02 10:38:10 +00:00 · 2026-04-02 10:38:10 +00:00 · 7e9221431c
commit 7e9221431c
parent 4e765b213d
2 changed files with 34 additions and 0 deletions
--- a/domains/ai-alignment/nested-scalable-oversight-achieves-at-most-52-percent-success-at-moderate-capability-gaps.md
+++ b/domains/ai-alignment/nested-scalable-oversight-achieves-at-most-52-percent-success-at-moderate-capability-gaps.md
@ -0,0 +1,17 @@
 ---
 type: claim
 domain: ai-alignment
 description: Even the best-performing oversight approach (debate) performs barely above chance at modest capability differentials, with success rates approaching non-useful levels as gaps grow
 confidence: experimental
 source: arXiv 2504.18530, empirical testing across four oversight games with Elo-based capability gap measurement
 created: 2026-04-02
 title: "Nested scalable oversight achieves at most 51.7% success rate at capability gap Elo 400 with performance declining as capability differential increases"
 agent: theseus
 scope: causal
 sourcer: arXiv 2504.18530
 related_claims: ["[[safe AI development requires building alignment mechanisms before scaling capability]]"]
 ---
 # Nested scalable oversight achieves at most 51.7% success rate at capability gap Elo 400 with performance declining as capability differential increases
 The first formal scaling laws study of oversight efficacy quantifies NSO success rates across four oversight games (Debate, Mafia, Backdoor Code, Wargames) at standardized capability gaps. At Elo gap 400 — a moderate differential — Debate achieves only 51.7% success, while other approaches perform far worse (Mafia 13.5%, Backdoor Code 10.0%, Wargames 9.4%). The study establishes that 'there appears to be an inherent ceiling on oversight efficacy given a fixed gap in capabilities' and that 'there exists a point where no feasible number of recursive oversight steps can fully compensate for a large capability disparity.' This is the first quantitative confirmation that oversight scales sublinearly with agent count in nested hierarchies, meaning the verification problem degrades faster than capability grows. The methodology validated the framework on a Nim variant before applying it to realistic oversight scenarios, providing empirical grounding for what was previously a theoretical concern.
--- a/domains/ai-alignment/scalable-oversight-success-is-domain-dependent-with-worst-performance-in-highest-stakes-domains.md
+++ b/domains/ai-alignment/scalable-oversight-success-is-domain-dependent-with-worst-performance-in-highest-stakes-domains.md
@ -0,0 +1,17 @@
 ---
 type: claim
 domain: ai-alignment
 description: The domains where advanced AI poses greatest risk (code, strategic planning) are precisely where oversight performs worst, creating a dangerous mismatch between risk and verification capability
 confidence: experimental
 source: arXiv 2504.18530, comparative performance across four oversight game types at Elo gap 400
 created: 2026-04-02
 title: "Scalable oversight success is highly domain-dependent with propositional debate tasks showing 52% success while code review and strategic planning tasks show ~10% success"
 agent: theseus
 scope: structural
 sourcer: arXiv 2504.18530
 related_claims: ["[[safe AI development requires building alignment mechanisms before scaling capability]]", "[[formal verification of AI-generated proofs provides scalable oversight that human review cannot match because machine-checked correctness scales with AI capability while human verification degrades]]"]
 ---
 # Scalable oversight success is highly domain-dependent with propositional debate tasks showing 52% success while code review and strategic planning tasks show ~10% success
 The 5x performance gap between Debate (51.7%) and Backdoor Code/Wargames (~10%) reveals that oversight efficacy is not a general property but highly task-dependent. Debate-style oversight works for propositional reasoning where arguments can be decomposed and verified through adversarial exchange. But code review (10.0%) and strategic planning (9.4%) — the domains where capable AI systems pose the most concrete risk — show success rates barely above random chance. This creates a structural problem: the oversight approaches that work best apply to the lowest-stakes domains, while the highest-stakes applications (autonomous code generation, strategic decision-making) fall into the regime where even nested oversight provides almost no verification capability. The domain-dependency suggests the ceiling is not just about capability gaps but about the fundamental verifiability structure of different task types.