--- type: source title: "Scaling Laws for Scalable Oversight — NSO Ceiling Quantified at 51.7% for Debate" author: "Multiple authors (arXiv 2504.18530)" url: https://arxiv.org/abs/2504.18530 date: 2025-04-25 domain: ai-alignment secondary_domains: [] format: research-paper status: unprocessed priority: high tags: [scalable-oversight, debate, nested-oversight, capability-gap, scaling-laws, verification-ceiling] --- ## Content "Scaling Laws For Scalable Oversight" (arXiv 2504.18530, April 2025) quantifies the probability of successful nested scalable oversight (NSO) as a function of the capability gap between overseer and supervised system. **Methodology:** - Framework models oversight as a game between capability-mismatched players with oversight-specific Elo scores - Validated on Nim variant, then applied to four oversight games: Mafia, Debate, Backdoor Code, Wargames - Used general Elo gap of 400 as standard measure of capability differential **NSO success rates at Elo gap of 400:** - Debate: 51.7% - Mafia: 13.5% - Backdoor Code: 10.0% - Wargames: 9.4% **Key findings:** 1. Best-case oversight (Debate) achieves only 51.7% success at Elo gap 400 — slightly above random 2. Success rates decline further as capability gap increases (overseeing stronger systems) 3. "There appears to be an inherent ceiling on oversight efficacy given a fixed gap in capabilities" 4. There exists a point where no feasible number of recursive oversight steps can fully compensate for a large capability disparity 5. Oversight scales sublinearly with agent count in nested hierarchies **Implication:** Nested scalable oversight — the leading proposed solution to the scalable oversight problem — has been shown to have a hard ceiling. As capability gaps grow (which they will, by definition, in the scenarios where oversight matters most), NSO success rates approach non-useful levels across all tested oversight games. ## Agent Notes **Why this matters:** This is a formal quantification of the B4 prediction. B4 asserts "verification degrades faster than capability grows." This paper gives specific numbers: at Elo gap 400, even the best NSO approach (Debate) is only 51.7% successful. This is the only paper I've found that tries to derive scaling laws for the oversight problem itself — not capabilities, but oversight. **What surprised me:** The spread across oversight games. Debate at 51.7% is already alarming, but Backdoor Code (10%) and Wargames (9.4%) are essentially chance. The variation suggests oversight success is highly domain-dependent — debate-style oversight may work for propositional reasoning but fails completely for the kinds of tasks (code, strategic planning) where capable AI poses the most risk. **What I expected but didn't find:** A clearer path to extending NSO beyond the ceiling. The paper identifies optimal oversight levels but doesn't offer a mechanism for exceeding the fundamental ceiling. This implies the ceiling is architectural, not just a matter of adding more oversight rounds. **KB connections:** - B4 (verification degrades): direct quantitative confirmation - AuditBench tool-to-agent gap: behavioral oversight tools also fail, now joined by scalable oversight ceiling result - Strategic deception / scheming: the 13% and 9.4% rates for non-debate oversight are likely BEFORE adjusting for deliberate adversarial evasion by schemed models **Extraction hints:** 1. CLAIM: "Nested scalable oversight achieves at most 51.7% success rate at capability gap Elo 400 — even the best available oversight approach performs barely above chance at modest capability differentials, declining further as capability grows" 2. CLAIM: "Scalable oversight success is highly domain-dependent: propositional debate tasks show 52% success, but code review and strategic planning tasks show ~10% — the domains where advanced AI poses greatest risk are precisely where oversight performs worst" **Context:** This is a formal scaling laws paper — the first attempt to apply the scaling laws framework to the oversight problem rather than capabilities. Published April 2025, it represents the field's first systematic quantification of NSO limits. ## Curator Notes (structured handoff for extractor) PRIMARY CONNECTION: Verification degrades faster than capability grows (B4) WHY ARCHIVED: First formal quantification of scalable oversight ceiling — transforms B4 from qualitative claim to quantitatively bounded result EXTRACTION HINT: The domain-dependency finding (52% for debate vs 10% for code/strategy) is the most important extract — oversight works worst in precisely the highest-stakes domains