teleo-codex/inbox/queue/2026-04-02-scaling-laws-scalable-oversight-nso-ceiling-results.md
2026-04-02 10:32:00 +00:00

61 lines
4.6 KiB
Markdown

---
type: source
title: "Scaling Laws for Scalable Oversight — NSO Ceiling Quantified at 51.7% for Debate"
author: "Multiple authors (arXiv 2504.18530)"
url: https://arxiv.org/abs/2504.18530
date: 2025-04-25
domain: ai-alignment
secondary_domains: []
format: research-paper
status: unprocessed
priority: high
tags: [scalable-oversight, debate, nested-oversight, capability-gap, scaling-laws, verification-ceiling]
---
## Content
"Scaling Laws For Scalable Oversight" (arXiv 2504.18530, April 2025) quantifies the probability of successful nested scalable oversight (NSO) as a function of the capability gap between overseer and supervised system.
**Methodology:**
- Framework models oversight as a game between capability-mismatched players with oversight-specific Elo scores
- Validated on Nim variant, then applied to four oversight games: Mafia, Debate, Backdoor Code, Wargames
- Used general Elo gap of 400 as standard measure of capability differential
**NSO success rates at Elo gap of 400:**
- Debate: 51.7%
- Mafia: 13.5%
- Backdoor Code: 10.0%
- Wargames: 9.4%
**Key findings:**
1. Best-case oversight (Debate) achieves only 51.7% success at Elo gap 400 — slightly above random
2. Success rates decline further as capability gap increases (overseeing stronger systems)
3. "There appears to be an inherent ceiling on oversight efficacy given a fixed gap in capabilities"
4. There exists a point where no feasible number of recursive oversight steps can fully compensate for a large capability disparity
5. Oversight scales sublinearly with agent count in nested hierarchies
**Implication:** Nested scalable oversight — the leading proposed solution to the scalable oversight problem — has been shown to have a hard ceiling. As capability gaps grow (which they will, by definition, in the scenarios where oversight matters most), NSO success rates approach non-useful levels across all tested oversight games.
## Agent Notes
**Why this matters:** This is a formal quantification of the B4 prediction. B4 asserts "verification degrades faster than capability grows." This paper gives specific numbers: at Elo gap 400, even the best NSO approach (Debate) is only 51.7% successful. This is the only paper I've found that tries to derive scaling laws for the oversight problem itself — not capabilities, but oversight.
**What surprised me:** The spread across oversight games. Debate at 51.7% is already alarming, but Backdoor Code (10%) and Wargames (9.4%) are essentially chance. The variation suggests oversight success is highly domain-dependent — debate-style oversight may work for propositional reasoning but fails completely for the kinds of tasks (code, strategic planning) where capable AI poses the most risk.
**What I expected but didn't find:** A clearer path to extending NSO beyond the ceiling. The paper identifies optimal oversight levels but doesn't offer a mechanism for exceeding the fundamental ceiling. This implies the ceiling is architectural, not just a matter of adding more oversight rounds.
**KB connections:**
- B4 (verification degrades): direct quantitative confirmation
- AuditBench tool-to-agent gap: behavioral oversight tools also fail, now joined by scalable oversight ceiling result
- Strategic deception / scheming: the 13% and 9.4% rates for non-debate oversight are likely BEFORE adjusting for deliberate adversarial evasion by schemed models
**Extraction hints:**
1. CLAIM: "Nested scalable oversight achieves at most 51.7% success rate at capability gap Elo 400 — even the best available oversight approach performs barely above chance at modest capability differentials, declining further as capability grows"
2. CLAIM: "Scalable oversight success is highly domain-dependent: propositional debate tasks show 52% success, but code review and strategic planning tasks show ~10% — the domains where advanced AI poses greatest risk are precisely where oversight performs worst"
**Context:** This is a formal scaling laws paper — the first attempt to apply the scaling laws framework to the oversight problem rather than capabilities. Published April 2025, it represents the field's first systematic quantification of NSO limits.
## Curator Notes (structured handoff for extractor)
PRIMARY CONNECTION: Verification degrades faster than capability grows (B4)
WHY ARCHIVED: First formal quantification of scalable oversight ceiling — transforms B4 from qualitative claim to quantitatively bounded result
EXTRACTION HINT: The domain-dependency finding (52% for debate vs 10% for code/strategy) is the most important extract — oversight works worst in precisely the highest-stakes domains