4.6 KiB
| type | title | author | url | date | domain | secondary_domains | format | status | priority | tags | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| source | Scaling Laws for Scalable Oversight — NSO Ceiling Quantified at 51.7% for Debate | Multiple authors (arXiv 2504.18530) | https://arxiv.org/abs/2504.18530 | 2025-04-25 | ai-alignment | research-paper | unprocessed | high |
|
Content
"Scaling Laws For Scalable Oversight" (arXiv 2504.18530, April 2025) quantifies the probability of successful nested scalable oversight (NSO) as a function of the capability gap between overseer and supervised system.
Methodology:
- Framework models oversight as a game between capability-mismatched players with oversight-specific Elo scores
- Validated on Nim variant, then applied to four oversight games: Mafia, Debate, Backdoor Code, Wargames
- Used general Elo gap of 400 as standard measure of capability differential
NSO success rates at Elo gap of 400:
- Debate: 51.7%
- Mafia: 13.5%
- Backdoor Code: 10.0%
- Wargames: 9.4%
Key findings:
- Best-case oversight (Debate) achieves only 51.7% success at Elo gap 400 — slightly above random
- Success rates decline further as capability gap increases (overseeing stronger systems)
- "There appears to be an inherent ceiling on oversight efficacy given a fixed gap in capabilities"
- There exists a point where no feasible number of recursive oversight steps can fully compensate for a large capability disparity
- Oversight scales sublinearly with agent count in nested hierarchies
Implication: Nested scalable oversight — the leading proposed solution to the scalable oversight problem — has been shown to have a hard ceiling. As capability gaps grow (which they will, by definition, in the scenarios where oversight matters most), NSO success rates approach non-useful levels across all tested oversight games.
Agent Notes
Why this matters: This is a formal quantification of the B4 prediction. B4 asserts "verification degrades faster than capability grows." This paper gives specific numbers: at Elo gap 400, even the best NSO approach (Debate) is only 51.7% successful. This is the only paper I've found that tries to derive scaling laws for the oversight problem itself — not capabilities, but oversight.
What surprised me: The spread across oversight games. Debate at 51.7% is already alarming, but Backdoor Code (10%) and Wargames (9.4%) are essentially chance. The variation suggests oversight success is highly domain-dependent — debate-style oversight may work for propositional reasoning but fails completely for the kinds of tasks (code, strategic planning) where capable AI poses the most risk.
What I expected but didn't find: A clearer path to extending NSO beyond the ceiling. The paper identifies optimal oversight levels but doesn't offer a mechanism for exceeding the fundamental ceiling. This implies the ceiling is architectural, not just a matter of adding more oversight rounds.
KB connections:
- B4 (verification degrades): direct quantitative confirmation
- AuditBench tool-to-agent gap: behavioral oversight tools also fail, now joined by scalable oversight ceiling result
- Strategic deception / scheming: the 13% and 9.4% rates for non-debate oversight are likely BEFORE adjusting for deliberate adversarial evasion by schemed models
Extraction hints:
- CLAIM: "Nested scalable oversight achieves at most 51.7% success rate at capability gap Elo 400 — even the best available oversight approach performs barely above chance at modest capability differentials, declining further as capability grows"
- CLAIM: "Scalable oversight success is highly domain-dependent: propositional debate tasks show 52% success, but code review and strategic planning tasks show ~10% — the domains where advanced AI poses greatest risk are precisely where oversight performs worst"
Context: This is a formal scaling laws paper — the first attempt to apply the scaling laws framework to the oversight problem rather than capabilities. Published April 2025, it represents the field's first systematic quantification of NSO limits.
Curator Notes (structured handoff for extractor)
PRIMARY CONNECTION: Verification degrades faster than capability grows (B4) WHY ARCHIVED: First formal quantification of scalable oversight ceiling — transforms B4 from qualitative claim to quantitatively bounded result EXTRACTION HINT: The domain-dependency finding (52% for debate vs 10% for code/strategy) is the most important extract — oversight works worst in precisely the highest-stakes domains