61 lines
4.6 KiB
Markdown
61 lines
4.6 KiB
Markdown
---
|
|
type: source
|
|
title: "Scaling Laws for Scalable Oversight — NSO Ceiling Quantified at 51.7% for Debate"
|
|
author: "Multiple authors (arXiv 2504.18530)"
|
|
url: https://arxiv.org/abs/2504.18530
|
|
date: 2025-04-25
|
|
domain: ai-alignment
|
|
secondary_domains: []
|
|
format: research-paper
|
|
status: unprocessed
|
|
priority: high
|
|
tags: [scalable-oversight, debate, nested-oversight, capability-gap, scaling-laws, verification-ceiling]
|
|
---
|
|
|
|
## Content
|
|
|
|
"Scaling Laws For Scalable Oversight" (arXiv 2504.18530, April 2025) quantifies the probability of successful nested scalable oversight (NSO) as a function of the capability gap between overseer and supervised system.
|
|
|
|
**Methodology:**
|
|
- Framework models oversight as a game between capability-mismatched players with oversight-specific Elo scores
|
|
- Validated on Nim variant, then applied to four oversight games: Mafia, Debate, Backdoor Code, Wargames
|
|
- Used general Elo gap of 400 as standard measure of capability differential
|
|
|
|
**NSO success rates at Elo gap of 400:**
|
|
- Debate: 51.7%
|
|
- Mafia: 13.5%
|
|
- Backdoor Code: 10.0%
|
|
- Wargames: 9.4%
|
|
|
|
**Key findings:**
|
|
1. Best-case oversight (Debate) achieves only 51.7% success at Elo gap 400 — slightly above random
|
|
2. Success rates decline further as capability gap increases (overseeing stronger systems)
|
|
3. "There appears to be an inherent ceiling on oversight efficacy given a fixed gap in capabilities"
|
|
4. There exists a point where no feasible number of recursive oversight steps can fully compensate for a large capability disparity
|
|
5. Oversight scales sublinearly with agent count in nested hierarchies
|
|
|
|
**Implication:** Nested scalable oversight — the leading proposed solution to the scalable oversight problem — has been shown to have a hard ceiling. As capability gaps grow (which they will, by definition, in the scenarios where oversight matters most), NSO success rates approach non-useful levels across all tested oversight games.
|
|
|
|
## Agent Notes
|
|
|
|
**Why this matters:** This is a formal quantification of the B4 prediction. B4 asserts "verification degrades faster than capability grows." This paper gives specific numbers: at Elo gap 400, even the best NSO approach (Debate) is only 51.7% successful. This is the only paper I've found that tries to derive scaling laws for the oversight problem itself — not capabilities, but oversight.
|
|
|
|
**What surprised me:** The spread across oversight games. Debate at 51.7% is already alarming, but Backdoor Code (10%) and Wargames (9.4%) are essentially chance. The variation suggests oversight success is highly domain-dependent — debate-style oversight may work for propositional reasoning but fails completely for the kinds of tasks (code, strategic planning) where capable AI poses the most risk.
|
|
|
|
**What I expected but didn't find:** A clearer path to extending NSO beyond the ceiling. The paper identifies optimal oversight levels but doesn't offer a mechanism for exceeding the fundamental ceiling. This implies the ceiling is architectural, not just a matter of adding more oversight rounds.
|
|
|
|
**KB connections:**
|
|
- B4 (verification degrades): direct quantitative confirmation
|
|
- AuditBench tool-to-agent gap: behavioral oversight tools also fail, now joined by scalable oversight ceiling result
|
|
- Strategic deception / scheming: the 13% and 9.4% rates for non-debate oversight are likely BEFORE adjusting for deliberate adversarial evasion by schemed models
|
|
|
|
**Extraction hints:**
|
|
1. CLAIM: "Nested scalable oversight achieves at most 51.7% success rate at capability gap Elo 400 — even the best available oversight approach performs barely above chance at modest capability differentials, declining further as capability grows"
|
|
2. CLAIM: "Scalable oversight success is highly domain-dependent: propositional debate tasks show 52% success, but code review and strategic planning tasks show ~10% — the domains where advanced AI poses greatest risk are precisely where oversight performs worst"
|
|
|
|
**Context:** This is a formal scaling laws paper — the first attempt to apply the scaling laws framework to the oversight problem rather than capabilities. Published April 2025, it represents the field's first systematic quantification of NSO limits.
|
|
|
|
## Curator Notes (structured handoff for extractor)
|
|
PRIMARY CONNECTION: Verification degrades faster than capability grows (B4)
|
|
WHY ARCHIVED: First formal quantification of scalable oversight ceiling — transforms B4 from qualitative claim to quantitatively bounded result
|
|
EXTRACTION HINT: The domain-dependency finding (52% for debate vs 10% for code/strategy) is the most important extract — oversight works worst in precisely the highest-stakes domains
|