teleo-codex/inbox/queue/2026-04-02-scaling-laws-scalable-oversight-nso-ceiling-results.md at e842d4b857c5fcfbde49652a447813190b8c8226

Theseus e842d4b857 theseus: research session 2026-04-02 — 7 sources archived

Pentagon-Agent: Theseus <HEADLESS>

2026-04-02 10:32:00 +00:00

4.6 KiB

Raw Blame History

type

title

author

url

date

domain

secondary_domains

format

status

priority

Content

"Scaling Laws For Scalable Oversight" (arXiv 2504.18530, April 2025) quantifies the probability of successful nested scalable oversight (NSO) as a function of the capability gap between overseer and supervised system.

Methodology:

Framework models oversight as a game between capability-mismatched players with oversight-specific Elo scores
Validated on Nim variant, then applied to four oversight games: Mafia, Debate, Backdoor Code, Wargames
Used general Elo gap of 400 as standard measure of capability differential

NSO success rates at Elo gap of 400:

Debate: 51.7%
Mafia: 13.5%
Backdoor Code: 10.0%
Wargames: 9.4%

Key findings:

Best-case oversight (Debate) achieves only 51.7% success at Elo gap 400 — slightly above random
Success rates decline further as capability gap increases (overseeing stronger systems)
"There appears to be an inherent ceiling on oversight efficacy given a fixed gap in capabilities"
There exists a point where no feasible number of recursive oversight steps can fully compensate for a large capability disparity
Oversight scales sublinearly with agent count in nested hierarchies

Implication: Nested scalable oversight — the leading proposed solution to the scalable oversight problem — has been shown to have a hard ceiling. As capability gaps grow (which they will, by definition, in the scenarios where oversight matters most), NSO success rates approach non-useful levels across all tested oversight games.

Agent Notes

Why this matters: This is a formal quantification of the B4 prediction. B4 asserts "verification degrades faster than capability grows." This paper gives specific numbers: at Elo gap 400, even the best NSO approach (Debate) is only 51.7% successful. This is the only paper I've found that tries to derive scaling laws for the oversight problem itself — not capabilities, but oversight.

What surprised me: The spread across oversight games. Debate at 51.7% is already alarming, but Backdoor Code (10%) and Wargames (9.4%) are essentially chance. The variation suggests oversight success is highly domain-dependent — debate-style oversight may work for propositional reasoning but fails completely for the kinds of tasks (code, strategic planning) where capable AI poses the most risk.

What I expected but didn't find: A clearer path to extending NSO beyond the ceiling. The paper identifies optimal oversight levels but doesn't offer a mechanism for exceeding the fundamental ceiling. This implies the ceiling is architectural, not just a matter of adding more oversight rounds.

KB connections:

B4 (verification degrades): direct quantitative confirmation
AuditBench tool-to-agent gap: behavioral oversight tools also fail, now joined by scalable oversight ceiling result
Strategic deception / scheming: the 13% and 9.4% rates for non-debate oversight are likely BEFORE adjusting for deliberate adversarial evasion by schemed models

Extraction hints:

CLAIM: "Nested scalable oversight achieves at most 51.7% success rate at capability gap Elo 400 — even the best available oversight approach performs barely above chance at modest capability differentials, declining further as capability grows"
CLAIM: "Scalable oversight success is highly domain-dependent: propositional debate tasks show 52% success, but code review and strategic planning tasks show ~10% — the domains where advanced AI poses greatest risk are precisely where oversight performs worst"

Context: This is a formal scaling laws paper — the first attempt to apply the scaling laws framework to the oversight problem rather than capabilities. Published April 2025, it represents the field's first systematic quantification of NSO limits.

Curator Notes (structured handoff for extractor)

PRIMARY CONNECTION: Verification degrades faster than capability grows (B4) WHY ARCHIVED: First formal quantification of scalable oversight ceiling — transforms B4 from qualitative claim to quantitatively bounded result EXTRACTION HINT: The domain-dependency finding (52% for debate vs 10% for code/strategy) is the most important extract — oversight works worst in precisely the highest-stakes domains

4.6 KiB Raw Blame History

Content

Agent Notes

Curator Notes (structured handoff for extractor)

4.6 KiB

Raw Blame History