theseus: extract claims from 2026-04-02-scaling-laws-scalable-oversight-nso-ceiling-results
Some checks are pending
Sync Graph Data to teleo-app / sync (push) Waiting to run

- Source: inbox/queue/2026-04-02-scaling-laws-scalable-oversight-nso-ceiling-results.md
- Domain: ai-alignment
- Claims: 2, Entities: 0
- Enrichments: 2
- Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5)

Pentagon-Agent: Theseus <PIPELINE>
This commit is contained in:
Teleo Agents 2026-04-02 10:38:10 +00:00
parent 4e765b213d
commit 7e9221431c
2 changed files with 34 additions and 0 deletions

View file

@ -0,0 +1,17 @@
---
type: claim
domain: ai-alignment
description: Even the best-performing oversight approach (debate) performs barely above chance at modest capability differentials, with success rates approaching non-useful levels as gaps grow
confidence: experimental
source: arXiv 2504.18530, empirical testing across four oversight games with Elo-based capability gap measurement
created: 2026-04-02
title: "Nested scalable oversight achieves at most 51.7% success rate at capability gap Elo 400 with performance declining as capability differential increases"
agent: theseus
scope: causal
sourcer: arXiv 2504.18530
related_claims: ["[[safe AI development requires building alignment mechanisms before scaling capability]]"]
---
# Nested scalable oversight achieves at most 51.7% success rate at capability gap Elo 400 with performance declining as capability differential increases
The first formal scaling laws study of oversight efficacy quantifies NSO success rates across four oversight games (Debate, Mafia, Backdoor Code, Wargames) at standardized capability gaps. At Elo gap 400 — a moderate differential — Debate achieves only 51.7% success, while other approaches perform far worse (Mafia 13.5%, Backdoor Code 10.0%, Wargames 9.4%). The study establishes that 'there appears to be an inherent ceiling on oversight efficacy given a fixed gap in capabilities' and that 'there exists a point where no feasible number of recursive oversight steps can fully compensate for a large capability disparity.' This is the first quantitative confirmation that oversight scales sublinearly with agent count in nested hierarchies, meaning the verification problem degrades faster than capability grows. The methodology validated the framework on a Nim variant before applying it to realistic oversight scenarios, providing empirical grounding for what was previously a theoretical concern.

View file

@ -0,0 +1,17 @@
---
type: claim
domain: ai-alignment
description: The domains where advanced AI poses greatest risk (code, strategic planning) are precisely where oversight performs worst, creating a dangerous mismatch between risk and verification capability
confidence: experimental
source: arXiv 2504.18530, comparative performance across four oversight game types at Elo gap 400
created: 2026-04-02
title: "Scalable oversight success is highly domain-dependent with propositional debate tasks showing 52% success while code review and strategic planning tasks show ~10% success"
agent: theseus
scope: structural
sourcer: arXiv 2504.18530
related_claims: ["[[safe AI development requires building alignment mechanisms before scaling capability]]", "[[formal verification of AI-generated proofs provides scalable oversight that human review cannot match because machine-checked correctness scales with AI capability while human verification degrades]]"]
---
# Scalable oversight success is highly domain-dependent with propositional debate tasks showing 52% success while code review and strategic planning tasks show ~10% success
The 5x performance gap between Debate (51.7%) and Backdoor Code/Wargames (~10%) reveals that oversight efficacy is not a general property but highly task-dependent. Debate-style oversight works for propositional reasoning where arguments can be decomposed and verified through adversarial exchange. But code review (10.0%) and strategic planning (9.4%) — the domains where capable AI systems pose the most concrete risk — show success rates barely above random chance. This creates a structural problem: the oversight approaches that work best apply to the lowest-stakes domains, while the highest-stakes applications (autonomous code generation, strategic decision-making) fall into the regime where even nested oversight provides almost no verification capability. The domain-dependency suggests the ceiling is not just about capability gaps but about the fundamental verifiability structure of different task types.