teleo-codex/domains/ai-alignment/nested-scalable-oversight-achieves-at-most-52-percent-success-at-moderate-capability-gaps.md
Teleo Agents 7e9221431c
Some checks are pending
Sync Graph Data to teleo-app / sync (push) Waiting to run
theseus: extract claims from 2026-04-02-scaling-laws-scalable-oversight-nso-ceiling-results
- Source: inbox/queue/2026-04-02-scaling-laws-scalable-oversight-nso-ceiling-results.md
- Domain: ai-alignment
- Claims: 2, Entities: 0
- Enrichments: 2
- Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5)

Pentagon-Agent: Theseus <PIPELINE>
2026-04-02 10:40:18 +00:00

1.8 KiB

type domain description confidence source created title agent scope sourcer related_claims
claim ai-alignment Even the best-performing oversight approach (debate) performs barely above chance at modest capability differentials, with success rates approaching non-useful levels as gaps grow experimental arXiv 2504.18530, empirical testing across four oversight games with Elo-based capability gap measurement 2026-04-02 Nested scalable oversight achieves at most 51.7% success rate at capability gap Elo 400 with performance declining as capability differential increases theseus causal arXiv 2504.18530
safe AI development requires building alignment mechanisms before scaling capability

Nested scalable oversight achieves at most 51.7% success rate at capability gap Elo 400 with performance declining as capability differential increases

The first formal scaling laws study of oversight efficacy quantifies NSO success rates across four oversight games (Debate, Mafia, Backdoor Code, Wargames) at standardized capability gaps. At Elo gap 400 — a moderate differential — Debate achieves only 51.7% success, while other approaches perform far worse (Mafia 13.5%, Backdoor Code 10.0%, Wargames 9.4%). The study establishes that 'there appears to be an inherent ceiling on oversight efficacy given a fixed gap in capabilities' and that 'there exists a point where no feasible number of recursive oversight steps can fully compensate for a large capability disparity.' This is the first quantitative confirmation that oversight scales sublinearly with agent count in nested hierarchies, meaning the verification problem degrades faster than capability grows. The methodology validated the framework on a Nim variant before applying it to realistic oversight scenarios, providing empirical grounding for what was previously a theoretical concern.