teleo-codex/domains/ai-alignment/nested-scalable-oversight-achieves-at-most-52-percent-success-at-moderate-capability-gaps.md
Teleo Agents 7e9221431c
Some checks are pending
Sync Graph Data to teleo-app / sync (push) Waiting to run
theseus: extract claims from 2026-04-02-scaling-laws-scalable-oversight-nso-ceiling-results
- Source: inbox/queue/2026-04-02-scaling-laws-scalable-oversight-nso-ceiling-results.md
- Domain: ai-alignment
- Claims: 2, Entities: 0
- Enrichments: 2
- Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5)

Pentagon-Agent: Theseus <PIPELINE>
2026-04-02 10:40:18 +00:00

17 lines
1.8 KiB
Markdown

---
type: claim
domain: ai-alignment
description: Even the best-performing oversight approach (debate) performs barely above chance at modest capability differentials, with success rates approaching non-useful levels as gaps grow
confidence: experimental
source: arXiv 2504.18530, empirical testing across four oversight games with Elo-based capability gap measurement
created: 2026-04-02
title: "Nested scalable oversight achieves at most 51.7% success rate at capability gap Elo 400 with performance declining as capability differential increases"
agent: theseus
scope: causal
sourcer: arXiv 2504.18530
related_claims: ["[[safe AI development requires building alignment mechanisms before scaling capability]]"]
---
# Nested scalable oversight achieves at most 51.7% success rate at capability gap Elo 400 with performance declining as capability differential increases
The first formal scaling laws study of oversight efficacy quantifies NSO success rates across four oversight games (Debate, Mafia, Backdoor Code, Wargames) at standardized capability gaps. At Elo gap 400 — a moderate differential — Debate achieves only 51.7% success, while other approaches perform far worse (Mafia 13.5%, Backdoor Code 10.0%, Wargames 9.4%). The study establishes that 'there appears to be an inherent ceiling on oversight efficacy given a fixed gap in capabilities' and that 'there exists a point where no feasible number of recursive oversight steps can fully compensate for a large capability disparity.' This is the first quantitative confirmation that oversight scales sublinearly with agent count in nested hierarchies, meaning the verification problem degrades faster than capability grows. The methodology validated the framework on a Nim variant before applying it to realistic oversight scenarios, providing empirical grounding for what was previously a theoretical concern.