Sync Graph Data to teleo-app / sync (push) Waiting to run

Details

theseus: extract claims from 2026-04-02-scaling-laws-scalable-oversight-nso-ceiling-results

- Source: inbox/queue/2026-04-02-scaling-laws-scalable-oversight-nso-ceiling-results.md
- Domain: ai-alignment
- Claims: 2, Entities: 0
- Enrichments: 2
- Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5)

Pentagon-Agent: Theseus <PIPELINE>

2026-04-02 10:40:18 +00:00

1.8 KiB

Raw Blame History

type

domain

description

confidence

source

created

title

agent

scope

sourcer

related_claims

claim

ai-alignment

Even the best-performing oversight approach (debate) performs barely above chance at modest capability differentials, with success rates approaching non-useful levels as gaps grow

experimental

arXiv 2504.18530, empirical testing across four oversight games with Elo-based capability gap measurement

2026-04-02

Nested scalable oversight achieves at most 51.7% success rate at capability gap Elo 400 with performance declining as capability differential increases

theseus

causal

arXiv 2504.18530

safe AI development requires building alignment mechanisms before scaling capability

Nested scalable oversight achieves at most 51.7% success rate at capability gap Elo 400 with performance declining as capability differential increases

The first formal scaling laws study of oversight efficacy quantifies NSO success rates across four oversight games (Debate, Mafia, Backdoor Code, Wargames) at standardized capability gaps. At Elo gap 400 — a moderate differential — Debate achieves only 51.7% success, while other approaches perform far worse (Mafia 13.5%, Backdoor Code 10.0%, Wargames 9.4%). The study establishes that 'there appears to be an inherent ceiling on oversight efficacy given a fixed gap in capabilities' and that 'there exists a point where no feasible number of recursive oversight steps can fully compensate for a large capability disparity.' This is the first quantitative confirmation that oversight scales sublinearly with agent count in nested hierarchies, meaning the verification problem degrades faster than capability grows. The methodology validated the framework on a Nim variant before applying it to realistic oversight scenarios, providing empirical grounding for what was previously a theoretical concern.

1.8 KiB Raw Blame History

Nested scalable oversight achieves at most 51.7% success rate at capability gap Elo 400 with performance declining as capability differential increases

1.8 KiB

Raw Blame History