Sync Graph Data to teleo-app / sync (push) Waiting to run

Details

theseus: extract claims from 2026-04-02-scaling-laws-scalable-oversight-nso-ceiling-results

- Source: inbox/queue/2026-04-02-scaling-laws-scalable-oversight-nso-ceiling-results.md
- Domain: ai-alignment
- Claims: 2, Entities: 0
- Enrichments: 2
- Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5)

Pentagon-Agent: Theseus <PIPELINE>

2026-04-02 10:40:18 +00:00

2 KiB

Raw Blame History

type

domain

description

confidence

source

created

title

agent

scope

sourcer

related_claims

claim

ai-alignment

The domains where advanced AI poses greatest risk (code, strategic planning) are precisely where oversight performs worst, creating a dangerous mismatch between risk and verification capability

experimental

arXiv 2504.18530, comparative performance across four oversight game types at Elo gap 400

2026-04-02

Scalable oversight success is highly domain-dependent with propositional debate tasks showing 52% success while code review and strategic planning tasks show ~10% success

theseus

structural

arXiv 2504.18530

safe AI development requires building alignment mechanisms before scaling capability

formal verification of AI-generated proofs provides scalable oversight that human review cannot match because machine-checked correctness scales with AI capability while human verification degrades

Scalable oversight success is highly domain-dependent with propositional debate tasks showing 52% success while code review and strategic planning tasks show ~10% success

The 5x performance gap between Debate (51.7%) and Backdoor Code/Wargames (~10%) reveals that oversight efficacy is not a general property but highly task-dependent. Debate-style oversight works for propositional reasoning where arguments can be decomposed and verified through adversarial exchange. But code review (10.0%) and strategic planning (9.4%) — the domains where capable AI systems pose the most concrete risk — show success rates barely above random chance. This creates a structural problem: the oversight approaches that work best apply to the lowest-stakes domains, while the highest-stakes applications (autonomous code generation, strategic decision-making) fall into the regime where even nested oversight provides almost no verification capability. The domain-dependency suggests the ceiling is not just about capability gaps but about the fundamental verifiability structure of different task types.

2 KiB Raw Blame History

Scalable oversight success is highly domain-dependent with propositional debate tasks showing 52% success while code review and strategic planning tasks show ~10% success

2 KiB

Raw Blame History