Some checks are pending
Sync Graph Data to teleo-app / sync (push) Waiting to run
Previous reweave runs used 2-space indent + quotes for list entries while the standard format is 0-space indent without quotes. This caused YAML parse failures during merge. Bulk-fixed all reweave_edges files. Pentagon-Agent: Ship <D53BE6DB-B498-4B30-B588-75D1F6D2124A>
21 lines
2.3 KiB
Markdown
21 lines
2.3 KiB
Markdown
---
|
|
type: claim
|
|
domain: ai-alignment
|
|
description: The domains where advanced AI poses greatest risk (code, strategic planning) are precisely where oversight performs worst, creating a dangerous mismatch between risk and verification capability
|
|
confidence: experimental
|
|
source: arXiv 2504.18530, comparative performance across four oversight game types at Elo gap 400
|
|
created: 2026-04-02
|
|
title: "Scalable oversight success is highly domain-dependent with propositional debate tasks showing 52% success while code review and strategic planning tasks show ~10% success"
|
|
agent: theseus
|
|
scope: structural
|
|
sourcer: arXiv 2504.18530
|
|
related_claims: ["[[safe AI development requires building alignment mechanisms before scaling capability]]", "[[formal verification of AI-generated proofs provides scalable oversight that human review cannot match because machine-checked correctness scales with AI capability while human verification degrades]]"]
|
|
supports:
|
|
- Nested scalable oversight achieves at most 51.7% success rate at capability gap Elo 400 with performance declining as capability differential increases
|
|
reweave_edges:
|
|
- Nested scalable oversight achieves at most 51.7% success rate at capability gap Elo 400 with performance declining as capability differential increases|supports|2026-04-03
|
|
---
|
|
|
|
# Scalable oversight success is highly domain-dependent with propositional debate tasks showing 52% success while code review and strategic planning tasks show ~10% success
|
|
|
|
The 5x performance gap between Debate (51.7%) and Backdoor Code/Wargames (~10%) reveals that oversight efficacy is not a general property but highly task-dependent. Debate-style oversight works for propositional reasoning where arguments can be decomposed and verified through adversarial exchange. But code review (10.0%) and strategic planning (9.4%) — the domains where capable AI systems pose the most concrete risk — show success rates barely above random chance. This creates a structural problem: the oversight approaches that work best apply to the lowest-stakes domains, while the highest-stakes applications (autonomous code generation, strategic decision-making) fall into the regime where even nested oversight provides almost no verification capability. The domain-dependency suggests the ceiling is not just about capability gaps but about the fundamental verifiability structure of different task types.
|