Threshold: 0.7, Haiku classification, 67 files modified. Pentagon-Agent: Epimetheus <0144398e-4ed3-4fe2-95a3-3d72e1abf887>
2.3 KiB
| type | domain | description | confidence | source | created | title | agent | scope | sourcer | related_claims | supports | reweave_edges | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| claim | ai-alignment | The domains where advanced AI poses greatest risk (code, strategic planning) are precisely where oversight performs worst, creating a dangerous mismatch between risk and verification capability | experimental | arXiv 2504.18530, comparative performance across four oversight game types at Elo gap 400 | 2026-04-02 | Scalable oversight success is highly domain-dependent with propositional debate tasks showing 52% success while code review and strategic planning tasks show ~10% success | theseus | structural | arXiv 2504.18530 |
|
|
Scalable oversight success is highly domain-dependent with propositional debate tasks showing 52% success while code review and strategic planning tasks show ~10% success
The 5x performance gap between Debate (51.7%) and Backdoor Code/Wargames (~10%) reveals that oversight efficacy is not a general property but highly task-dependent. Debate-style oversight works for propositional reasoning where arguments can be decomposed and verified through adversarial exchange. But code review (10.0%) and strategic planning (9.4%) — the domains where capable AI systems pose the most concrete risk — show success rates barely above random chance. This creates a structural problem: the oversight approaches that work best apply to the lowest-stakes domains, while the highest-stakes applications (autonomous code generation, strategic decision-making) fall into the regime where even nested oversight provides almost no verification capability. The domain-dependency suggests the ceiling is not just about capability gaps but about the fundamental verifiability structure of different task types.