theseus: add epistemological oversight claim from Karpathy 8-agent experiment

- What: 1 new claim — agents cannot recognize confounded results, requiring human epistemological oversight even at max capability. Added to _map.md under Failure Modes & Oversight. - Why: Karpathy's 8-agent research org (4 Claude, 4 Codex) is empirical evidence that the failure is structural (epistemological) not capability- limited. Agent accepted spurious result without controlling for compute. - Connections: grounds adversarial PR review, extends capability ≠ reliability, connects to correlated blind spots and role specialization claims Pentagon-Agent: Theseus <25B96405-E50F-45ED-9C92-D8046DFAAD00>
2026-03-10 15:43:36 +00:00 · 2026-03-10 15:43:36 +00:00 · f929bb6cff
commit f929bb6cff
parent 83a2402575
2 changed files with 35 additions and 0 deletions
--- a/domains/ai-alignment/AI
+++ b/domains/ai-alignment/AI
@ -0,0 +1,34 @@
+---
+type: claim
+domain: ai-alignment
+description: "Karpathy's 8-agent research org experiment found agents accept spurious results (e.g. 'bigger network improves loss' without controlling for compute) at highest intelligence settings, indicating the failure is structural not capability-limited"
+confidence: experimental
+source: "Andrej Karpathy, 8-agent research org experiment (2026-02-27), x.com/karpathy/status/2027521323275325622"
+created: 2026-03-10
+---
+
+# AI research agents cannot recognize confounded experimental results as spurious requiring human epistemological oversight even when agents reach maximum capability settings
+
+Karpathy ran 8 agents (4 Claude, 4 Codex) on GPU-based ML experiments (removing logit softcap from nanochat without regression). He tested multiple organizational structures: 8 independent solo researchers, 1 chief scientist directing 8 juniors, and variations. The consistent finding: agents generate poor experimental designs and fail to recognize confounded results.
+
+The clearest example: an agent "discovered" that increasing hidden size improves validation loss. This is a spurious result — a bigger network will have lower validation loss in the infinite data regime, and it also trains for longer. The confound is obvious to any ML researcher but invisible to the agent, even at highest intelligence settings. Karpathy had to manually intervene to point it out.
+
+This failure is epistemological, not capability-limited. The agents can implement any well-scoped experiment perfectly — [[AI agents excel at implementing well-scoped ideas but cannot generate creative experiment designs which makes the human role shift from researcher to agent workflow architect]]. But they cannot evaluate whether their own experimental methodology is valid. They don't create strong baselines, don't ablate properly, don't control for runtime or compute. This is the difference between executing instructions and understanding why the instructions matter.
+
+The implication for AI oversight: [[AI capability and reliability are independent dimensions because Claude solved a 30-year open mathematical problem while simultaneously degrading at basic program execution during the same session]]. Capability at one level (implementation) does not transfer to a different cognitive task (epistemological judgment). This makes human oversight non-optional for any research pipeline using AI agents — not as a temporary measure until agents improve, but as a structural requirement because the failure mode is orthogonal to the capability axis.
+
+For collective agent systems like Teleo, this grounds the design choice of [[adversarial PR review produces higher quality knowledge than self-review because separated proposer and evaluator roles catch errors that the originating agent cannot see]]. The evaluator catches what the proposer structurally cannot. Karpathy's experience is empirical evidence that this separation is necessary, not optional.
+
+**Scope:** This claim is about research methodology oversight — the ability to evaluate whether experimental designs are valid. It does not claim agents cannot reason generally or that all agent failures are epistemological. The specific failure is recognizing and controlling for confounds in experimental design.
+
+---
+
+Relevant Notes:
+- [[AI agents excel at implementing well-scoped ideas but cannot generate creative experiment designs which makes the human role shift from researcher to agent workflow architect]] — the broader pattern this instantiates: agents implement, humans architect
+- [[AI capability and reliability are independent dimensions because Claude solved a 30-year open mathematical problem while simultaneously degrading at basic program execution during the same session]] — capability independence: strong implementation doesn't imply strong methodology
+- [[adversarial PR review produces higher quality knowledge than self-review because separated proposer and evaluator roles catch errors that the originating agent cannot see]] — the design response: separated roles catch what single agents miss
+- [[coordination protocol design produces larger capability gains than model scaling because the same AI model performed 6x better with structured exploration than with human coaching on the same problem]] — same agents, different org structure, different results (Karpathy tested multiple structures)
+- [[all agents running the same model family creates correlated blind spots that adversarial review cannot catch because the evaluator shares the proposers training biases]] — Karpathy's 4 Claude + 4 Codex mix is evidence for model diversity
+
+Topics:
+- [[_map]]
--- a/domains/ai-alignment/_map.md
+++ b/domains/ai-alignment/_map.md
@ -49,6 +49,7 @@ Evidence from documented AI problem-solving cases, primarily Knuth's "Claude's C
 - [[formal verification of AI-generated proofs provides scalable oversight that human review cannot match because machine-checked correctness scales with AI capability while human verification degrades]] — formal verification as scalable oversight
 - [[agent-generated code creates cognitive debt that compounds when developers cannot understand what was produced on their behalf]] — Willison's cognitive debt concept: understanding deficit from agent-generated code
 - [[coding agents cannot take accountability for mistakes which means humans must retain decision authority over security and critical systems regardless of agent capability]] — the accountability gap: agents bear zero downside risk
+- [[AI research agents cannot recognize confounded experimental results as spurious requiring human epistemological oversight even when agents reach maximum capability settings]] — Karpathy's 8-agent experiment: agents accept spurious results at max intelligence

 ## Architecture & Emergence
 - [[AGI may emerge as a patchwork of coordinating sub-AGI agents rather than a single monolithic system]] — DeepMind researchers: distributed AGI makes single-system alignment research insufficient