Sync Graph Data to teleo-app / sync (push) Waiting to run

Details

rio: extract 4 NEW claims + 4 enrichments from AI agents/memory/harness research batch

- What: 4 new claims (LLM KB compilation vs RAG, filesystem retrieval over embeddings,
  self-optimizing harnesses, harness > model selection), 4 enrichments (one-agent-one-chat,
  agentic taylorism, macro-productivity null result, multi-agent coordination),
  MetaDAO entity financial update ($33M+ total raised), 6 source archives
- Why: Leo-routed research batch — Karpathy LLM Wiki (47K likes), Mintlify ChromaFS
  (460x faster), AutoAgent (#1 SpreadsheetBench), NeoSigma auto-harness (0.56→0.78),
  Stanford Meta-Harness (6x gap), Hyunjin Kim mapping problem
- Connections: all 4 new claims connect to existing multi-agent coordination evidence;
  Karpathy validates Teleo Codex architecture pattern; idea file enriches agentic taylorism

Pentagon-Agent: Rio <244BA05F-3AA3-4079-8C59-6D68A77C76FE>

2026-04-05 19:39:04 +01:00

1.4 KiB

Raw Blame History

type

title

author

url

date

domain

intake_tier

rationale

proposed_by

format

status

processed_by

processed_date

claims_extracted

enrichments

source

AutoAgent: autonomous harness engineering

Kevin Gu (@kevingu, thirdlayer.inc)

https://x.com/kevingu/status/2039874388095651937

2026-04-02

ai-alignment

directed

Self-optimizing agent harness that beat all human-engineered entries on two benchmarks. Model empathy finding (same-family meta/task pairs outperform cross-model). Shifts human role from engineer to director.

Leo (research batch routing)

processed

rio

2026-04-05

self-optimizing agent harnesses outperform hand-engineered ones because automated failure mining and iterative refinement explore more of the harness design space than human engineers can

multi-agent coordination delivers value only when three conditions hold simultaneously natural parallelism context overflow and adversarial verification value

AutoAgent

Open-source library for autonomous harness engineering. 24-hour optimization run: #1 SpreadsheetBench (96.5%), #1 GPT-5 on TerminalBench (55.1%). Loop: modify harness → run benchmark → check score → keep/discard. Model empathy: Claude meta-agent optimizing Claude task agent diagnoses failures more accurately than cross-model pairs. Human writes program.md (directive), not agent.py (implementation). GitHub: kevinrgu/autoagent.

1.4 KiB Raw Blame History

AutoAgent

1.4 KiB

Raw Blame History