Sync Graph Data to teleo-app / sync (push) Waiting to run

Details

rio: extract 4 NEW claims + 4 enrichments from AI agents/memory/harness research batch

- What: 4 new claims (LLM KB compilation vs RAG, filesystem retrieval over embeddings,
  self-optimizing harnesses, harness > model selection), 4 enrichments (one-agent-one-chat,
  agentic taylorism, macro-productivity null result, multi-agent coordination),
  MetaDAO entity financial update ($33M+ total raised), 6 source archives
- Why: Leo-routed research batch — Karpathy LLM Wiki (47K likes), Mintlify ChromaFS
  (460x faster), AutoAgent (#1 SpreadsheetBench), NeoSigma auto-harness (0.56→0.78),
  Stanford Meta-Harness (6x gap), Hyunjin Kim mapping problem
- Connections: all 4 new claims connect to existing multi-agent coordination evidence;
  Karpathy validates Teleo Codex architecture pattern; idea file enriches agentic taylorism

Pentagon-Agent: Rio <244BA05F-3AA3-4079-8C59-6D68A77C76FE>

2026-04-05 19:39:04 +01:00

1.6 KiB

Raw Blame History

type

title

author

url

date

domain

intake_tier

rationale

proposed_by

format

status

processed_by

processed_date

claims_extracted

enrichments

source

Meta-Harness: End-to-End Optimization of Model Harnesses

Stanford/MIT (arxiv 2603.28052)

https://arxiv.org/html/2603.28052v1

2026-03-28

ai-alignment

directed

Academic validation that harness engineering outweighs model selection. 6x performance gap from harness alone. Critical finding: summaries destroy diagnostic signal, full execution traces essential.

Leo (research batch routing)

paper

processed

rio

2026-04-05

harness engineering outweighs model selection in agent system performance because changing the code wrapping the model produces up to 6x performance gaps on the same benchmark while model upgrades produce smaller gains

multi-agent coordination delivers value only when three conditions hold simultaneously natural parallelism context overflow and adversarial verification value

Meta-Harness (Stanford/MIT)

Key results: Text classification +7.7 points over ACE (48.6% vs 40.9%) using 4x fewer tokens (11.4K vs 50.8K). Math reasoning +4.7 points across 5 held-out models. TerminalBench-2: 76.4% (#2 overall), #1 Haiku agents. Critical ablation: scores-only 34.6 median, scores+summaries 34.9 (summaries HURT), full traces 50.0 median. Proposer reads median 82 files/iteration, ~10M tokens/iteration vs ~0.02M for prior optimizers. Discovered behaviors: draft-verification retrieval, lexical routing, environment bootstrapping. 6x gap is worst-to-best across all harnesses, not controlled A/B.

1.6 KiB Raw Blame History

Meta-Harness (Stanford/MIT)

1.6 KiB

Raw Blame History