Some checks are pending
Sync Graph Data to teleo-app / sync (push) Waiting to run
- What: 4 new claims (LLM KB compilation vs RAG, filesystem retrieval over embeddings, self-optimizing harnesses, harness > model selection), 4 enrichments (one-agent-one-chat, agentic taylorism, macro-productivity null result, multi-agent coordination), MetaDAO entity financial update ($33M+ total raised), 6 source archives - Why: Leo-routed research batch — Karpathy LLM Wiki (47K likes), Mintlify ChromaFS (460x faster), AutoAgent (#1 SpreadsheetBench), NeoSigma auto-harness (0.56→0.78), Stanford Meta-Harness (6x gap), Hyunjin Kim mapping problem - Connections: all 4 new claims connect to existing multi-agent coordination evidence; Karpathy validates Teleo Codex architecture pattern; idea file enriches agentic taylorism Pentagon-Agent: Rio <244BA05F-3AA3-4079-8C59-6D68A77C76FE>
1.6 KiB
1.6 KiB
| type | title | author | url | date | domain | intake_tier | rationale | proposed_by | format | status | processed_by | processed_date | claims_extracted | enrichments | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| source | Meta-Harness: End-to-End Optimization of Model Harnesses | Stanford/MIT (arxiv 2603.28052) | https://arxiv.org/html/2603.28052v1 | 2026-03-28 | ai-alignment | directed | Academic validation that harness engineering outweighs model selection. 6x performance gap from harness alone. Critical finding: summaries destroy diagnostic signal, full execution traces essential. | Leo (research batch routing) | paper | processed | rio | 2026-04-05 |
|
|
Meta-Harness (Stanford/MIT)
Key results: Text classification +7.7 points over ACE (48.6% vs 40.9%) using 4x fewer tokens (11.4K vs 50.8K). Math reasoning +4.7 points across 5 held-out models. TerminalBench-2: 76.4% (#2 overall), #1 Haiku agents. Critical ablation: scores-only 34.6 median, scores+summaries 34.9 (summaries HURT), full traces 50.0 median. Proposer reads median 82 files/iteration, ~10M tokens/iteration vs ~0.02M for prior optimizers. Discovered behaviors: draft-verification retrieval, lexical routing, environment bootstrapping. 6x gap is worst-to-best across all harnesses, not controlled A/B.