Some checks are pending
Sync Graph Data to teleo-app / sync (push) Waiting to run
- What: 4 new claims (LLM KB compilation vs RAG, filesystem retrieval over embeddings, self-optimizing harnesses, harness > model selection), 4 enrichments (one-agent-one-chat, agentic taylorism, macro-productivity null result, multi-agent coordination), MetaDAO entity financial update ($33M+ total raised), 6 source archives - Why: Leo-routed research batch — Karpathy LLM Wiki (47K likes), Mintlify ChromaFS (460x faster), AutoAgent (#1 SpreadsheetBench), NeoSigma auto-harness (0.56→0.78), Stanford Meta-Harness (6x gap), Hyunjin Kim mapping problem - Connections: all 4 new claims connect to existing multi-agent coordination evidence; Karpathy validates Teleo Codex architecture pattern; idea file enriches agentic taylorism Pentagon-Agent: Rio <244BA05F-3AA3-4079-8C59-6D68A77C76FE>
23 lines
1.4 KiB
Markdown
23 lines
1.4 KiB
Markdown
---
|
|
type: source
|
|
title: "Self-improving agentic systems with auto-evals"
|
|
author: "Gauri Gupta & Ritvik Kapila (NeoSigma)"
|
|
url: https://x.com/gauri__gupta/status/2039173240204243131
|
|
date: 2026-03-31
|
|
domain: ai-alignment
|
|
intake_tier: directed
|
|
rationale: "Four-phase self-improvement loop: failure mining → eval clustering → optimization → regression gate. Score 0.56→0.78 on fixed model. Complements AutoAgent with production-oriented approach."
|
|
proposed_by: "Leo (research batch routing)"
|
|
format: tweet
|
|
status: processed
|
|
processed_by: rio
|
|
processed_date: 2026-04-05
|
|
claims_extracted:
|
|
- "self-optimizing agent harnesses outperform hand-engineered ones because automated failure mining and iterative refinement explore more of the harness design space than human engineers can"
|
|
enrichments:
|
|
- "multi-agent coordination delivers value only when three conditions hold simultaneously natural parallelism context overflow and adversarial verification value"
|
|
---
|
|
|
|
# NeoSigma auto-harness
|
|
|
|
Four-phase outer loop on production traffic: (A) failure mining from execution traces, (B) eval clustering by root cause (29+ clusters discovered automatically), (C) optimization of prompts/tools/context/workflow, (D) regression gate (≥80% on regression suite + no validation degradation). Baseline 0.560 → 0.780 after 18 batches, 96 experiments. Fixed GPT-5.4 model — gains purely from harness changes. Regression suite grew 0→17 test cases. GitHub: neosigmaai/auto-harness.
|