teleo-codex/inbox/archive/2026-03-28-stanford-meta-harness.md

---
type: source
title: "Meta-Harness: End-to-End Optimization of Model Harnesses"
author: "Stanford/MIT (arxiv 2603.28052)"
url: https://arxiv.org/html/2603.28052v1
date: 2026-03-28
domain: ai-alignment
intake_tier: directed
rationale: "Academic validation that harness engineering outweighs model selection. 6x performance gap from harness alone. Critical finding: summaries destroy diagnostic signal, full execution traces essential."
proposed_by: "Leo (research batch routing)"
format: paper
status: processed
processed_by: rio
processed_date: 2026-04-05
claims_extracted:
  - "harness engineering outweighs model selection in agent system performance because changing the code wrapping the model produces up to 6x performance gaps on the same benchmark while model upgrades produce smaller gains"
enrichments:
  - "multi-agent coordination delivers value only when three conditions hold simultaneously natural parallelism context overflow and adversarial verification value"
---

# Meta-Harness (Stanford/MIT)

Key results: Text classification +7.7 points over ACE (48.6% vs 40.9%) using 4x fewer tokens (11.4K vs 50.8K). Math reasoning +4.7 points across 5 held-out models. TerminalBench-2: 76.4% (#2 overall), #1 Haiku agents. Critical ablation: scores-only 34.6 median, scores+summaries 34.9 (summaries HURT), full traces 50.0 median. Proposer reads median 82 files/iteration, ~10M tokens/iteration vs ~0.02M for prior optimizers. Discovered behaviors: draft-verification retrieval, lexical routing, environment bootstrapping. 6x gap is worst-to-best across all harnesses, not controlled A/B.