teleo-codex/domains/ai-alignment/harness engineering outweighs model selection in agent system performance because changing the code wrapping the model produces up to 6x performance gaps on the same benchmark while model upgrades produce smaller gains.md
m3taversal be8ff41bfe link: bidirectional source↔claim index — 414 claims + 252 sources connected
Wrote sourced_from: into 414 claim files pointing back to their origin source.
Backfilled claims_extracted: into 252 source files that were processed but
missing this field. Matching uses author+title overlap against claim source:
field, validated against 296 known-good pairs from existing claims_extracted.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-21 11:55:18 +01:00

5 KiB

type domain secondary_domains description confidence source created depends_on sourced_from
claim ai-alignment
collective-intelligence
Stanford Meta-Harness paper shows a single harness change can produce a 6x performance gap on the same model and benchmark, with their automated harness optimizer achieving +7.7 points and 4x fewer tokens versus state-of-the-art, ranking #1 on multiple benchmarks likely Stanford/MIT, 'Meta-Harness: End-to-End Optimization of Model Harnesses' (March 2026, arxiv 2603.28052); Alex Prompter tweet (609 likes); Lior Alexander tweet; elvis/omarsar tweet 2026-04-05
self-optimizing agent harnesses outperform hand-engineered ones because automated failure mining and iterative refinement explore more of the harness design space than human engineers can
inbox/archive/2026-03-28-stanford-meta-harness.md

Harness engineering outweighs model selection in agent system performance because changing the code wrapping the model produces up to 6x performance gaps on the same benchmark while model upgrades produce smaller gains

Stanford and MIT's Meta-Harness paper (March 2026) establishes that the harness — the code determining what to store, retrieve, and show to the model — often matters as much as or more than the model itself. A single harness change can produce "a 6x performance gap on the same benchmark."

Key results

Text Classification (Online Learning):

  • Meta-Harness: 48.6% accuracy vs. ACE (state-of-the-art context management): 40.9%
  • +7.7 point improvement using 4x fewer context tokens (11.4K vs 50.8K)
  • Matched best prior text optimizers' performance in 0.1x evaluations (4 vs 60 proposals)
  • Out-of-distribution evaluation on 9 unseen datasets: +2.9 points over ACE (73.1% vs 70.2%)

Retrieval-Augmented Math Reasoning:

  • Single discovered harness improved IMO-level problem solving by 4.7 points on average across 5 held-out models
  • Transferability demonstrated across models not seen during search

TerminalBench-2 Agentic Coding:

  • 76.4% pass rate on Opus 4.6 (#2 among all agents)
  • #1 among Claude Haiku 4.5 agents (37.6% vs next-best 35.5%)
  • Surpassed hand-engineered baseline Terminus-KIRA

The critical finding: execution traces matter, summaries don't

An ablation study quantified the value of different information access:

Information Access Median Accuracy Best Accuracy
Scores only 34.6 41.3
Scores + LLM summaries 34.9 38.7
Full execution traces 50.0 56.7

LLM-generated summaries actually degraded performance compared to scores-only. "Information compression destroys signal needed for harness engineering." The proposer reads a median of 82 files per iteration, referencing over 20 prior candidates — operating at ~10 million tokens per iteration versus ~0.02 million for prior text optimizers.

This has a direct implication for agent system design: summarization-based approaches to managing agent memory and context may be destroying the diagnostic signal needed for system improvement. Full execution traces, despite their cost, contain information that summaries cannot recover.

Discovered behaviors

The Meta-Harness system discovered non-obvious harness strategies:

  • Draft-verification retrieval — using a draft label to retrieve targeted counterexamples rather than generic neighbors (text classification)
  • Lexical routing — assigning problems to subject-specific retrieval policies with domain-specific reranking (math)
  • Environment bootstrapping — a single pre-execution shell command gathering OS and package info, eliminating 2-4 exploratory agent turns (coding)

The TerminalBench-2 search log showed sophisticated causal reasoning: after regressions from confounded interventions, the proposer explicitly identified confounds, isolated variables, and pivoted to purely additive modifications.

Challenges

The "6x gap" headline is from a worst-to-best comparison across all possible harnesses, not a controlled A/B test against a reasonable baseline. The practical improvement over state-of-the-art baselines is meaningful but more modest (+7.7 points, +4.7 points). The paper's strongest claim — that harness matters as much as the model — is well-supported, but the headline number is more dramatic than the typical improvement a practitioner would see.


Relevant Notes:

Topics: