Wrote sourced_from: into 414 claim files pointing back to their origin source. Backfilled claims_extracted: into 252 source files that were processed but missing this field. Matching uses author+title overlap against claim source: field, validated against 296 known-good pairs from existing claims_extracted. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
5 KiB
| type | domain | secondary_domains | description | confidence | source | created | depends_on | sourced_from | |||
|---|---|---|---|---|---|---|---|---|---|---|---|
| claim | ai-alignment |
|
Stanford Meta-Harness paper shows a single harness change can produce a 6x performance gap on the same model and benchmark, with their automated harness optimizer achieving +7.7 points and 4x fewer tokens versus state-of-the-art, ranking #1 on multiple benchmarks | likely | Stanford/MIT, 'Meta-Harness: End-to-End Optimization of Model Harnesses' (March 2026, arxiv 2603.28052); Alex Prompter tweet (609 likes); Lior Alexander tweet; elvis/omarsar tweet | 2026-04-05 |
|
|
Harness engineering outweighs model selection in agent system performance because changing the code wrapping the model produces up to 6x performance gaps on the same benchmark while model upgrades produce smaller gains
Stanford and MIT's Meta-Harness paper (March 2026) establishes that the harness — the code determining what to store, retrieve, and show to the model — often matters as much as or more than the model itself. A single harness change can produce "a 6x performance gap on the same benchmark."
Key results
Text Classification (Online Learning):
- Meta-Harness: 48.6% accuracy vs. ACE (state-of-the-art context management): 40.9%
- +7.7 point improvement using 4x fewer context tokens (11.4K vs 50.8K)
- Matched best prior text optimizers' performance in 0.1x evaluations (4 vs 60 proposals)
- Out-of-distribution evaluation on 9 unseen datasets: +2.9 points over ACE (73.1% vs 70.2%)
Retrieval-Augmented Math Reasoning:
- Single discovered harness improved IMO-level problem solving by 4.7 points on average across 5 held-out models
- Transferability demonstrated across models not seen during search
TerminalBench-2 Agentic Coding:
- 76.4% pass rate on Opus 4.6 (#2 among all agents)
- #1 among Claude Haiku 4.5 agents (37.6% vs next-best 35.5%)
- Surpassed hand-engineered baseline Terminus-KIRA
The critical finding: execution traces matter, summaries don't
An ablation study quantified the value of different information access:
| Information Access | Median Accuracy | Best Accuracy |
|---|---|---|
| Scores only | 34.6 | 41.3 |
| Scores + LLM summaries | 34.9 | 38.7 |
| Full execution traces | 50.0 | 56.7 |
LLM-generated summaries actually degraded performance compared to scores-only. "Information compression destroys signal needed for harness engineering." The proposer reads a median of 82 files per iteration, referencing over 20 prior candidates — operating at ~10 million tokens per iteration versus ~0.02 million for prior text optimizers.
This has a direct implication for agent system design: summarization-based approaches to managing agent memory and context may be destroying the diagnostic signal needed for system improvement. Full execution traces, despite their cost, contain information that summaries cannot recover.
Discovered behaviors
The Meta-Harness system discovered non-obvious harness strategies:
- Draft-verification retrieval — using a draft label to retrieve targeted counterexamples rather than generic neighbors (text classification)
- Lexical routing — assigning problems to subject-specific retrieval policies with domain-specific reranking (math)
- Environment bootstrapping — a single pre-execution shell command gathering OS and package info, eliminating 2-4 exploratory agent turns (coding)
The TerminalBench-2 search log showed sophisticated causal reasoning: after regressions from confounded interventions, the proposer explicitly identified confounds, isolated variables, and pivoted to purely additive modifications.
Challenges
The "6x gap" headline is from a worst-to-best comparison across all possible harnesses, not a controlled A/B test against a reasonable baseline. The practical improvement over state-of-the-art baselines is meaningful but more modest (+7.7 points, +4.7 points). The paper's strongest claim — that harness matters as much as the model — is well-supported, but the headline number is more dramatic than the typical improvement a practitioner would see.
Relevant Notes:
- self-optimizing agent harnesses outperform hand-engineered ones because automated failure mining and iterative refinement explore more of the harness design space than human engineers can — Meta-Harness is the academic validation of the pattern AutoAgent and auto-harness demonstrated in production
- multi-agent coordination delivers value only when three conditions hold simultaneously natural parallelism context overflow and adversarial verification value — Meta-Harness proposes using a single meta-agent rather than multi-agent coordination for system improvement, suggesting harness optimization may be a higher-ROI intervention than adding agents
Topics: