m3taversal be8ff41bfe link: bidirectional source↔claim index — 414 claims + 252 sources connected

Wrote sourced_from: into 414 claim files pointing back to their origin source.
Backfilled claims_extracted: into 252 source files that were processed but
missing this field. Matching uses author+title overlap against claim source:
field, validated against 296 known-good pairs from existing claims_extracted.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-21 11:55:18 +01:00

5 KiB

Raw Blame History

type

domain

secondary_domains

description

confidence

source

created

depends_on

sourced_from

claim

ai-alignment

collective-intelligence

Stanford Meta-Harness paper shows a single harness change can produce a 6x performance gap on the same model and benchmark, with their automated harness optimizer achieving +7.7 points and 4x fewer tokens versus state-of-the-art, ranking #1 on multiple benchmarks

likely

Stanford/MIT, 'Meta-Harness: End-to-End Optimization of Model Harnesses' (March 2026, arxiv 2603.28052); Alex Prompter tweet (609 likes); Lior Alexander tweet; elvis/omarsar tweet

2026-04-05

self-optimizing agent harnesses outperform hand-engineered ones because automated failure mining and iterative refinement explore more of the harness design space than human engineers can

inbox/archive/2026-03-28-stanford-meta-harness.md

Harness engineering outweighs model selection in agent system performance because changing the code wrapping the model produces up to 6x performance gaps on the same benchmark while model upgrades produce smaller gains

Stanford and MIT's Meta-Harness paper (March 2026) establishes that the harness — the code determining what to store, retrieve, and show to the model — often matters as much as or more than the model itself. A single harness change can produce "a 6x performance gap on the same benchmark."

Key results

Text Classification (Online Learning):

Meta-Harness: 48.6% accuracy vs. ACE (state-of-the-art context management): 40.9%
+7.7 point improvement using 4x fewer context tokens (11.4K vs 50.8K)
Matched best prior text optimizers' performance in 0.1x evaluations (4 vs 60 proposals)
Out-of-distribution evaluation on 9 unseen datasets: +2.9 points over ACE (73.1% vs 70.2%)

Retrieval-Augmented Math Reasoning:

Single discovered harness improved IMO-level problem solving by 4.7 points on average across 5 held-out models
Transferability demonstrated across models not seen during search

TerminalBench-2 Agentic Coding:

76.4% pass rate on Opus 4.6 (#2 among all agents)
#1 among Claude Haiku 4.5 agents (37.6% vs next-best 35.5%)
Surpassed hand-engineered baseline Terminus-KIRA

The critical finding: execution traces matter, summaries don't

An ablation study quantified the value of different information access:

Information Access	Median Accuracy	Best Accuracy
Scores only	34.6	41.3
Scores + LLM summaries	34.9	38.7
Full execution traces	50.0	56.7

LLM-generated summaries actually degraded performance compared to scores-only. "Information compression destroys signal needed for harness engineering." The proposer reads a median of 82 files per iteration, referencing over 20 prior candidates — operating at ~10 million tokens per iteration versus ~0.02 million for prior text optimizers.

This has a direct implication for agent system design: summarization-based approaches to managing agent memory and context may be destroying the diagnostic signal needed for system improvement. Full execution traces, despite their cost, contain information that summaries cannot recover.

Discovered behaviors

The Meta-Harness system discovered non-obvious harness strategies:

Draft-verification retrieval — using a draft label to retrieve targeted counterexamples rather than generic neighbors (text classification)
Lexical routing — assigning problems to subject-specific retrieval policies with domain-specific reranking (math)
Environment bootstrapping — a single pre-execution shell command gathering OS and package info, eliminating 2-4 exploratory agent turns (coding)

The TerminalBench-2 search log showed sophisticated causal reasoning: after regressions from confounded interventions, the proposer explicitly identified confounds, isolated variables, and pivoted to purely additive modifications.

Challenges

The "6x gap" headline is from a worst-to-best comparison across all possible harnesses, not a controlled A/B test against a reasonable baseline. The practical improvement over state-of-the-art baselines is meaningful but more modest (+7.7 points, +4.7 points). The paper's strongest claim — that harness matters as much as the model — is well-supported, but the headline number is more dramatic than the typical improvement a practitioner would see.

Relevant Notes:

self-optimizing agent harnesses outperform hand-engineered ones because automated failure mining and iterative refinement explore more of the harness design space than human engineers can — Meta-Harness is the academic validation of the pattern AutoAgent and auto-harness demonstrated in production
multi-agent coordination delivers value only when three conditions hold simultaneously natural parallelism context overflow and adversarial verification value — Meta-Harness proposes using a single meta-agent rather than multi-agent coordination for system improvement, suggesting harness optimization may be a higher-ROI intervention than adding agents

Topics:

_map

5 KiB Raw Blame History