Previous reweave runs used 2-space indent + quotes for list entries while the standard format is 0-space indent without quotes. This caused YAML parse failures during merge. Bulk-fixed all reweave_edges files. Pentagon-Agent: Ship <D53BE6DB-B498-4B30-B588-75D1F6D2124A>
4.5 KiB
| type | domain | secondary_domains | description | confidence | source | created | depends_on | challenged_by | related | reweave_edges | |||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| claim | ai-alignment |
|
Controlled ablation of 6 harness modules on SWE-bench Verified shows 110-115 of 125 samples agree between Full IHR and each ablation — the harness reshapes which boundary cases flip, not overall solve rate | experimental | Pan et al. 'Natural-Language Agent Harnesses', arXiv:2603.25723, March 2026. Tables 1-3. SWE-bench Verified (125 samples) + OSWorld (36 samples), GPT-5.4, Codex CLI. | 2026-03-31 |
|
|
|
|
Harness module effects concentrate on a small solved frontier rather than shifting benchmarks uniformly because most tasks are robust to control logic changes and meaningful differences come from boundary cases that flip under changed structure
Pan et al. (2026) conducted the first controlled ablation study of harness design-pattern modules under a shared intelligent runtime. Six modules were tested individually: file-backed state, evidence-backed answering, verifier separation, self-evolution, multi-candidate search, and dynamic orchestration.
The core finding is that Full IHR behaves as a solved-set replacer, not a uniform frontier expander. Across both TRAE and Live-SWE harness families on SWE-bench Verified, more than 110 of 125 stitched samples agree between Full IHR and each ablation (Table 2). The meaningful differences are concentrated in a small frontier of 4-8 component-sensitive cases that flip — Full IHR creates some new wins but also loses some direct-path repairs that lighter settings retain.
The most informative failures are alignment failures, not random misses. On matplotlib__matplotlib-24570, TRAE Full expands into a large candidate search, runs multiple selector and revalidation stages, and ends with a locally plausible patch that misses the official evaluator. On django__django-14404 and sympy__sympy-23950, extra structure makes the run more organized and more expensive while drifting from the shortest benchmark-aligned repair path.
This has direct implications for harness engineering strategy: adding modules should be evaluated by which boundary cases they unlock or lose, not by aggregate score deltas. The dominant effect is redistribution of solvability, not expansion.
Challenges
The study uses benchmark subsets (125 SWE, 36 OSWorld) sampled once with a fixed random seed, not full benchmark suites. Whether the frontier-concentration pattern holds at full scale or with different seeds is untested. The authors plan GPT-5.4-mini reruns in a future revision. Additionally, SWE-bench Verified has known ceiling effects that may compress the observable range of module differences.
Relevant Notes:
- multi-agent coordination improves parallel task performance but degrades sequential reasoning because communication overhead fragments linear workflows — the NLAH ablation data shows this at the module level, not just the agent level: adding orchestration structure can hurt sequential repair paths
- coordination protocol design produces larger capability gains than model scaling because the same AI model performed 6x better with structured exploration than with human coaching on the same problem — the 6x gain is real but this paper shows it concentrates on a small frontier of cases; the majority of tasks are insensitive to protocol changes
- 79 percent of multi-agent failures originate from specification and coordination not implementation because decomposition quality is the primary determinant of system success — the solved-set replacer effect suggests that even well-decomposed multi-agent systems may trade one set of solvable problems for another rather than strictly expanding the frontier
Topics: