Some checks are pending
Sync Graph Data to teleo-app / sync (push) Waiting to run
Previous reweave runs used 2-space indent + quotes for list entries while the standard format is 0-space indent without quotes. This caused YAML parse failures during merge. Bulk-fixed all reweave_edges files. Pentagon-Agent: Ship <D53BE6DB-B498-4B30-B588-75D1F6D2124A>
41 lines
4.5 KiB
Markdown
41 lines
4.5 KiB
Markdown
---
|
|
type: claim
|
|
domain: ai-alignment
|
|
secondary_domains: [collective-intelligence]
|
|
description: "Controlled ablation of 6 harness modules on SWE-bench Verified shows 110-115 of 125 samples agree between Full IHR and each ablation — the harness reshapes which boundary cases flip, not overall solve rate"
|
|
confidence: experimental
|
|
source: "Pan et al. 'Natural-Language Agent Harnesses', arXiv:2603.25723, March 2026. Tables 1-3. SWE-bench Verified (125 samples) + OSWorld (36 samples), GPT-5.4, Codex CLI."
|
|
created: 2026-03-31
|
|
depends_on:
|
|
- multi-agent coordination improves parallel task performance but degrades sequential reasoning because communication overhead fragments linear workflows
|
|
challenged_by:
|
|
- coordination protocol design produces larger capability gains than model scaling because the same AI model performed 6x better with structured exploration than with human coaching on the same problem
|
|
related:
|
|
- harness pattern logic is portable as natural language without degradation when backed by a shared intelligent runtime because the design pattern layer is separable from low level execution hooks
|
|
reweave_edges:
|
|
- harness pattern logic is portable as natural language without degradation when backed by a shared intelligent runtime because the design pattern layer is separable from low level execution hooks|related|2026-04-03
|
|
---
|
|
|
|
# Harness module effects concentrate on a small solved frontier rather than shifting benchmarks uniformly because most tasks are robust to control logic changes and meaningful differences come from boundary cases that flip under changed structure
|
|
|
|
Pan et al. (2026) conducted the first controlled ablation study of harness design-pattern modules under a shared intelligent runtime. Six modules were tested individually: file-backed state, evidence-backed answering, verifier separation, self-evolution, multi-candidate search, and dynamic orchestration.
|
|
|
|
The core finding is that Full IHR behaves as a **solved-set replacer**, not a uniform frontier expander. Across both TRAE and Live-SWE harness families on SWE-bench Verified, more than 110 of 125 stitched samples agree between Full IHR and each ablation (Table 2). The meaningful differences are concentrated in a small frontier of 4-8 component-sensitive cases that flip — Full IHR creates some new wins but also loses some direct-path repairs that lighter settings retain.
|
|
|
|
The most informative failures are alignment failures, not random misses. On `matplotlib__matplotlib-24570`, TRAE Full expands into a large candidate search, runs multiple selector and revalidation stages, and ends with a locally plausible patch that misses the official evaluator. On `django__django-14404` and `sympy__sympy-23950`, extra structure makes the run more organized and more expensive while drifting from the shortest benchmark-aligned repair path.
|
|
|
|
This has direct implications for harness engineering strategy: adding modules should be evaluated by which boundary cases they unlock or lose, not by aggregate score deltas. The dominant effect is redistribution of solvability, not expansion.
|
|
|
|
## Challenges
|
|
|
|
The study uses benchmark subsets (125 SWE, 36 OSWorld) sampled once with a fixed random seed, not full benchmark suites. Whether the frontier-concentration pattern holds at full scale or with different seeds is untested. The authors plan GPT-5.4-mini reruns in a future revision. Additionally, SWE-bench Verified has known ceiling effects that may compress the observable range of module differences.
|
|
|
|
---
|
|
|
|
Relevant Notes:
|
|
- [[multi-agent coordination improves parallel task performance but degrades sequential reasoning because communication overhead fragments linear workflows]] — the NLAH ablation data shows this at the module level, not just the agent level: adding orchestration structure can hurt sequential repair paths
|
|
- [[coordination protocol design produces larger capability gains than model scaling because the same AI model performed 6x better with structured exploration than with human coaching on the same problem]] — the 6x gain is real but this paper shows it concentrates on a small frontier of cases; the majority of tasks are insensitive to protocol changes
|
|
- [[79 percent of multi-agent failures originate from specification and coordination not implementation because decomposition quality is the primary determinant of system success]] — the solved-set replacer effect suggests that even well-decomposed multi-agent systems may trade one set of solvable problems for another rather than strictly expanding the frontier
|
|
|
|
Topics:
|
|
- [[_map]]
|