teleo-codex/domains/ai-alignment/harness module effects concentrate on a small solved frontier rather than shifting benchmarks uniformly because most tasks are robust to control logic changes and meaningful differences come from boundary cases that flip under changed structure.md

43 lines
No EOL
4.9 KiB
Markdown

---
type: claim
domain: ai-alignment
secondary_domains: [collective-intelligence]
description: "Controlled ablation of 6 harness modules on SWE-bench Verified shows 110-115 of 125 samples agree between Full IHR and each ablation — the harness reshapes which boundary cases flip, not overall solve rate"
confidence: experimental
source: "Pan et al. 'Natural-Language Agent Harnesses', arXiv:2603.25723, March 2026. Tables 1-3. SWE-bench Verified (125 samples) + OSWorld (36 samples), GPT-5.4, Codex CLI."
created: 2026-03-31
depends_on:
- multi-agent coordination improves parallel task performance but degrades sequential reasoning because communication overhead fragments linear workflows
challenged_by:
- coordination protocol design produces larger capability gains than model scaling because the same AI model performed 6x better with structured exploration than with human coaching on the same problem
related:
- harness pattern logic is portable as natural language without degradation when backed by a shared intelligent runtime because the design pattern layer is separable from low level execution hooks
- file backed durable state is the most consistently positive harness module across task types because externalizing state to path addressable artifacts survives context truncation delegation and restart
reweave_edges:
- harness pattern logic is portable as natural language without degradation when backed by a shared intelligent runtime because the design pattern layer is separable from low level execution hooks|related|2026-04-03
- file backed durable state is the most consistently positive harness module across task types because externalizing state to path addressable artifacts survives context truncation delegation and restart|related|2026-04-17
---
# Harness module effects concentrate on a small solved frontier rather than shifting benchmarks uniformly because most tasks are robust to control logic changes and meaningful differences come from boundary cases that flip under changed structure
Pan et al. (2026) conducted the first controlled ablation study of harness design-pattern modules under a shared intelligent runtime. Six modules were tested individually: file-backed state, evidence-backed answering, verifier separation, self-evolution, multi-candidate search, and dynamic orchestration.
The core finding is that Full IHR behaves as a **solved-set replacer**, not a uniform frontier expander. Across both TRAE and Live-SWE harness families on SWE-bench Verified, more than 110 of 125 stitched samples agree between Full IHR and each ablation (Table 2). The meaningful differences are concentrated in a small frontier of 4-8 component-sensitive cases that flip — Full IHR creates some new wins but also loses some direct-path repairs that lighter settings retain.
The most informative failures are alignment failures, not random misses. On `matplotlib__matplotlib-24570`, TRAE Full expands into a large candidate search, runs multiple selector and revalidation stages, and ends with a locally plausible patch that misses the official evaluator. On `django__django-14404` and `sympy__sympy-23950`, extra structure makes the run more organized and more expensive while drifting from the shortest benchmark-aligned repair path.
This has direct implications for harness engineering strategy: adding modules should be evaluated by which boundary cases they unlock or lose, not by aggregate score deltas. The dominant effect is redistribution of solvability, not expansion.
## Challenges
The study uses benchmark subsets (125 SWE, 36 OSWorld) sampled once with a fixed random seed, not full benchmark suites. Whether the frontier-concentration pattern holds at full scale or with different seeds is untested. The authors plan GPT-5.4-mini reruns in a future revision. Additionally, SWE-bench Verified has known ceiling effects that may compress the observable range of module differences.
---
Relevant Notes:
- [[multi-agent coordination improves parallel task performance but degrades sequential reasoning because communication overhead fragments linear workflows]] — the NLAH ablation data shows this at the module level, not just the agent level: adding orchestration structure can hurt sequential repair paths
- [[coordination protocol design produces larger capability gains than model scaling because the same AI model performed 6x better with structured exploration than with human coaching on the same problem]] — the 6x gain is real but this paper shows it concentrates on a small frontier of cases; the majority of tasks are insensitive to protocol changes
- [[79 percent of multi-agent failures originate from specification and coordination not implementation because decomposition quality is the primary determinant of system success]] — the solved-set replacer effect suggests that even well-decomposed multi-agent systems may trade one set of solvable problems for another rather than strictly expanding the frontier
Topics:
- [[_map]]