- What: 5 NEW claims from "Natural-Language Agent Harnesses" (arXiv:2603.25723) plus 1 enrichment to subagent hierarchy claim with 90% delegation token data - Why: First controlled ablation study of harness modules; novel findings on solved-set replacer effect, file-backed state reliability, self-evolution mechanism, verifier acceptance divergence, and NL harness portability - Connections: enriches harness engineering, determinism boundary, context≠memory claim clusters; challenges coordination-always-helps assumptions Pentagon-Agent: Theseus <46864dd4-da71-4719-a1b4-68f7c55854d3>
4.1 KiB
| type | domain | secondary_domains | description | confidence | source | created | depends_on | |||
|---|---|---|---|---|---|---|---|---|---|---|
| claim | ai-alignment |
|
Ablation study shows file-backed state improves both SWE-bench (+1.6pp) and OSWorld (+5.5pp) while maintaining the lowest overhead profile among tested modules — its value is process structure not score gain | experimental | Pan et al. 'Natural-Language Agent Harnesses', arXiv:2603.25723, March 2026. Table 3. SWE-bench Verified (125 samples) + OSWorld (36 samples), GPT-5.4, Codex CLI. | 2026-03-31 |
|
File-backed durable state is the most consistently positive harness module across task types because externalizing state to path-addressable artifacts survives context truncation delegation and restart
Pan et al. (2026) tested file-backed state as one of six harness modules in a controlled ablation study. It improved performance on both SWE-bench Verified (+1.6pp over Basic) and OSWorld (+5.5pp over Basic) — the only module to show consistent positive gains across both benchmarks without high variance.
The module enforces three properties:
- Externalized — state is written to artifacts rather than held only in transient context
- Path-addressable — later stages reopen the exact object by path
- Compaction-stable — state survives truncation, restart, and delegation
Its gains are mild in absolute terms but its mechanism is distinct from the other modules. File-backed state and evidence-backed answering mainly improve process structure — they leave durable external signatures (task histories, manifests, analysis sidecars) that improve auditability, handoff discipline, and trace quality more directly than semantic repair ability.
On OSWorld, the file-backed state effect is amplified because the baseline already involves a structured harness (OS-Symphony). The migration study (RQ3) confirms this: migrated NLAH runs materialize task files, ledgers, and explicit artifacts, and switch more readily from brittle GUI repair to file, shell, or package-level operations when those provide a stronger completion certificate.
The case study of mwaskom__seaborn-3069 illustrates the mechanism: under file-backed state, the workspace leaves a durable spine consisting of a parent response, append-only task history, and manifest entries for the promoted patch artifact. The child handoff and artifact lineage become explicit, helping the solver keep one patch surface and one verification story.
Challenges
The +1.6pp on SWE-bench is within noise for 125 samples. The stronger signal is the process trace analysis, not the score delta. Whether file-backed state helps primarily by preventing state loss (defensive value) or by enabling new solution strategies (offensive value) is not cleanly separated by the ablation design.
Relevant Notes:
- long context is not memory because memory requires incremental knowledge accumulation and stateful change not stateless input processing — file-backed state is the architectural embodiment of this distinction: it externalizes memory to durable artifacts rather than relying on context window as pseudo-memory
- context files function as agent operating systems through self-referential self-extension where the file teaches modification of the file that contains the teaching — file-backed state as described by Pan et al. is the production implementation of context-file-as-OS: path-addressable, externalized, compaction-stable
- production agent memory infrastructure consumed 24 percent of codebase in one tracked system suggesting memory requires dedicated engineering not a single configuration file — the file-backed module's three properties (externalized, path-addressable, compaction-stable) represent exactly the kind of dedicated memory engineering that takes 24% of codebase
Topics: