m3taversal 0fa4836b34 theseus: extract 5 claims + 1 enrichment from Pan et al. NLAH paper

- What: 5 NEW claims from "Natural-Language Agent Harnesses" (arXiv:2603.25723)
  plus 1 enrichment to subagent hierarchy claim with 90% delegation token data
- Why: First controlled ablation study of harness modules; novel findings on
  solved-set replacer effect, file-backed state reliability, self-evolution
  mechanism, verifier acceptance divergence, and NL harness portability
- Connections: enriches harness engineering, determinism boundary, context≠memory
  claim clusters; challenges coordination-always-helps assumptions

Pentagon-Agent: Theseus <46864dd4-da71-4719-a1b4-68f7c55854d3>

2026-03-31 10:35:01 +01:00

4.1 KiB

Raw Blame History

type

domain

secondary_domains

description

confidence

source

created

depends_on

claim

ai-alignment

collective-intelligence

Ablation study shows file-backed state improves both SWE-bench (+1.6pp) and OSWorld (+5.5pp) while maintaining the lowest overhead profile among tested modules — its value is process structure not score gain

experimental

Pan et al. 'Natural-Language Agent Harnesses', arXiv:2603.25723, March 2026. Table 3. SWE-bench Verified (125 samples) + OSWorld (36 samples), GPT-5.4, Codex CLI.

2026-03-31

long context is not memory because memory requires incremental knowledge accumulation and stateful change not stateless input processing

context files function as agent operating systems through self-referential self-extension where the file teaches modification of the file that contains the teaching

File-backed durable state is the most consistently positive harness module across task types because externalizing state to path-addressable artifacts survives context truncation delegation and restart

Pan et al. (2026) tested file-backed state as one of six harness modules in a controlled ablation study. It improved performance on both SWE-bench Verified (+1.6pp over Basic) and OSWorld (+5.5pp over Basic) — the only module to show consistent positive gains across both benchmarks without high variance.

The module enforces three properties:

Externalized — state is written to artifacts rather than held only in transient context
Path-addressable — later stages reopen the exact object by path
Compaction-stable — state survives truncation, restart, and delegation

Its gains are mild in absolute terms but its mechanism is distinct from the other modules. File-backed state and evidence-backed answering mainly improve process structure — they leave durable external signatures (task histories, manifests, analysis sidecars) that improve auditability, handoff discipline, and trace quality more directly than semantic repair ability.

On OSWorld, the file-backed state effect is amplified because the baseline already involves a structured harness (OS-Symphony). The migration study (RQ3) confirms this: migrated NLAH runs materialize task files, ledgers, and explicit artifacts, and switch more readily from brittle GUI repair to file, shell, or package-level operations when those provide a stronger completion certificate.

The case study of mwaskom__seaborn-3069 illustrates the mechanism: under file-backed state, the workspace leaves a durable spine consisting of a parent response, append-only task history, and manifest entries for the promoted patch artifact. The child handoff and artifact lineage become explicit, helping the solver keep one patch surface and one verification story.

Challenges

The +1.6pp on SWE-bench is within noise for 125 samples. The stronger signal is the process trace analysis, not the score delta. Whether file-backed state helps primarily by preventing state loss (defensive value) or by enabling new solution strategies (offensive value) is not cleanly separated by the ablation design.

Relevant Notes:

long context is not memory because memory requires incremental knowledge accumulation and stateful change not stateless input processing — file-backed state is the architectural embodiment of this distinction: it externalizes memory to durable artifacts rather than relying on context window as pseudo-memory
context files function as agent operating systems through self-referential self-extension where the file teaches modification of the file that contains the teaching — file-backed state as described by Pan et al. is the production implementation of context-file-as-OS: path-addressable, externalized, compaction-stable
production agent memory infrastructure consumed 24 percent of codebase in one tracked system suggesting memory requires dedicated engineering not a single configuration file — the file-backed module's three properties (externalized, path-addressable, compaction-stable) represent exactly the kind of dedicated memory engineering that takes 24% of codebase

Topics:

_map

4.1 KiB Raw Blame History

File-backed durable state is the most consistently positive harness module across task types because externalizing state to path-addressable artifacts survives context truncation delegation and restart

Challenges

4.1 KiB

Raw Blame History