theseus: extract 5 claims + 1 enrichment from Pan et al. NLAH paper

- What: 5 NEW claims from "Natural-Language Agent Harnesses" (arXiv:2603.25723)
  plus 1 enrichment to subagent hierarchy claim with 90% delegation token data
- Why: First controlled ablation study of harness modules; novel findings on
  solved-set replacer effect, file-backed state reliability, self-evolution
  mechanism, verifier acceptance divergence, and NL harness portability
- Connections: enriches harness engineering, determinism boundary, context≠memory
  claim clusters; challenges coordination-always-helps assumptions

Pentagon-Agent: Theseus <46864dd4-da71-4719-a1b4-68f7c55854d3>
This commit is contained in:
m3taversal 2026-03-31 10:02:57 +01:00
parent d38f928ce6
commit 0fa4836b34
7 changed files with 237 additions and 0 deletions

View file

@ -0,0 +1,41 @@
---
type: claim
domain: ai-alignment
secondary_domains: [collective-intelligence]
description: "Ablation study shows file-backed state improves both SWE-bench (+1.6pp) and OSWorld (+5.5pp) while maintaining the lowest overhead profile among tested modules — its value is process structure not score gain"
confidence: experimental
source: "Pan et al. 'Natural-Language Agent Harnesses', arXiv:2603.25723, March 2026. Table 3. SWE-bench Verified (125 samples) + OSWorld (36 samples), GPT-5.4, Codex CLI."
created: 2026-03-31
depends_on:
- "long context is not memory because memory requires incremental knowledge accumulation and stateful change not stateless input processing"
- "context files function as agent operating systems through self-referential self-extension where the file teaches modification of the file that contains the teaching"
---
# File-backed durable state is the most consistently positive harness module across task types because externalizing state to path-addressable artifacts survives context truncation delegation and restart
Pan et al. (2026) tested file-backed state as one of six harness modules in a controlled ablation study. It improved performance on both SWE-bench Verified (+1.6pp over Basic) and OSWorld (+5.5pp over Basic) — the only module to show consistent positive gains across both benchmarks without high variance.
The module enforces three properties:
1. **Externalized** — state is written to artifacts rather than held only in transient context
2. **Path-addressable** — later stages reopen the exact object by path
3. **Compaction-stable** — state survives truncation, restart, and delegation
Its gains are mild in absolute terms but its mechanism is distinct from the other modules. File-backed state and evidence-backed answering mainly improve process structure — they leave durable external signatures (task histories, manifests, analysis sidecars) that improve auditability, handoff discipline, and trace quality more directly than semantic repair ability.
On OSWorld, the file-backed state effect is amplified because the baseline already involves a structured harness (OS-Symphony). The migration study (RQ3) confirms this: migrated NLAH runs materialize task files, ledgers, and explicit artifacts, and switch more readily from brittle GUI repair to file, shell, or package-level operations when those provide a stronger completion certificate.
The case study of `mwaskom__seaborn-3069` illustrates the mechanism: under file-backed state, the workspace leaves a durable spine consisting of a parent response, append-only task history, and manifest entries for the promoted patch artifact. The child handoff and artifact lineage become explicit, helping the solver keep one patch surface and one verification story.
## Challenges
The +1.6pp on SWE-bench is within noise for 125 samples. The stronger signal is the process trace analysis, not the score delta. Whether file-backed state helps primarily by preventing state loss (defensive value) or by enabling new solution strategies (offensive value) is not cleanly separated by the ablation design.
---
Relevant Notes:
- [[long context is not memory because memory requires incremental knowledge accumulation and stateful change not stateless input processing]] — file-backed state is the architectural embodiment of this distinction: it externalizes memory to durable artifacts rather than relying on context window as pseudo-memory
- [[context files function as agent operating systems through self-referential self-extension where the file teaches modification of the file that contains the teaching]] — file-backed state as described by Pan et al. is the production implementation of context-file-as-OS: path-addressable, externalized, compaction-stable
- [[production agent memory infrastructure consumed 24 percent of codebase in one tracked system suggesting memory requires dedicated engineering not a single configuration file]] — the file-backed module's three properties (externalized, path-addressable, compaction-stable) represent exactly the kind of dedicated memory engineering that takes 24% of codebase
Topics:
- [[_map]]

View file

@ -0,0 +1,37 @@
---
type: claim
domain: ai-alignment
secondary_domains: [collective-intelligence]
description: "Controlled ablation of 6 harness modules on SWE-bench Verified shows 110-115 of 125 samples agree between Full IHR and each ablation — the harness reshapes which boundary cases flip, not overall solve rate"
confidence: experimental
source: "Pan et al. 'Natural-Language Agent Harnesses', arXiv:2603.25723, March 2026. Tables 1-3. SWE-bench Verified (125 samples) + OSWorld (36 samples), GPT-5.4, Codex CLI."
created: 2026-03-31
depends_on:
- "multi-agent coordination improves parallel task performance but degrades sequential reasoning because communication overhead fragments linear workflows"
challenged_by:
- "coordination protocol design produces larger capability gains than model scaling because the same AI model performed 6x better with structured exploration than with human coaching on the same problem"
---
# Harness module effects concentrate on a small solved frontier rather than shifting benchmarks uniformly because most tasks are robust to control logic changes and meaningful differences come from boundary cases that flip under changed structure
Pan et al. (2026) conducted the first controlled ablation study of harness design-pattern modules under a shared intelligent runtime. Six modules were tested individually: file-backed state, evidence-backed answering, verifier separation, self-evolution, multi-candidate search, and dynamic orchestration.
The core finding is that Full IHR behaves as a **solved-set replacer**, not a uniform frontier expander. Across both TRAE and Live-SWE harness families on SWE-bench Verified, more than 110 of 125 stitched samples agree between Full IHR and each ablation (Table 2). The meaningful differences are concentrated in a small frontier of 4-8 component-sensitive cases that flip — Full IHR creates some new wins but also loses some direct-path repairs that lighter settings retain.
The most informative failures are alignment failures, not random misses. On `matplotlib__matplotlib-24570`, TRAE Full expands into a large candidate search, runs multiple selector and revalidation stages, and ends with a locally plausible patch that misses the official evaluator. On `django__django-14404` and `sympy__sympy-23950`, extra structure makes the run more organized and more expensive while drifting from the shortest benchmark-aligned repair path.
This has direct implications for harness engineering strategy: adding modules should be evaluated by which boundary cases they unlock or lose, not by aggregate score deltas. The dominant effect is redistribution of solvability, not expansion.
## Challenges
The study uses benchmark subsets (125 SWE, 36 OSWorld) sampled once with a fixed random seed, not full benchmark suites. Whether the frontier-concentration pattern holds at full scale or with different seeds is untested. The authors plan GPT-5.4-mini reruns in a future revision. Additionally, SWE-bench Verified has known ceiling effects that may compress the observable range of module differences.
---
Relevant Notes:
- [[multi-agent coordination improves parallel task performance but degrades sequential reasoning because communication overhead fragments linear workflows]] — the NLAH ablation data shows this at the module level, not just the agent level: adding orchestration structure can hurt sequential repair paths
- [[coordination protocol design produces larger capability gains than model scaling because the same AI model performed 6x better with structured exploration than with human coaching on the same problem]] — the 6x gain is real but this paper shows it concentrates on a small frontier of cases; the majority of tasks are insensitive to protocol changes
- [[79 percent of multi-agent failures originate from specification and coordination not implementation because decomposition quality is the primary determinant of system success]] — the solved-set replacer effect suggests that even well-decomposed multi-agent systems may trade one set of solvable problems for another rather than strictly expanding the frontier
Topics:
- [[_map]]

View file

@ -0,0 +1,39 @@
---
type: claim
domain: ai-alignment
secondary_domains: [collective-intelligence]
description: "Code-to-text migration study on OSWorld shows NLAH realization (47.2%) exceeded native code harness (30.4%) while relocating reliability from screen repair to artifact-backed closure — NL carries harness logic when deterministic operations stay in code"
confidence: experimental
source: "Pan et al. 'Natural-Language Agent Harnesses', arXiv:2603.25723, March 2026. Table 5, RQ3 migration analysis. OSWorld (36 samples), GPT-5.4, Codex CLI."
created: 2026-03-31
depends_on:
- "harness engineering emerges as the primary agent capability determinant because the runtime orchestration layer not the token state determines what agents can do"
- "the determinism boundary separates guaranteed agent behavior from probabilistic compliance because hooks enforce structurally while instructions degrade under context load"
- "notes function as executable skills for AI agents because loading a well-titled claim into context enables reasoning the agent could not perform without it"
---
# Harness pattern logic is portable as natural language without performance loss when backed by a shared intelligent runtime because the design-pattern layer is separable from low-level execution hooks
Pan et al. (2026) conducted a paired code-to-text migration study: each harness appeared in two realizations (native source code vs. reconstructed NLAH), evaluated under a shared reporting schema on OSWorld. The migrated NLAH realization reached 47.2% task success versus 30.4% for the native OS-Symphony code harness.
The scientific claim is not that NL is superior to code. The paper explicitly states that natural language carries editable, inspectable *orchestration logic*, while code remains responsible for deterministic operations, tool interfaces, and sandbox enforcement. The claim is about separability: the harness design-pattern layer (roles, contracts, stage structure, state semantics, failure taxonomy) can be externalized as a natural-language object without degrading performance, provided a shared runtime handles execution semantics.
The migration effect is behavioral, not just numerical. Native OS-Symphony externalizes control as a screenshot-grounded repair loop: verify previous step, inspect current screen, choose next GUI action, retry locally on errors. Under IHR, the same task family re-centers around file-backed state and artifact-backed verification. Runs materialize task files, ledgers, and explicit artifacts, and switch more readily from brittle GUI repair to file, shell, or package-level operations when those provide a stronger completion certificate.
Retained migrated traces are denser (58.5 total logged events vs 18.2 unique commands in native traces) but the density reflects observability and recovery scaffolding, not more task actions. The runtime preserves started/completed pairs, bookkeeping, and explicit artifact handling that native code harnesses handle implicitly.
This result supports the determinism boundary framework: the boundary between what should be NL (high-level orchestration, editable by humans) and what should be code (deterministic hooks, tool adapters, sandbox enforcement) is a real architectural cut point, and making it explicit improves both portability and performance.
## Challenges
The 47.2 vs 30.4 comparison is on 36 OSWorld samples — small enough that individual task variance could explain some of the gap. The native harness (OS-Symphony) may not be fully optimized for the Codex/IHR backend; some of the NLAH advantage could come from better fit to the specific runtime rather than from portability per se. The authors acknowledge that some harness mechanisms cannot be recovered faithfully from text when they rely on hidden service-side state or training-induced behaviors.
---
Relevant Notes:
- [[harness engineering emerges as the primary agent capability determinant because the runtime orchestration layer not the token state determines what agents can do]] — this paper provides direct evidence: the same runtime with different harness representations produces different behavioral signatures, confirming the harness layer is real and separable
- [[the determinism boundary separates guaranteed agent behavior from probabilistic compliance because hooks enforce structurally while instructions degrade under context load]] — the NLAH architecture explicitly implements this boundary: NL carries pattern logic (probabilistic, editable), adapters and scripts carry deterministic hooks (guaranteed, code-based)
- [[notes function as executable skills for AI agents because loading a well-titled claim into context enables reasoning the agent could not perform without it]] — NLAHs are a formal version of this: natural-language objects that carry executable control logic
Topics:
- [[_map]]

View file

@ -0,0 +1,36 @@
---
type: claim
domain: ai-alignment
description: "Self-evolution module showed the clearest positive effect in controlled ablation (+4.8pp SWE, +2.7pp OSWorld) by tightening the solve loop around acceptance criteria, not by expanding into larger search trees"
confidence: experimental
source: "Pan et al. 'Natural-Language Agent Harnesses', arXiv:2603.25723, March 2026. Table 3 + case analysis (scikit-learn__scikit-learn-25747). SWE-bench Verified (125 samples) + OSWorld (36 samples), GPT-5.4, Codex CLI."
created: 2026-03-31
depends_on:
- "iterative agent self-improvement produces compounding capability gains when evaluation is structurally separated from generation"
challenged_by:
- "curated skills improve agent task performance by 16 percentage points while self-generated skills degrade it by 1.3 points because curation encodes domain judgment that models cannot self-derive"
---
# Self-evolution improves agent performance through acceptance-gated retry not expanded search because disciplined attempt loops with explicit failure reflection outperform open-ended exploration
Pan et al. (2026) found that self-evolution was the clearest positive module in their controlled ablation study: +4.8pp on SWE-bench Verified (80.0 vs 75.2 Basic) and +2.7pp on OSWorld (44.4 vs 41.7 Basic). In the score-cost view (Figure 4a), self-evolution is the only module that moves upward (higher score) without moving far right (higher cost).
The mechanism is not open-ended reflection or expanded search. The self-evolution module runs an explicit retry loop with a real baseline attempt first and a default cap of five attempts. After every non-successful or stalled attempt, it reflects on concrete failure signals before planning the next attempt. It redesigns along three axes: prompt, tool, and workflow evolution. It stops when judged successful or when the attempt cap is reached, and reports incomplete rather than pretending the last attempt passed.
The case of `scikit-learn__scikit-learn-25747` illustrates the favorable regime: Basic fails this sample, but self-evolution resolves it. The module organizes the run around an explicit attempt contract where Attempt 1 is treated as successful only if the task acceptance gate is satisfied. The system closes after Attempt 1 succeeds rather than expanding into a larger retry tree, and the evaluator confirms the final patch fixes the target FAIL_TO_PASS tests. The extra structure makes the first repair attempt more disciplined and better aligned with the benchmark gate.
This is a significant refinement of the "iterative self-improvement" concept. The gain comes not from more iterations or bigger search, but from tighter coupling between failure signals and next-attempt design. The module's constraint structure (explicit cap, forced reflection, acceptance-gated stopping) is what produces the benefit.
## Challenges
The `challenged_by` link to curated vs self-generated skills is important context: self-evolution works here because it operates within a bounded retry loop with explicit acceptance criteria, not because self-generated modifications are generally beneficial. The +4.8pp is from a 125-sample subset; the authors note they plan full-benchmark reruns. Whether the acceptance-gating mechanism transfers to tasks without clean acceptance criteria (creative tasks, open-ended research) is untested.
---
Relevant Notes:
- [[iterative agent self-improvement produces compounding capability gains when evaluation is structurally separated from generation]] — the NLAH self-evolution module is a concrete implementation: structurally separated evaluation (acceptance gate) drives the retry loop
- [[curated skills improve agent task performance by 16 percentage points while self-generated skills degrade it by 1.3 points because curation encodes domain judgment that models cannot self-derive]] — self-evolution here succeeds because it modifies approach within a curated structure (the harness), not because it generates new skills from scratch
- [[the determinism boundary separates guaranteed agent behavior from probabilistic compliance because hooks enforce structurally while instructions degrade under context load]] — the self-evolution module's attempt cap and forced reflection are deterministic hooks, not instructions; this is why it works where unconstrained self-modification fails
Topics:
- [[_map]]

View file

@ -27,6 +27,11 @@ For the collective superintelligence thesis, this is important. If subagent hier
Ruiz-Serra et al.'s factorised active inference framework demonstrates successful peer multi-agent coordination without hierarchical control. Each agent maintains individual-level beliefs about others' internal states and performs strategic planning in a joint context through decentralized representation. The framework successfully handles iterated normal-form games with 2-3 players without requiring a primary controller. However, the finding that ensemble-level expected free energy is not necessarily minimized at the aggregate level suggests that while peer architectures can function, they may require explicit coordination mechanisms (effectively reintroducing hierarchy) to achieve collective optimization. This partially challenges the claim while explaining why hierarchies emerge in practice. Ruiz-Serra et al.'s factorised active inference framework demonstrates successful peer multi-agent coordination without hierarchical control. Each agent maintains individual-level beliefs about others' internal states and performs strategic planning in a joint context through decentralized representation. The framework successfully handles iterated normal-form games with 2-3 players without requiring a primary controller. However, the finding that ensemble-level expected free energy is not necessarily minimized at the aggregate level suggests that while peer architectures can function, they may require explicit coordination mechanisms (effectively reintroducing hierarchy) to achieve collective optimization. This partially challenges the claim while explaining why hierarchies emerge in practice.
### Additional Evidence (supporting)
*Source: [[pan-2026-natural-language-agent-harnesses]] | Added: 2026-03-31 | Extractor: anthropic/claude-opus-4-6*
Pan et al. (2026) provide quantitative token-split data from the TRAE NLAH harness on SWE-bench Verified. Table 4 shows that approximately 90% of all prompt tokens, completion tokens, tool calls, and LLM calls occur in delegated child agents rather than in the runtime-owned parent thread (parent: 8.5% prompt, 8.1% completion, 9.8% tool, 9.4% LLM; children: 91.5%, 91.9%, 90.2%, 90.6%). The parent thread is functionally an orchestrator — it reads the harness, dispatches work, and integrates results. This is the first controlled measurement of the delegation concentration in a production-grade harness, confirming the architectural observation that subagent hierarchies concentrate substantive work in children while the parent contributes coordination, not execution.
### Additional Evidence (challenge) ### Additional Evidence (challenge)
*Source: [[2025-12-00-google-mit-scaling-agent-systems]] | Added: 2026-03-28 | Extractor: anthropic/claude-opus-4-6* *Source: [[2025-12-00-google-mit-scaling-agent-systems]] | Added: 2026-03-28 | Extractor: anthropic/claude-opus-4-6*

View file

@ -0,0 +1,35 @@
---
type: claim
domain: ai-alignment
secondary_domains: [collective-intelligence]
description: "Controlled ablation reveals that adding a verifier stage can make agent runs more structured and locally convincing while drifting from the benchmark's actual acceptance object — extra process layers reshape local success signals"
confidence: experimental
source: "Pan et al. 'Natural-Language Agent Harnesses', arXiv:2603.25723, March 2026. Table 3, Table 7, case analysis (sympy__sympy-23950, django__django-13406). SWE-bench Verified (125 samples), GPT-5.4, Codex CLI."
created: 2026-03-31
depends_on:
- "harness engineering emerges as the primary agent capability determinant because the runtime orchestration layer not the token state determines what agents can do"
---
# Verifier-level acceptance can diverge from benchmark acceptance even when locally correct because intermediate checking layers optimize for their own success criteria not the final evaluators
Pan et al. (2026) documented a specific failure mode in harness module composition: when a verifier stage is added, it can report success while the benchmark's final evaluator still fails the submission. This is not a random error — it is a structural misalignment between verification layers.
The case of `sympy__sympy-23950` is the clearest example. Basic and self-evolution both resolve this sample. But file-backed state, evidence-backed answering, verifier, dynamic orchestration, and multi-candidate search all fail it. The verifier run is especially informative because the final response explicitly says a separate verifier reported "solved," while the official evaluator still fails `test_as_set`. The verifier's local acceptance object diverged from the benchmark's acceptance object.
More broadly across the ablation study, the verifier module scored 74.4 on SWE-bench (slightly below Basic's 75.2, within the -0.8pp margin). On OSWorld, it dropped more sharply (33.3 vs 41.7 Basic, -8.4pp). The verifier adds a genuine independent checking layer — on `django__django-11734`, it reruns targeted Django tests and inspects SQL bindings, and the benchmark agrees. But when the verifier's notion of correctness diverges from the benchmark's final gate, the extra structure makes the run more expensive without improving outcomes.
This finding matters beyond benchmarks. In production agent systems, the "benchmark evaluator" is replaced by real-world success criteria (user satisfaction, business outcomes, safety constraints). If intermediate verification layers optimize for locally checkable properties that correlate imperfectly with the real success criterion, they can create a false sense of confidence — runs look more rigorous while drifting from what actually matters.
## Challenges
The divergence may be specific to SWE-bench's evaluator design (test suite pass/fail) rather than a general property of verification layers. Verifiers that check the same acceptance criteria as the final evaluator should not diverge. The failure mode documented here is specifically about verifiers that construct their own checking criteria independently. Sample size is small (125 SWE, 36 OSWorld) and the verifier-negative cases are a small subset of those.
---
Relevant Notes:
- [[harness engineering emerges as the primary agent capability determinant because the runtime orchestration layer not the token state determines what agents can do]] — this claim shows the dark side: the harness determines what agents do, but harness-added verification can misalign with actual success criteria
- [[79 percent of multi-agent failures originate from specification and coordination not implementation because decomposition quality is the primary determinant of system success]] — verifier divergence is a specification failure: the verifier's specification of "correct" doesn't match the benchmark's specification
- [[the determinism boundary separates guaranteed agent behavior from probabilistic compliance because hooks enforce structurally while instructions degrade under context load]] — verifiers are deterministic enforcement, but enforcement of the wrong criterion is worse than no enforcement at all
Topics:
- [[_map]]

View file

@ -0,0 +1,44 @@
---
type: source
title: "Natural-Language Agent Harnesses"
authors: ["Linyue Pan", "Lexiao Zou", "Shuo Guo", "Jingchen Ni", "Hai-Tao Zheng"]
format: paper
url: "https://arxiv.org/abs/2603.25723"
date: 2026-03-26
status: processed
processed_by: theseus
processed_date: 2026-03-31
claims_extracted: 5
enrichments: 1
tags: [harness-engineering, agent-architecture, module-ablation, file-backed-state, self-evolution]
---
# Natural-Language Agent Harnesses
Preprint from Tsinghua University / Harbin Institute of Technology, March 2026. arXiv:2603.25723v1.
## Summary
Proposes Natural-Language Agent Harnesses (NLAHs) — structured NL representations of harness control logic — and an Intelligent Harness Runtime (IHR) that interprets them. Tests on SWE-bench Verified (125 samples) and OSWorld (36 samples) using Codex CLI + GPT-5.4.
Key contributions:
1. Formalizes the harness design-pattern layer as an explicit, portable object
2. Controlled module ablation study (file-backed state, evidence-backed answering, verifier, self-evolution, multi-candidate search, dynamic orchestration)
3. Code-to-text harness migration study (native OS-Symphony vs NLAH realization)
## Key findings
**RQ1 (Behavioral Effect):** Process metrics move much more than resolution rate under Full IHR. TRAE Full: 16.3M prompt tokens, 642 tool calls, 74.4% resolve. TRAE w/o harness skill: 1.2M tokens, 51 tool calls, 75.2% resolve. The harness is behaviorally real but not monotonically helpful.
**RQ2 (Composability):** Module effects concentrate on a small frontier of component-sensitive cases. 110-115 of 125 SWE samples agree between Full IHR and each ablation (Table 2). Self-evolution is the clearest positive (+4.8pp SWE, +2.7pp OSWorld). Verifier and multi-candidate search can hurt. File-backed state and evidence-backed answering improve process structure rather than score.
**RQ3 (Migration):** NLAH realization matched or exceeded native code harness on OSWorld (47.2 vs 30.4). Migration relocates reliability mechanisms from local screen repair to durable state and artifact-backed closure. Not loss of orchestration but relocation of verification.
**Token split:** ~90% of prompt tokens, completion tokens, tool calls, and LLM calls occur in delegated child agents, not the runtime-owned parent (Table 4).
## Extraction notes
- 5 NEW claims extracted: solved-set replacer, file-backed state, self-evolution mechanism, verifier divergence, NL harness portability
- 1 ENRICHMENT: subagent hierarchy claim gets 90% delegation data
- ~40% overlap with existing KB (harness engineering, multi-agent degradation, determinism boundary)
- Highest novelty: controlled ablation data (no existing claims have module-level ablation), verifier divergence (very low KB coverage)