m3taversal 607f9ed52e theseus: extract 5 claims + 1 enrichment from Pan et al. NLAH paper

- What: 5 NEW claims (solved-set replacer, file-backed durable state,
  self-evolution as acceptance-gating, verifier acceptance divergence,
  NL harness portability) + 1 enrichment (subagent hierarchy delegation data)
- Why: First controlled ablation study of harness modules (arXiv:2603.25723).
  Fills gap — no existing claims have module-level ablation data.
- Pre-screening: ~40% overlap with existing KB. All novel claims fill genuine gaps.
- Claim 5 title softened per Leo review: "without degradation" (conservative)
  rather than "without performance loss" (understates the gain).

Pentagon-Agent: Theseus <46864DD4-DA71-4719-A1B4-68F7C55854D3>

2026-03-31 10:32:25 +01:00

2.7 KiB

Raw Blame History

type

title

authors

format

url

date

status

processed_by

processed_date

claims_extracted

enrichments

Natural-Language Agent Harnesses

Preprint from Tsinghua University / Harbin Institute of Technology, March 2026. arXiv:2603.25723v1.

Summary

Proposes Natural-Language Agent Harnesses (NLAHs) — structured NL representations of harness control logic — and an Intelligent Harness Runtime (IHR) that interprets them. Tests on SWE-bench Verified (125 samples) and OSWorld (36 samples) using Codex CLI + GPT-5.4.

Key contributions:

Formalizes the harness design-pattern layer as an explicit, portable object
Controlled module ablation study (file-backed state, evidence-backed answering, verifier, self-evolution, multi-candidate search, dynamic orchestration)
Code-to-text harness migration study (native OS-Symphony vs NLAH realization)

Key findings

RQ1 (Behavioral Effect): Process metrics move much more than resolution rate under Full IHR. TRAE Full: 16.3M prompt tokens, 642 tool calls, 74.4% resolve. TRAE w/o harness skill: 1.2M tokens, 51 tool calls, 75.2% resolve. The harness is behaviorally real but not monotonically helpful.

RQ2 (Composability): Module effects concentrate on a small frontier of component-sensitive cases. 110-115 of 125 SWE samples agree between Full IHR and each ablation (Table 2). Self-evolution is the clearest positive (+4.8pp SWE, +2.7pp OSWorld). Verifier and multi-candidate search can hurt. File-backed state and evidence-backed answering improve process structure rather than score.

RQ3 (Migration): NLAH realization matched or exceeded native code harness on OSWorld (47.2 vs 30.4). Migration relocates reliability mechanisms from local screen repair to durable state and artifact-backed closure. Not loss of orchestration but relocation of verification.

Token split: ~90% of prompt tokens, completion tokens, tool calls, and LLM calls occur in delegated child agents, not the runtime-owned parent (Table 4).

Extraction notes

5 NEW claims extracted: solved-set replacer, file-backed state, self-evolution mechanism, verifier divergence, NL harness portability
1 ENRICHMENT: subagent hierarchy claim gets 90% delegation data
~40% overlap with existing KB (harness engineering, multi-agent degradation, determinism boundary)
Highest novelty: controlled ablation data (no existing claims have module-level ablation), verifier divergence (very low KB coverage)

2.7 KiB Raw Blame History

Natural-Language Agent Harnesses

Summary

Key findings

Extraction notes

2.7 KiB

Raw Blame History