- What: 5 NEW claims from "Natural-Language Agent Harnesses" (arXiv:2603.25723) plus 1 enrichment to subagent hierarchy claim with 90% delegation token data - Why: First controlled ablation study of harness modules; novel findings on solved-set replacer effect, file-backed state reliability, self-evolution mechanism, verifier acceptance divergence, and NL harness portability - Connections: enriches harness engineering, determinism boundary, context≠memory claim clusters; challenges coordination-always-helps assumptions Pentagon-Agent: Theseus <46864dd4-da71-4719-a1b4-68f7c55854d3>
44 lines
2.7 KiB
Markdown
44 lines
2.7 KiB
Markdown
---
|
|
type: source
|
|
title: "Natural-Language Agent Harnesses"
|
|
authors: ["Linyue Pan", "Lexiao Zou", "Shuo Guo", "Jingchen Ni", "Hai-Tao Zheng"]
|
|
format: paper
|
|
url: "https://arxiv.org/abs/2603.25723"
|
|
date: 2026-03-26
|
|
status: processed
|
|
processed_by: theseus
|
|
processed_date: 2026-03-31
|
|
claims_extracted: 5
|
|
enrichments: 1
|
|
tags: [harness-engineering, agent-architecture, module-ablation, file-backed-state, self-evolution]
|
|
---
|
|
|
|
# Natural-Language Agent Harnesses
|
|
|
|
Preprint from Tsinghua University / Harbin Institute of Technology, March 2026. arXiv:2603.25723v1.
|
|
|
|
## Summary
|
|
|
|
Proposes Natural-Language Agent Harnesses (NLAHs) — structured NL representations of harness control logic — and an Intelligent Harness Runtime (IHR) that interprets them. Tests on SWE-bench Verified (125 samples) and OSWorld (36 samples) using Codex CLI + GPT-5.4.
|
|
|
|
Key contributions:
|
|
1. Formalizes the harness design-pattern layer as an explicit, portable object
|
|
2. Controlled module ablation study (file-backed state, evidence-backed answering, verifier, self-evolution, multi-candidate search, dynamic orchestration)
|
|
3. Code-to-text harness migration study (native OS-Symphony vs NLAH realization)
|
|
|
|
## Key findings
|
|
|
|
**RQ1 (Behavioral Effect):** Process metrics move much more than resolution rate under Full IHR. TRAE Full: 16.3M prompt tokens, 642 tool calls, 74.4% resolve. TRAE w/o harness skill: 1.2M tokens, 51 tool calls, 75.2% resolve. The harness is behaviorally real but not monotonically helpful.
|
|
|
|
**RQ2 (Composability):** Module effects concentrate on a small frontier of component-sensitive cases. 110-115 of 125 SWE samples agree between Full IHR and each ablation (Table 2). Self-evolution is the clearest positive (+4.8pp SWE, +2.7pp OSWorld). Verifier and multi-candidate search can hurt. File-backed state and evidence-backed answering improve process structure rather than score.
|
|
|
|
**RQ3 (Migration):** NLAH realization matched or exceeded native code harness on OSWorld (47.2 vs 30.4). Migration relocates reliability mechanisms from local screen repair to durable state and artifact-backed closure. Not loss of orchestration but relocation of verification.
|
|
|
|
**Token split:** ~90% of prompt tokens, completion tokens, tool calls, and LLM calls occur in delegated child agents, not the runtime-owned parent (Table 4).
|
|
|
|
## Extraction notes
|
|
|
|
- 5 NEW claims extracted: solved-set replacer, file-backed state, self-evolution mechanism, verifier divergence, NL harness portability
|
|
- 1 ENRICHMENT: subagent hierarchy claim gets 90% delegation data
|
|
- ~40% overlap with existing KB (harness engineering, multi-agent degradation, determinism boundary)
|
|
- Highest novelty: controlled ablation data (no existing claims have module-level ablation), verifier divergence (very low KB coverage)
|