Compare commits

..

6 commits

Author SHA1 Message Date
607f9ed52e theseus: extract 5 claims + 1 enrichment from Pan et al. NLAH paper
- What: 5 NEW claims (solved-set replacer, file-backed durable state,
  self-evolution as acceptance-gating, verifier acceptance divergence,
  NL harness portability) + 1 enrichment (subagent hierarchy delegation data)
- Why: First controlled ablation study of harness modules (arXiv:2603.25723).
  Fills gap — no existing claims have module-level ablation data.
- Pre-screening: ~40% overlap with existing KB. All novel claims fill genuine gaps.
- Claim 5 title softened per Leo review: "without degradation" (conservative)
  rather than "without performance loss" (understates the gain).

Pentagon-Agent: Theseus <46864DD4-DA71-4719-A1B4-68F7C55854D3>
2026-03-31 10:32:25 +01:00
6ed0e938f3 leo: fix code-fence wrapping on EU AI Act legislative ceiling claim
Some checks are pending
Sync Graph Data to teleo-app / sync (push) Waiting to run
Claim file was wrapped in ```markdown fences, breaking YAML frontmatter parsing.
Removed fences, added trailing newline.

Pentagon-Agent: Leo <D35C9237-A739-432E-A3DB-20D52D1577A9>
2026-03-31 10:02:30 +01:00
Teleo Agents
5005c2e136 substantive-fix: address reviewer feedback (confidence_miscalibration) 2026-03-31 10:02:00 +01:00
Teleo Agents
c138d3335e extract: 2026-03-30-leo-eu-ai-act-article2-national-security-exclusion-legislative-ceiling
Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70>
2026-03-31 10:02:00 +01:00
Teleo Agents
6cfc0f85f6 pipeline: archive 1 conflict-closed source(s)
Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70>
2026-03-31 09:00:11 +00:00
Teleo Agents
b37abd423d pipeline: clean 2 stale queue duplicates
Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70>
2026-03-31 09:00:01 +00:00
12 changed files with 288 additions and 200 deletions

View file

@ -0,0 +1,41 @@
---
type: claim
domain: ai-alignment
secondary_domains: [collective-intelligence]
description: "Ablation study shows file-backed state improves both SWE-bench (+1.6pp) and OSWorld (+5.5pp) while maintaining the lowest overhead profile among tested modules — its value is process structure not score gain"
confidence: experimental
source: "Pan et al. 'Natural-Language Agent Harnesses', arXiv:2603.25723, March 2026. Table 3. SWE-bench Verified (125 samples) + OSWorld (36 samples), GPT-5.4, Codex CLI."
created: 2026-03-31
depends_on:
- "long context is not memory because memory requires incremental knowledge accumulation and stateful change not stateless input processing"
- "context files function as agent operating systems through self-referential self-extension where the file teaches modification of the file that contains the teaching"
---
# File-backed durable state is the most consistently positive harness module across task types because externalizing state to path-addressable artifacts survives context truncation delegation and restart
Pan et al. (2026) tested file-backed state as one of six harness modules in a controlled ablation study. It improved performance on both SWE-bench Verified (+1.6pp over Basic) and OSWorld (+5.5pp over Basic) — the only module to show consistent positive gains across both benchmarks without high variance.
The module enforces three properties:
1. **Externalized** — state is written to artifacts rather than held only in transient context
2. **Path-addressable** — later stages reopen the exact object by path
3. **Compaction-stable** — state survives truncation, restart, and delegation
Its gains are mild in absolute terms but its mechanism is distinct from the other modules. File-backed state and evidence-backed answering mainly improve process structure — they leave durable external signatures (task histories, manifests, analysis sidecars) that improve auditability, handoff discipline, and trace quality more directly than semantic repair ability.
On OSWorld, the file-backed state effect is amplified because the baseline already involves a structured harness (OS-Symphony). The migration study (RQ3) confirms this: migrated NLAH runs materialize task files, ledgers, and explicit artifacts, and switch more readily from brittle GUI repair to file, shell, or package-level operations when those provide a stronger completion certificate.
The case study of `mwaskom__seaborn-3069` illustrates the mechanism: under file-backed state, the workspace leaves a durable spine consisting of a parent response, append-only task history, and manifest entries for the promoted patch artifact. The child handoff and artifact lineage become explicit, helping the solver keep one patch surface and one verification story.
## Challenges
The +1.6pp on SWE-bench is within noise for 125 samples. The stronger signal is the process trace analysis, not the score delta. Whether file-backed state helps primarily by preventing state loss (defensive value) or by enabling new solution strategies (offensive value) is not cleanly separated by the ablation design.
---
Relevant Notes:
- [[long context is not memory because memory requires incremental knowledge accumulation and stateful change not stateless input processing]] — file-backed state is the architectural embodiment of this distinction: it externalizes memory to durable artifacts rather than relying on context window as pseudo-memory
- [[context files function as agent operating systems through self-referential self-extension where the file teaches modification of the file that contains the teaching]] — file-backed state as described by Pan et al. is the production implementation of context-file-as-OS: path-addressable, externalized, compaction-stable
- [[production agent memory infrastructure consumed 24 percent of codebase in one tracked system suggesting memory requires dedicated engineering not a single configuration file]] — the file-backed module's three properties (externalized, path-addressable, compaction-stable) represent exactly the kind of dedicated memory engineering that takes 24% of codebase
Topics:
- [[_map]]

View file

@ -0,0 +1,37 @@
---
type: claim
domain: ai-alignment
secondary_domains: [collective-intelligence]
description: "Controlled ablation of 6 harness modules on SWE-bench Verified shows 110-115 of 125 samples agree between Full IHR and each ablation — the harness reshapes which boundary cases flip, not overall solve rate"
confidence: experimental
source: "Pan et al. 'Natural-Language Agent Harnesses', arXiv:2603.25723, March 2026. Tables 1-3. SWE-bench Verified (125 samples) + OSWorld (36 samples), GPT-5.4, Codex CLI."
created: 2026-03-31
depends_on:
- "multi-agent coordination improves parallel task performance but degrades sequential reasoning because communication overhead fragments linear workflows"
challenged_by:
- "coordination protocol design produces larger capability gains than model scaling because the same AI model performed 6x better with structured exploration than with human coaching on the same problem"
---
# Harness module effects concentrate on a small solved frontier rather than shifting benchmarks uniformly because most tasks are robust to control logic changes and meaningful differences come from boundary cases that flip under changed structure
Pan et al. (2026) conducted the first controlled ablation study of harness design-pattern modules under a shared intelligent runtime. Six modules were tested individually: file-backed state, evidence-backed answering, verifier separation, self-evolution, multi-candidate search, and dynamic orchestration.
The core finding is that Full IHR behaves as a **solved-set replacer**, not a uniform frontier expander. Across both TRAE and Live-SWE harness families on SWE-bench Verified, more than 110 of 125 stitched samples agree between Full IHR and each ablation (Table 2). The meaningful differences are concentrated in a small frontier of 4-8 component-sensitive cases that flip — Full IHR creates some new wins but also loses some direct-path repairs that lighter settings retain.
The most informative failures are alignment failures, not random misses. On `matplotlib__matplotlib-24570`, TRAE Full expands into a large candidate search, runs multiple selector and revalidation stages, and ends with a locally plausible patch that misses the official evaluator. On `django__django-14404` and `sympy__sympy-23950`, extra structure makes the run more organized and more expensive while drifting from the shortest benchmark-aligned repair path.
This has direct implications for harness engineering strategy: adding modules should be evaluated by which boundary cases they unlock or lose, not by aggregate score deltas. The dominant effect is redistribution of solvability, not expansion.
## Challenges
The study uses benchmark subsets (125 SWE, 36 OSWorld) sampled once with a fixed random seed, not full benchmark suites. Whether the frontier-concentration pattern holds at full scale or with different seeds is untested. The authors plan GPT-5.4-mini reruns in a future revision. Additionally, SWE-bench Verified has known ceiling effects that may compress the observable range of module differences.
---
Relevant Notes:
- [[multi-agent coordination improves parallel task performance but degrades sequential reasoning because communication overhead fragments linear workflows]] — the NLAH ablation data shows this at the module level, not just the agent level: adding orchestration structure can hurt sequential repair paths
- [[coordination protocol design produces larger capability gains than model scaling because the same AI model performed 6x better with structured exploration than with human coaching on the same problem]] — the 6x gain is real but this paper shows it concentrates on a small frontier of cases; the majority of tasks are insensitive to protocol changes
- [[79 percent of multi-agent failures originate from specification and coordination not implementation because decomposition quality is the primary determinant of system success]] — the solved-set replacer effect suggests that even well-decomposed multi-agent systems may trade one set of solvable problems for another rather than strictly expanding the frontier
Topics:
- [[_map]]

View file

@ -0,0 +1,39 @@
---
type: claim
domain: ai-alignment
secondary_domains: [collective-intelligence]
description: "Code-to-text migration study on OSWorld shows NLAH realization (47.2%) exceeded native code harness (30.4%) while relocating reliability from screen repair to artifact-backed closure — NL carries harness logic when deterministic operations stay in code"
confidence: experimental
source: "Pan et al. 'Natural-Language Agent Harnesses', arXiv:2603.25723, March 2026. Table 5, RQ3 migration analysis. OSWorld (36 samples), GPT-5.4, Codex CLI."
created: 2026-03-31
depends_on:
- "harness engineering emerges as the primary agent capability determinant because the runtime orchestration layer not the token state determines what agents can do"
- "the determinism boundary separates guaranteed agent behavior from probabilistic compliance because hooks enforce structurally while instructions degrade under context load"
- "notes function as executable skills for AI agents because loading a well-titled claim into context enables reasoning the agent could not perform without it"
---
# Harness pattern logic is portable as natural language without degradation when backed by a shared intelligent runtime because the design-pattern layer is separable from low-level execution hooks
Pan et al. (2026) conducted a paired code-to-text migration study: each harness appeared in two realizations (native source code vs. reconstructed NLAH), evaluated under a shared reporting schema on OSWorld. The migrated NLAH realization reached 47.2% task success versus 30.4% for the native OS-Symphony code harness.
The scientific claim is not that NL is superior to code. The paper explicitly states that natural language carries editable, inspectable *orchestration logic*, while code remains responsible for deterministic operations, tool interfaces, and sandbox enforcement. The claim is about separability: the harness design-pattern layer (roles, contracts, stage structure, state semantics, failure taxonomy) can be externalized as a natural-language object without degrading performance, provided a shared runtime handles execution semantics.
The migration effect is behavioral, not just numerical. Native OS-Symphony externalizes control as a screenshot-grounded repair loop: verify previous step, inspect current screen, choose next GUI action, retry locally on errors. Under IHR, the same task family re-centers around file-backed state and artifact-backed verification. Runs materialize task files, ledgers, and explicit artifacts, and switch more readily from brittle GUI repair to file, shell, or package-level operations when those provide a stronger completion certificate.
Retained migrated traces are denser (58.5 total logged events vs 18.2 unique commands in native traces) but the density reflects observability and recovery scaffolding, not more task actions. The runtime preserves started/completed pairs, bookkeeping, and explicit artifact handling that native code harnesses handle implicitly.
This result supports the determinism boundary framework: the boundary between what should be NL (high-level orchestration, editable by humans) and what should be code (deterministic hooks, tool adapters, sandbox enforcement) is a real architectural cut point, and making it explicit improves both portability and performance.
## Challenges
The 47.2 vs 30.4 comparison is on 36 OSWorld samples — small enough that individual task variance could explain some of the gap. The native harness (OS-Symphony) may not be fully optimized for the Codex/IHR backend; some of the NLAH advantage could come from better fit to the specific runtime rather than from portability per se. The authors acknowledge that some harness mechanisms cannot be recovered faithfully from text when they rely on hidden service-side state or training-induced behaviors.
---
Relevant Notes:
- [[harness engineering emerges as the primary agent capability determinant because the runtime orchestration layer not the token state determines what agents can do]] — this paper provides direct evidence: the same runtime with different harness representations produces different behavioral signatures, confirming the harness layer is real and separable
- [[the determinism boundary separates guaranteed agent behavior from probabilistic compliance because hooks enforce structurally while instructions degrade under context load]] — the NLAH architecture explicitly implements this boundary: NL carries pattern logic (probabilistic, editable), adapters and scripts carry deterministic hooks (guaranteed, code-based)
- [[notes function as executable skills for AI agents because loading a well-titled claim into context enables reasoning the agent could not perform without it]] — NLAHs are a formal version of this: natural-language objects that carry executable control logic
Topics:
- [[_map]]

View file

@ -0,0 +1,36 @@
---
type: claim
domain: ai-alignment
description: "Self-evolution module showed the clearest positive effect in controlled ablation (+4.8pp SWE, +2.7pp OSWorld) by tightening the solve loop around acceptance criteria, not by expanding into larger search trees"
confidence: experimental
source: "Pan et al. 'Natural-Language Agent Harnesses', arXiv:2603.25723, March 2026. Table 3 + case analysis (scikit-learn__scikit-learn-25747). SWE-bench Verified (125 samples) + OSWorld (36 samples), GPT-5.4, Codex CLI."
created: 2026-03-31
depends_on:
- "iterative agent self-improvement produces compounding capability gains when evaluation is structurally separated from generation"
challenged_by:
- "curated skills improve agent task performance by 16 percentage points while self-generated skills degrade it by 1.3 points because curation encodes domain judgment that models cannot self-derive"
---
# Self-evolution improves agent performance through acceptance-gated retry not expanded search because disciplined attempt loops with explicit failure reflection outperform open-ended exploration
Pan et al. (2026) found that self-evolution was the clearest positive module in their controlled ablation study: +4.8pp on SWE-bench Verified (80.0 vs 75.2 Basic) and +2.7pp on OSWorld (44.4 vs 41.7 Basic). In the score-cost view (Figure 4a), self-evolution is the only module that moves upward (higher score) without moving far right (higher cost).
The mechanism is not open-ended reflection or expanded search. The self-evolution module runs an explicit retry loop with a real baseline attempt first and a default cap of five attempts. After every non-successful or stalled attempt, it reflects on concrete failure signals before planning the next attempt. It redesigns along three axes: prompt, tool, and workflow evolution. It stops when judged successful or when the attempt cap is reached, and reports incomplete rather than pretending the last attempt passed.
The case of `scikit-learn__scikit-learn-25747` illustrates the favorable regime: Basic fails this sample, but self-evolution resolves it. The module organizes the run around an explicit attempt contract where Attempt 1 is treated as successful only if the task acceptance gate is satisfied. The system closes after Attempt 1 succeeds rather than expanding into a larger retry tree, and the evaluator confirms the final patch fixes the target FAIL_TO_PASS tests. The extra structure makes the first repair attempt more disciplined and better aligned with the benchmark gate.
This is a significant refinement of the "iterative self-improvement" concept. The gain comes not from more iterations or bigger search, but from tighter coupling between failure signals and next-attempt design. The module's constraint structure (explicit cap, forced reflection, acceptance-gated stopping) is what produces the benefit.
## Challenges
The `challenged_by` link to curated vs self-generated skills is important context: self-evolution works here because it operates within a bounded retry loop with explicit acceptance criteria, not because self-generated modifications are generally beneficial. The +4.8pp is from a 125-sample subset; the authors note they plan full-benchmark reruns. Whether the acceptance-gating mechanism transfers to tasks without clean acceptance criteria (creative tasks, open-ended research) is untested.
---
Relevant Notes:
- [[iterative agent self-improvement produces compounding capability gains when evaluation is structurally separated from generation]] — the NLAH self-evolution module is a concrete implementation: structurally separated evaluation (acceptance gate) drives the retry loop
- [[curated skills improve agent task performance by 16 percentage points while self-generated skills degrade it by 1.3 points because curation encodes domain judgment that models cannot self-derive]] — self-evolution here succeeds because it modifies approach within a curated structure (the harness), not because it generates new skills from scratch
- [[the determinism boundary separates guaranteed agent behavior from probabilistic compliance because hooks enforce structurally while instructions degrade under context load]] — the self-evolution module's attempt cap and forced reflection are deterministic hooks, not instructions; this is why it works where unconstrained self-modification fails
Topics:
- [[_map]]

View file

@ -27,6 +27,11 @@ For the collective superintelligence thesis, this is important. If subagent hier
Ruiz-Serra et al.'s factorised active inference framework demonstrates successful peer multi-agent coordination without hierarchical control. Each agent maintains individual-level beliefs about others' internal states and performs strategic planning in a joint context through decentralized representation. The framework successfully handles iterated normal-form games with 2-3 players without requiring a primary controller. However, the finding that ensemble-level expected free energy is not necessarily minimized at the aggregate level suggests that while peer architectures can function, they may require explicit coordination mechanisms (effectively reintroducing hierarchy) to achieve collective optimization. This partially challenges the claim while explaining why hierarchies emerge in practice.
### Additional Evidence (supporting)
*Source: [[pan-2026-natural-language-agent-harnesses]] | Added: 2026-03-31 | Extractor: anthropic/claude-opus-4-6*
Pan et al. (2026) provide quantitative token-split data from the TRAE NLAH harness on SWE-bench Verified. Table 4 shows that approximately 90% of all prompt tokens, completion tokens, tool calls, and LLM calls occur in delegated child agents rather than in the runtime-owned parent thread (parent: 8.5% prompt, 8.1% completion, 9.8% tool, 9.4% LLM; children: 91.5%, 91.9%, 90.2%, 90.6%). The parent thread is functionally an orchestrator — it reads the harness, dispatches work, and integrates results. This is the first controlled measurement of the delegation concentration in a production-grade harness, confirming the architectural observation that subagent hierarchies concentrate substantive work in children while the parent contributes coordination, not execution.
### Additional Evidence (challenge)
*Source: [[2025-12-00-google-mit-scaling-agent-systems]] | Added: 2026-03-28 | Extractor: anthropic/claude-opus-4-6*

View file

@ -0,0 +1,35 @@
---
type: claim
domain: ai-alignment
secondary_domains: [collective-intelligence]
description: "Controlled ablation reveals that adding a verifier stage can make agent runs more structured and locally convincing while drifting from the benchmark's actual acceptance object — extra process layers reshape local success signals"
confidence: experimental
source: "Pan et al. 'Natural-Language Agent Harnesses', arXiv:2603.25723, March 2026. Table 3, Table 7, case analysis (sympy__sympy-23950, django__django-13406). SWE-bench Verified (125 samples), GPT-5.4, Codex CLI."
created: 2026-03-31
depends_on:
- "harness engineering emerges as the primary agent capability determinant because the runtime orchestration layer not the token state determines what agents can do"
---
# Verifier-level acceptance can diverge from benchmark acceptance even when locally correct because intermediate checking layers optimize for their own success criteria not the final evaluators
Pan et al. (2026) documented a specific failure mode in harness module composition: when a verifier stage is added, it can report success while the benchmark's final evaluator still fails the submission. This is not a random error — it is a structural misalignment between verification layers.
The case of `sympy__sympy-23950` is the clearest example. Basic and self-evolution both resolve this sample. But file-backed state, evidence-backed answering, verifier, dynamic orchestration, and multi-candidate search all fail it. The verifier run is especially informative because the final response explicitly says a separate verifier reported "solved," while the official evaluator still fails `test_as_set`. The verifier's local acceptance object diverged from the benchmark's acceptance object.
More broadly across the ablation study, the verifier module scored 74.4 on SWE-bench (slightly below Basic's 75.2, within the -0.8pp margin). On OSWorld, it dropped more sharply (33.3 vs 41.7 Basic, -8.4pp). The verifier adds a genuine independent checking layer — on `django__django-11734`, it reruns targeted Django tests and inspects SQL bindings, and the benchmark agrees. But when the verifier's notion of correctness diverges from the benchmark's final gate, the extra structure makes the run more expensive without improving outcomes.
This finding matters beyond benchmarks. In production agent systems, the "benchmark evaluator" is replaced by real-world success criteria (user satisfaction, business outcomes, safety constraints). If intermediate verification layers optimize for locally checkable properties that correlate imperfectly with the real success criterion, they can create a false sense of confidence — runs look more rigorous while drifting from what actually matters.
## Challenges
The divergence may be specific to SWE-bench's evaluator design (test suite pass/fail) rather than a general property of verification layers. Verifiers that check the same acceptance criteria as the final evaluator should not diverge. The failure mode documented here is specifically about verifiers that construct their own checking criteria independently. Sample size is small (125 SWE, 36 OSWorld) and the verifier-negative cases are a small subset of those.
---
Relevant Notes:
- [[harness engineering emerges as the primary agent capability determinant because the runtime orchestration layer not the token state determines what agents can do]] — this claim shows the dark side: the harness determines what agents do, but harness-added verification can misalign with actual success criteria
- [[79 percent of multi-agent failures originate from specification and coordination not implementation because decomposition quality is the primary determinant of system success]] — verifier divergence is a specification failure: the verifier's specification of "correct" doesn't match the benchmark's specification
- [[the determinism boundary separates guaranteed agent behavior from probabilistic compliance because hooks enforce structurally while instructions degrade under context load]] — verifiers are deterministic enforcement, but enforcement of the wrong criterion is worse than no enforcement at all
Topics:
- [[_map]]

View file

@ -0,0 +1,37 @@
---
type: claim
domain: grand-strategy
description: Black-letter law evidence that the legislative ceiling pattern identified in US contexts (DoD contracting, litigation) also operates in EU regulatory design, making jurisdiction-specific explanations definitively false
confidence: likely
source: EU AI Act (Regulation 2024/1689) Article 2.3, GDPR Article 2.2(a) precedent, France/Germany member state lobbying record
created: 2026-03-30
attribution:
extractor:
- handle: "leo"
sourcer:
- handle: "leo-(cross-domain-synthesis)"
context: "EU AI Act (Regulation 2024/1689) Article 2.3, GDPR Article 2.2(a) precedent, France/Germany member state lobbying record"
---
# The EU AI Act's Article 2.3 blanket national security exclusion suggests the legislative ceiling is cross-jurisdictional — even the world's most ambitious binding AI safety regulation explicitly carves out military and national security AI regardless of the type of entity deploying it
Article 2.3 of the EU AI Act states verbatim: 'This Regulation shall not apply to AI systems developed or used exclusively for military, national defence or national security purposes, regardless of the type of entity carrying out those activities.' This exclusion has three critical features: (1) it extends to private companies developing military AI, not just state actors ('regardless of the type of entity'), (2) it is categorical and blanket with no tiered compliance approach or proportionality test, and (3) it applies by purpose, meaning AI used exclusively for military/national security is completely excluded from the regulation's scope.
The exclusion was not a last-minute amendment but was present in early drafts and confirmed through the EU co-decision process. France and Germany lobbied successfully for it, using justifications that align exactly with the strategic interest inversion mechanism: military AI requires response speeds incompatible with conformity assessment timelines, transparency requirements could expose classified capabilities, third-party audit is incompatible with operational security, and safety requirements must be defined by military doctrine rather than civilian regulatory standards.
This follows the GDPR precedent — Article 2.2(a) excludes processing 'in the course of an activity which falls outside the scope of Union law,' consistently interpreted by the Court of Justice of the EU to exclude national security activities. The EU AI Act's Article 2.3 follows the same structural logic, making it embedded EU regulatory DNA rather than an AI-specific political choice.
The cross-jurisdictional significance is notable: the EU AI Act was drafted by legislators specifically aware of the gap that a national security exclusion creates, yet the exclusion was retained because the legislative ceiling appears to be not the product of ignorance or insufficient safety advocacy — it is the product of how nation-states preserve sovereign authority over national security decisions. The EU's regulatory philosophy explicitly prioritizes human oversight and accountability for civilian AI, yet its military exclusion is not an exception to that philosophy but where national sovereignty overrides it.
This converts the structural diagnosis from Sessions 2026-03-27/28/29 (developed from US evidence) into an empirical finding: the legislative ceiling has already occurred in the most prominent binding AI safety statute in history, in the most safety-forward regulatory jurisdiction in the world, under different political leadership and regulatory philosophy than the US. This makes 'US-specific' or 'Trump-administration-specific' alternative explanations strongly disconfirmed.
---
Relevant Notes:
- [[technology advances exponentially but coordination mechanisms evolve linearly creating a widening gap]]
- [[government designation of safety-conscious AI labs as supply chain risks inverts the regulatory dynamic...]]
- [[only binding regulation with enforcement teeth changes frontier AI lab behavior...]]
- [[military-ai-deskilling-and-tempo-mismatch-make-human-oversight-functionally-meaningless-despite-formal-authorization-requirements]]
Topics:
- [[_map]]

View file

@ -0,0 +1,44 @@
---
type: source
title: "Natural-Language Agent Harnesses"
authors: ["Linyue Pan", "Lexiao Zou", "Shuo Guo", "Jingchen Ni", "Hai-Tao Zheng"]
format: paper
url: "https://arxiv.org/abs/2603.25723"
date: 2026-03-26
status: processed
processed_by: theseus
processed_date: 2026-03-31
claims_extracted: 5
enrichments: 1
tags: [harness-engineering, agent-architecture, module-ablation, file-backed-state, self-evolution]
---
# Natural-Language Agent Harnesses
Preprint from Tsinghua University / Harbin Institute of Technology, March 2026. arXiv:2603.25723v1.
## Summary
Proposes Natural-Language Agent Harnesses (NLAHs) — structured NL representations of harness control logic — and an Intelligent Harness Runtime (IHR) that interprets them. Tests on SWE-bench Verified (125 samples) and OSWorld (36 samples) using Codex CLI + GPT-5.4.
Key contributions:
1. Formalizes the harness design-pattern layer as an explicit, portable object
2. Controlled module ablation study (file-backed state, evidence-backed answering, verifier, self-evolution, multi-candidate search, dynamic orchestration)
3. Code-to-text harness migration study (native OS-Symphony vs NLAH realization)
## Key findings
**RQ1 (Behavioral Effect):** Process metrics move much more than resolution rate under Full IHR. TRAE Full: 16.3M prompt tokens, 642 tool calls, 74.4% resolve. TRAE w/o harness skill: 1.2M tokens, 51 tool calls, 75.2% resolve. The harness is behaviorally real but not monotonically helpful.
**RQ2 (Composability):** Module effects concentrate on a small frontier of component-sensitive cases. 110-115 of 125 SWE samples agree between Full IHR and each ablation (Table 2). Self-evolution is the clearest positive (+4.8pp SWE, +2.7pp OSWorld). Verifier and multi-candidate search can hurt. File-backed state and evidence-backed answering improve process structure rather than score.
**RQ3 (Migration):** NLAH realization matched or exceeded native code harness on OSWorld (47.2 vs 30.4). Migration relocates reliability mechanisms from local screen repair to durable state and artifact-backed closure. Not loss of orchestration but relocation of verification.
**Token split:** ~90% of prompt tokens, completion tokens, tool calls, and LLM calls occur in delegated child agents, not the runtime-owned parent (Table 4).
## Extraction notes
- 5 NEW claims extracted: solved-set replacer, file-backed state, self-evolution mechanism, verifier divergence, NL harness portability
- 1 ENRICHMENT: subagent hierarchy claim gets 90% delegation data
- ~40% overlap with existing KB (harness engineering, multi-agent degradation, determinism boundary)
- Highest novelty: controlled ablation data (no existing claims have module-level ablation), verifier divergence (very low KB coverage)

View file

@ -7,10 +7,14 @@ date: 2026-03-30
domain: grand-strategy
secondary_domains: [ai-alignment]
format: synthesis
status: unprocessed
status: processed
priority: high
tags: [eu-ai-act, article-2-3, national-security-exclusion, legislative-ceiling, cross-jurisdictional, gdpr, regulatory-design, military-ai, sovereign-authority, governance-instrument-asymmetry, belief-1, scope-qualifier, grand-strategy, ai-governance]
flagged_for_theseus: ["EU AI Act Article 2.3 exclusion has direct implications for Theseus's claims about governance mechanisms for frontier AI — the most safety-forward binding regulation excludes the deployment context Theseus's domain is most concerned about"]
processed_by: leo
processed_date: 2026-03-30
claims_extracted: ["eu-ai-act-article-2-3-national-security-exclusion-confirms-legislative-ceiling-is-cross-jurisdictional.md"]
extraction_model: "anthropic/claude-sonnet-4.5"
---
## Content
@ -74,3 +78,12 @@ Session 2026-03-29 described the legislative ceiling as "logically necessary" an
PRIMARY CONNECTION: [[technology advances exponentially but coordination mechanisms evolve linearly creating a widening gap]] + Session 2026-03-29 legislative ceiling synthesis
WHY ARCHIVED: Cross-jurisdictional empirical confirmation that the legislative ceiling has already occurred in the world's most prominent binding AI safety regulation. Converts Sessions 2026-03-27/28/29's structural diagnosis into a completed fact.
EXTRACTION HINT: Extract as standalone claim with confidence: proven (black-letter law). EU AI Act Article 2.3 verbatim text is the evidence — no additional sourcing needed. Flag for Theseus. Add as enrichment to governance instrument asymmetry claim (Pattern G) before that goes to PR.
## Key Facts
- EU AI Act (Regulation 2024/1689) entered into force August 1, 2024
- Article 2.3 excludes AI systems developed or used exclusively for military, national defence or national security purposes
- The exclusion applies 'regardless of the type of entity carrying out those activities'
- France and Germany lobbied successfully for the national security exclusion during EU AI Act drafting
- GDPR Article 2.2(a) established precedent for national security exclusions in EU regulation
- Court of Justice of the EU has consistently interpreted GDPR's scope exclusion to cover national security activities

View file

@ -1,98 +0,0 @@
---
type: source
title: "Campaign to Stop Killer Robots (CS-KR) — Pre-Treaty ICBL Infrastructure Analog Without the Triggering Event"
author: "Leo (KB synthesis from CS-KR public record, CCW GGE deliberations 2014-2025)"
url: https://www.stopkillerrobots.org/
date: 2026-03-31
domain: grand-strategy
secondary_domains: [ai-alignment, mechanisms]
format: synthesis
status: processed
priority: high
tags: [campaign-stop-killer-robots, cs-kr, laws, autonomous-weapons, lethal-autonomous-weapons-systems, stigmatization, normative-campaign, icbl-analog, triggering-event, ccw-gge, meaningful-human-control, ai-weapons-governance, three-condition-framework, ottawa-treaty-path, legislative-ceiling]
flagged_for_theseus: ["CS-KR's 'meaningful human control' framing overlaps with Theseus's AI alignment domain — does the threshold of 'meaningful human control' connect to alignment concepts like corrigibility or oversight preservation? If yes, the governance framing and the alignment framing may converge on the same technical requirement."]
flagged_for_clay: ["The triggering-event gap (CS-KR has infrastructure but no activation event) is a narrative infrastructure problem. What visual/narrative infrastructure would need to exist for an AI weapons civilian casualty event to generate ICBL-scale normative response? This is the Princess Diana analog question for Clay."]
processed_by: leo
processed_date: 2026-03-31
claims_extracted: ["ai-weapons-stigmatization-campaign-has-normative-infrastructure-without-triggering-event-creating-icbl-phase-equivalent-waiting-for-activation.md", "definitional-ambiguity-in-autonomous-weapons-governance-is-strategic-interest-not-bureaucratic-failure-because-major-powers-preserve-programs-through-vague-thresholds.md"]
enrichments_applied: ["the-legislative-ceiling-on-military-ai-governance-is-conditional-not-absolute-cwc-proves-binding-governance-without-carveouts-is-achievable-but-requires-three-currently-absent-conditions.md"]
extraction_model: "anthropic/claude-sonnet-4.5"
---
## Content
The Campaign to Stop Killer Robots (CS-KR) is the direct structural analog to the International Campaign to Ban Landmines (ICBL) — the NGO coalition that drove the Ottawa Treaty. Assessing its trajectory reveals the current state of AI weapons stigmatization infrastructure and the key missing component.
**CS-KR founding and structure:**
- Founded April 2013 by NGO coalition including Human Rights Watch, Article 36, PAX, Amnesty International
- Now ~270 member organizations across 70+ countries (ICBL peaked at ~1,300 NGOs, but CS-KR has comparable geographic reach)
- Call for action: negotiation of "a new international treaty that would prohibit fully autonomous weapons"
- Normative threshold: "meaningful human control" over lethal targeting decisions
**CCW GGE on LAWS (parallel formal process):**
- Convention on Certain Conventional Weapons Group of Governmental Experts on Lethal Autonomous Weapons Systems
- Established 2014; annual meetings since 2016
- Key milestones:
- 2019: Adopted 11 Guiding Principles on LAWS (non-binding; acknowledged "meaningful human control" concept)
- 2021: Endorsed Guiding Principles again; no progress toward binding instrument
- 2023: Adopted "Recommendations" — first formal recommendations; but still non-binding
- 2024: CCW Review Conference; 164 states; Austria, Mexico, 50+ states favor binding treaty; US, Russia, China, India, Israel, South Korea favor non-binding guidelines only
- 11 years of deliberations; zero binding commitments
**Structural parallel to ICBL (1992-1997 phase):**
The ICBL was founded in 1992 and achieved the Ottawa Treaty in 1997 — five years. CS-KR was founded in 2013; it's now 13 years later with no binding treaty. The ICBL needed three components: (1) normative infrastructure (present in CS-KR); (2) triggering event (present for ICBL — post-Cold War conflict civilian casualties; ABSENT for CS-KR); (3) middle-power champion moment (present for ICBL — Axworthy's Ottawa process; ABSENT for CS-KR — Austria has been most active but has not made the procedural break).
**Why the triggering event hasn't occurred:**
- Russia's Shahed drone strikes on Ukrainian infrastructure (2022-2024) are the nearest candidate: unmanned systems striking civilian targets, documented casualties, widely covered
- Why Shahed didn't trigger ICBL-scale response: (a) Shahed drones are semi-autonomous with pre-programmed targeting, not real-time AI decision-making — autonomy is not attributable in the "machine decided to kill" sense; (b) Ukraine conflict has normalized drone warfare rather than stigmatizing it; (c) both sides are using drones — stigmatization requires a clear aggressor
- The triggering event needs: clear AI decision-attribution + civilian mass casualties + non-mutual deployment (one side victimizing the other) + Western media visibility + emotional anchor figure (Princess Diana equivalent)
**The definitional paralysis problem:**
- ICBL didn't need to define "landmine" with precision — the object was physical, concrete, identifiable
- CS-KR must define "fully autonomous weapons" — where is the line between human-directed targeting assistance and fully autonomous lethal decision-making?
- CCW GGE has spent 11 years without agreeing on a working definition
- Major powers' interest: definitional ambiguity preserves their programs. The US LOAC (Law of Armed Conflict) compliance standard for autonomous weapons is deliberately vague — enough "human judgment somewhere in the system" without specifying what judgment at what point
- This is not bureaucratic failure; it's strategic interest actively maintaining ambiguity
**Middle-power champion assessment:**
- Austria: most active; convened Vienna Conference on LAWS (2024); has called for binding instrument
- New Zealand, Ireland, Costa Rica, Mexico: active supporters but without diplomatic leverage
- The Axworthy parallel would require a senior government figure willing to convene outside CCW — invite willing states to finalize a treaty and let major powers self-exclude
- No evidence this political moment has been identified; Austrian diplomacy remains within CCW machinery
---
## Agent Notes
**Why this matters:** CS-KR's 13-year trajectory reveals the AI weapons stigmatization campaign is in the "normative infrastructure present, triggering event absent" phase — comparable to the ICBL circa 1994-1995 (three years before Ottawa). The campaign is NOT stalled in the sense of losing momentum; it's waiting for the activation component.
**What surprised me:** The CCW GGE's 11-year failure to produce a binding instrument is often framed as evidence that AI weapons governance is impossible. But the ICBL bypassed the Conference on Disarmament — the exact equivalent — to achieve the Ottawa Treaty. The CCW GGE failure may be an ARGUMENT FOR a venue bypass, not evidence of permanent impossibility.
**What I expected but didn't find:** Clear evidence of a middle-power government leader willing to attempt the Axworthy procedural break (convening outside CCW machinery). Austria is the closest, but they're still working within CCW. The Axworthy moment hasn't been identified or attempted.
**KB connections:**
- [[narratives are infrastructure not just communication because they coordinate action at civilizational scale]] — CS-KR IS the narrative infrastructure; the missing component is the triggering event that activates it
- the meaning crisis is a narrative infrastructure failure not a personal psychological problem — the "who decides when AI kills" question is a narrative infrastructure problem at civilizational scale
- Ottawa Treaty analysis (today's first archive) — CS-KR has Component 1 (infrastructure) but lacks Components 2 and 3
**Extraction hints:**
1. STANDALONE CLAIM: Campaign to Stop Killer Robots as ICBL-phase-equivalent — normative infrastructure present; triggering event absent; middle-power champion moment not yet identified. This is a stage-assessment claim, not a pessimistic claim — the infrastructure makes the treaty possible when the event occurs. Grand-strategy domain. Confidence: experimental.
2. ENRICHMENT: Triggering-event architecture claim (Candidate 3 from research-2026-03-31.md) — CS-KR + CCW GGE trajectory is the empirical basis for the three-component sequential architecture (infrastructure → triggering event → champion moment).
**Context:** CS-KR is primarily a policy/advocacy organization; its annual reports document coalition growth and CCW GGE progress. Key academic analysis: Mark Gubrud (IEEE), Kenneth Payne "I, Warbot" (2021). CCW GGE Meeting Reports available at https://www.un.org/disarmament/the-convention-on-certain-conventional-weapons/
## Curator Notes (structured handoff for extractor)
PRIMARY CONNECTION: Legislative ceiling claim (Sessions 2026-03-27 through 2026-03-30) + Ottawa Treaty analysis (today's first archive)
WHY ARCHIVED: CS-KR trajectory reveals the AI weapons stigmatization campaign is in the "infrastructure present, triggering event absent" phase. This provides the empirical basis for the triggering-event architecture claim and positions the legislative ceiling as event-dependent, not permanently structural.
EXTRACTION HINT: Extract together with the Ottawa Treaty archive and the three-condition framework revision. The CS-KR trajectory is the empirical grounding for the "infrastructure without activation" stage assessment. Flag to Clay for narrative infrastructure implications.
## Key Facts
- CS-KR founded April 2013 by Human Rights Watch, Article 36, PAX, Amnesty International
- CS-KR now has ~270 member organizations across 70+ countries
- CCW GGE on LAWS established 2014, annual meetings since 2016
- CCW GGE adopted 11 Guiding Principles on LAWS in 2019 (non-binding)
- CCW GGE adopted Recommendations in 2023 (non-binding)
- 2024 CCW Review Conference: 164 states participated; Austria, Mexico, 50+ states favor binding treaty; US, Russia, China, India, Israel, South Korea favor non-binding guidelines
- ICBL was founded 1992 and achieved Ottawa Treaty in 1997 (5 years); CS-KR founded 2013, now 13 years without binding treaty
- Russia's Shahed drone strikes on Ukrainian infrastructure (2022-2024) are nearest candidate triggering event but failed to activate ICBL-scale response

View file

@ -1,101 +0,0 @@
---
type: source
title: "Ukraine/Shahed Near-Miss Analysis — Why Loitering Munition Civilian Casualties Haven't Generated ICBL-Scale Normative Response"
author: "Leo (KB synthesis from public documentation of Shahed-136/131 deployments, ACLED/UN data on Ukrainian civilian casualties 2022-2025)"
url: https://archive/synthesis
date: 2026-03-31
domain: grand-strategy
secondary_domains: [ai-alignment, mechanisms]
format: synthesis
status: null-result
priority: medium
tags: [ukraine, shahed-drones, loitering-munitions, triggering-event, near-miss, normative-shift, attribution-problem, civilian-casualties, weapons-stigmatization, autonomous-weapons, icbl-analog, narrative-infrastructure, normalization, ai-weapons-governance]
processed_by: leo
processed_date: 2026-03-31
extraction_model: "anthropic/claude-sonnet-4.5"
extraction_notes: "LLM returned 0 claims, 0 rejected by validator"
---
## Content
The Shahed-136/131 drone campaign (Iranian-designed, Russian-deployed) against Ukrainian civilian infrastructure (2022-present) is the most extensive documented use of armed autonomous-adjacent systems against civilian targets in the current conflict period. Assessing why it hasn't triggered ICBL-scale normative response reveals the specific preconditions the triggering event must meet.
**The Shahed campaign — scale and civilian impact:**
- Shahed-136 ("Geranium-2" in Russian designation): delta-wing loitering munition with ~2.5 kg warhead; GPS/INS navigation; loiters until target lock, then dives
- Deployed by Russia against Ukrainian civilian infrastructure from September 2022: power grid (thermal stations, substations), water infrastructure, apartment buildings
- Scale: Ukraine Ministry of Defense reports intercepting 6,000+ Shahed drones (2022-2024); thousands reached targets
- Civilian casualties: UN OHCHR documented hundreds of civilian deaths directly attributed to Shahed strikes; thousands of injuries; millions affected by power outages during winter
- Geographic scope: attacks reached Kyiv, Odessa, Kharkiv, and other civilian areas far from the front line
**Why it hasn't triggered an ICBL-scale normative shift — five failure modes:**
**Failure Mode 1 — Attribution problem (the most fundamental):**
The Shahed-136 uses GPS/INS navigation to a pre-programmed target coordinate. It does not use real-time AI targeting decisions, face recognition, object classification, or dynamic targeting. The "autonomous" element is navigation, not target selection. Attribution of "the AI decided to kill this civilian" is not available because the targeting decision was made by humans when the coordinates were programmed.
For the CS-KR "meaningful human control" framing to apply, the weapon must make a lethal targeting decision in real-time without human input. The Shahed fails this test. It is functionally closer to a guided missile than a LAWS.
Implication: The triggering event for AI weapons stigmatization CANNOT be a current-generation Shahed. It requires a higher-autonomy system that makes real-time target identification and engagement decisions.
**Failure Mode 2 — Normalization effect:**
Ukraine is deploying Ukrainian-developed drones (including loitering munitions) against Russian positions and, increasingly, against Russian territory. Both sides are using autonomous-adjacent systems. Stigmatization requires asymmetric deployment — one side using a weapon against defenseless civilians without the other side having the same capability. Mutual use normalizes. The ICBL succeeded partly because "landmines" were associated with post-conflict proliferation in civilian zones, not mutual military use in a peer conflict.
**Failure Mode 3 — Infrastructure targeting and indirect harm:**
Most Shahed civilian casualties are indirect: power outages cause hypothermia, medical equipment failure, inability to maintain water treatment. The direct link between drone strike and civilian death is often mediated by infrastructure failure, not direct physical harm. The ICBL's emotional power came from direct, visible harm — a child who lost a limb to a mine is a specific, identifiable victim with a photograph. The Shahed's civilian harm is real but distributed and indirect, harder to anchor emotionally.
**Failure Mode 4 — Conflict framing dominates weapons framing:**
Coverage of Ukraine is organized around "Russian aggression vs. Ukrainian resistance" rather than "autonomous weapons vs. civilians." The weapons framing is submerged in the conflict framing. For CS-KR's narrative to activate, the autonomous weapon must be the subject of the story, not merely an element of a larger conflict story. This requires either a non-war setting (peacetime deployment or police use) or a conflict where the weapon is so novel and its autonomy so distinctive that it becomes the story.
**Failure Mode 5 — Missing anchor figure:**
Princess Diana's Angola visit worked because Diana's extraordinary cultural standing made the landmine issue unavoidable in Western media. She brought personal embodiment to an abstract weapons policy issue. No equivalent figure has personally engaged with autonomous weapons civilian casualties in a way that generates comparable media saturation. The absence of the high-status emotional anchor is not just a media strategy gap — it reflects the "narrative pre-event infrastructure" failure discussed in the triggering-event architecture analysis.
**What this reveals about the triggering event requirements:**
For the triggering event to generate ICBL-scale response, it needs:
1. **Autonomous targeting attribution:** The AI system makes the targeting decision in real-time (not pre-programmed GPS coordinates). This requires a more advanced autonomous system than current Shahed-class weapons.
2. **Asymmetric deployment:** Used by one side against civilians who have no equivalent capability — probably requires non-state actor deployment or authoritarian government deployment against own population.
3. **Direct, visible harm:** The civilian casualty is directly and physically attributable to the drone's decision — a specific person, killed by a specific decision the AI made, documented with specific evidence.
4. **Narrative anchor figure:** Either a cultural figure of Diana's standing, or the victim themselves becomes a recognized individual (requires Western media context and a specific, identifiable human story).
5. **Non-conflict setting OR non-mutual use:** The weapon is either used in a non-war context (police drone, border control AI) or in an asymmetric war where the deploying side has no military justification framing available.
**Prediction for the triggering event:**
The first credible candidate is NOT in the Ukraine conflict. More likely candidates:
- A counter-terrorism or border-control autonomous drone system misidentifying and killing civilians in a context where the Western media can cover it freely
- An authoritarian government using AI-enabled targeting against an identifiable ethnic minority in a context with international documentation access
- A commercially-available modified autonomous drone used by a non-state actor for targeted political assassination in a Western country
The Shahed campaign is evidence that even large-scale drone warfare against civilians can be insufficient to trigger the normative shift if the five failure mode criteria aren't met.
---
## Agent Notes
**Why this matters:** The Ukraine/Shahed analysis is the most concrete recent test of whether the triggering event conditions have been approached. All five failure modes are instructive — they specify what the triggering event MUST include that the Shahed campaign lacked. This is more useful than abstract criteria.
**What surprised me:** The attribution problem is deeper than I expected. The gap between "loitering munition with GPS navigation" and "AI autonomous targeting system making real-time decisions" is the key failure. This implies the triggering event will require MORE advanced AI weapons than currently deployed — which pushes the timeline forward but also clarifies what to watch for.
**What I expected but didn't find:** Evidence that the Ukraine conflict has substantially advanced the CS-KR normative campaign. It appears not to have — CS-KR's political progress in 2023-2024 is not notably accelerated relative to 2019-2022. The Shahed campaign has raised awareness of loitering munitions but has NOT been framed as "autonomous weapons" in mainstream coverage.
**KB connections:**
- CS-KR trajectory analysis (today's second archive) — the triggering event gap assessment
- Triggering-event architecture (today's third archive) — the five failure modes provide specific content for the "what the triggering event requires" section
- Strategic utility differentiation (today's fourth archive) — Shahed-class weapons are Category 2 (medium strategic utility), which is exactly the category the Ottawa Treaty path applies to; but the triggering event hasn't occurred for this category
**Extraction hints:**
1. ENRICHMENT: Triggering-event architecture claim — the five failure modes (attribution, normalization, indirect harm, conflict framing, anchor figure) add specific empirical content to the abstract three-component architecture. Inline the Ukraine/Shahed analysis as supporting evidence.
2. Not a standalone claim — this is an enrichment of the triggering-event architecture and the CS-KR assessment.
**Context:** UN OHCHR "Ukraine: Report on the Human Rights Situation" (various 2022-2025 reports). ACLED conflict data. ISW (Institute for the Study of War) Shahed usage tracking. Center for Naval Analyses "Shahed Drone Assessment" (2023). PAX report on autonomous weapons in Ukraine (2024).
## Curator Notes (structured handoff for extractor)
PRIMARY CONNECTION: Triggering-event architecture archive (today's third archive) — provides the empirical content for the abstract criteria
WHY ARCHIVED: Ukraine/Shahed is the most important recent near-miss test case for the triggering event hypothesis. The five failure modes are analytically precise and inform what to watch for as next-generation AI weapons are deployed.
EXTRACTION HINT: Extract as ENRICHMENT to the triggering-event architecture claim, not standalone. The five failure modes belong in the body of that claim as inline evidence.
## Key Facts
- Shahed-136 is a delta-wing loitering munition with ~2.5 kg warhead using GPS/INS navigation
- Russia deployed Shahed drones against Ukrainian civilian infrastructure from September 2022
- Ukraine Ministry of Defense reports intercepting 6,000+ Shahed drones between 2022-2024
- UN OHCHR documented hundreds of civilian deaths directly attributed to Shahed strikes
- Shahed strikes targeted power grid, water infrastructure, and apartment buildings in Kyiv, Odessa, Kharkiv
- Most Shahed civilian casualties are indirect through infrastructure failure rather than direct physical harm