From 4b25300ef7d8cb71a3f7238186e302f670ef771f Mon Sep 17 00:00:00 2001 From: Teleo Agents Date: Tue, 31 Mar 2026 09:16:06 +0000 Subject: [PATCH 1/4] extract: 2026-03-30-leo-eu-ai-act-article2-national-security-exclusion-legislative-ceiling Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70> --- ...gislative-ceiling-is-cross-jurisdictional.md | 6 ++++++ ...l-security-exclusion-legislative-ceiling.txt | 3 +++ ...al-security-exclusion-legislative-ceiling.md | 17 ++++++++++++++++- 3 files changed, 25 insertions(+), 1 deletion(-) create mode 100644 inbox/queue/.prior-art/2026-03-30-leo-eu-ai-act-article2-national-security-exclusion-legislative-ceiling.txt diff --git a/domains/grand-strategy/eu-ai-act-article-2-3-national-security-exclusion-confirms-legislative-ceiling-is-cross-jurisdictional.md b/domains/grand-strategy/eu-ai-act-article-2-3-national-security-exclusion-confirms-legislative-ceiling-is-cross-jurisdictional.md index 75911bc7..31c18d55 100644 --- a/domains/grand-strategy/eu-ai-act-article-2-3-national-security-exclusion-confirms-legislative-ceiling-is-cross-jurisdictional.md +++ b/domains/grand-strategy/eu-ai-act-article-2-3-national-security-exclusion-confirms-legislative-ceiling-is-cross-jurisdictional.md @@ -27,6 +27,12 @@ This converts the structural diagnosis from Sessions 2026-03-27/28/29 (developed --- +### Additional Evidence (confirm) +*Source: [[2026-03-30-leo-eu-ai-act-article2-national-security-exclusion-legislative-ceiling]] | Added: 2026-03-31* + +This source IS the primary claim file itself - it documents EU AI Act Article 2.3's blanket national security exclusion ('This Regulation shall not apply to AI systems developed or used exclusively for military, national defence or national security purposes, regardless of the type of entity carrying out those activities'). The exclusion was present in early drafts and confirmed through co-decision process after France/Germany lobbying. GDPR Article 2.2(a) established precedent for national security exclusions in EU regulation, with CJEU consistently interpreting it to exclude national security activities. This converts Sessions 2026-03-27/28/29's structural diagnosis into black-letter law. + + Relevant Notes: - [[technology advances exponentially but coordination mechanisms evolve linearly creating a widening gap]] - [[government designation of safety-conscious AI labs as supply chain risks inverts the regulatory dynamic...]] diff --git a/inbox/queue/.prior-art/2026-03-30-leo-eu-ai-act-article2-national-security-exclusion-legislative-ceiling.txt b/inbox/queue/.prior-art/2026-03-30-leo-eu-ai-act-article2-national-security-exclusion-legislative-ceiling.txt new file mode 100644 index 00000000..67a1cbca --- /dev/null +++ b/inbox/queue/.prior-art/2026-03-30-leo-eu-ai-act-article2-national-security-exclusion-legislative-ceiling.txt @@ -0,0 +1,3 @@ +## Prior Art (automated pre-screening) + +- [house-senate-ai-defense-divergence-creates-structural-governance-chokepoint-at-conference](domains/ai-alignment/house-senate-ai-defense-divergence-creates-structural-governance-chokepoint-at-conference.md) — similarity: 0.65 — matched query: "Legislative ceiling mechanism confirms cross-jurisdictional governance gaps in f" diff --git a/inbox/queue/2026-03-30-leo-eu-ai-act-article2-national-security-exclusion-legislative-ceiling.md b/inbox/queue/2026-03-30-leo-eu-ai-act-article2-national-security-exclusion-legislative-ceiling.md index 23c8c5ef..2dd2af21 100644 --- a/inbox/queue/2026-03-30-leo-eu-ai-act-article2-national-security-exclusion-legislative-ceiling.md +++ b/inbox/queue/2026-03-30-leo-eu-ai-act-article2-national-security-exclusion-legislative-ceiling.md @@ -7,7 +7,7 @@ date: 2026-03-30 domain: grand-strategy secondary_domains: [ai-alignment] format: synthesis -status: processed +status: enrichment priority: high tags: [eu-ai-act, article-2-3, national-security-exclusion, legislative-ceiling, cross-jurisdictional, gdpr, regulatory-design, military-ai, sovereign-authority, governance-instrument-asymmetry, belief-1, scope-qualifier, grand-strategy, ai-governance] flagged_for_theseus: ["EU AI Act Article 2.3 exclusion has direct implications for Theseus's claims about governance mechanisms for frontier AI — the most safety-forward binding regulation excludes the deployment context Theseus's domain is most concerned about"] @@ -15,6 +15,11 @@ processed_by: leo processed_date: 2026-03-30 claims_extracted: ["eu-ai-act-article-2-3-national-security-exclusion-confirms-legislative-ceiling-is-cross-jurisdictional.md"] extraction_model: "anthropic/claude-sonnet-4.5" +processed_by: leo +processed_date: 2026-03-31 +enrichments_applied: ["eu-ai-act-article-2-3-national-security-exclusion-confirms-legislative-ceiling-is-cross-jurisdictional.md"] +extraction_model: "anthropic/claude-sonnet-4.5" +extraction_notes: "pre-screen: 1 prior art claims from 5 themes" --- ## Content @@ -87,3 +92,13 @@ EXTRACTION HINT: Extract as standalone claim with confidence: proven (black-lett - France and Germany lobbied successfully for the national security exclusion during EU AI Act drafting - GDPR Article 2.2(a) established precedent for national security exclusions in EU regulation - Court of Justice of the EU has consistently interpreted GDPR's scope exclusion to cover national security activities + + +## Key Facts +- EU AI Act (Regulation 2024/1689) entered into force August 1, 2024 +- Article 2.3 excludes AI systems developed or used exclusively for military, national defence or national security purposes +- The exclusion applies 'regardless of the type of entity carrying out those activities' +- France and Germany lobbied successfully for the national security exclusion during EU AI Act drafting +- GDPR Article 2.2(a) excludes processing 'in the course of an activity which falls outside the scope of Union law' +- Court of Justice of the EU has consistently interpreted GDPR's scope exclusion to cover national security activities +- The national security exclusion was present in early EU AI Act drafts and confirmed through co-decision process From 0e3cbd08279f6ffcd183eb0b64b7cd3abf67cddc Mon Sep 17 00:00:00 2001 From: Teleo Agents Date: Tue, 31 Mar 2026 09:17:08 +0000 Subject: [PATCH 2/4] auto-fix: strip 2 broken wiki links Pipeline auto-fixer: removed [[ ]] brackets from links that don't resolve to existing claims in the knowledge base. --- ...on-confirms-legislative-ceiling-is-cross-jurisdictional.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/domains/grand-strategy/eu-ai-act-article-2-3-national-security-exclusion-confirms-legislative-ceiling-is-cross-jurisdictional.md b/domains/grand-strategy/eu-ai-act-article-2-3-national-security-exclusion-confirms-legislative-ceiling-is-cross-jurisdictional.md index 31c18d55..3d9f308d 100644 --- a/domains/grand-strategy/eu-ai-act-article-2-3-national-security-exclusion-confirms-legislative-ceiling-is-cross-jurisdictional.md +++ b/domains/grand-strategy/eu-ai-act-article-2-3-national-security-exclusion-confirms-legislative-ceiling-is-cross-jurisdictional.md @@ -35,8 +35,8 @@ This source IS the primary claim file itself - it documents EU AI Act Article 2. Relevant Notes: - [[technology advances exponentially but coordination mechanisms evolve linearly creating a widening gap]] -- [[government designation of safety-conscious AI labs as supply chain risks inverts the regulatory dynamic...]] -- [[only binding regulation with enforcement teeth changes frontier AI lab behavior...]] +- government designation of safety-conscious AI labs as supply chain risks inverts the regulatory dynamic... +- only binding regulation with enforcement teeth changes frontier AI lab behavior... - [[military-ai-deskilling-and-tempo-mismatch-make-human-oversight-functionally-meaningless-despite-formal-authorization-requirements]] Topics: From d38f928ce6295ba81db197161fd3ac16e69665a1 Mon Sep 17 00:00:00 2001 From: Teleo Agents Date: Tue, 31 Mar 2026 09:32:53 +0000 Subject: [PATCH 3/4] pipeline: archive 1 source(s) post-merge Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70> --- ...-security-exclusion-legislative-ceiling.md | 89 +++++++++++++++++++ 1 file changed, 89 insertions(+) create mode 100644 inbox/archive/grand-strategy/2026-03-30-leo-eu-ai-act-article2-national-security-exclusion-legislative-ceiling.md diff --git a/inbox/archive/grand-strategy/2026-03-30-leo-eu-ai-act-article2-national-security-exclusion-legislative-ceiling.md b/inbox/archive/grand-strategy/2026-03-30-leo-eu-ai-act-article2-national-security-exclusion-legislative-ceiling.md new file mode 100644 index 00000000..23c8c5ef --- /dev/null +++ b/inbox/archive/grand-strategy/2026-03-30-leo-eu-ai-act-article2-national-security-exclusion-legislative-ceiling.md @@ -0,0 +1,89 @@ +--- +type: source +title: "Leo Synthesis — EU AI Act Article 2.3 National Security Exclusion Confirms the Legislative Ceiling Is Cross-Jurisdictional, Not US-Specific" +author: "Leo (cross-domain synthesis from EU AI Act Regulation 2024/1689, GDPR Article 2.2, and Sessions 2026-03-27/28/29 legislative ceiling pattern)" +url: https://archive/synthesis +date: 2026-03-30 +domain: grand-strategy +secondary_domains: [ai-alignment] +format: synthesis +status: processed +priority: high +tags: [eu-ai-act, article-2-3, national-security-exclusion, legislative-ceiling, cross-jurisdictional, gdpr, regulatory-design, military-ai, sovereign-authority, governance-instrument-asymmetry, belief-1, scope-qualifier, grand-strategy, ai-governance] +flagged_for_theseus: ["EU AI Act Article 2.3 exclusion has direct implications for Theseus's claims about governance mechanisms for frontier AI — the most safety-forward binding regulation excludes the deployment context Theseus's domain is most concerned about"] +processed_by: leo +processed_date: 2026-03-30 +claims_extracted: ["eu-ai-act-article-2-3-national-security-exclusion-confirms-legislative-ceiling-is-cross-jurisdictional.md"] +extraction_model: "anthropic/claude-sonnet-4.5" +--- + +## Content + +**Source material:** EU AI Act (Regulation (EU) 2024/1689), Article 2.3; GDPR (Regulation (EU) 2016/679), Article 2.2(a); France/Germany member state lobbying record during EU AI Act drafting (documented in EU legislative process); existing KB source 2026-03-20-eu-ai-act-article43-conformity-assessment-limits.md. + +**The EU AI Act's Article 2.3 (verbatim):** +"This Regulation shall not apply to AI systems developed or used exclusively for military, national defence or national security purposes, regardless of the type of entity carrying out those activities." + +This is the legislative ceiling instantiated in black-letter law by the most ambitious binding AI safety regulation in the world, produced by the most safety-forward regulatory jurisdiction, after years of negotiation with safety-oriented political leadership. + +**Key features of the exclusion:** +1. "Regardless of the type of entity" — covers private companies developing military AI, not just state actors +2. Categorical and blanket — no tiered approach, no proportionality test, no compliance-lite version for military AI +3. Applies by purpose: AI used "exclusively" for military/national security is excluded; dual-use AI may still be subject to the regulation for its civilian applications +4. The scope exclusion was not a last-minute amendment — it was present in early drafts and confirmed through the co-decision process + +**Why the exclusion was adopted:** +France and Germany, as major member states with significant defense industries, lobbied successfully for the exclusion. The stated justifications align exactly with the strategic interest inversion mechanism documented in Sessions 2026-03-27/28: +- Military AI systems require response speed incompatible with conformity assessment timelines +- Transparency requirements (explainability, technical documentation) could expose classified capabilities +- Third-party audit of military AI decision systems is incompatible with operational security +- "Safety" requirements must be defined by military doctrine, not civilian regulatory standards + +These are the same arguments that produced the DoD blacklisting of Anthropic at the contracting level — now operating at the legislative scope-definition level, in a different jurisdiction, under a different political administration, producing the same outcome. + +**GDPR precedent:** +Article 2.2(a) of GDPR (the world's leading data protection regulation, which entered into force in 2018) excludes processing "in the course of an activity which falls outside the scope of Union law." The Court of Justice of the EU has consistently interpreted this to exclude national security activities. The EU AI Act's Article 2.3 follows the same structural logic as GDPR's national security exclusion — it is embedded EU regulatory DNA, not an AI-specific political choice. + +**Cross-jurisdictional significance:** +The EU AI Act was drafted by legislators who were specifically aware of the gap that a national security exclusion creates. The exclusion was retained anyway — because the legislative ceiling is not the product of ignorance or insufficient safety advocacy; it is the product of how nation-states preserve sovereign authority over national security decisions. The EU's regulatory philosophy explicitly prioritizes human oversight and accountability for civilian AI. Its military exclusion is not an exception to that philosophy — it is where national sovereignty overrides it. + +**Relationship to Sessions 2026-03-27/28/29 findings:** +Session 2026-03-29 described the legislative ceiling as "logically necessary" and offered it as a structural diagnosis. The EU AI Act Article 2.3 converts that structural diagnosis into an empirical finding: the legislative ceiling has already occurred, in the most prominent binding AI safety statute in history, in the most safety-forward regulatory jurisdiction in the world. This is not a prediction — it is a completed fact. + +--- + +## Agent Notes + +**Why this matters:** This is the most important cross-jurisdictional confirmation available for the legislative ceiling claim. Sessions 2026-03-27/28/29 developed the pattern from US evidence (DoD contracting, litigation, PAC investment). The EU AI Act Article 2.3 confirms the pattern holds in a different political system, under different leadership, with different regulatory philosophy — making "this is US-specific" or "this is Trump-administration-specific" alternative explanations definitively false. + +**What surprised me:** The "regardless of the type of entity" clause. I expected the exclusion to cover government/military use. The extension to private companies using AI for military purposes is a broader exclusion than I anticipated — it closes the "private contractor loophole" that might otherwise allow civilian AI safety requirements to flow through procurement chains. The EU explicitly foreclosed that alternative governance pathway. + +**What I expected but didn't find:** Any "minimal standards" provision for military AI — a lite compliance tier that would apply reduced requirements to national security AI. The EU chose a categorical binary (in scope / out of scope) rather than a tiered approach. This makes the exclusion cleaner analytically but also removes any pathway to partial governance of military AI through the EU AI Act's framework. + +**KB connections:** +- [[technology advances exponentially but coordination mechanisms evolve linearly creating a widening gap]] — EU AI Act Article 2.3 is direct evidence that even the most sophisticated coordination mechanism (binding regulation) contains the gap for the highest-stakes deployment context +- Session 2026-03-28 synthesis (legal mechanism gap) — Article 2.3 confirms that even when the instrument changes from voluntary to mandatory, the legal mechanism gap persists for military AI in exactly the most successful mandatory governance regime +- Session 2026-03-29 synthesis (legislative ceiling) — Article 2.3 converts the structural diagnosis into a completed empirical fact +- 2026-03-20-eu-ai-act-article43-conformity-assessment-limits.md (existing KB archive) — that source covers Article 43 (conformity assessment); this source covers Article 2.3 (scope exclusion); together they paint the full picture of EU AI Act's governance limitations + +**Extraction hints:** +- PRIMARY: Extract as standalone claim: "The EU AI Act's Article 2.3 blanket national security exclusion confirms the legislative ceiling is cross-jurisdictional — even the world's most ambitious binding AI safety regulation explicitly carves out military and national security AI, regardless of the type of entity deploying it" — domain: grand-strategy, confidence: proven (black-letter law), cross-domain: ai-alignment +- SECONDARY: The GDPR precedent strengthens the "embedded regulatory DNA" framing — consider as supporting evidence in the claim body, not as a separate claim +- ENRICHMENT: This source should be added to the legislative ceiling scope qualifier enrichment on [[technology advances exponentially but coordination mechanisms evolve linearly creating a widening gap]] as the cross-jurisdictional confirmation +- DOMAIN NOTE: Flag for Theseus — Article 2.3 directly affects the governance mechanisms available for frontier AI safety; Theseus should know the most binding regulation doesn't apply to the deployment contexts they're most concerned about + +**Context:** EU AI Act entered into force August 1, 2024. Existing KB source (2026-03-20-eu-ai-act-article43-conformity-assessment-limits.md) covers Article 43 conformity assessment — this archive covers Article 2.3 scope exclusion, which is a different provision with different significance. The KB has EU AI Act coverage of conformity assessment limits (Article 43) but not scope exclusion (Article 2.3) — this fills the gap. + +## Curator Notes (structured handoff for extractor) +PRIMARY CONNECTION: [[technology advances exponentially but coordination mechanisms evolve linearly creating a widening gap]] + Session 2026-03-29 legislative ceiling synthesis +WHY ARCHIVED: Cross-jurisdictional empirical confirmation that the legislative ceiling has already occurred in the world's most prominent binding AI safety regulation. Converts Sessions 2026-03-27/28/29's structural diagnosis into a completed fact. +EXTRACTION HINT: Extract as standalone claim with confidence: proven (black-letter law). EU AI Act Article 2.3 verbatim text is the evidence — no additional sourcing needed. Flag for Theseus. Add as enrichment to governance instrument asymmetry claim (Pattern G) before that goes to PR. + + +## Key Facts +- EU AI Act (Regulation 2024/1689) entered into force August 1, 2024 +- Article 2.3 excludes AI systems developed or used exclusively for military, national defence or national security purposes +- The exclusion applies 'regardless of the type of entity carrying out those activities' +- France and Germany lobbied successfully for the national security exclusion during EU AI Act drafting +- GDPR Article 2.2(a) established precedent for national security exclusions in EU regulation +- Court of Justice of the EU has consistently interpreted GDPR's scope exclusion to cover national security activities From 0fa4836b34b9d2c0e74835bfa5736fecfaff4c03 Mon Sep 17 00:00:00 2001 From: m3taversal Date: Tue, 31 Mar 2026 10:02:57 +0100 Subject: [PATCH 4/4] theseus: extract 5 claims + 1 enrichment from Pan et al. NLAH paper MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - What: 5 NEW claims from "Natural-Language Agent Harnesses" (arXiv:2603.25723) plus 1 enrichment to subagent hierarchy claim with 90% delegation token data - Why: First controlled ablation study of harness modules; novel findings on solved-set replacer effect, file-backed state reliability, self-evolution mechanism, verifier acceptance divergence, and NL harness portability - Connections: enriches harness engineering, determinism boundary, context≠memory claim clusters; challenges coordination-always-helps assumptions Pentagon-Agent: Theseus <46864dd4-da71-4719-a1b4-68f7c55854d3> --- ...ntext truncation delegation and restart.md | 41 +++++++++++++++++ ...cases that flip under changed structure.md | 37 ++++++++++++++++ ...eparable from low-level execution hooks.md | 39 ++++++++++++++++ ...ction outperform open-ended exploration.md | 36 +++++++++++++++ ...y agent controlling specialized helpers.md | 5 +++ ...ccess criteria not the final evaluators.md | 35 +++++++++++++++ ...n-2026-natural-language-agent-harnesses.md | 44 +++++++++++++++++++ 7 files changed, 237 insertions(+) create mode 100644 domains/ai-alignment/file-backed durable state is the most consistently positive harness module across task types because externalizing state to path-addressable artifacts survives context truncation delegation and restart.md create mode 100644 domains/ai-alignment/harness module effects concentrate on a small solved frontier rather than shifting benchmarks uniformly because most tasks are robust to control logic changes and meaningful differences come from boundary cases that flip under changed structure.md create mode 100644 domains/ai-alignment/harness pattern logic is portable as natural language without performance loss when backed by a shared intelligent runtime because the design-pattern layer is separable from low-level execution hooks.md create mode 100644 domains/ai-alignment/self-evolution improves agent performance through acceptance-gated retry not expanded search because disciplined attempt loops with explicit failure reflection outperform open-ended exploration.md create mode 100644 domains/ai-alignment/verifier-level acceptance can diverge from benchmark acceptance even when locally correct because intermediate checking layers optimize for their own success criteria not the final evaluators.md create mode 100644 inbox/archive/pan-2026-natural-language-agent-harnesses.md diff --git a/domains/ai-alignment/file-backed durable state is the most consistently positive harness module across task types because externalizing state to path-addressable artifacts survives context truncation delegation and restart.md b/domains/ai-alignment/file-backed durable state is the most consistently positive harness module across task types because externalizing state to path-addressable artifacts survives context truncation delegation and restart.md new file mode 100644 index 00000000..a4c07816 --- /dev/null +++ b/domains/ai-alignment/file-backed durable state is the most consistently positive harness module across task types because externalizing state to path-addressable artifacts survives context truncation delegation and restart.md @@ -0,0 +1,41 @@ +--- +type: claim +domain: ai-alignment +secondary_domains: [collective-intelligence] +description: "Ablation study shows file-backed state improves both SWE-bench (+1.6pp) and OSWorld (+5.5pp) while maintaining the lowest overhead profile among tested modules — its value is process structure not score gain" +confidence: experimental +source: "Pan et al. 'Natural-Language Agent Harnesses', arXiv:2603.25723, March 2026. Table 3. SWE-bench Verified (125 samples) + OSWorld (36 samples), GPT-5.4, Codex CLI." +created: 2026-03-31 +depends_on: + - "long context is not memory because memory requires incremental knowledge accumulation and stateful change not stateless input processing" + - "context files function as agent operating systems through self-referential self-extension where the file teaches modification of the file that contains the teaching" +--- + +# File-backed durable state is the most consistently positive harness module across task types because externalizing state to path-addressable artifacts survives context truncation delegation and restart + +Pan et al. (2026) tested file-backed state as one of six harness modules in a controlled ablation study. It improved performance on both SWE-bench Verified (+1.6pp over Basic) and OSWorld (+5.5pp over Basic) — the only module to show consistent positive gains across both benchmarks without high variance. + +The module enforces three properties: +1. **Externalized** — state is written to artifacts rather than held only in transient context +2. **Path-addressable** — later stages reopen the exact object by path +3. **Compaction-stable** — state survives truncation, restart, and delegation + +Its gains are mild in absolute terms but its mechanism is distinct from the other modules. File-backed state and evidence-backed answering mainly improve process structure — they leave durable external signatures (task histories, manifests, analysis sidecars) that improve auditability, handoff discipline, and trace quality more directly than semantic repair ability. + +On OSWorld, the file-backed state effect is amplified because the baseline already involves a structured harness (OS-Symphony). The migration study (RQ3) confirms this: migrated NLAH runs materialize task files, ledgers, and explicit artifacts, and switch more readily from brittle GUI repair to file, shell, or package-level operations when those provide a stronger completion certificate. + +The case study of `mwaskom__seaborn-3069` illustrates the mechanism: under file-backed state, the workspace leaves a durable spine consisting of a parent response, append-only task history, and manifest entries for the promoted patch artifact. The child handoff and artifact lineage become explicit, helping the solver keep one patch surface and one verification story. + +## Challenges + +The +1.6pp on SWE-bench is within noise for 125 samples. The stronger signal is the process trace analysis, not the score delta. Whether file-backed state helps primarily by preventing state loss (defensive value) or by enabling new solution strategies (offensive value) is not cleanly separated by the ablation design. + +--- + +Relevant Notes: +- [[long context is not memory because memory requires incremental knowledge accumulation and stateful change not stateless input processing]] — file-backed state is the architectural embodiment of this distinction: it externalizes memory to durable artifacts rather than relying on context window as pseudo-memory +- [[context files function as agent operating systems through self-referential self-extension where the file teaches modification of the file that contains the teaching]] — file-backed state as described by Pan et al. is the production implementation of context-file-as-OS: path-addressable, externalized, compaction-stable +- [[production agent memory infrastructure consumed 24 percent of codebase in one tracked system suggesting memory requires dedicated engineering not a single configuration file]] — the file-backed module's three properties (externalized, path-addressable, compaction-stable) represent exactly the kind of dedicated memory engineering that takes 24% of codebase + +Topics: +- [[_map]] diff --git a/domains/ai-alignment/harness module effects concentrate on a small solved frontier rather than shifting benchmarks uniformly because most tasks are robust to control logic changes and meaningful differences come from boundary cases that flip under changed structure.md b/domains/ai-alignment/harness module effects concentrate on a small solved frontier rather than shifting benchmarks uniformly because most tasks are robust to control logic changes and meaningful differences come from boundary cases that flip under changed structure.md new file mode 100644 index 00000000..6cf68608 --- /dev/null +++ b/domains/ai-alignment/harness module effects concentrate on a small solved frontier rather than shifting benchmarks uniformly because most tasks are robust to control logic changes and meaningful differences come from boundary cases that flip under changed structure.md @@ -0,0 +1,37 @@ +--- +type: claim +domain: ai-alignment +secondary_domains: [collective-intelligence] +description: "Controlled ablation of 6 harness modules on SWE-bench Verified shows 110-115 of 125 samples agree between Full IHR and each ablation — the harness reshapes which boundary cases flip, not overall solve rate" +confidence: experimental +source: "Pan et al. 'Natural-Language Agent Harnesses', arXiv:2603.25723, March 2026. Tables 1-3. SWE-bench Verified (125 samples) + OSWorld (36 samples), GPT-5.4, Codex CLI." +created: 2026-03-31 +depends_on: + - "multi-agent coordination improves parallel task performance but degrades sequential reasoning because communication overhead fragments linear workflows" +challenged_by: + - "coordination protocol design produces larger capability gains than model scaling because the same AI model performed 6x better with structured exploration than with human coaching on the same problem" +--- + +# Harness module effects concentrate on a small solved frontier rather than shifting benchmarks uniformly because most tasks are robust to control logic changes and meaningful differences come from boundary cases that flip under changed structure + +Pan et al. (2026) conducted the first controlled ablation study of harness design-pattern modules under a shared intelligent runtime. Six modules were tested individually: file-backed state, evidence-backed answering, verifier separation, self-evolution, multi-candidate search, and dynamic orchestration. + +The core finding is that Full IHR behaves as a **solved-set replacer**, not a uniform frontier expander. Across both TRAE and Live-SWE harness families on SWE-bench Verified, more than 110 of 125 stitched samples agree between Full IHR and each ablation (Table 2). The meaningful differences are concentrated in a small frontier of 4-8 component-sensitive cases that flip — Full IHR creates some new wins but also loses some direct-path repairs that lighter settings retain. + +The most informative failures are alignment failures, not random misses. On `matplotlib__matplotlib-24570`, TRAE Full expands into a large candidate search, runs multiple selector and revalidation stages, and ends with a locally plausible patch that misses the official evaluator. On `django__django-14404` and `sympy__sympy-23950`, extra structure makes the run more organized and more expensive while drifting from the shortest benchmark-aligned repair path. + +This has direct implications for harness engineering strategy: adding modules should be evaluated by which boundary cases they unlock or lose, not by aggregate score deltas. The dominant effect is redistribution of solvability, not expansion. + +## Challenges + +The study uses benchmark subsets (125 SWE, 36 OSWorld) sampled once with a fixed random seed, not full benchmark suites. Whether the frontier-concentration pattern holds at full scale or with different seeds is untested. The authors plan GPT-5.4-mini reruns in a future revision. Additionally, SWE-bench Verified has known ceiling effects that may compress the observable range of module differences. + +--- + +Relevant Notes: +- [[multi-agent coordination improves parallel task performance but degrades sequential reasoning because communication overhead fragments linear workflows]] — the NLAH ablation data shows this at the module level, not just the agent level: adding orchestration structure can hurt sequential repair paths +- [[coordination protocol design produces larger capability gains than model scaling because the same AI model performed 6x better with structured exploration than with human coaching on the same problem]] — the 6x gain is real but this paper shows it concentrates on a small frontier of cases; the majority of tasks are insensitive to protocol changes +- [[79 percent of multi-agent failures originate from specification and coordination not implementation because decomposition quality is the primary determinant of system success]] — the solved-set replacer effect suggests that even well-decomposed multi-agent systems may trade one set of solvable problems for another rather than strictly expanding the frontier + +Topics: +- [[_map]] diff --git a/domains/ai-alignment/harness pattern logic is portable as natural language without performance loss when backed by a shared intelligent runtime because the design-pattern layer is separable from low-level execution hooks.md b/domains/ai-alignment/harness pattern logic is portable as natural language without performance loss when backed by a shared intelligent runtime because the design-pattern layer is separable from low-level execution hooks.md new file mode 100644 index 00000000..acb70b6c --- /dev/null +++ b/domains/ai-alignment/harness pattern logic is portable as natural language without performance loss when backed by a shared intelligent runtime because the design-pattern layer is separable from low-level execution hooks.md @@ -0,0 +1,39 @@ +--- +type: claim +domain: ai-alignment +secondary_domains: [collective-intelligence] +description: "Code-to-text migration study on OSWorld shows NLAH realization (47.2%) exceeded native code harness (30.4%) while relocating reliability from screen repair to artifact-backed closure — NL carries harness logic when deterministic operations stay in code" +confidence: experimental +source: "Pan et al. 'Natural-Language Agent Harnesses', arXiv:2603.25723, March 2026. Table 5, RQ3 migration analysis. OSWorld (36 samples), GPT-5.4, Codex CLI." +created: 2026-03-31 +depends_on: + - "harness engineering emerges as the primary agent capability determinant because the runtime orchestration layer not the token state determines what agents can do" + - "the determinism boundary separates guaranteed agent behavior from probabilistic compliance because hooks enforce structurally while instructions degrade under context load" + - "notes function as executable skills for AI agents because loading a well-titled claim into context enables reasoning the agent could not perform without it" +--- + +# Harness pattern logic is portable as natural language without performance loss when backed by a shared intelligent runtime because the design-pattern layer is separable from low-level execution hooks + +Pan et al. (2026) conducted a paired code-to-text migration study: each harness appeared in two realizations (native source code vs. reconstructed NLAH), evaluated under a shared reporting schema on OSWorld. The migrated NLAH realization reached 47.2% task success versus 30.4% for the native OS-Symphony code harness. + +The scientific claim is not that NL is superior to code. The paper explicitly states that natural language carries editable, inspectable *orchestration logic*, while code remains responsible for deterministic operations, tool interfaces, and sandbox enforcement. The claim is about separability: the harness design-pattern layer (roles, contracts, stage structure, state semantics, failure taxonomy) can be externalized as a natural-language object without degrading performance, provided a shared runtime handles execution semantics. + +The migration effect is behavioral, not just numerical. Native OS-Symphony externalizes control as a screenshot-grounded repair loop: verify previous step, inspect current screen, choose next GUI action, retry locally on errors. Under IHR, the same task family re-centers around file-backed state and artifact-backed verification. Runs materialize task files, ledgers, and explicit artifacts, and switch more readily from brittle GUI repair to file, shell, or package-level operations when those provide a stronger completion certificate. + +Retained migrated traces are denser (58.5 total logged events vs 18.2 unique commands in native traces) but the density reflects observability and recovery scaffolding, not more task actions. The runtime preserves started/completed pairs, bookkeeping, and explicit artifact handling that native code harnesses handle implicitly. + +This result supports the determinism boundary framework: the boundary between what should be NL (high-level orchestration, editable by humans) and what should be code (deterministic hooks, tool adapters, sandbox enforcement) is a real architectural cut point, and making it explicit improves both portability and performance. + +## Challenges + +The 47.2 vs 30.4 comparison is on 36 OSWorld samples — small enough that individual task variance could explain some of the gap. The native harness (OS-Symphony) may not be fully optimized for the Codex/IHR backend; some of the NLAH advantage could come from better fit to the specific runtime rather than from portability per se. The authors acknowledge that some harness mechanisms cannot be recovered faithfully from text when they rely on hidden service-side state or training-induced behaviors. + +--- + +Relevant Notes: +- [[harness engineering emerges as the primary agent capability determinant because the runtime orchestration layer not the token state determines what agents can do]] — this paper provides direct evidence: the same runtime with different harness representations produces different behavioral signatures, confirming the harness layer is real and separable +- [[the determinism boundary separates guaranteed agent behavior from probabilistic compliance because hooks enforce structurally while instructions degrade under context load]] — the NLAH architecture explicitly implements this boundary: NL carries pattern logic (probabilistic, editable), adapters and scripts carry deterministic hooks (guaranteed, code-based) +- [[notes function as executable skills for AI agents because loading a well-titled claim into context enables reasoning the agent could not perform without it]] — NLAHs are a formal version of this: natural-language objects that carry executable control logic + +Topics: +- [[_map]] diff --git a/domains/ai-alignment/self-evolution improves agent performance through acceptance-gated retry not expanded search because disciplined attempt loops with explicit failure reflection outperform open-ended exploration.md b/domains/ai-alignment/self-evolution improves agent performance through acceptance-gated retry not expanded search because disciplined attempt loops with explicit failure reflection outperform open-ended exploration.md new file mode 100644 index 00000000..281e1073 --- /dev/null +++ b/domains/ai-alignment/self-evolution improves agent performance through acceptance-gated retry not expanded search because disciplined attempt loops with explicit failure reflection outperform open-ended exploration.md @@ -0,0 +1,36 @@ +--- +type: claim +domain: ai-alignment +description: "Self-evolution module showed the clearest positive effect in controlled ablation (+4.8pp SWE, +2.7pp OSWorld) by tightening the solve loop around acceptance criteria, not by expanding into larger search trees" +confidence: experimental +source: "Pan et al. 'Natural-Language Agent Harnesses', arXiv:2603.25723, March 2026. Table 3 + case analysis (scikit-learn__scikit-learn-25747). SWE-bench Verified (125 samples) + OSWorld (36 samples), GPT-5.4, Codex CLI." +created: 2026-03-31 +depends_on: + - "iterative agent self-improvement produces compounding capability gains when evaluation is structurally separated from generation" +challenged_by: + - "curated skills improve agent task performance by 16 percentage points while self-generated skills degrade it by 1.3 points because curation encodes domain judgment that models cannot self-derive" +--- + +# Self-evolution improves agent performance through acceptance-gated retry not expanded search because disciplined attempt loops with explicit failure reflection outperform open-ended exploration + +Pan et al. (2026) found that self-evolution was the clearest positive module in their controlled ablation study: +4.8pp on SWE-bench Verified (80.0 vs 75.2 Basic) and +2.7pp on OSWorld (44.4 vs 41.7 Basic). In the score-cost view (Figure 4a), self-evolution is the only module that moves upward (higher score) without moving far right (higher cost). + +The mechanism is not open-ended reflection or expanded search. The self-evolution module runs an explicit retry loop with a real baseline attempt first and a default cap of five attempts. After every non-successful or stalled attempt, it reflects on concrete failure signals before planning the next attempt. It redesigns along three axes: prompt, tool, and workflow evolution. It stops when judged successful or when the attempt cap is reached, and reports incomplete rather than pretending the last attempt passed. + +The case of `scikit-learn__scikit-learn-25747` illustrates the favorable regime: Basic fails this sample, but self-evolution resolves it. The module organizes the run around an explicit attempt contract where Attempt 1 is treated as successful only if the task acceptance gate is satisfied. The system closes after Attempt 1 succeeds rather than expanding into a larger retry tree, and the evaluator confirms the final patch fixes the target FAIL_TO_PASS tests. The extra structure makes the first repair attempt more disciplined and better aligned with the benchmark gate. + +This is a significant refinement of the "iterative self-improvement" concept. The gain comes not from more iterations or bigger search, but from tighter coupling between failure signals and next-attempt design. The module's constraint structure (explicit cap, forced reflection, acceptance-gated stopping) is what produces the benefit. + +## Challenges + +The `challenged_by` link to curated vs self-generated skills is important context: self-evolution works here because it operates within a bounded retry loop with explicit acceptance criteria, not because self-generated modifications are generally beneficial. The +4.8pp is from a 125-sample subset; the authors note they plan full-benchmark reruns. Whether the acceptance-gating mechanism transfers to tasks without clean acceptance criteria (creative tasks, open-ended research) is untested. + +--- + +Relevant Notes: +- [[iterative agent self-improvement produces compounding capability gains when evaluation is structurally separated from generation]] — the NLAH self-evolution module is a concrete implementation: structurally separated evaluation (acceptance gate) drives the retry loop +- [[curated skills improve agent task performance by 16 percentage points while self-generated skills degrade it by 1.3 points because curation encodes domain judgment that models cannot self-derive]] — self-evolution here succeeds because it modifies approach within a curated structure (the harness), not because it generates new skills from scratch +- [[the determinism boundary separates guaranteed agent behavior from probabilistic compliance because hooks enforce structurally while instructions degrade under context load]] — the self-evolution module's attempt cap and forced reflection are deterministic hooks, not instructions; this is why it works where unconstrained self-modification fails + +Topics: +- [[_map]] diff --git a/domains/ai-alignment/subagent hierarchies outperform peer multi-agent architectures in practice because deployed systems consistently converge on one primary agent controlling specialized helpers.md b/domains/ai-alignment/subagent hierarchies outperform peer multi-agent architectures in practice because deployed systems consistently converge on one primary agent controlling specialized helpers.md index f2b05a7e..77e82c99 100644 --- a/domains/ai-alignment/subagent hierarchies outperform peer multi-agent architectures in practice because deployed systems consistently converge on one primary agent controlling specialized helpers.md +++ b/domains/ai-alignment/subagent hierarchies outperform peer multi-agent architectures in practice because deployed systems consistently converge on one primary agent controlling specialized helpers.md @@ -27,6 +27,11 @@ For the collective superintelligence thesis, this is important. If subagent hier Ruiz-Serra et al.'s factorised active inference framework demonstrates successful peer multi-agent coordination without hierarchical control. Each agent maintains individual-level beliefs about others' internal states and performs strategic planning in a joint context through decentralized representation. The framework successfully handles iterated normal-form games with 2-3 players without requiring a primary controller. However, the finding that ensemble-level expected free energy is not necessarily minimized at the aggregate level suggests that while peer architectures can function, they may require explicit coordination mechanisms (effectively reintroducing hierarchy) to achieve collective optimization. This partially challenges the claim while explaining why hierarchies emerge in practice. +### Additional Evidence (supporting) +*Source: [[pan-2026-natural-language-agent-harnesses]] | Added: 2026-03-31 | Extractor: anthropic/claude-opus-4-6* + +Pan et al. (2026) provide quantitative token-split data from the TRAE NLAH harness on SWE-bench Verified. Table 4 shows that approximately 90% of all prompt tokens, completion tokens, tool calls, and LLM calls occur in delegated child agents rather than in the runtime-owned parent thread (parent: 8.5% prompt, 8.1% completion, 9.8% tool, 9.4% LLM; children: 91.5%, 91.9%, 90.2%, 90.6%). The parent thread is functionally an orchestrator — it reads the harness, dispatches work, and integrates results. This is the first controlled measurement of the delegation concentration in a production-grade harness, confirming the architectural observation that subagent hierarchies concentrate substantive work in children while the parent contributes coordination, not execution. + ### Additional Evidence (challenge) *Source: [[2025-12-00-google-mit-scaling-agent-systems]] | Added: 2026-03-28 | Extractor: anthropic/claude-opus-4-6* diff --git a/domains/ai-alignment/verifier-level acceptance can diverge from benchmark acceptance even when locally correct because intermediate checking layers optimize for their own success criteria not the final evaluators.md b/domains/ai-alignment/verifier-level acceptance can diverge from benchmark acceptance even when locally correct because intermediate checking layers optimize for their own success criteria not the final evaluators.md new file mode 100644 index 00000000..f95543d2 --- /dev/null +++ b/domains/ai-alignment/verifier-level acceptance can diverge from benchmark acceptance even when locally correct because intermediate checking layers optimize for their own success criteria not the final evaluators.md @@ -0,0 +1,35 @@ +--- +type: claim +domain: ai-alignment +secondary_domains: [collective-intelligence] +description: "Controlled ablation reveals that adding a verifier stage can make agent runs more structured and locally convincing while drifting from the benchmark's actual acceptance object — extra process layers reshape local success signals" +confidence: experimental +source: "Pan et al. 'Natural-Language Agent Harnesses', arXiv:2603.25723, March 2026. Table 3, Table 7, case analysis (sympy__sympy-23950, django__django-13406). SWE-bench Verified (125 samples), GPT-5.4, Codex CLI." +created: 2026-03-31 +depends_on: + - "harness engineering emerges as the primary agent capability determinant because the runtime orchestration layer not the token state determines what agents can do" +--- + +# Verifier-level acceptance can diverge from benchmark acceptance even when locally correct because intermediate checking layers optimize for their own success criteria not the final evaluators + +Pan et al. (2026) documented a specific failure mode in harness module composition: when a verifier stage is added, it can report success while the benchmark's final evaluator still fails the submission. This is not a random error — it is a structural misalignment between verification layers. + +The case of `sympy__sympy-23950` is the clearest example. Basic and self-evolution both resolve this sample. But file-backed state, evidence-backed answering, verifier, dynamic orchestration, and multi-candidate search all fail it. The verifier run is especially informative because the final response explicitly says a separate verifier reported "solved," while the official evaluator still fails `test_as_set`. The verifier's local acceptance object diverged from the benchmark's acceptance object. + +More broadly across the ablation study, the verifier module scored 74.4 on SWE-bench (slightly below Basic's 75.2, within the -0.8pp margin). On OSWorld, it dropped more sharply (33.3 vs 41.7 Basic, -8.4pp). The verifier adds a genuine independent checking layer — on `django__django-11734`, it reruns targeted Django tests and inspects SQL bindings, and the benchmark agrees. But when the verifier's notion of correctness diverges from the benchmark's final gate, the extra structure makes the run more expensive without improving outcomes. + +This finding matters beyond benchmarks. In production agent systems, the "benchmark evaluator" is replaced by real-world success criteria (user satisfaction, business outcomes, safety constraints). If intermediate verification layers optimize for locally checkable properties that correlate imperfectly with the real success criterion, they can create a false sense of confidence — runs look more rigorous while drifting from what actually matters. + +## Challenges + +The divergence may be specific to SWE-bench's evaluator design (test suite pass/fail) rather than a general property of verification layers. Verifiers that check the same acceptance criteria as the final evaluator should not diverge. The failure mode documented here is specifically about verifiers that construct their own checking criteria independently. Sample size is small (125 SWE, 36 OSWorld) and the verifier-negative cases are a small subset of those. + +--- + +Relevant Notes: +- [[harness engineering emerges as the primary agent capability determinant because the runtime orchestration layer not the token state determines what agents can do]] — this claim shows the dark side: the harness determines what agents do, but harness-added verification can misalign with actual success criteria +- [[79 percent of multi-agent failures originate from specification and coordination not implementation because decomposition quality is the primary determinant of system success]] — verifier divergence is a specification failure: the verifier's specification of "correct" doesn't match the benchmark's specification +- [[the determinism boundary separates guaranteed agent behavior from probabilistic compliance because hooks enforce structurally while instructions degrade under context load]] — verifiers are deterministic enforcement, but enforcement of the wrong criterion is worse than no enforcement at all + +Topics: +- [[_map]] diff --git a/inbox/archive/pan-2026-natural-language-agent-harnesses.md b/inbox/archive/pan-2026-natural-language-agent-harnesses.md new file mode 100644 index 00000000..63697582 --- /dev/null +++ b/inbox/archive/pan-2026-natural-language-agent-harnesses.md @@ -0,0 +1,44 @@ +--- +type: source +title: "Natural-Language Agent Harnesses" +authors: ["Linyue Pan", "Lexiao Zou", "Shuo Guo", "Jingchen Ni", "Hai-Tao Zheng"] +format: paper +url: "https://arxiv.org/abs/2603.25723" +date: 2026-03-26 +status: processed +processed_by: theseus +processed_date: 2026-03-31 +claims_extracted: 5 +enrichments: 1 +tags: [harness-engineering, agent-architecture, module-ablation, file-backed-state, self-evolution] +--- + +# Natural-Language Agent Harnesses + +Preprint from Tsinghua University / Harbin Institute of Technology, March 2026. arXiv:2603.25723v1. + +## Summary + +Proposes Natural-Language Agent Harnesses (NLAHs) — structured NL representations of harness control logic — and an Intelligent Harness Runtime (IHR) that interprets them. Tests on SWE-bench Verified (125 samples) and OSWorld (36 samples) using Codex CLI + GPT-5.4. + +Key contributions: +1. Formalizes the harness design-pattern layer as an explicit, portable object +2. Controlled module ablation study (file-backed state, evidence-backed answering, verifier, self-evolution, multi-candidate search, dynamic orchestration) +3. Code-to-text harness migration study (native OS-Symphony vs NLAH realization) + +## Key findings + +**RQ1 (Behavioral Effect):** Process metrics move much more than resolution rate under Full IHR. TRAE Full: 16.3M prompt tokens, 642 tool calls, 74.4% resolve. TRAE w/o harness skill: 1.2M tokens, 51 tool calls, 75.2% resolve. The harness is behaviorally real but not monotonically helpful. + +**RQ2 (Composability):** Module effects concentrate on a small frontier of component-sensitive cases. 110-115 of 125 SWE samples agree between Full IHR and each ablation (Table 2). Self-evolution is the clearest positive (+4.8pp SWE, +2.7pp OSWorld). Verifier and multi-candidate search can hurt. File-backed state and evidence-backed answering improve process structure rather than score. + +**RQ3 (Migration):** NLAH realization matched or exceeded native code harness on OSWorld (47.2 vs 30.4). Migration relocates reliability mechanisms from local screen repair to durable state and artifact-backed closure. Not loss of orchestration but relocation of verification. + +**Token split:** ~90% of prompt tokens, completion tokens, tool calls, and LLM calls occur in delegated child agents, not the runtime-owned parent (Table 4). + +## Extraction notes + +- 5 NEW claims extracted: solved-set replacer, file-backed state, self-evolution mechanism, verifier divergence, NL harness portability +- 1 ENRICHMENT: subagent hierarchy claim gets 90% delegation data +- ~40% overlap with existing KB (harness engineering, multi-agent degradation, determinism boundary) +- Highest novelty: controlled ablation data (no existing claims have module-level ablation), verifier divergence (very low KB coverage)