theseus: add multi-model evaluation architecture spec

- What: Architecture spec for second-model eval pass, unified rejection format, automatable CI rules, retrieval calibration, agent self-upgrade criteria - Why: Break correlated blind spots in single-model evaluation (Kim et al. ICML 2025: ~60% error agreement within same-family). Codifies agreements with Leo across 4 design sessions. Implementation target for Epimetheus. - Connections: References PR #2074 (schema change protocol), NLAH verifier divergence finding, retrieval two-pass system, rejection feedback loop Pentagon-Agent: Theseus <46864DD4-DA71-4719-A1B4-68F7C55854D3>
pipeline: clean 1 stale queue duplicates
2026-03-31 09:45:17 +00:00 · 2026-03-31 09:45:02 +00:00
2 changed files with 181 additions and 104 deletions
--- a/inbox/queue/2026-03-30-leo-eu-ai-act-article2-national-security-exclusion-legislative-ceiling.md
+++ b/inbox/queue/2026-03-30-leo-eu-ai-act-article2-national-security-exclusion-legislative-ceiling.md
@ -1,104 +0,0 @@
 ---
 type: source
 title: "Leo Synthesis — EU AI Act Article 2.3 National Security Exclusion Confirms the Legislative Ceiling Is Cross-Jurisdictional, Not US-Specific"
 author: "Leo (cross-domain synthesis from EU AI Act Regulation 2024/1689, GDPR Article 2.2, and Sessions 2026-03-27/28/29 legislative ceiling pattern)"
 url: https://archive/synthesis
 date: 2026-03-30
 domain: grand-strategy
 secondary_domains: [ai-alignment]
 format: synthesis
 status: enrichment
 priority: high
 tags: [eu-ai-act, article-2-3, national-security-exclusion, legislative-ceiling, cross-jurisdictional, gdpr, regulatory-design, military-ai, sovereign-authority, governance-instrument-asymmetry, belief-1, scope-qualifier, grand-strategy, ai-governance]
 flagged_for_theseus: ["EU AI Act Article 2.3 exclusion has direct implications for Theseus's claims about governance mechanisms for frontier AI — the most safety-forward binding regulation excludes the deployment context Theseus's domain is most concerned about"]
 processed_by: leo
 processed_date: 2026-03-30
 claims_extracted: ["eu-ai-act-article-2-3-national-security-exclusion-confirms-legislative-ceiling-is-cross-jurisdictional.md"]
 extraction_model: "anthropic/claude-sonnet-4.5"
 processed_by: leo
 processed_date: 2026-03-31
 enrichments_applied: ["eu-ai-act-article-2-3-national-security-exclusion-confirms-legislative-ceiling-is-cross-jurisdictional.md"]
 extraction_model: "anthropic/claude-sonnet-4.5"
 extraction_notes: "pre-screen: 1 prior art claims from 5 themes"
 ---
 ## Content
 **Source material:** EU AI Act (Regulation (EU) 2024/1689), Article 2.3; GDPR (Regulation (EU) 2016/679), Article 2.2(a); France/Germany member state lobbying record during EU AI Act drafting (documented in EU legislative process); existing KB source 2026-03-20-eu-ai-act-article43-conformity-assessment-limits.md.
 **The EU AI Act's Article 2.3 (verbatim):**
 "This Regulation shall not apply to AI systems developed or used exclusively for military, national defence or national security purposes, regardless of the type of entity carrying out those activities."
 This is the legislative ceiling instantiated in black-letter law by the most ambitious binding AI safety regulation in the world, produced by the most safety-forward regulatory jurisdiction, after years of negotiation with safety-oriented political leadership.
 **Key features of the exclusion:**
 1. "Regardless of the type of entity" — covers private companies developing military AI, not just state actors
 2. Categorical and blanket — no tiered approach, no proportionality test, no compliance-lite version for military AI
 3. Applies by purpose: AI used "exclusively" for military/national security is excluded; dual-use AI may still be subject to the regulation for its civilian applications
 4. The scope exclusion was not a last-minute amendment — it was present in early drafts and confirmed through the co-decision process
 **Why the exclusion was adopted:**
 France and Germany, as major member states with significant defense industries, lobbied successfully for the exclusion. The stated justifications align exactly with the strategic interest inversion mechanism documented in Sessions 2026-03-27/28:
 - Military AI systems require response speed incompatible with conformity assessment timelines
 - Transparency requirements (explainability, technical documentation) could expose classified capabilities
 - Third-party audit of military AI decision systems is incompatible with operational security
 - "Safety" requirements must be defined by military doctrine, not civilian regulatory standards
 These are the same arguments that produced the DoD blacklisting of Anthropic at the contracting level — now operating at the legislative scope-definition level, in a different jurisdiction, under a different political administration, producing the same outcome.
 **GDPR precedent:**
 Article 2.2(a) of GDPR (the world's leading data protection regulation, which entered into force in 2018) excludes processing "in the course of an activity which falls outside the scope of Union law." The Court of Justice of the EU has consistently interpreted this to exclude national security activities. The EU AI Act's Article 2.3 follows the same structural logic as GDPR's national security exclusion — it is embedded EU regulatory DNA, not an AI-specific political choice.
 **Cross-jurisdictional significance:**
 The EU AI Act was drafted by legislators who were specifically aware of the gap that a national security exclusion creates. The exclusion was retained anyway — because the legislative ceiling is not the product of ignorance or insufficient safety advocacy; it is the product of how nation-states preserve sovereign authority over national security decisions. The EU's regulatory philosophy explicitly prioritizes human oversight and accountability for civilian AI. Its military exclusion is not an exception to that philosophy — it is where national sovereignty overrides it.
 **Relationship to Sessions 2026-03-27/28/29 findings:**
 Session 2026-03-29 described the legislative ceiling as "logically necessary" and offered it as a structural diagnosis. The EU AI Act Article 2.3 converts that structural diagnosis into an empirical finding: the legislative ceiling has already occurred, in the most prominent binding AI safety statute in history, in the most safety-forward regulatory jurisdiction in the world. This is not a prediction — it is a completed fact.
 ---
 ## Agent Notes
 **Why this matters:** This is the most important cross-jurisdictional confirmation available for the legislative ceiling claim. Sessions 2026-03-27/28/29 developed the pattern from US evidence (DoD contracting, litigation, PAC investment). The EU AI Act Article 2.3 confirms the pattern holds in a different political system, under different leadership, with different regulatory philosophy — making "this is US-specific" or "this is Trump-administration-specific" alternative explanations definitively false.
 **What surprised me:** The "regardless of the type of entity" clause. I expected the exclusion to cover government/military use. The extension to private companies using AI for military purposes is a broader exclusion than I anticipated — it closes the "private contractor loophole" that might otherwise allow civilian AI safety requirements to flow through procurement chains. The EU explicitly foreclosed that alternative governance pathway.
 **What I expected but didn't find:** Any "minimal standards" provision for military AI — a lite compliance tier that would apply reduced requirements to national security AI. The EU chose a categorical binary (in scope / out of scope) rather than a tiered approach. This makes the exclusion cleaner analytically but also removes any pathway to partial governance of military AI through the EU AI Act's framework.
 **KB connections:**
 - [[technology advances exponentially but coordination mechanisms evolve linearly creating a widening gap]] — EU AI Act Article 2.3 is direct evidence that even the most sophisticated coordination mechanism (binding regulation) contains the gap for the highest-stakes deployment context
 - Session 2026-03-28 synthesis (legal mechanism gap) — Article 2.3 confirms that even when the instrument changes from voluntary to mandatory, the legal mechanism gap persists for military AI in exactly the most successful mandatory governance regime
 - Session 2026-03-29 synthesis (legislative ceiling) — Article 2.3 converts the structural diagnosis into a completed empirical fact
 - 2026-03-20-eu-ai-act-article43-conformity-assessment-limits.md (existing KB archive) — that source covers Article 43 (conformity assessment); this source covers Article 2.3 (scope exclusion); together they paint the full picture of EU AI Act's governance limitations
 **Extraction hints:**
 - PRIMARY: Extract as standalone claim: "The EU AI Act's Article 2.3 blanket national security exclusion confirms the legislative ceiling is cross-jurisdictional — even the world's most ambitious binding AI safety regulation explicitly carves out military and national security AI, regardless of the type of entity deploying it" — domain: grand-strategy, confidence: proven (black-letter law), cross-domain: ai-alignment
 - SECONDARY: The GDPR precedent strengthens the "embedded regulatory DNA" framing — consider as supporting evidence in the claim body, not as a separate claim
 - ENRICHMENT: This source should be added to the legislative ceiling scope qualifier enrichment on [[technology advances exponentially but coordination mechanisms evolve linearly creating a widening gap]] as the cross-jurisdictional confirmation
 - DOMAIN NOTE: Flag for Theseus — Article 2.3 directly affects the governance mechanisms available for frontier AI safety; Theseus should know the most binding regulation doesn't apply to the deployment contexts they're most concerned about
 **Context:** EU AI Act entered into force August 1, 2024. Existing KB source (2026-03-20-eu-ai-act-article43-conformity-assessment-limits.md) covers Article 43 conformity assessment — this archive covers Article 2.3 scope exclusion, which is a different provision with different significance. The KB has EU AI Act coverage of conformity assessment limits (Article 43) but not scope exclusion (Article 2.3) — this fills the gap.
 ## Curator Notes (structured handoff for extractor)
 PRIMARY CONNECTION: [[technology advances exponentially but coordination mechanisms evolve linearly creating a widening gap]] + Session 2026-03-29 legislative ceiling synthesis
 WHY ARCHIVED: Cross-jurisdictional empirical confirmation that the legislative ceiling has already occurred in the world's most prominent binding AI safety regulation. Converts Sessions 2026-03-27/28/29's structural diagnosis into a completed fact.
 EXTRACTION HINT: Extract as standalone claim with confidence: proven (black-letter law). EU AI Act Article 2.3 verbatim text is the evidence — no additional sourcing needed. Flag for Theseus. Add as enrichment to governance instrument asymmetry claim (Pattern G) before that goes to PR.
 ## Key Facts
 - EU AI Act (Regulation 2024/1689) entered into force August 1, 2024
 - Article 2.3 excludes AI systems developed or used exclusively for military, national defence or national security purposes
 - The exclusion applies 'regardless of the type of entity carrying out those activities'
 - France and Germany lobbied successfully for the national security exclusion during EU AI Act drafting
 - GDPR Article 2.2(a) established precedent for national security exclusions in EU regulation
 - Court of Justice of the EU has consistently interpreted GDPR's scope exclusion to cover national security activities
 ## Key Facts
 - EU AI Act (Regulation 2024/1689) entered into force August 1, 2024
 - Article 2.3 excludes AI systems developed or used exclusively for military, national defence or national security purposes
 - The exclusion applies 'regardless of the type of entity carrying out those activities'
 - France and Germany lobbied successfully for the national security exclusion during EU AI Act drafting
 - GDPR Article 2.2(a) excludes processing 'in the course of an activity which falls outside the scope of Union law'
 - Court of Justice of the EU has consistently interpreted GDPR's scope exclusion to cover national security activities
 - The national security exclusion was present in early EU AI Act drafts and confirmed through co-decision process
--- a/ops/multi-model-eval-architecture.md
+++ b/ops/multi-model-eval-architecture.md
@ -0,0 +1,181 @@
 # Multi-Model Evaluation Architecture
 Spec for adding a second-model evaluation pass to break correlated blind spots in claim review. Designed with Leo (primary evaluator). Implementation by Epimetheus.
 ## Problem
 Kim et al. (ICML 2025): ~60% error agreement within same-model-family evaluations. Self-preference bias is linear with self-recognition. A single-model evaluator systematically misses the same class of errors every time. Human and LLM biases are complementary, not overlapping — multi-model evaluation captures this.
 ## Architecture
 ### Evaluation Sequence
 1. **Leo evaluates first.** Verdict + reasoning stored as structured record.
 2. **Second model evaluates independently** against the same rubric. Different model family required — GPT-4o via OpenRouter or Gemini. Never another Claude instance.
 3. **System surfaces disagreements only.** Agreements are noise; disagreements are signal.
 4. **Leo makes final call** on all disagreements.
 Sequencing rationale: Leo sees the second model's assessment **after** his own eval, never before. Seeing it before anchors judgment. Seeing it after functions as a genuine blind-spot check.
 ### Second Model Selection
 Requirements:
 - Different model family from the evaluating agent (currently Claude → use GPT-4o or Gemini)
 - Access via OpenRouter API (single integration point)
 - Must receive the same rubric and claim content as Leo
 - Must output structured verdict in the same format
 ### Disagreement Handling
 A disagreement occurs when the two evaluators reach different verdicts on the same claim (accept vs reject, or different rejection categories).
 Disagreements surface in a review queue Leo checks before finalizing. Each disagreement record includes:
 - Leo's verdict + reasoning
 - Second model's verdict + reasoning
 - The specific claim and PR context
 - Which evaluation criteria they diverge on
 ### Calibration Metrics
 Track disagreement rate over time:
 - **Below ~10%:** System is working. Evaluators are calibrated.
 - **10-25%:** Normal operating range. Disagreements are productive signal.
 - **Above ~25%:** Either the rubric is ambiguous or one evaluator is drifting. Both are actionable — trigger rubric review.
 Disagreement rate itself becomes the primary calibration metric for evaluation quality.
 ## Unified Rejection Record
 Single format used by both CI gates and human evaluators. The feedback loop to agents consumes this format without caring about the source.
 ```json
 {
  "source": "ci | evaluator | second_model",
  "category": "schema_violation | wiki_link_broken | weak_evidence | scope_mismatch | factual_error | precision_failure | opsec_violation",
  "severity": "hard | soft",
  "agent_id": "<producer of the rejected content>",
  "pr": "<PR number>",
  "file": "<file path in PR>",
  "claim_path": "<claim file path if different from file>",
  "detail": "<free text explanation>",
  "timestamp": "<ISO 8601>"
 }
 ```
 Field notes:
 - `source`: `ci` for automated gates, `evaluator` for Leo, `second_model` for the disagreement-check model
 - `severity`: `hard` = merge blocker (schema_violation, wiki_link_broken), `soft` = reviewer judgment (weak_evidence, precision_failure). Hard rejections trigger immediate resubmission attempts. Soft rejections accumulate toward the 3-strikes upgrade threshold.
 - `claim_path` separate from `file` handles multi-file enrichment PRs where only one file has the issue
 - `category` taxonomy covers ~80% of rejection causes based on ~400 PR reviews
 ### Rejection Feedback Loop
 1. Rejection records flow to the producing agent as structured feedback.
 2. Agent receives the category, severity, and detail.
 3. Hard rejections → agent attempts immediate fix and resubmission.
 4. Soft rejections → agent accumulates feedback. **After 3 rejections of the same category from the same agent**, the system triggers a skill upgrade proposal.
 5. Skill upgrade proposals route back to Leo for eval (see Agent Self-Upgrade Criteria below).
 The 3-strikes rule prevents premature optimization while creating learning pressure. Learning from rejection is the agent's job — the system just tracks the pattern.
 ## Automatable CI Rules
 Five rules that catch ~80% of current rejections. Rules 1-2 are hard gates (block merge). Rules 3-5 are soft flags (surface to reviewer).
 ### Hard Gates
 **1. YAML Schema Validation**
 - `type` field exists and equals `claim`
 - All required frontmatter fields present: type, domain, description, confidence, source, created
 - Domain value is one of the 14 valid domains
 - Confidence value is one of: proven, likely, experimental, speculative
 - Date format is valid ISO 8601
 - Pure syntax check — zero judgment needed
 **2. Wiki Link Resolution**
 - Every `[[link]]` in the body must resolve to an existing file at merge time
 - Includes links in the `Relevant Notes` section
 - Already policy, not yet enforced in CI
 ### Soft Flags
 **3. Domain Validation**
 - File path domain matches one of the 14 valid domains
 - Claim content plausibly belongs in that domain
 - Path check is automatable; content check needs light NLP or embedding similarity against domain centroids
 - Flag for reviewer if domain assignment seems wrong
 **4. OPSEC Scan**
 - Regex for dollar amounts, percentage allocations, fund sizes, deal terms
 - Flag for human review, never auto-reject (false positive risk on dollar-sign patterns in technical content)
 - Standing directive from Cory: strict enforcement, but false positives on technical content create friction
 **5. Duplicate Detection**
 - Embedding similarity against existing claims in the same domain using Qdrant (text-embedding-3-small, 1536d)
 - **Threshold: 0.92 universal** — not per-domain tuning
 - Flag includes **top-3 similar claims with scores** so the reviewer can judge in context
 - The threshold is the attention trigger; reviewer judgment is the decision
 - If a domain consistently generates >50% false positive flags, tune that domain's threshold as a targeted fix (data-driven, not preemptive)
 Domain maps, topic indices, and non-claim type files are hard-filtered from duplicate detection — they're navigation aids, not claims.
 ## Agent Self-Upgrade Criteria
 When agents propose changes to their own skills, tools, or extraction quality, these criteria apply in priority order:
 1. **Scope compliance** — Does the upgrade stay within the agent's authorized domain? Extraction agent improving YAML parsing: yes. Same agent adding merge capability: no.
 2. **Measurable improvement** — Before/after on a concrete metric. Minimum: 3 test cases showing improvement with 0 regressions. No "this feels better."
 3. **Schema compliance preserved** — Upgrade cannot break existing quality gates. Full validation suite runs against output produced by the new skill.
 4. **Reversibility** — Every skill change must be revertable. If not, the evidence bar goes up significantly.
 5. **No scope creep** — The upgrade does what it claims, nothing more. Watch for "while I was in there I also..." additions.
 Evidence bar difference: a **claim** needs sourced evidence. A **skill change** needs **demonstrated performance delta** — show the before, show the after, on real data not synthetic examples.
 For skill changes that affect other agents' outputs (e.g., shared extraction templates), the evidence bar requires testing against multiple agents' typical inputs, not just the proposing agent's.
 ## Retrieval Quality (Two-Pass System)
 Design parameters calibrated against Leo's ground-truth rankings on 3 real query scenarios.
 ### Two-Pass Architecture
 - **Pass 1:** Top 5 claims, similarity-descending sort
 - **Pass 2 (expand):** Top 10 claims, triggered when pass 1 is insufficient
 ### Calibration Findings
 1. **5 first-pass claims is viable for all tested scenarios** — but only if the 5 are well-chosen. Similarity ranking alone won't produce optimal results.
 2. **Counter-evidence must be explicitly surfaced.** Similarity-descending sort systematically buries opposing-valence claims. Counter-claims are semantically adjacent but have opposite valence. Design: after first pass, check if all returned claims share directional agreement. If yes, force-include the highest-similarity opposing claim.
 3. **Synthesis claims suppress their source claims.** If a synthesis claim is in the result set, its individual source claims are filtered out to prevent slot waste. Implementation: tag synthesis claims with source list in frontmatter, filter at retrieval time. **Bidirectional:** if a source claim scores higher than its synthesis parent, keep the source and consider suppressing the synthesis (user query more specific than synthesis scope).
 4. **Cross-domain claims earn inclusion only when causally load-bearing.** Astra's power infrastructure claims earn a spot in compute governance queries because power constraints cause the governance window. Rio's blockchain claims don't because they're a parallel domain, not a causal input.
 5. **Domain maps and topic indices hard-filtered from retrieval results.** Non-claim types (`type: "map"`, indices) should be the first filter in the pipeline, before similarity ranking runs.
 ### Valence Tagging
 Tag claims with `supports` / `challenges` / `neutral` relative to query thesis at ingestion time. Lightweight, one-time cost per claim. Enables the counter-evidence surfacing logic without runtime sentiment analysis.
 ## Verifier Divergence Implications
 From NLAH paper (Pan et al.): verification layers can optimize for locally checkable properties that diverge from actual acceptance criteria (e.g., verifier reports "solved" while benchmark fails). Implication for multi-model eval: the second-model eval pass must check against the **same rubric** as Leo, not construct its own notion of quality. Shared rubric enforcement is a hard requirement.
 ## Implementation Sequence
 1. **Automatable CI rules** (hard gates first) — YAML schema validation + wiki link resolution. Foundation for everything else. References: PR #2074 (schema change protocol v2) defines the authoritative schema surface.
 2. **Automatable CI rules** (soft flags) — domain validation, OPSEC scan, duplicate detection via Qdrant.
 3. **Unified rejection record** — data structure for both CI and human rejections, stored in pipeline.db.
 4. **Rejection feedback loop** — structured feedback to agents with 3-strikes accumulation.
 5. **Multi-model eval integration** — OpenRouter connection, rubric sharing, disagreement queue.
 6. **Self-upgrade eval criteria** — codified in eval workflow, triggered by 3-strikes pattern.
 ## Design Principle
 The constraint enforcement layer must be **outside** the agent being constrained. That's why multi-model eval matters, why Leo shouldn't eval his own proposals, and why policy-as-code runs in CI, not in the agent's own process. As agents get more capable, the integrity layer gets more important, not less.
 ---
 *Authored by Theseus. Reviewed by Leo (proposals integrated). Implementation: Epimetheus.*
 *Created: 2026-03-31*