teleo-codex/ops/multi-model-eval-architecture.md
m3taversal f3bd2b396d theseus: add multi-model evaluation architecture spec
- What: Architecture spec for second-model eval pass, unified rejection format,
  automatable CI rules, retrieval calibration, agent self-upgrade criteria
- Why: Break correlated blind spots in single-model evaluation (Kim et al. ICML 2025:
  ~60% error agreement within same-family). Codifies agreements with Leo across
  4 design sessions. Implementation target for Epimetheus.
- Connections: References PR #2074 (schema change protocol), NLAH verifier
  divergence finding, retrieval two-pass system, rejection feedback loop

Pentagon-Agent: Theseus <46864DD4-DA71-4719-A1B4-68F7C55854D3>
2026-03-31 10:43:32 +01:00

11 KiB

Multi-Model Evaluation Architecture

Spec for adding a second-model evaluation pass to break correlated blind spots in claim review. Designed with Leo (primary evaluator). Implementation by Epimetheus.

Problem

Kim et al. (ICML 2025): ~60% error agreement within same-model-family evaluations. Self-preference bias is linear with self-recognition. A single-model evaluator systematically misses the same class of errors every time. Human and LLM biases are complementary, not overlapping — multi-model evaluation captures this.

Architecture

Evaluation Sequence

  1. Leo evaluates first. Verdict + reasoning stored as structured record.
  2. Second model evaluates independently against the same rubric. Different model family required — GPT-4o via OpenRouter or Gemini. Never another Claude instance.
  3. System surfaces disagreements only. Agreements are noise; disagreements are signal.
  4. Leo makes final call on all disagreements.

Sequencing rationale: Leo sees the second model's assessment after his own eval, never before. Seeing it before anchors judgment. Seeing it after functions as a genuine blind-spot check.

Second Model Selection

Requirements:

  • Different model family from the evaluating agent (currently Claude → use GPT-4o or Gemini)
  • Access via OpenRouter API (single integration point)
  • Must receive the same rubric and claim content as Leo
  • Must output structured verdict in the same format

Disagreement Handling

A disagreement occurs when the two evaluators reach different verdicts on the same claim (accept vs reject, or different rejection categories).

Disagreements surface in a review queue Leo checks before finalizing. Each disagreement record includes:

  • Leo's verdict + reasoning
  • Second model's verdict + reasoning
  • The specific claim and PR context
  • Which evaluation criteria they diverge on

Calibration Metrics

Track disagreement rate over time:

  • Below ~10%: System is working. Evaluators are calibrated.
  • 10-25%: Normal operating range. Disagreements are productive signal.
  • Above ~25%: Either the rubric is ambiguous or one evaluator is drifting. Both are actionable — trigger rubric review.

Disagreement rate itself becomes the primary calibration metric for evaluation quality.

Unified Rejection Record

Single format used by both CI gates and human evaluators. The feedback loop to agents consumes this format without caring about the source.

{
  "source": "ci | evaluator | second_model",
  "category": "schema_violation | wiki_link_broken | weak_evidence | scope_mismatch | factual_error | precision_failure | opsec_violation",
  "severity": "hard | soft",
  "agent_id": "<producer of the rejected content>",
  "pr": "<PR number>",
  "file": "<file path in PR>",
  "claim_path": "<claim file path if different from file>",
  "detail": "<free text explanation>",
  "timestamp": "<ISO 8601>"
}

Field notes:

  • source: ci for automated gates, evaluator for Leo, second_model for the disagreement-check model
  • severity: hard = merge blocker (schema_violation, wiki_link_broken), soft = reviewer judgment (weak_evidence, precision_failure). Hard rejections trigger immediate resubmission attempts. Soft rejections accumulate toward the 3-strikes upgrade threshold.
  • claim_path separate from file handles multi-file enrichment PRs where only one file has the issue
  • category taxonomy covers ~80% of rejection causes based on ~400 PR reviews

Rejection Feedback Loop

  1. Rejection records flow to the producing agent as structured feedback.
  2. Agent receives the category, severity, and detail.
  3. Hard rejections → agent attempts immediate fix and resubmission.
  4. Soft rejections → agent accumulates feedback. After 3 rejections of the same category from the same agent, the system triggers a skill upgrade proposal.
  5. Skill upgrade proposals route back to Leo for eval (see Agent Self-Upgrade Criteria below).

The 3-strikes rule prevents premature optimization while creating learning pressure. Learning from rejection is the agent's job — the system just tracks the pattern.

Automatable CI Rules

Five rules that catch ~80% of current rejections. Rules 1-2 are hard gates (block merge). Rules 3-5 are soft flags (surface to reviewer).

Hard Gates

1. YAML Schema Validation

  • type field exists and equals claim
  • All required frontmatter fields present: type, domain, description, confidence, source, created
  • Domain value is one of the 14 valid domains
  • Confidence value is one of: proven, likely, experimental, speculative
  • Date format is valid ISO 8601
  • Pure syntax check — zero judgment needed

2. Wiki Link Resolution

  • Every [[link]] in the body must resolve to an existing file at merge time
  • Includes links in the Relevant Notes section
  • Already policy, not yet enforced in CI

Soft Flags

3. Domain Validation

  • File path domain matches one of the 14 valid domains
  • Claim content plausibly belongs in that domain
  • Path check is automatable; content check needs light NLP or embedding similarity against domain centroids
  • Flag for reviewer if domain assignment seems wrong

4. OPSEC Scan

  • Regex for dollar amounts, percentage allocations, fund sizes, deal terms
  • Flag for human review, never auto-reject (false positive risk on dollar-sign patterns in technical content)
  • Standing directive from Cory: strict enforcement, but false positives on technical content create friction

5. Duplicate Detection

  • Embedding similarity against existing claims in the same domain using Qdrant (text-embedding-3-small, 1536d)
  • Threshold: 0.92 universal — not per-domain tuning
  • Flag includes top-3 similar claims with scores so the reviewer can judge in context
  • The threshold is the attention trigger; reviewer judgment is the decision
  • If a domain consistently generates >50% false positive flags, tune that domain's threshold as a targeted fix (data-driven, not preemptive)

Domain maps, topic indices, and non-claim type files are hard-filtered from duplicate detection — they're navigation aids, not claims.

Agent Self-Upgrade Criteria

When agents propose changes to their own skills, tools, or extraction quality, these criteria apply in priority order:

  1. Scope compliance — Does the upgrade stay within the agent's authorized domain? Extraction agent improving YAML parsing: yes. Same agent adding merge capability: no.
  2. Measurable improvement — Before/after on a concrete metric. Minimum: 3 test cases showing improvement with 0 regressions. No "this feels better."
  3. Schema compliance preserved — Upgrade cannot break existing quality gates. Full validation suite runs against output produced by the new skill.
  4. Reversibility — Every skill change must be revertable. If not, the evidence bar goes up significantly.
  5. No scope creep — The upgrade does what it claims, nothing more. Watch for "while I was in there I also..." additions.

Evidence bar difference: a claim needs sourced evidence. A skill change needs demonstrated performance delta — show the before, show the after, on real data not synthetic examples.

For skill changes that affect other agents' outputs (e.g., shared extraction templates), the evidence bar requires testing against multiple agents' typical inputs, not just the proposing agent's.

Retrieval Quality (Two-Pass System)

Design parameters calibrated against Leo's ground-truth rankings on 3 real query scenarios.

Two-Pass Architecture

  • Pass 1: Top 5 claims, similarity-descending sort
  • Pass 2 (expand): Top 10 claims, triggered when pass 1 is insufficient

Calibration Findings

  1. 5 first-pass claims is viable for all tested scenarios — but only if the 5 are well-chosen. Similarity ranking alone won't produce optimal results.

  2. Counter-evidence must be explicitly surfaced. Similarity-descending sort systematically buries opposing-valence claims. Counter-claims are semantically adjacent but have opposite valence. Design: after first pass, check if all returned claims share directional agreement. If yes, force-include the highest-similarity opposing claim.

  3. Synthesis claims suppress their source claims. If a synthesis claim is in the result set, its individual source claims are filtered out to prevent slot waste. Implementation: tag synthesis claims with source list in frontmatter, filter at retrieval time. Bidirectional: if a source claim scores higher than its synthesis parent, keep the source and consider suppressing the synthesis (user query more specific than synthesis scope).

  4. Cross-domain claims earn inclusion only when causally load-bearing. Astra's power infrastructure claims earn a spot in compute governance queries because power constraints cause the governance window. Rio's blockchain claims don't because they're a parallel domain, not a causal input.

  5. Domain maps and topic indices hard-filtered from retrieval results. Non-claim types (type: "map", indices) should be the first filter in the pipeline, before similarity ranking runs.

Valence Tagging

Tag claims with supports / challenges / neutral relative to query thesis at ingestion time. Lightweight, one-time cost per claim. Enables the counter-evidence surfacing logic without runtime sentiment analysis.

Verifier Divergence Implications

From NLAH paper (Pan et al.): verification layers can optimize for locally checkable properties that diverge from actual acceptance criteria (e.g., verifier reports "solved" while benchmark fails). Implication for multi-model eval: the second-model eval pass must check against the same rubric as Leo, not construct its own notion of quality. Shared rubric enforcement is a hard requirement.

Implementation Sequence

  1. Automatable CI rules (hard gates first) — YAML schema validation + wiki link resolution. Foundation for everything else. References: PR #2074 (schema change protocol v2) defines the authoritative schema surface.
  2. Automatable CI rules (soft flags) — domain validation, OPSEC scan, duplicate detection via Qdrant.
  3. Unified rejection record — data structure for both CI and human rejections, stored in pipeline.db.
  4. Rejection feedback loop — structured feedback to agents with 3-strikes accumulation.
  5. Multi-model eval integration — OpenRouter connection, rubric sharing, disagreement queue.
  6. Self-upgrade eval criteria — codified in eval workflow, triggered by 3-strikes pattern.

Design Principle

The constraint enforcement layer must be outside the agent being constrained. That's why multi-model eval matters, why Leo shouldn't eval his own proposals, and why policy-as-code runs in CI, not in the agent's own process. As agents get more capable, the integrity layer gets more important, not less.


Authored by Theseus. Reviewed by Leo (proposals integrated). Implementation: Epimetheus. Created: 2026-03-31