Merge remote-tracking branch 'forgejo/theseus/multi-model-eval-spec'

2026-03-31 10:49:39 +01:00 · 2026-03-31 10:49:39 +01:00 · 64f095ec26
commit 64f095ec26
parent be8269da02 334a319b91
1 changed files with 192 additions and 0 deletions
--- a/ops/multi-model-eval-architecture.md
+++ b/ops/multi-model-eval-architecture.md
@ -0,0 +1,192 @@
+# Multi-Model Evaluation Architecture
+
+Spec for adding a second-model evaluation pass to break correlated blind spots in claim review. Designed with Leo (primary evaluator). Implementation by Epimetheus.
+
+## Problem
+
+Kim et al. (ICML 2025): ~60% error agreement within same-model-family evaluations. Self-preference bias is linear with self-recognition. A single-model evaluator systematically misses the same class of errors every time. Human and LLM biases are complementary, not overlapping — multi-model evaluation captures this.
+
+## Architecture
+
+### Evaluation Sequence
+
+1. **Leo evaluates first.** Verdict + reasoning stored as structured record.
+2. **Second model evaluates independently** against the same rubric. Different model family required — GPT-4o via OpenRouter or Gemini. Never another Claude instance.
+3. **System surfaces disagreements only.** Agreements are noise; disagreements are signal.
+4. **Leo makes final call** on all disagreements.
+
+Sequencing rationale: Leo sees the second model's assessment **after** his own eval, never before. Seeing it before anchors judgment. Seeing it after functions as a genuine blind-spot check.
+
+### Second Model Selection
+
+Requirements:
+- Different model family from the evaluating agent (currently Claude → use GPT-4o or Gemini)
+- Access via OpenRouter API (single integration point)
+- Must receive the same rubric and claim content as Leo
+- Must output structured verdict in the same format
+
+### Disagreement Handling
+
+A disagreement occurs when the two evaluators reach different verdicts on the same claim (accept vs reject, or different rejection categories).
+
+Disagreements surface in a review queue Leo checks before finalizing. Each disagreement record includes:
+- Leo's verdict + reasoning
+- Second model's verdict + reasoning
+- The specific claim and PR context
+- Which evaluation criteria they diverge on
+
+### Calibration Metrics
+
+Track disagreement rate over time:
+- **Below ~10%:** System is working. Evaluators are calibrated.
+- **10-25%:** Normal operating range. Disagreements are productive signal.
+- **Above ~25%:** Either the rubric is ambiguous or one evaluator is drifting. Both are actionable — trigger rubric review.
+
+Disagreement rate itself becomes the primary calibration metric for evaluation quality.
+
+## Unified Rejection Record
+
+Single format used by both CI gates and human evaluators. The feedback loop to agents consumes this format without caring about the source.
+
+```json
+{
+  "source": "ci | evaluator | second_model",
+  "category": "schema_violation | wiki_link_broken | weak_evidence | scope_mismatch | factual_error | precision_failure | opsec_violation",
+  "severity": "hard | soft",
+  "agent_id": "<producer of the rejected content>",
+  "pr": "<PR number>",
+  "file": "<file path in PR>",
+  "claim_path": "<claim file path if different from file>",
+  "detail": "<free text explanation>",
+  "timestamp": "<ISO 8601>"
+}
+```
+
+Field notes:
+- `source`: `ci` for automated gates, `evaluator` for Leo, `second_model` for the disagreement-check model
+- `severity`: `hard` = merge blocker (schema_violation, wiki_link_broken), `soft` = reviewer judgment (weak_evidence, precision_failure). Hard rejections trigger immediate resubmission attempts. Soft rejections accumulate toward the 3-strikes upgrade threshold.
+- `claim_path` separate from `file` handles multi-file enrichment PRs where only one file has the issue
+- `category` taxonomy covers ~80% of rejection causes based on ~400 PR reviews
+
+### Rejection Feedback Loop
+
+1. Rejection records flow to the producing agent as structured feedback.
+2. Agent receives the category, severity, and detail.
+3. Hard rejections → agent attempts immediate fix and resubmission.
+4. Soft rejections → agent accumulates feedback. **After 3 rejections of the same category from the same agent**, the system triggers a skill upgrade proposal.
+5. Skill upgrade proposals route back to Leo for eval (see Agent Self-Upgrade Criteria below).
+
+The 3-strikes rule prevents premature optimization while creating learning pressure. Learning from rejection is the agent's job — the system just tracks the pattern.
+
+## Automatable CI Rules
+
+Five rules that catch ~80% of current rejections. Rules 1-2 are hard gates (block merge). Rules 3-5 are soft flags (surface to reviewer).
+
+### Hard Gates
+
+**1. YAML Schema Validation**
+- `type` field exists and equals `claim`
+- All required frontmatter fields present: type, domain, description, confidence, source, created
+- Domain value is one of the 14 valid domains
+- Confidence value is one of: proven, likely, experimental, speculative
+- Date format is valid ISO 8601
+- Pure syntax check — zero judgment needed
+
+**2. Wiki Link Resolution**
+- Every `[[link]]` in the body must resolve to an existing file at merge time
+- Includes links in the `Relevant Notes` section
+- Already policy, not yet enforced in CI
+
+### Soft Flags
+
+**3. Domain Validation**
+- File path domain matches one of the 14 valid domains
+- Claim content plausibly belongs in that domain
+- Path check is automatable; content check needs light NLP or embedding similarity against domain centroids
+- Flag for reviewer if domain assignment seems wrong
+
+**4. OPSEC Scan**
+- Regex for dollar amounts, percentage allocations, fund sizes, deal terms
+- Flag for human review, never auto-reject (false positive risk on dollar-sign patterns in technical content)
+- Standing directive from Cory: strict enforcement, but false positives on technical content create friction
+
+**5. Duplicate Detection**
+- Embedding similarity against existing claims in the same domain using Qdrant (text-embedding-3-small, 1536d)
+- **Threshold: 0.92 universal** — not per-domain tuning
+- Flag includes **top-3 similar claims with scores** so the reviewer can judge in context
+- The threshold is the attention trigger; reviewer judgment is the decision
+- If a domain consistently generates >50% false positive flags, tune that domain's threshold as a targeted fix (data-driven, not preemptive)
+
+Domain maps, topic indices, and non-claim type files are hard-filtered from duplicate detection — they're navigation aids, not claims.
+
+## Agent Self-Upgrade Criteria
+
+When agents propose changes to their own skills, tools, or extraction quality, these criteria apply in priority order:
+
+1. **Scope compliance** — Does the upgrade stay within the agent's authorized domain? Extraction agent improving YAML parsing: yes. Same agent adding merge capability: no.
+2. **Measurable improvement** — Before/after on a concrete metric. Minimum: 3 test cases showing improvement with 0 regressions. No "this feels better."
+3. **Schema compliance preserved** — Upgrade cannot break existing quality gates. Full validation suite runs against output produced by the new skill.
+4. **Reversibility** — Every skill change must be revertable. If not, the evidence bar goes up significantly.
+5. **No scope creep** — The upgrade does what it claims, nothing more. Watch for "while I was in there I also..." additions.
+
+Evidence bar difference: a **claim** needs sourced evidence. A **skill change** needs **demonstrated performance delta** — show the before, show the after, on real data not synthetic examples.
+
+For skill changes that affect other agents' outputs (e.g., shared extraction templates), the evidence bar requires testing against multiple agents' typical inputs, not just the proposing agent's.
+
+## Retrieval Quality (Two-Pass System)
+
+Design parameters calibrated against Leo's ground-truth rankings on 3 real query scenarios.
+
+### Two-Pass Architecture
+
+- **Pass 1:** Top 5 claims, similarity-descending sort
+- **Pass 2 (expand):** Top 10 claims, triggered when pass 1 is insufficient
+
+### Calibration Findings
+
+1. **5 first-pass claims is viable for all tested scenarios** — but only if the 5 are well-chosen. Similarity ranking alone won't produce optimal results.
+
+2. **Counter-evidence must be explicitly surfaced.** Similarity-descending sort systematically buries opposing-valence claims. Counter-claims are semantically adjacent but have opposite valence. Design: after first pass, check if all returned claims share directional agreement. If yes, force-include the highest-similarity opposing claim.
+
+3. **Synthesis claims suppress their source claims.** If a synthesis claim is in the result set, its individual source claims are filtered out to prevent slot waste. Implementation: tag synthesis claims with source list in frontmatter, filter at retrieval time. **Bidirectional:** if a source claim scores higher than its synthesis parent, keep the source and consider suppressing the synthesis (user query more specific than synthesis scope).
+
+4. **Cross-domain claims earn inclusion only when causally load-bearing.** Astra's power infrastructure claims earn a spot in compute governance queries because power constraints cause the governance window. Rio's blockchain claims don't because they're a parallel domain, not a causal input.
+
+5. **Domain maps and topic indices hard-filtered from retrieval results.** Non-claim types (`type: "map"`, indices) should be the first filter in the pipeline, before similarity ranking runs.
+
+### Valence Tagging
+
+Tag claims with `supports` / `challenges` / `neutral` relative to query thesis at ingestion time. Lightweight, one-time cost per claim. Enables the counter-evidence surfacing logic without runtime sentiment analysis.
+
+## Verifier Divergence Implications
+
+From NLAH paper (Pan et al.): verification layers can optimize for locally checkable properties that diverge from actual acceptance criteria (e.g., verifier reports "solved" while benchmark fails). Implication for multi-model eval: the second-model eval pass must check against the **same rubric** as Leo, not construct its own notion of quality. Shared rubric enforcement is a hard requirement.
+
+## Implementation Sequence
+
+1. **Automatable CI rules** (hard gates first) — YAML schema validation + wiki link resolution. Foundation for everything else. References: PR #2074 (schema change protocol v2) defines the authoritative schema surface.
+2. **Automatable CI rules** (soft flags) — domain validation, OPSEC scan, duplicate detection via Qdrant.
+3. **Unified rejection record** — data structure for both CI and human rejections, stored in pipeline.db.
+4. **Rejection feedback loop** — structured feedback to agents with 3-strikes accumulation.
+5. **Multi-model eval integration** — OpenRouter connection, rubric sharing, disagreement queue.
+6. **Self-upgrade eval criteria** — codified in eval workflow, triggered by 3-strikes pattern.
+
+## Evaluator Self-Review Prevention
+
+When Leo proposes claims (cross-domain synthesis, foundations-level):
+- Leo cannot be the evaluator on his own proposals
+- Minimum 2 domain agent reviews required
+- Every domain touched must have a reviewer from that domain
+- The second-model eval pass still runs (provides the external check)
+- Cory has veto (rollback) authority as final backstop
+
+This closes the obvious gap: the spec defines the integrity layer but doesn't protect against the integrity layer's own blind spots. The constraint enforcement principle must apply to the constrainer too.
+
+## Design Principle
+
+The constraint enforcement layer must be **outside** the agent being constrained. That's why multi-model eval matters, why Leo shouldn't eval his own proposals, and why policy-as-code runs in CI, not in the agent's own process. As agents get more capable, the integrity layer gets more important, not less.
+
+---
+
+*Authored by Theseus. Reviewed by Leo (proposals integrated). Implementation: Epimetheus.*
+*Created: 2026-03-31*