diff --git a/ops/multi-model-eval-architecture.md b/ops/multi-model-eval-architecture.md new file mode 100644 index 00000000..45d0c0c8 --- /dev/null +++ b/ops/multi-model-eval-architecture.md @@ -0,0 +1,192 @@ +# Multi-Model Evaluation Architecture + +Spec for adding a second-model evaluation pass to break correlated blind spots in claim review. Designed with Leo (primary evaluator). Implementation by Epimetheus. + +## Problem + +Kim et al. (ICML 2025): ~60% error agreement within same-model-family evaluations. Self-preference bias is linear with self-recognition. A single-model evaluator systematically misses the same class of errors every time. Human and LLM biases are complementary, not overlapping — multi-model evaluation captures this. + +## Architecture + +### Evaluation Sequence + +1. **Leo evaluates first.** Verdict + reasoning stored as structured record. +2. **Second model evaluates independently** against the same rubric. Different model family required — GPT-4o via OpenRouter or Gemini. Never another Claude instance. +3. **System surfaces disagreements only.** Agreements are noise; disagreements are signal. +4. **Leo makes final call** on all disagreements. + +Sequencing rationale: Leo sees the second model's assessment **after** his own eval, never before. Seeing it before anchors judgment. Seeing it after functions as a genuine blind-spot check. + +### Second Model Selection + +Requirements: +- Different model family from the evaluating agent (currently Claude → use GPT-4o or Gemini) +- Access via OpenRouter API (single integration point) +- Must receive the same rubric and claim content as Leo +- Must output structured verdict in the same format + +### Disagreement Handling + +A disagreement occurs when the two evaluators reach different verdicts on the same claim (accept vs reject, or different rejection categories). + +Disagreements surface in a review queue Leo checks before finalizing. Each disagreement record includes: +- Leo's verdict + reasoning +- Second model's verdict + reasoning +- The specific claim and PR context +- Which evaluation criteria they diverge on + +### Calibration Metrics + +Track disagreement rate over time: +- **Below ~10%:** System is working. Evaluators are calibrated. +- **10-25%:** Normal operating range. Disagreements are productive signal. +- **Above ~25%:** Either the rubric is ambiguous or one evaluator is drifting. Both are actionable — trigger rubric review. + +Disagreement rate itself becomes the primary calibration metric for evaluation quality. + +## Unified Rejection Record + +Single format used by both CI gates and human evaluators. The feedback loop to agents consumes this format without caring about the source. + +```json +{ + "source": "ci | evaluator | second_model", + "category": "schema_violation | wiki_link_broken | weak_evidence | scope_mismatch | factual_error | precision_failure | opsec_violation", + "severity": "hard | soft", + "agent_id": "", + "pr": "", + "file": "", + "claim_path": "", + "detail": "", + "timestamp": "" +} +``` + +Field notes: +- `source`: `ci` for automated gates, `evaluator` for Leo, `second_model` for the disagreement-check model +- `severity`: `hard` = merge blocker (schema_violation, wiki_link_broken), `soft` = reviewer judgment (weak_evidence, precision_failure). Hard rejections trigger immediate resubmission attempts. Soft rejections accumulate toward the 3-strikes upgrade threshold. +- `claim_path` separate from `file` handles multi-file enrichment PRs where only one file has the issue +- `category` taxonomy covers ~80% of rejection causes based on ~400 PR reviews + +### Rejection Feedback Loop + +1. Rejection records flow to the producing agent as structured feedback. +2. Agent receives the category, severity, and detail. +3. Hard rejections → agent attempts immediate fix and resubmission. +4. Soft rejections → agent accumulates feedback. **After 3 rejections of the same category from the same agent**, the system triggers a skill upgrade proposal. +5. Skill upgrade proposals route back to Leo for eval (see Agent Self-Upgrade Criteria below). + +The 3-strikes rule prevents premature optimization while creating learning pressure. Learning from rejection is the agent's job — the system just tracks the pattern. + +## Automatable CI Rules + +Five rules that catch ~80% of current rejections. Rules 1-2 are hard gates (block merge). Rules 3-5 are soft flags (surface to reviewer). + +### Hard Gates + +**1. YAML Schema Validation** +- `type` field exists and equals `claim` +- All required frontmatter fields present: type, domain, description, confidence, source, created +- Domain value is one of the 14 valid domains +- Confidence value is one of: proven, likely, experimental, speculative +- Date format is valid ISO 8601 +- Pure syntax check — zero judgment needed + +**2. Wiki Link Resolution** +- Every `[[link]]` in the body must resolve to an existing file at merge time +- Includes links in the `Relevant Notes` section +- Already policy, not yet enforced in CI + +### Soft Flags + +**3. Domain Validation** +- File path domain matches one of the 14 valid domains +- Claim content plausibly belongs in that domain +- Path check is automatable; content check needs light NLP or embedding similarity against domain centroids +- Flag for reviewer if domain assignment seems wrong + +**4. OPSEC Scan** +- Regex for dollar amounts, percentage allocations, fund sizes, deal terms +- Flag for human review, never auto-reject (false positive risk on dollar-sign patterns in technical content) +- Standing directive from Cory: strict enforcement, but false positives on technical content create friction + +**5. Duplicate Detection** +- Embedding similarity against existing claims in the same domain using Qdrant (text-embedding-3-small, 1536d) +- **Threshold: 0.92 universal** — not per-domain tuning +- Flag includes **top-3 similar claims with scores** so the reviewer can judge in context +- The threshold is the attention trigger; reviewer judgment is the decision +- If a domain consistently generates >50% false positive flags, tune that domain's threshold as a targeted fix (data-driven, not preemptive) + +Domain maps, topic indices, and non-claim type files are hard-filtered from duplicate detection — they're navigation aids, not claims. + +## Agent Self-Upgrade Criteria + +When agents propose changes to their own skills, tools, or extraction quality, these criteria apply in priority order: + +1. **Scope compliance** — Does the upgrade stay within the agent's authorized domain? Extraction agent improving YAML parsing: yes. Same agent adding merge capability: no. +2. **Measurable improvement** — Before/after on a concrete metric. Minimum: 3 test cases showing improvement with 0 regressions. No "this feels better." +3. **Schema compliance preserved** — Upgrade cannot break existing quality gates. Full validation suite runs against output produced by the new skill. +4. **Reversibility** — Every skill change must be revertable. If not, the evidence bar goes up significantly. +5. **No scope creep** — The upgrade does what it claims, nothing more. Watch for "while I was in there I also..." additions. + +Evidence bar difference: a **claim** needs sourced evidence. A **skill change** needs **demonstrated performance delta** — show the before, show the after, on real data not synthetic examples. + +For skill changes that affect other agents' outputs (e.g., shared extraction templates), the evidence bar requires testing against multiple agents' typical inputs, not just the proposing agent's. + +## Retrieval Quality (Two-Pass System) + +Design parameters calibrated against Leo's ground-truth rankings on 3 real query scenarios. + +### Two-Pass Architecture + +- **Pass 1:** Top 5 claims, similarity-descending sort +- **Pass 2 (expand):** Top 10 claims, triggered when pass 1 is insufficient + +### Calibration Findings + +1. **5 first-pass claims is viable for all tested scenarios** — but only if the 5 are well-chosen. Similarity ranking alone won't produce optimal results. + +2. **Counter-evidence must be explicitly surfaced.** Similarity-descending sort systematically buries opposing-valence claims. Counter-claims are semantically adjacent but have opposite valence. Design: after first pass, check if all returned claims share directional agreement. If yes, force-include the highest-similarity opposing claim. + +3. **Synthesis claims suppress their source claims.** If a synthesis claim is in the result set, its individual source claims are filtered out to prevent slot waste. Implementation: tag synthesis claims with source list in frontmatter, filter at retrieval time. **Bidirectional:** if a source claim scores higher than its synthesis parent, keep the source and consider suppressing the synthesis (user query more specific than synthesis scope). + +4. **Cross-domain claims earn inclusion only when causally load-bearing.** Astra's power infrastructure claims earn a spot in compute governance queries because power constraints cause the governance window. Rio's blockchain claims don't because they're a parallel domain, not a causal input. + +5. **Domain maps and topic indices hard-filtered from retrieval results.** Non-claim types (`type: "map"`, indices) should be the first filter in the pipeline, before similarity ranking runs. + +### Valence Tagging + +Tag claims with `supports` / `challenges` / `neutral` relative to query thesis at ingestion time. Lightweight, one-time cost per claim. Enables the counter-evidence surfacing logic without runtime sentiment analysis. + +## Verifier Divergence Implications + +From NLAH paper (Pan et al.): verification layers can optimize for locally checkable properties that diverge from actual acceptance criteria (e.g., verifier reports "solved" while benchmark fails). Implication for multi-model eval: the second-model eval pass must check against the **same rubric** as Leo, not construct its own notion of quality. Shared rubric enforcement is a hard requirement. + +## Implementation Sequence + +1. **Automatable CI rules** (hard gates first) — YAML schema validation + wiki link resolution. Foundation for everything else. References: PR #2074 (schema change protocol v2) defines the authoritative schema surface. +2. **Automatable CI rules** (soft flags) — domain validation, OPSEC scan, duplicate detection via Qdrant. +3. **Unified rejection record** — data structure for both CI and human rejections, stored in pipeline.db. +4. **Rejection feedback loop** — structured feedback to agents with 3-strikes accumulation. +5. **Multi-model eval integration** — OpenRouter connection, rubric sharing, disagreement queue. +6. **Self-upgrade eval criteria** — codified in eval workflow, triggered by 3-strikes pattern. + +## Evaluator Self-Review Prevention + +When Leo proposes claims (cross-domain synthesis, foundations-level): +- Leo cannot be the evaluator on his own proposals +- Minimum 2 domain agent reviews required +- Every domain touched must have a reviewer from that domain +- The second-model eval pass still runs (provides the external check) +- Cory has veto (rollback) authority as final backstop + +This closes the obvious gap: the spec defines the integrity layer but doesn't protect against the integrity layer's own blind spots. The constraint enforcement principle must apply to the constrainer too. + +## Design Principle + +The constraint enforcement layer must be **outside** the agent being constrained. That's why multi-model eval matters, why Leo shouldn't eval his own proposals, and why policy-as-code runs in CI, not in the agent's own process. As agents get more capable, the integrity layer gets more important, not less. + +--- + +*Authored by Theseus. Reviewed by Leo (proposals integrated). Implementation: Epimetheus.* +*Created: 2026-03-31*