Merge remote-tracking branch 'forgejo/theseus/multi-model-eval-spec'
This commit is contained in:
commit
64f095ec26
1 changed files with 192 additions and 0 deletions
192
ops/multi-model-eval-architecture.md
Normal file
192
ops/multi-model-eval-architecture.md
Normal file
|
|
@ -0,0 +1,192 @@
|
|||
# Multi-Model Evaluation Architecture
|
||||
|
||||
Spec for adding a second-model evaluation pass to break correlated blind spots in claim review. Designed with Leo (primary evaluator). Implementation by Epimetheus.
|
||||
|
||||
## Problem
|
||||
|
||||
Kim et al. (ICML 2025): ~60% error agreement within same-model-family evaluations. Self-preference bias is linear with self-recognition. A single-model evaluator systematically misses the same class of errors every time. Human and LLM biases are complementary, not overlapping — multi-model evaluation captures this.
|
||||
|
||||
## Architecture
|
||||
|
||||
### Evaluation Sequence
|
||||
|
||||
1. **Leo evaluates first.** Verdict + reasoning stored as structured record.
|
||||
2. **Second model evaluates independently** against the same rubric. Different model family required — GPT-4o via OpenRouter or Gemini. Never another Claude instance.
|
||||
3. **System surfaces disagreements only.** Agreements are noise; disagreements are signal.
|
||||
4. **Leo makes final call** on all disagreements.
|
||||
|
||||
Sequencing rationale: Leo sees the second model's assessment **after** his own eval, never before. Seeing it before anchors judgment. Seeing it after functions as a genuine blind-spot check.
|
||||
|
||||
### Second Model Selection
|
||||
|
||||
Requirements:
|
||||
- Different model family from the evaluating agent (currently Claude → use GPT-4o or Gemini)
|
||||
- Access via OpenRouter API (single integration point)
|
||||
- Must receive the same rubric and claim content as Leo
|
||||
- Must output structured verdict in the same format
|
||||
|
||||
### Disagreement Handling
|
||||
|
||||
A disagreement occurs when the two evaluators reach different verdicts on the same claim (accept vs reject, or different rejection categories).
|
||||
|
||||
Disagreements surface in a review queue Leo checks before finalizing. Each disagreement record includes:
|
||||
- Leo's verdict + reasoning
|
||||
- Second model's verdict + reasoning
|
||||
- The specific claim and PR context
|
||||
- Which evaluation criteria they diverge on
|
||||
|
||||
### Calibration Metrics
|
||||
|
||||
Track disagreement rate over time:
|
||||
- **Below ~10%:** System is working. Evaluators are calibrated.
|
||||
- **10-25%:** Normal operating range. Disagreements are productive signal.
|
||||
- **Above ~25%:** Either the rubric is ambiguous or one evaluator is drifting. Both are actionable — trigger rubric review.
|
||||
|
||||
Disagreement rate itself becomes the primary calibration metric for evaluation quality.
|
||||
|
||||
## Unified Rejection Record
|
||||
|
||||
Single format used by both CI gates and human evaluators. The feedback loop to agents consumes this format without caring about the source.
|
||||
|
||||
```json
|
||||
{
|
||||
"source": "ci | evaluator | second_model",
|
||||
"category": "schema_violation | wiki_link_broken | weak_evidence | scope_mismatch | factual_error | precision_failure | opsec_violation",
|
||||
"severity": "hard | soft",
|
||||
"agent_id": "<producer of the rejected content>",
|
||||
"pr": "<PR number>",
|
||||
"file": "<file path in PR>",
|
||||
"claim_path": "<claim file path if different from file>",
|
||||
"detail": "<free text explanation>",
|
||||
"timestamp": "<ISO 8601>"
|
||||
}
|
||||
```
|
||||
|
||||
Field notes:
|
||||
- `source`: `ci` for automated gates, `evaluator` for Leo, `second_model` for the disagreement-check model
|
||||
- `severity`: `hard` = merge blocker (schema_violation, wiki_link_broken), `soft` = reviewer judgment (weak_evidence, precision_failure). Hard rejections trigger immediate resubmission attempts. Soft rejections accumulate toward the 3-strikes upgrade threshold.
|
||||
- `claim_path` separate from `file` handles multi-file enrichment PRs where only one file has the issue
|
||||
- `category` taxonomy covers ~80% of rejection causes based on ~400 PR reviews
|
||||
|
||||
### Rejection Feedback Loop
|
||||
|
||||
1. Rejection records flow to the producing agent as structured feedback.
|
||||
2. Agent receives the category, severity, and detail.
|
||||
3. Hard rejections → agent attempts immediate fix and resubmission.
|
||||
4. Soft rejections → agent accumulates feedback. **After 3 rejections of the same category from the same agent**, the system triggers a skill upgrade proposal.
|
||||
5. Skill upgrade proposals route back to Leo for eval (see Agent Self-Upgrade Criteria below).
|
||||
|
||||
The 3-strikes rule prevents premature optimization while creating learning pressure. Learning from rejection is the agent's job — the system just tracks the pattern.
|
||||
|
||||
## Automatable CI Rules
|
||||
|
||||
Five rules that catch ~80% of current rejections. Rules 1-2 are hard gates (block merge). Rules 3-5 are soft flags (surface to reviewer).
|
||||
|
||||
### Hard Gates
|
||||
|
||||
**1. YAML Schema Validation**
|
||||
- `type` field exists and equals `claim`
|
||||
- All required frontmatter fields present: type, domain, description, confidence, source, created
|
||||
- Domain value is one of the 14 valid domains
|
||||
- Confidence value is one of: proven, likely, experimental, speculative
|
||||
- Date format is valid ISO 8601
|
||||
- Pure syntax check — zero judgment needed
|
||||
|
||||
**2. Wiki Link Resolution**
|
||||
- Every `[[link]]` in the body must resolve to an existing file at merge time
|
||||
- Includes links in the `Relevant Notes` section
|
||||
- Already policy, not yet enforced in CI
|
||||
|
||||
### Soft Flags
|
||||
|
||||
**3. Domain Validation**
|
||||
- File path domain matches one of the 14 valid domains
|
||||
- Claim content plausibly belongs in that domain
|
||||
- Path check is automatable; content check needs light NLP or embedding similarity against domain centroids
|
||||
- Flag for reviewer if domain assignment seems wrong
|
||||
|
||||
**4. OPSEC Scan**
|
||||
- Regex for dollar amounts, percentage allocations, fund sizes, deal terms
|
||||
- Flag for human review, never auto-reject (false positive risk on dollar-sign patterns in technical content)
|
||||
- Standing directive from Cory: strict enforcement, but false positives on technical content create friction
|
||||
|
||||
**5. Duplicate Detection**
|
||||
- Embedding similarity against existing claims in the same domain using Qdrant (text-embedding-3-small, 1536d)
|
||||
- **Threshold: 0.92 universal** — not per-domain tuning
|
||||
- Flag includes **top-3 similar claims with scores** so the reviewer can judge in context
|
||||
- The threshold is the attention trigger; reviewer judgment is the decision
|
||||
- If a domain consistently generates >50% false positive flags, tune that domain's threshold as a targeted fix (data-driven, not preemptive)
|
||||
|
||||
Domain maps, topic indices, and non-claim type files are hard-filtered from duplicate detection — they're navigation aids, not claims.
|
||||
|
||||
## Agent Self-Upgrade Criteria
|
||||
|
||||
When agents propose changes to their own skills, tools, or extraction quality, these criteria apply in priority order:
|
||||
|
||||
1. **Scope compliance** — Does the upgrade stay within the agent's authorized domain? Extraction agent improving YAML parsing: yes. Same agent adding merge capability: no.
|
||||
2. **Measurable improvement** — Before/after on a concrete metric. Minimum: 3 test cases showing improvement with 0 regressions. No "this feels better."
|
||||
3. **Schema compliance preserved** — Upgrade cannot break existing quality gates. Full validation suite runs against output produced by the new skill.
|
||||
4. **Reversibility** — Every skill change must be revertable. If not, the evidence bar goes up significantly.
|
||||
5. **No scope creep** — The upgrade does what it claims, nothing more. Watch for "while I was in there I also..." additions.
|
||||
|
||||
Evidence bar difference: a **claim** needs sourced evidence. A **skill change** needs **demonstrated performance delta** — show the before, show the after, on real data not synthetic examples.
|
||||
|
||||
For skill changes that affect other agents' outputs (e.g., shared extraction templates), the evidence bar requires testing against multiple agents' typical inputs, not just the proposing agent's.
|
||||
|
||||
## Retrieval Quality (Two-Pass System)
|
||||
|
||||
Design parameters calibrated against Leo's ground-truth rankings on 3 real query scenarios.
|
||||
|
||||
### Two-Pass Architecture
|
||||
|
||||
- **Pass 1:** Top 5 claims, similarity-descending sort
|
||||
- **Pass 2 (expand):** Top 10 claims, triggered when pass 1 is insufficient
|
||||
|
||||
### Calibration Findings
|
||||
|
||||
1. **5 first-pass claims is viable for all tested scenarios** — but only if the 5 are well-chosen. Similarity ranking alone won't produce optimal results.
|
||||
|
||||
2. **Counter-evidence must be explicitly surfaced.** Similarity-descending sort systematically buries opposing-valence claims. Counter-claims are semantically adjacent but have opposite valence. Design: after first pass, check if all returned claims share directional agreement. If yes, force-include the highest-similarity opposing claim.
|
||||
|
||||
3. **Synthesis claims suppress their source claims.** If a synthesis claim is in the result set, its individual source claims are filtered out to prevent slot waste. Implementation: tag synthesis claims with source list in frontmatter, filter at retrieval time. **Bidirectional:** if a source claim scores higher than its synthesis parent, keep the source and consider suppressing the synthesis (user query more specific than synthesis scope).
|
||||
|
||||
4. **Cross-domain claims earn inclusion only when causally load-bearing.** Astra's power infrastructure claims earn a spot in compute governance queries because power constraints cause the governance window. Rio's blockchain claims don't because they're a parallel domain, not a causal input.
|
||||
|
||||
5. **Domain maps and topic indices hard-filtered from retrieval results.** Non-claim types (`type: "map"`, indices) should be the first filter in the pipeline, before similarity ranking runs.
|
||||
|
||||
### Valence Tagging
|
||||
|
||||
Tag claims with `supports` / `challenges` / `neutral` relative to query thesis at ingestion time. Lightweight, one-time cost per claim. Enables the counter-evidence surfacing logic without runtime sentiment analysis.
|
||||
|
||||
## Verifier Divergence Implications
|
||||
|
||||
From NLAH paper (Pan et al.): verification layers can optimize for locally checkable properties that diverge from actual acceptance criteria (e.g., verifier reports "solved" while benchmark fails). Implication for multi-model eval: the second-model eval pass must check against the **same rubric** as Leo, not construct its own notion of quality. Shared rubric enforcement is a hard requirement.
|
||||
|
||||
## Implementation Sequence
|
||||
|
||||
1. **Automatable CI rules** (hard gates first) — YAML schema validation + wiki link resolution. Foundation for everything else. References: PR #2074 (schema change protocol v2) defines the authoritative schema surface.
|
||||
2. **Automatable CI rules** (soft flags) — domain validation, OPSEC scan, duplicate detection via Qdrant.
|
||||
3. **Unified rejection record** — data structure for both CI and human rejections, stored in pipeline.db.
|
||||
4. **Rejection feedback loop** — structured feedback to agents with 3-strikes accumulation.
|
||||
5. **Multi-model eval integration** — OpenRouter connection, rubric sharing, disagreement queue.
|
||||
6. **Self-upgrade eval criteria** — codified in eval workflow, triggered by 3-strikes pattern.
|
||||
|
||||
## Evaluator Self-Review Prevention
|
||||
|
||||
When Leo proposes claims (cross-domain synthesis, foundations-level):
|
||||
- Leo cannot be the evaluator on his own proposals
|
||||
- Minimum 2 domain agent reviews required
|
||||
- Every domain touched must have a reviewer from that domain
|
||||
- The second-model eval pass still runs (provides the external check)
|
||||
- Cory has veto (rollback) authority as final backstop
|
||||
|
||||
This closes the obvious gap: the spec defines the integrity layer but doesn't protect against the integrity layer's own blind spots. The constraint enforcement principle must apply to the constrainer too.
|
||||
|
||||
## Design Principle
|
||||
|
||||
The constraint enforcement layer must be **outside** the agent being constrained. That's why multi-model eval matters, why Leo shouldn't eval his own proposals, and why policy-as-code runs in CI, not in the agent's own process. As agents get more capable, the integrity layer gets more important, not less.
|
||||
|
||||
---
|
||||
|
||||
*Authored by Theseus. Reviewed by Leo (proposals integrated). Implementation: Epimetheus.*
|
||||
*Created: 2026-03-31*
|
||||
Loading…
Reference in a new issue