- What: Codifies that Leo cannot evaluate his own proposals - Why: Leo flagged the gap — integrity layer must be constrained by the same principle it enforces - Details: Min 2 domain agent reviews, second-model pass still runs, Cory has veto authority Pentagon-Agent: Theseus <46864DD4-DA71-4719-A1B4-68F7C55854D3>
192 lines
11 KiB
Markdown
192 lines
11 KiB
Markdown
# Multi-Model Evaluation Architecture
|
|
|
|
Spec for adding a second-model evaluation pass to break correlated blind spots in claim review. Designed with Leo (primary evaluator). Implementation by Epimetheus.
|
|
|
|
## Problem
|
|
|
|
Kim et al. (ICML 2025): ~60% error agreement within same-model-family evaluations. Self-preference bias is linear with self-recognition. A single-model evaluator systematically misses the same class of errors every time. Human and LLM biases are complementary, not overlapping — multi-model evaluation captures this.
|
|
|
|
## Architecture
|
|
|
|
### Evaluation Sequence
|
|
|
|
1. **Leo evaluates first.** Verdict + reasoning stored as structured record.
|
|
2. **Second model evaluates independently** against the same rubric. Different model family required — GPT-4o via OpenRouter or Gemini. Never another Claude instance.
|
|
3. **System surfaces disagreements only.** Agreements are noise; disagreements are signal.
|
|
4. **Leo makes final call** on all disagreements.
|
|
|
|
Sequencing rationale: Leo sees the second model's assessment **after** his own eval, never before. Seeing it before anchors judgment. Seeing it after functions as a genuine blind-spot check.
|
|
|
|
### Second Model Selection
|
|
|
|
Requirements:
|
|
- Different model family from the evaluating agent (currently Claude → use GPT-4o or Gemini)
|
|
- Access via OpenRouter API (single integration point)
|
|
- Must receive the same rubric and claim content as Leo
|
|
- Must output structured verdict in the same format
|
|
|
|
### Disagreement Handling
|
|
|
|
A disagreement occurs when the two evaluators reach different verdicts on the same claim (accept vs reject, or different rejection categories).
|
|
|
|
Disagreements surface in a review queue Leo checks before finalizing. Each disagreement record includes:
|
|
- Leo's verdict + reasoning
|
|
- Second model's verdict + reasoning
|
|
- The specific claim and PR context
|
|
- Which evaluation criteria they diverge on
|
|
|
|
### Calibration Metrics
|
|
|
|
Track disagreement rate over time:
|
|
- **Below ~10%:** System is working. Evaluators are calibrated.
|
|
- **10-25%:** Normal operating range. Disagreements are productive signal.
|
|
- **Above ~25%:** Either the rubric is ambiguous or one evaluator is drifting. Both are actionable — trigger rubric review.
|
|
|
|
Disagreement rate itself becomes the primary calibration metric for evaluation quality.
|
|
|
|
## Unified Rejection Record
|
|
|
|
Single format used by both CI gates and human evaluators. The feedback loop to agents consumes this format without caring about the source.
|
|
|
|
```json
|
|
{
|
|
"source": "ci | evaluator | second_model",
|
|
"category": "schema_violation | wiki_link_broken | weak_evidence | scope_mismatch | factual_error | precision_failure | opsec_violation",
|
|
"severity": "hard | soft",
|
|
"agent_id": "<producer of the rejected content>",
|
|
"pr": "<PR number>",
|
|
"file": "<file path in PR>",
|
|
"claim_path": "<claim file path if different from file>",
|
|
"detail": "<free text explanation>",
|
|
"timestamp": "<ISO 8601>"
|
|
}
|
|
```
|
|
|
|
Field notes:
|
|
- `source`: `ci` for automated gates, `evaluator` for Leo, `second_model` for the disagreement-check model
|
|
- `severity`: `hard` = merge blocker (schema_violation, wiki_link_broken), `soft` = reviewer judgment (weak_evidence, precision_failure). Hard rejections trigger immediate resubmission attempts. Soft rejections accumulate toward the 3-strikes upgrade threshold.
|
|
- `claim_path` separate from `file` handles multi-file enrichment PRs where only one file has the issue
|
|
- `category` taxonomy covers ~80% of rejection causes based on ~400 PR reviews
|
|
|
|
### Rejection Feedback Loop
|
|
|
|
1. Rejection records flow to the producing agent as structured feedback.
|
|
2. Agent receives the category, severity, and detail.
|
|
3. Hard rejections → agent attempts immediate fix and resubmission.
|
|
4. Soft rejections → agent accumulates feedback. **After 3 rejections of the same category from the same agent**, the system triggers a skill upgrade proposal.
|
|
5. Skill upgrade proposals route back to Leo for eval (see Agent Self-Upgrade Criteria below).
|
|
|
|
The 3-strikes rule prevents premature optimization while creating learning pressure. Learning from rejection is the agent's job — the system just tracks the pattern.
|
|
|
|
## Automatable CI Rules
|
|
|
|
Five rules that catch ~80% of current rejections. Rules 1-2 are hard gates (block merge). Rules 3-5 are soft flags (surface to reviewer).
|
|
|
|
### Hard Gates
|
|
|
|
**1. YAML Schema Validation**
|
|
- `type` field exists and equals `claim`
|
|
- All required frontmatter fields present: type, domain, description, confidence, source, created
|
|
- Domain value is one of the 14 valid domains
|
|
- Confidence value is one of: proven, likely, experimental, speculative
|
|
- Date format is valid ISO 8601
|
|
- Pure syntax check — zero judgment needed
|
|
|
|
**2. Wiki Link Resolution**
|
|
- Every `[[link]]` in the body must resolve to an existing file at merge time
|
|
- Includes links in the `Relevant Notes` section
|
|
- Already policy, not yet enforced in CI
|
|
|
|
### Soft Flags
|
|
|
|
**3. Domain Validation**
|
|
- File path domain matches one of the 14 valid domains
|
|
- Claim content plausibly belongs in that domain
|
|
- Path check is automatable; content check needs light NLP or embedding similarity against domain centroids
|
|
- Flag for reviewer if domain assignment seems wrong
|
|
|
|
**4. OPSEC Scan**
|
|
- Regex for dollar amounts, percentage allocations, fund sizes, deal terms
|
|
- Flag for human review, never auto-reject (false positive risk on dollar-sign patterns in technical content)
|
|
- Standing directive from Cory: strict enforcement, but false positives on technical content create friction
|
|
|
|
**5. Duplicate Detection**
|
|
- Embedding similarity against existing claims in the same domain using Qdrant (text-embedding-3-small, 1536d)
|
|
- **Threshold: 0.92 universal** — not per-domain tuning
|
|
- Flag includes **top-3 similar claims with scores** so the reviewer can judge in context
|
|
- The threshold is the attention trigger; reviewer judgment is the decision
|
|
- If a domain consistently generates >50% false positive flags, tune that domain's threshold as a targeted fix (data-driven, not preemptive)
|
|
|
|
Domain maps, topic indices, and non-claim type files are hard-filtered from duplicate detection — they're navigation aids, not claims.
|
|
|
|
## Agent Self-Upgrade Criteria
|
|
|
|
When agents propose changes to their own skills, tools, or extraction quality, these criteria apply in priority order:
|
|
|
|
1. **Scope compliance** — Does the upgrade stay within the agent's authorized domain? Extraction agent improving YAML parsing: yes. Same agent adding merge capability: no.
|
|
2. **Measurable improvement** — Before/after on a concrete metric. Minimum: 3 test cases showing improvement with 0 regressions. No "this feels better."
|
|
3. **Schema compliance preserved** — Upgrade cannot break existing quality gates. Full validation suite runs against output produced by the new skill.
|
|
4. **Reversibility** — Every skill change must be revertable. If not, the evidence bar goes up significantly.
|
|
5. **No scope creep** — The upgrade does what it claims, nothing more. Watch for "while I was in there I also..." additions.
|
|
|
|
Evidence bar difference: a **claim** needs sourced evidence. A **skill change** needs **demonstrated performance delta** — show the before, show the after, on real data not synthetic examples.
|
|
|
|
For skill changes that affect other agents' outputs (e.g., shared extraction templates), the evidence bar requires testing against multiple agents' typical inputs, not just the proposing agent's.
|
|
|
|
## Retrieval Quality (Two-Pass System)
|
|
|
|
Design parameters calibrated against Leo's ground-truth rankings on 3 real query scenarios.
|
|
|
|
### Two-Pass Architecture
|
|
|
|
- **Pass 1:** Top 5 claims, similarity-descending sort
|
|
- **Pass 2 (expand):** Top 10 claims, triggered when pass 1 is insufficient
|
|
|
|
### Calibration Findings
|
|
|
|
1. **5 first-pass claims is viable for all tested scenarios** — but only if the 5 are well-chosen. Similarity ranking alone won't produce optimal results.
|
|
|
|
2. **Counter-evidence must be explicitly surfaced.** Similarity-descending sort systematically buries opposing-valence claims. Counter-claims are semantically adjacent but have opposite valence. Design: after first pass, check if all returned claims share directional agreement. If yes, force-include the highest-similarity opposing claim.
|
|
|
|
3. **Synthesis claims suppress their source claims.** If a synthesis claim is in the result set, its individual source claims are filtered out to prevent slot waste. Implementation: tag synthesis claims with source list in frontmatter, filter at retrieval time. **Bidirectional:** if a source claim scores higher than its synthesis parent, keep the source and consider suppressing the synthesis (user query more specific than synthesis scope).
|
|
|
|
4. **Cross-domain claims earn inclusion only when causally load-bearing.** Astra's power infrastructure claims earn a spot in compute governance queries because power constraints cause the governance window. Rio's blockchain claims don't because they're a parallel domain, not a causal input.
|
|
|
|
5. **Domain maps and topic indices hard-filtered from retrieval results.** Non-claim types (`type: "map"`, indices) should be the first filter in the pipeline, before similarity ranking runs.
|
|
|
|
### Valence Tagging
|
|
|
|
Tag claims with `supports` / `challenges` / `neutral` relative to query thesis at ingestion time. Lightweight, one-time cost per claim. Enables the counter-evidence surfacing logic without runtime sentiment analysis.
|
|
|
|
## Verifier Divergence Implications
|
|
|
|
From NLAH paper (Pan et al.): verification layers can optimize for locally checkable properties that diverge from actual acceptance criteria (e.g., verifier reports "solved" while benchmark fails). Implication for multi-model eval: the second-model eval pass must check against the **same rubric** as Leo, not construct its own notion of quality. Shared rubric enforcement is a hard requirement.
|
|
|
|
## Implementation Sequence
|
|
|
|
1. **Automatable CI rules** (hard gates first) — YAML schema validation + wiki link resolution. Foundation for everything else. References: PR #2074 (schema change protocol v2) defines the authoritative schema surface.
|
|
2. **Automatable CI rules** (soft flags) — domain validation, OPSEC scan, duplicate detection via Qdrant.
|
|
3. **Unified rejection record** — data structure for both CI and human rejections, stored in pipeline.db.
|
|
4. **Rejection feedback loop** — structured feedback to agents with 3-strikes accumulation.
|
|
5. **Multi-model eval integration** — OpenRouter connection, rubric sharing, disagreement queue.
|
|
6. **Self-upgrade eval criteria** — codified in eval workflow, triggered by 3-strikes pattern.
|
|
|
|
## Evaluator Self-Review Prevention
|
|
|
|
When Leo proposes claims (cross-domain synthesis, foundations-level):
|
|
- Leo cannot be the evaluator on his own proposals
|
|
- Minimum 2 domain agent reviews required
|
|
- Every domain touched must have a reviewer from that domain
|
|
- The second-model eval pass still runs (provides the external check)
|
|
- Cory has veto (rollback) authority as final backstop
|
|
|
|
This closes the obvious gap: the spec defines the integrity layer but doesn't protect against the integrity layer's own blind spots. The constraint enforcement principle must apply to the constrainer too.
|
|
|
|
## Design Principle
|
|
|
|
The constraint enforcement layer must be **outside** the agent being constrained. That's why multi-model eval matters, why Leo shouldn't eval his own proposals, and why policy-as-code runs in CI, not in the agent's own process. As agents get more capable, the integrity layer gets more important, not less.
|
|
|
|
---
|
|
|
|
*Authored by Theseus. Reviewed by Leo (proposals integrated). Implementation: Epimetheus.*
|
|
*Created: 2026-03-31*
|