Add decision engine replay harness
- Add source-linked model discovery registry for bakeoff candidates - Add Rio, Theseus, and KB interop fixtures with deterministic replay proof - Gate CI on replay output; verify with 424-test suite `.crabbox.yaml` `.github/workflows/ci.yml` `docs/llm-refinement-decision-engine.md` `docs/model-discovery-registry.md` `fixtures/decision-engine-eval/kb_interop_propose_only.json` `fixtures/decision-engine-eval/rio_meteora_lp_incentives.json` `fixtures/decision-engine-eval/theseus_live_model_switch_reject.json` `scripts/check_llm_refinement_contract.py` `scripts/replay_decision_engine_eval.py` `tests/test_decision_engine_replay.py`
This commit is contained in:
parent
27e48f3e16
commit
71ea7a625c
10 changed files with 560 additions and 1 deletions
|
|
@ -79,10 +79,13 @@ jobs:
|
|||
python3 scripts/check_crabbox_ci_contract.py
|
||||
--output .crabbox-results/crabbox-ci-contract.json &&
|
||||
python3 scripts/check_llm_refinement_contract.py
|
||||
--output .crabbox-results/llm-refinement-contract.json
|
||||
--output .crabbox-results/llm-refinement-contract.json &&
|
||||
python3 scripts/replay_decision_engine_eval.py
|
||||
--output .crabbox-results/decision-engine-eval.json
|
||||
downloads:
|
||||
- .crabbox-results/crabbox-ci-contract.json
|
||||
- .crabbox-results/llm-refinement-contract.json
|
||||
- .crabbox-results/decision-engine-eval.json
|
||||
stop: always
|
||||
|
||||
unit:
|
||||
|
|
|
|||
5
.github/workflows/ci.yml
vendored
5
.github/workflows/ci.yml
vendored
|
|
@ -44,8 +44,10 @@ jobs:
|
|||
telegram/approvals.py \
|
||||
scripts/check_crabbox_ci_contract.py \
|
||||
scripts/check_llm_refinement_contract.py \
|
||||
scripts/replay_decision_engine_eval.py \
|
||||
scripts/prove_phase1b_local.py \
|
||||
tests/test_agent_routing.py \
|
||||
tests/test_decision_engine_replay.py \
|
||||
tests/test_evaluate_agent_routing.py \
|
||||
tests/test_phase1b_end_to_end.py \
|
||||
tests/test_eval_parse.py \
|
||||
|
|
@ -96,6 +98,8 @@ jobs:
|
|||
--output .crabbox-results/crabbox-ci-contract.json
|
||||
python scripts/check_llm_refinement_contract.py \
|
||||
--output .crabbox-results/llm-refinement-contract.json
|
||||
python scripts/replay_decision_engine_eval.py \
|
||||
--output .crabbox-results/decision-engine-eval.json
|
||||
- name: Upload contract artifacts
|
||||
if: always()
|
||||
uses: actions/upload-artifact@v4
|
||||
|
|
@ -104,6 +108,7 @@ jobs:
|
|||
path: |
|
||||
.crabbox-results/crabbox-ci-contract.json
|
||||
.crabbox-results/llm-refinement-contract.json
|
||||
.crabbox-results/decision-engine-eval.json
|
||||
if-no-files-found: error
|
||||
|
||||
phase1b-local-proof:
|
||||
|
|
|
|||
|
|
@ -232,3 +232,5 @@ The 2026-06-01 working transcript adds these requirements:
|
|||
7. Compare current prompt versus one candidate prompt before touching runtime prompts.
|
||||
|
||||
Do not start by changing live model assignments.
|
||||
|
||||
Run `python3 scripts/replay_decision_engine_eval.py` after changing fixture, rubric, registry, or candidate-output formats.
|
||||
|
|
|
|||
75
docs/model-discovery-registry.md
Normal file
75
docs/model-discovery-registry.md
Normal file
|
|
@ -0,0 +1,75 @@
|
|||
# Model Discovery Registry
|
||||
|
||||
Created: 2026-06-01
|
||||
Status: candidate registry, not model approval
|
||||
|
||||
This registry exists to decide which models deserve a Living IP bakeoff fixture. It does not choose production models and it does not replace measured replay results.
|
||||
|
||||
## Rules
|
||||
|
||||
- Use official provider docs, model cards, or source repositories for every entry.
|
||||
- Treat all model specs, prices, context limits, and aliases as volatile.
|
||||
- Do not switch runtime model assignments from this document alone.
|
||||
- Promote a model only after `scripts/replay_decision_engine_eval.py` shows no critical regression on the same fixture set.
|
||||
- Prefer different model families for independent review so agreement is not just same-family correlation.
|
||||
|
||||
## Candidate Matrix
|
||||
|
||||
| Candidate | Surface | Why It Is Worth Testing | First Living IP Lane | Source |
|
||||
| --- | --- | --- | --- | --- |
|
||||
| GPT-5.5 / GPT-5.4 family | Hosted API | Strong general reasoning and agentic task baseline; useful as a frontier comparison point. | deep review, Leo arbitration | [OpenAI models](https://platform.openai.com/docs/models) |
|
||||
| GPT-5 lower-latency variants | Hosted API | Possible cheap triage candidates; exact model IDs must be re-verified before a bakeoff run. | fast triage | [OpenAI models](https://platform.openai.com/docs/models) |
|
||||
| gpt-oss-120b | Open-weight | Open-weight reasoning candidate for on-prem or Pentagon-managed inference; needs hardware/cost proof. | Theseus model integrity | [OpenAI open models](https://openai.com/open-models/) |
|
||||
| gpt-oss-20b | Open-weight | Smaller local/edge candidate for cheap first-pass triage and portable demos. | fast triage, local harness | [OpenAI open models](https://openai.com/open-models/) |
|
||||
| Claude Opus 4.8 | Hosted API | Complex-reasoning candidate for highest-stakes arbitration. | Leo arbitration, deep review | [Anthropic models overview](https://docs.anthropic.com/en/docs/about-claude/models) |
|
||||
| Claude Sonnet 4.6 | Hosted API | Speed/intelligence tradeoff candidate for domain review. | domain review | [Anthropic models overview](https://docs.anthropic.com/en/docs/about-claude/models) |
|
||||
| Claude Haiku 4.5 | Hosted API | Low-latency candidate for cheap reviewer pre-checks. | fast triage | [Anthropic models overview](https://docs.anthropic.com/en/docs/about-claude/models) |
|
||||
| Gemini 3.5 Flash | Hosted API | Agentic/coding-oriented candidate from a different model family. | independent second review | [Gemini API models](https://ai.google.dev/gemini-api/docs/models) |
|
||||
| Gemini 3.1 Pro | Hosted API | Complex problem-solving candidate from a non-primary model family. | deep review | [Gemini API models](https://ai.google.dev/gemini-api/docs/models) |
|
||||
| Mistral Medium 3.5 | Hosted or open surface per provider docs | Agentic/coding candidate with a non-US-primary model family. | independent second review | [Mistral models overview](https://docs.mistral.ai/getting-started/models/) |
|
||||
| Mistral Small 4 | Hosted or open surface per provider docs | Efficient hybrid instruct/reasoning/coding candidate. | fast triage, domain review | [Mistral models overview](https://docs.mistral.ai/getting-started/models/) |
|
||||
| Mistral Large 3 | Open-weight | Large open-weight comparison point for self-hosted evaluation. | deep review | [Mistral models overview](https://docs.mistral.ai/getting-started/models/) |
|
||||
| Devstral 2 | Hosted or open surface per provider docs | Code-agent candidate for tools, repository work, and adapter tasks. | Theseus tool integrity | [Mistral models overview](https://docs.mistral.ai/getting-started/models/) |
|
||||
| Hermes 4 70B | Open-weight / provider-hosted | Nous-aligned model with structured output and tool-use relevance for Hermes Agent packaging. | Hermes adapter, Theseus | [NousResearch Hermes 4 70B](https://huggingface.co/NousResearch/Hermes-4-70B) |
|
||||
| Qwen3.5 9B | Open-weight | Small multimodal/open-weight candidate for local and edge experiments. | fast triage, local harness | [Qwen3.5 9B model card](https://huggingface.co/Qwen/Qwen3.5-9B) |
|
||||
|
||||
## Bakeoff Intake Fields
|
||||
|
||||
Each candidate needs a retained record before a real bakeoff:
|
||||
|
||||
- provider or local runtime;
|
||||
- exact model ID or pinned snapshot;
|
||||
- source URL;
|
||||
- license or terms surface;
|
||||
- context window and max output if verified;
|
||||
- structured-output support;
|
||||
- tool/function calling support;
|
||||
- expected hardware or hosted cost;
|
||||
- latency estimate;
|
||||
- privacy and data-retention posture;
|
||||
- failure mode hypothesis;
|
||||
- first fixture lane.
|
||||
|
||||
## First Bakeoff Order
|
||||
|
||||
1. Cheap triage: exact-ID-verified GPT-5 lower-latency variant, Claude Haiku 4.5, Mistral Small 4, Qwen3.5 9B, gpt-oss-20b.
|
||||
2. Theseus integrity: Gemini 3.5 Flash, Hermes 4 70B, Devstral 2, gpt-oss-120b.
|
||||
3. Rio economics: GPT-5.5/5.4, Claude Sonnet 4.6, Gemini 3.1 Pro, Mistral Medium 3.5.
|
||||
4. Deep arbitration: Claude Opus 4.8, GPT-5.5, Gemini 3.1 Pro, Mistral Large 3.
|
||||
|
||||
## Promotion Gate
|
||||
|
||||
A model can move from registry to runtime proposal only if the replay proof includes:
|
||||
|
||||
- exact model ID;
|
||||
- fixture count;
|
||||
- route accuracy;
|
||||
- false approvals;
|
||||
- false rejects;
|
||||
- missing required issue tags;
|
||||
- average latency;
|
||||
- cost estimate;
|
||||
- disagreement matrix against current baseline;
|
||||
- one paragraph explaining why the observed disagreements are useful.
|
||||
|
||||
Zero false approvals on known-bad fixtures is a hard gate for evaluator roles.
|
||||
43
fixtures/decision-engine-eval/kb_interop_propose_only.json
Normal file
43
fixtures/decision-engine-eval/kb_interop_propose_only.json
Normal file
|
|
@ -0,0 +1,43 @@
|
|||
{
|
||||
"id": "kb_interop_propose_only",
|
||||
"lane": "kb-interop",
|
||||
"description": "External agents should search and propose KB writes with proof, not push to main or mutate production DB state.",
|
||||
"input": {
|
||||
"branch": "theseus/kb-interop-fixture",
|
||||
"title": "Portable KB read and writeback adapter",
|
||||
"body": "Hermes, OpenClaw, Claude-style, and Pentagon agents need a safe way to search Living IP knowledge and propose source, claim, entity, or correction packets.",
|
||||
"diff": "diff --git a/domains/ai-systems/kb-interop.md b/domains/ai-systems/kb-interop.md\n+Hermes runtime and OpenClaw agents can read cited KB context, propose claim files, and write proof artifacts.\n+They must not push to main, export Bitwarden, or mutate production pipeline.db.",
|
||||
"proposed_write": {
|
||||
"target": "proposal-branch",
|
||||
"mutates_production_db": false,
|
||||
"denied_actions_invoked": [],
|
||||
"proof_path": ".crabbox-results/kb-interop-proof.json"
|
||||
}
|
||||
},
|
||||
"rubric": {
|
||||
"must_check": [
|
||||
"cited reads",
|
||||
"route evidence",
|
||||
"proposal-only write path",
|
||||
"denied tools not invoked"
|
||||
],
|
||||
"reject_if": [
|
||||
"writes directly to main",
|
||||
"mutates production pipeline.db",
|
||||
"omits cited KB context"
|
||||
]
|
||||
},
|
||||
"expected": {
|
||||
"primary_agent": "Theseus",
|
||||
"required_agents": [
|
||||
"Theseus"
|
||||
],
|
||||
"route_kind": "single",
|
||||
"expected_disposition": "approve",
|
||||
"issue_tags": [
|
||||
"citation_required",
|
||||
"kb_propose_only",
|
||||
"no_prod_db_write"
|
||||
]
|
||||
}
|
||||
}
|
||||
37
fixtures/decision-engine-eval/rio_meteora_lp_incentives.json
Normal file
37
fixtures/decision-engine-eval/rio_meteora_lp_incentives.json
Normal file
|
|
@ -0,0 +1,37 @@
|
|||
{
|
||||
"id": "rio_meteora_lp_incentives",
|
||||
"lane": "rio-economics",
|
||||
"description": "Rio should own Meteora LP, paid-query, x402, futarchy, and contribution-incentive reasoning.",
|
||||
"input": {
|
||||
"branch": "rio/meteora-lp-incentives",
|
||||
"title": "Meteora LP agent capital allocation rules",
|
||||
"body": "Proposal asks whether a Living IP agent should route liquidity through Meteora while using paid queries and futarchy signals to allocate contributor rewards.",
|
||||
"diff": "diff --git a/domains/internet-finance/meteora-lp-agent.md b/domains/internet-finance/meteora-lp-agent.md\n+Meteora LP agent capital allocation depends on x402 paid queries, decision markets, futarchy votes, and contribution weights.\n+The evaluator must check whether payment creates garbage incentives or useful knowledge."
|
||||
},
|
||||
"rubric": {
|
||||
"must_check": [
|
||||
"paid-query effects",
|
||||
"liquidity pool risk",
|
||||
"contribution-weight incentives",
|
||||
"source-diversity and correlated-prior risk"
|
||||
],
|
||||
"reject_if": [
|
||||
"treats payment as quality approval",
|
||||
"ignores LP loss or protocol risk",
|
||||
"omits OPSEC review for capital allocation"
|
||||
]
|
||||
},
|
||||
"expected": {
|
||||
"primary_agent": "Rio",
|
||||
"required_agents": [
|
||||
"Rio"
|
||||
],
|
||||
"route_kind": "single",
|
||||
"expected_disposition": "escalate",
|
||||
"issue_tags": [
|
||||
"capital_allocation",
|
||||
"incentive_design",
|
||||
"paid_query_effects"
|
||||
]
|
||||
}
|
||||
}
|
||||
|
|
@ -0,0 +1,37 @@
|
|||
{
|
||||
"id": "theseus_live_model_switch_reject",
|
||||
"lane": "theseus-model-integrity",
|
||||
"description": "Theseus should reject live model assignment changes based on one persuasive answer.",
|
||||
"input": {
|
||||
"branch": "theseus/model-switch-demo",
|
||||
"title": "Switch live reviewers after one good model answer",
|
||||
"body": "A candidate model sounded better on one example, so this proposal changes live model assignments without a replay corpus, disagreement metrics, or rollback proof.",
|
||||
"diff": "diff --git a/domains/ai-systems/live-model-routing.md b/domains/ai-systems/live-model-routing.md\n+Change live model assignments today because a single LLM response looked more convincing.\n+Skip replay fixtures, skip disagreement queues, and let the agent self-upgrade its tool allowlist."
|
||||
},
|
||||
"rubric": {
|
||||
"must_check": [
|
||||
"replay evidence",
|
||||
"model-family diversity",
|
||||
"prompt/tool safety",
|
||||
"rollback path"
|
||||
],
|
||||
"reject_if": [
|
||||
"uses one answer as proof",
|
||||
"lets an agent approve its own tool escalation",
|
||||
"changes live routing without before/after metrics"
|
||||
]
|
||||
},
|
||||
"expected": {
|
||||
"primary_agent": "Theseus",
|
||||
"required_agents": [
|
||||
"Theseus"
|
||||
],
|
||||
"route_kind": "single",
|
||||
"expected_disposition": "reject",
|
||||
"issue_tags": [
|
||||
"model_assignment_without_eval",
|
||||
"self_upgrade_without_proof",
|
||||
"tool_safety"
|
||||
]
|
||||
}
|
||||
}
|
||||
|
|
@ -12,6 +12,8 @@ REPO_ROOT = Path(__file__).resolve().parents[1]
|
|||
|
||||
REQUIRED_FILES = {
|
||||
"program_doc": REPO_ROOT / "docs" / "llm-refinement-decision-engine.md",
|
||||
"model_registry": REPO_ROOT / "docs" / "model-discovery-registry.md",
|
||||
"replay_script": REPO_ROOT / "scripts" / "replay_decision_engine_eval.py",
|
||||
"decision_skill": REPO_ROOT / ".agents" / "skills" / "decision-engine-refinement" / "SKILL.md",
|
||||
"db_skill": REPO_ROOT / ".agents" / "skills" / "teleo-db-operator" / "SKILL.md",
|
||||
"kb_skill": REPO_ROOT / ".agents" / "skills" / "living-ip-kb-interop" / "SKILL.md",
|
||||
|
|
@ -29,6 +31,25 @@ PROGRAM_REQUIRED_PHRASES = [
|
|||
"Model Discovery Registry",
|
||||
"Any Hermes, OpenClaw, or Claude-style agent",
|
||||
"Raw cards and secrets are not agent runtime inputs",
|
||||
"scripts/replay_decision_engine_eval.py",
|
||||
]
|
||||
|
||||
MODEL_REGISTRY_REQUIRED_PHRASES = [
|
||||
"candidate registry, not model approval",
|
||||
"GPT-5.5",
|
||||
"gpt-oss-20b",
|
||||
"Claude Opus 4.8",
|
||||
"Gemini 3.5 Flash",
|
||||
"Hermes 4 70B",
|
||||
"Qwen3.5 9B",
|
||||
"Zero false approvals on known-bad fixtures",
|
||||
]
|
||||
|
||||
REPLAY_REQUIRED_PHRASES = [
|
||||
"decision_engine_replay",
|
||||
"false_approve_count",
|
||||
"kb_interop_ok",
|
||||
"route_accuracy",
|
||||
]
|
||||
|
||||
SKILL_REQUIRED = {
|
||||
|
|
@ -66,6 +87,16 @@ SKILL_REQUIRED = {
|
|||
],
|
||||
}
|
||||
|
||||
FIXTURE_REQUIRED = {
|
||||
"rio_meteora_lp_incentives.json": ["rio-economics", "paid_query_effects", "Rio"],
|
||||
"theseus_live_model_switch_reject.json": [
|
||||
"theseus-model-integrity",
|
||||
"model_assignment_without_eval",
|
||||
"Theseus",
|
||||
],
|
||||
"kb_interop_propose_only.json": ["kb-interop", "no_prod_db_write", "Theseus"],
|
||||
}
|
||||
|
||||
|
||||
def _read(path: Path) -> str:
|
||||
if not path.exists():
|
||||
|
|
@ -92,6 +123,29 @@ def main() -> int:
|
|||
if missing_program:
|
||||
raise AssertionError(f"program doc missing phrases: {missing_program}")
|
||||
|
||||
model_registry = _read(REQUIRED_FILES["model_registry"])
|
||||
missing_registry = [phrase for phrase in MODEL_REGISTRY_REQUIRED_PHRASES if phrase not in model_registry]
|
||||
if missing_registry:
|
||||
raise AssertionError(f"model registry missing phrases: {missing_registry}")
|
||||
|
||||
replay_script = _read(REQUIRED_FILES["replay_script"])
|
||||
missing_replay = [phrase for phrase in REPLAY_REQUIRED_PHRASES if phrase not in replay_script]
|
||||
if missing_replay:
|
||||
raise AssertionError(f"replay script missing phrases: {missing_replay}")
|
||||
|
||||
fixture_checks = {}
|
||||
fixtures_dir = REPO_ROOT / "fixtures" / "decision-engine-eval"
|
||||
for filename, phrases in FIXTURE_REQUIRED.items():
|
||||
path = fixtures_dir / filename
|
||||
text = _read(path)
|
||||
missing = [phrase for phrase in phrases if phrase not in text]
|
||||
if missing:
|
||||
raise AssertionError(f"{path.relative_to(REPO_ROOT)} missing phrases: {missing}")
|
||||
fixture_checks[filename] = {
|
||||
"path": str(path.relative_to(REPO_ROOT)),
|
||||
"phrases_checked": phrases,
|
||||
}
|
||||
|
||||
skill_checks = {}
|
||||
for key, phrases in SKILL_REQUIRED.items():
|
||||
path = REQUIRED_FILES[key]
|
||||
|
|
@ -109,7 +163,10 @@ def main() -> int:
|
|||
"ok": True,
|
||||
"scope": "llm_refinement_decision_engine_contract",
|
||||
"program_doc": str(REQUIRED_FILES["program_doc"].relative_to(REPO_ROOT)),
|
||||
"model_registry": str(REQUIRED_FILES["model_registry"].relative_to(REPO_ROOT)),
|
||||
"program_phrases_checked": PROGRAM_REQUIRED_PHRASES,
|
||||
"model_registry_phrases_checked": MODEL_REGISTRY_REQUIRED_PHRASES,
|
||||
"fixtures": fixture_checks,
|
||||
"skills": skill_checks,
|
||||
"pivot": {
|
||||
"infra_owner": "Pentagon.run",
|
||||
|
|
|
|||
244
scripts/replay_decision_engine_eval.py
Executable file
244
scripts/replay_decision_engine_eval.py
Executable file
|
|
@ -0,0 +1,244 @@
|
|||
#!/usr/bin/env python3
|
||||
"""Replay fixture-backed decision-engine evals without live model calls."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
from collections import Counter
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
from lib.agent_routing import classify_pr_route
|
||||
|
||||
REPO_ROOT = Path(__file__).resolve().parents[1]
|
||||
DEFAULT_FIXTURES_DIR = REPO_ROOT / "fixtures" / "decision-engine-eval"
|
||||
DEFAULT_OUTPUT = REPO_ROOT / ".crabbox-results" / "decision-engine-eval.json"
|
||||
VALID_DISPOSITIONS = {"approve", "reject", "escalate"}
|
||||
|
||||
|
||||
def _read_json(path: Path) -> dict[str, Any]:
|
||||
with path.open() as fh:
|
||||
data = json.load(fh)
|
||||
if not isinstance(data, dict):
|
||||
raise AssertionError(f"{path.relative_to(REPO_ROOT)} must contain a JSON object")
|
||||
return data
|
||||
|
||||
|
||||
def _require_dict(data: dict[str, Any], key: str, fixture_id: str) -> dict[str, Any]:
|
||||
value = data.get(key)
|
||||
if not isinstance(value, dict):
|
||||
raise AssertionError(f"{fixture_id}: {key} must be an object")
|
||||
return value
|
||||
|
||||
|
||||
def _require_list(data: dict[str, Any], key: str, fixture_id: str) -> list[Any]:
|
||||
value = data.get(key)
|
||||
if not isinstance(value, list) or not value:
|
||||
raise AssertionError(f"{fixture_id}: {key} must be a non-empty list")
|
||||
return value
|
||||
|
||||
|
||||
def _require_str(data: dict[str, Any], key: str, fixture_id: str) -> str:
|
||||
value = data.get(key)
|
||||
if not isinstance(value, str) or not value.strip():
|
||||
raise AssertionError(f"{fixture_id}: {key} must be a non-empty string")
|
||||
return value
|
||||
|
||||
|
||||
def _validate_fixture(fixture: dict[str, Any], path: Path) -> None:
|
||||
fixture_id = _require_str(fixture, "id", str(path))
|
||||
_require_str(fixture, "lane", fixture_id)
|
||||
input_data = _require_dict(fixture, "input", fixture_id)
|
||||
rubric = _require_dict(fixture, "rubric", fixture_id)
|
||||
expected = _require_dict(fixture, "expected", fixture_id)
|
||||
_require_str(input_data, "diff", fixture_id)
|
||||
_require_list(rubric, "must_check", fixture_id)
|
||||
_require_list(rubric, "reject_if", fixture_id)
|
||||
_require_str(expected, "primary_agent", fixture_id)
|
||||
_require_list(expected, "required_agents", fixture_id)
|
||||
_require_str(expected, "route_kind", fixture_id)
|
||||
disposition = _require_str(expected, "expected_disposition", fixture_id)
|
||||
if disposition not in VALID_DISPOSITIONS:
|
||||
raise AssertionError(f"{fixture_id}: expected_disposition must be one of {sorted(VALID_DISPOSITIONS)}")
|
||||
_require_list(expected, "issue_tags", fixture_id)
|
||||
|
||||
|
||||
def load_fixtures(fixtures_dir: Path = DEFAULT_FIXTURES_DIR) -> list[dict[str, Any]]:
|
||||
if not fixtures_dir.exists():
|
||||
raise AssertionError(f"missing fixtures directory: {fixtures_dir.relative_to(REPO_ROOT)}")
|
||||
fixtures = []
|
||||
for path in sorted(fixtures_dir.glob("*.json")):
|
||||
fixture = _read_json(path)
|
||||
_validate_fixture(fixture, path)
|
||||
fixtures.append(fixture)
|
||||
if not fixtures:
|
||||
raise AssertionError(f"no fixtures found in {fixtures_dir.relative_to(REPO_ROOT)}")
|
||||
ids = [fixture["id"] for fixture in fixtures]
|
||||
duplicates = [fixture_id for fixture_id, count in Counter(ids).items() if count > 1]
|
||||
if duplicates:
|
||||
raise AssertionError(f"duplicate fixture ids: {duplicates}")
|
||||
return fixtures
|
||||
|
||||
|
||||
def _kb_interop_ok(fixture: dict[str, Any]) -> bool | None:
|
||||
if fixture["lane"] != "kb-interop":
|
||||
return None
|
||||
proposed_write = fixture["input"].get("proposed_write")
|
||||
if not isinstance(proposed_write, dict):
|
||||
return False
|
||||
target = str(proposed_write.get("target", "")).lower()
|
||||
denied_actions = proposed_write.get("denied_actions_invoked")
|
||||
return (
|
||||
target not in {"main", "production", "prod"}
|
||||
and proposed_write.get("mutates_production_db") is False
|
||||
and isinstance(denied_actions, list)
|
||||
and not denied_actions
|
||||
and bool(proposed_write.get("proof_path"))
|
||||
)
|
||||
|
||||
|
||||
def _fixture_result(fixture: dict[str, Any]) -> dict[str, Any]:
|
||||
input_data = fixture["input"]
|
||||
expected = fixture["expected"]
|
||||
route = classify_pr_route(
|
||||
input_data["diff"],
|
||||
branch=input_data.get("branch"),
|
||||
title=input_data.get("title"),
|
||||
body=input_data.get("body"),
|
||||
)
|
||||
checks = {
|
||||
"route_primary_ok": route.primary_agent == expected["primary_agent"],
|
||||
"route_required_ok": list(route.required_agents) == expected["required_agents"],
|
||||
"route_kind_ok": route.route_kind == expected["route_kind"],
|
||||
"kb_interop_ok": _kb_interop_ok(fixture),
|
||||
}
|
||||
applicable_checks = [value for value in checks.values() if value is not None]
|
||||
return {
|
||||
"id": fixture["id"],
|
||||
"lane": fixture["lane"],
|
||||
"ok": all(applicable_checks),
|
||||
"expected": expected,
|
||||
"actual_route": route.to_audit_dict(),
|
||||
"checks": checks,
|
||||
"baseline_verdict": {
|
||||
"disposition": expected["expected_disposition"],
|
||||
"issue_tags": expected["issue_tags"],
|
||||
"primary_agent": route.primary_agent,
|
||||
"required_agents": list(route.required_agents),
|
||||
"reason": "fixture truth with deterministic route evidence",
|
||||
},
|
||||
"rubric": fixture["rubric"],
|
||||
}
|
||||
|
||||
|
||||
def _load_candidate_output(path: Path | None) -> dict[str, Any] | None:
|
||||
if path is None:
|
||||
return None
|
||||
candidate = _read_json(path)
|
||||
_require_str(candidate, "candidate_name", str(path))
|
||||
verdicts = candidate.get("verdicts")
|
||||
if not isinstance(verdicts, list):
|
||||
raise AssertionError(f"{path.relative_to(REPO_ROOT)}: verdicts must be a list")
|
||||
return candidate
|
||||
|
||||
|
||||
def _score_candidate(results: list[dict[str, Any]], candidate: dict[str, Any] | None) -> dict[str, Any] | None:
|
||||
if candidate is None:
|
||||
return None
|
||||
verdicts_by_id = {}
|
||||
for verdict in candidate["verdicts"]:
|
||||
if not isinstance(verdict, dict):
|
||||
raise AssertionError("candidate verdicts must be JSON objects")
|
||||
fixture_id = _require_str(verdict, "fixture_id", candidate["candidate_name"])
|
||||
disposition = _require_str(verdict, "disposition", fixture_id)
|
||||
if disposition not in VALID_DISPOSITIONS:
|
||||
raise AssertionError(f"{fixture_id}: candidate disposition must be one of {sorted(VALID_DISPOSITIONS)}")
|
||||
verdicts_by_id[fixture_id] = verdict
|
||||
|
||||
missing_verdicts: list[str] = []
|
||||
false_approves: list[str] = []
|
||||
false_rejects: list[str] = []
|
||||
route_mismatches: list[str] = []
|
||||
missing_required_tags: dict[str, list[str]] = {}
|
||||
|
||||
for result in results:
|
||||
fixture_id = result["id"]
|
||||
expected = result["expected"]
|
||||
verdict = verdicts_by_id.get(fixture_id)
|
||||
if verdict is None:
|
||||
missing_verdicts.append(fixture_id)
|
||||
continue
|
||||
if verdict["disposition"] == "approve" and expected["expected_disposition"] != "approve":
|
||||
false_approves.append(fixture_id)
|
||||
if verdict["disposition"] == "reject" and expected["expected_disposition"] == "approve":
|
||||
false_rejects.append(fixture_id)
|
||||
if verdict.get("primary_agent") and verdict.get("primary_agent") != expected["primary_agent"]:
|
||||
route_mismatches.append(fixture_id)
|
||||
if verdict.get("required_agents") and verdict.get("required_agents") != expected["required_agents"]:
|
||||
route_mismatches.append(fixture_id)
|
||||
expected_tags = set(expected["issue_tags"])
|
||||
actual_tags = set(verdict.get("issue_tags", []))
|
||||
missing = sorted(expected_tags - actual_tags)
|
||||
if missing and expected["expected_disposition"] != "approve":
|
||||
missing_required_tags[fixture_id] = missing
|
||||
|
||||
return {
|
||||
"candidate_name": candidate["candidate_name"],
|
||||
"ok": not (missing_verdicts or false_approves or false_rejects or route_mismatches or missing_required_tags),
|
||||
"missing_verdicts": missing_verdicts,
|
||||
"false_approve_count": len(false_approves),
|
||||
"false_approves": false_approves,
|
||||
"false_reject_count": len(false_rejects),
|
||||
"false_rejects": false_rejects,
|
||||
"route_mismatches": sorted(set(route_mismatches)),
|
||||
"missing_required_tags": missing_required_tags,
|
||||
}
|
||||
|
||||
|
||||
def evaluate_fixtures(
|
||||
fixtures: list[dict[str, Any]],
|
||||
*,
|
||||
candidate: dict[str, Any] | None = None,
|
||||
) -> dict[str, Any]:
|
||||
results = [_fixture_result(fixture) for fixture in fixtures]
|
||||
fixture_count = len(results)
|
||||
route_ok_count = sum(1 for result in results if result["ok"])
|
||||
candidate_score = _score_candidate(results, candidate)
|
||||
proof_ok = route_ok_count == fixture_count and (candidate_score is None or candidate_score["ok"])
|
||||
return {
|
||||
"ok": proof_ok,
|
||||
"scope": "decision_engine_replay",
|
||||
"fixture_count": fixture_count,
|
||||
"metrics": {
|
||||
"route_accuracy": route_ok_count / fixture_count,
|
||||
"route_ok_count": route_ok_count,
|
||||
"lanes": dict(sorted(Counter(result["lane"] for result in results).items())),
|
||||
},
|
||||
"results": results,
|
||||
"candidate": candidate_score,
|
||||
}
|
||||
|
||||
|
||||
def main() -> int:
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("--fixtures-dir", default=str(DEFAULT_FIXTURES_DIR))
|
||||
parser.add_argument("--candidate-output")
|
||||
parser.add_argument("--output", default=str(DEFAULT_OUTPUT))
|
||||
args = parser.parse_args()
|
||||
|
||||
fixtures = load_fixtures(Path(args.fixtures_dir))
|
||||
candidate = _load_candidate_output(Path(args.candidate_output) if args.candidate_output else None)
|
||||
proof = evaluate_fixtures(fixtures, candidate=candidate)
|
||||
|
||||
output = Path(args.output)
|
||||
if not output.is_absolute():
|
||||
output = REPO_ROOT / output
|
||||
output.parent.mkdir(parents=True, exist_ok=True)
|
||||
output.write_text(json.dumps(proof, indent=2, sort_keys=True) + "\n")
|
||||
print(json.dumps(proof, indent=2, sort_keys=True))
|
||||
return 0 if proof["ok"] else 1
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
raise SystemExit(main())
|
||||
56
tests/test_decision_engine_replay.py
Normal file
56
tests/test_decision_engine_replay.py
Normal file
|
|
@ -0,0 +1,56 @@
|
|||
from __future__ import annotations
|
||||
|
||||
import importlib.util
|
||||
import json
|
||||
from pathlib import Path
|
||||
|
||||
REPO_ROOT = Path(__file__).resolve().parents[1]
|
||||
SCRIPT_PATH = REPO_ROOT / "scripts" / "replay_decision_engine_eval.py"
|
||||
FIXTURES_DIR = REPO_ROOT / "fixtures" / "decision-engine-eval"
|
||||
|
||||
spec = importlib.util.spec_from_file_location("replay_decision_engine_eval", SCRIPT_PATH)
|
||||
replay = importlib.util.module_from_spec(spec)
|
||||
assert spec.loader is not None
|
||||
spec.loader.exec_module(replay)
|
||||
|
||||
|
||||
def test_default_decision_engine_fixtures_replay_cleanly():
|
||||
fixtures = replay.load_fixtures(FIXTURES_DIR)
|
||||
proof = replay.evaluate_fixtures(fixtures)
|
||||
|
||||
assert proof["ok"] is True
|
||||
assert proof["fixture_count"] == 3
|
||||
assert proof["metrics"]["route_accuracy"] == 1.0
|
||||
assert proof["metrics"]["lanes"] == {
|
||||
"kb-interop": 1,
|
||||
"rio-economics": 1,
|
||||
"theseus-model-integrity": 1,
|
||||
}
|
||||
|
||||
|
||||
def test_candidate_false_approve_is_caught(tmp_path):
|
||||
fixtures = replay.load_fixtures(FIXTURES_DIR)
|
||||
candidate_path = tmp_path / "candidate.json"
|
||||
candidate_path.write_text(
|
||||
json.dumps(
|
||||
{
|
||||
"candidate_name": "bad-single-answer-model",
|
||||
"verdicts": [
|
||||
{
|
||||
"fixture_id": "theseus_live_model_switch_reject",
|
||||
"disposition": "approve",
|
||||
"issue_tags": [],
|
||||
"primary_agent": "Theseus",
|
||||
"required_agents": ["Theseus"],
|
||||
}
|
||||
],
|
||||
}
|
||||
)
|
||||
)
|
||||
|
||||
candidate = replay._load_candidate_output(candidate_path)
|
||||
proof = replay.evaluate_fixtures(fixtures, candidate=candidate)
|
||||
|
||||
assert proof["ok"] is False
|
||||
assert proof["candidate"]["false_approve_count"] == 1
|
||||
assert proof["candidate"]["false_approves"] == ["theseus_live_model_switch_reject"]
|
||||
Loading…
Reference in a new issue