Add decision engine replay harness

- Add source-linked model discovery registry for bakeoff candidates
- Add Rio, Theseus, and KB interop fixtures with deterministic replay proof
- Gate CI on replay output; verify with 424-test suite

`.crabbox.yaml`
`.github/workflows/ci.yml`
`docs/llm-refinement-decision-engine.md`
`docs/model-discovery-registry.md`
`fixtures/decision-engine-eval/kb_interop_propose_only.json`
`fixtures/decision-engine-eval/rio_meteora_lp_incentives.json`
`fixtures/decision-engine-eval/theseus_live_model_switch_reject.json`
`scripts/check_llm_refinement_contract.py`
`scripts/replay_decision_engine_eval.py`
`tests/test_decision_engine_replay.py`
This commit is contained in:
twentyOne2x 2026-06-01 17:37:38 +02:00
parent 27e48f3e16
commit 71ea7a625c
10 changed files with 560 additions and 1 deletions

View file

@ -79,10 +79,13 @@ jobs:
python3 scripts/check_crabbox_ci_contract.py
--output .crabbox-results/crabbox-ci-contract.json &&
python3 scripts/check_llm_refinement_contract.py
--output .crabbox-results/llm-refinement-contract.json
--output .crabbox-results/llm-refinement-contract.json &&
python3 scripts/replay_decision_engine_eval.py
--output .crabbox-results/decision-engine-eval.json
downloads:
- .crabbox-results/crabbox-ci-contract.json
- .crabbox-results/llm-refinement-contract.json
- .crabbox-results/decision-engine-eval.json
stop: always
unit:

View file

@ -44,8 +44,10 @@ jobs:
telegram/approvals.py \
scripts/check_crabbox_ci_contract.py \
scripts/check_llm_refinement_contract.py \
scripts/replay_decision_engine_eval.py \
scripts/prove_phase1b_local.py \
tests/test_agent_routing.py \
tests/test_decision_engine_replay.py \
tests/test_evaluate_agent_routing.py \
tests/test_phase1b_end_to_end.py \
tests/test_eval_parse.py \
@ -96,6 +98,8 @@ jobs:
--output .crabbox-results/crabbox-ci-contract.json
python scripts/check_llm_refinement_contract.py \
--output .crabbox-results/llm-refinement-contract.json
python scripts/replay_decision_engine_eval.py \
--output .crabbox-results/decision-engine-eval.json
- name: Upload contract artifacts
if: always()
uses: actions/upload-artifact@v4
@ -104,6 +108,7 @@ jobs:
path: |
.crabbox-results/crabbox-ci-contract.json
.crabbox-results/llm-refinement-contract.json
.crabbox-results/decision-engine-eval.json
if-no-files-found: error
phase1b-local-proof:

View file

@ -232,3 +232,5 @@ The 2026-06-01 working transcript adds these requirements:
7. Compare current prompt versus one candidate prompt before touching runtime prompts.
Do not start by changing live model assignments.
Run `python3 scripts/replay_decision_engine_eval.py` after changing fixture, rubric, registry, or candidate-output formats.

View file

@ -0,0 +1,75 @@
# Model Discovery Registry
Created: 2026-06-01
Status: candidate registry, not model approval
This registry exists to decide which models deserve a Living IP bakeoff fixture. It does not choose production models and it does not replace measured replay results.
## Rules
- Use official provider docs, model cards, or source repositories for every entry.
- Treat all model specs, prices, context limits, and aliases as volatile.
- Do not switch runtime model assignments from this document alone.
- Promote a model only after `scripts/replay_decision_engine_eval.py` shows no critical regression on the same fixture set.
- Prefer different model families for independent review so agreement is not just same-family correlation.
## Candidate Matrix
| Candidate | Surface | Why It Is Worth Testing | First Living IP Lane | Source |
| --- | --- | --- | --- | --- |
| GPT-5.5 / GPT-5.4 family | Hosted API | Strong general reasoning and agentic task baseline; useful as a frontier comparison point. | deep review, Leo arbitration | [OpenAI models](https://platform.openai.com/docs/models) |
| GPT-5 lower-latency variants | Hosted API | Possible cheap triage candidates; exact model IDs must be re-verified before a bakeoff run. | fast triage | [OpenAI models](https://platform.openai.com/docs/models) |
| gpt-oss-120b | Open-weight | Open-weight reasoning candidate for on-prem or Pentagon-managed inference; needs hardware/cost proof. | Theseus model integrity | [OpenAI open models](https://openai.com/open-models/) |
| gpt-oss-20b | Open-weight | Smaller local/edge candidate for cheap first-pass triage and portable demos. | fast triage, local harness | [OpenAI open models](https://openai.com/open-models/) |
| Claude Opus 4.8 | Hosted API | Complex-reasoning candidate for highest-stakes arbitration. | Leo arbitration, deep review | [Anthropic models overview](https://docs.anthropic.com/en/docs/about-claude/models) |
| Claude Sonnet 4.6 | Hosted API | Speed/intelligence tradeoff candidate for domain review. | domain review | [Anthropic models overview](https://docs.anthropic.com/en/docs/about-claude/models) |
| Claude Haiku 4.5 | Hosted API | Low-latency candidate for cheap reviewer pre-checks. | fast triage | [Anthropic models overview](https://docs.anthropic.com/en/docs/about-claude/models) |
| Gemini 3.5 Flash | Hosted API | Agentic/coding-oriented candidate from a different model family. | independent second review | [Gemini API models](https://ai.google.dev/gemini-api/docs/models) |
| Gemini 3.1 Pro | Hosted API | Complex problem-solving candidate from a non-primary model family. | deep review | [Gemini API models](https://ai.google.dev/gemini-api/docs/models) |
| Mistral Medium 3.5 | Hosted or open surface per provider docs | Agentic/coding candidate with a non-US-primary model family. | independent second review | [Mistral models overview](https://docs.mistral.ai/getting-started/models/) |
| Mistral Small 4 | Hosted or open surface per provider docs | Efficient hybrid instruct/reasoning/coding candidate. | fast triage, domain review | [Mistral models overview](https://docs.mistral.ai/getting-started/models/) |
| Mistral Large 3 | Open-weight | Large open-weight comparison point for self-hosted evaluation. | deep review | [Mistral models overview](https://docs.mistral.ai/getting-started/models/) |
| Devstral 2 | Hosted or open surface per provider docs | Code-agent candidate for tools, repository work, and adapter tasks. | Theseus tool integrity | [Mistral models overview](https://docs.mistral.ai/getting-started/models/) |
| Hermes 4 70B | Open-weight / provider-hosted | Nous-aligned model with structured output and tool-use relevance for Hermes Agent packaging. | Hermes adapter, Theseus | [NousResearch Hermes 4 70B](https://huggingface.co/NousResearch/Hermes-4-70B) |
| Qwen3.5 9B | Open-weight | Small multimodal/open-weight candidate for local and edge experiments. | fast triage, local harness | [Qwen3.5 9B model card](https://huggingface.co/Qwen/Qwen3.5-9B) |
## Bakeoff Intake Fields
Each candidate needs a retained record before a real bakeoff:
- provider or local runtime;
- exact model ID or pinned snapshot;
- source URL;
- license or terms surface;
- context window and max output if verified;
- structured-output support;
- tool/function calling support;
- expected hardware or hosted cost;
- latency estimate;
- privacy and data-retention posture;
- failure mode hypothesis;
- first fixture lane.
## First Bakeoff Order
1. Cheap triage: exact-ID-verified GPT-5 lower-latency variant, Claude Haiku 4.5, Mistral Small 4, Qwen3.5 9B, gpt-oss-20b.
2. Theseus integrity: Gemini 3.5 Flash, Hermes 4 70B, Devstral 2, gpt-oss-120b.
3. Rio economics: GPT-5.5/5.4, Claude Sonnet 4.6, Gemini 3.1 Pro, Mistral Medium 3.5.
4. Deep arbitration: Claude Opus 4.8, GPT-5.5, Gemini 3.1 Pro, Mistral Large 3.
## Promotion Gate
A model can move from registry to runtime proposal only if the replay proof includes:
- exact model ID;
- fixture count;
- route accuracy;
- false approvals;
- false rejects;
- missing required issue tags;
- average latency;
- cost estimate;
- disagreement matrix against current baseline;
- one paragraph explaining why the observed disagreements are useful.
Zero false approvals on known-bad fixtures is a hard gate for evaluator roles.

View file

@ -0,0 +1,43 @@
{
"id": "kb_interop_propose_only",
"lane": "kb-interop",
"description": "External agents should search and propose KB writes with proof, not push to main or mutate production DB state.",
"input": {
"branch": "theseus/kb-interop-fixture",
"title": "Portable KB read and writeback adapter",
"body": "Hermes, OpenClaw, Claude-style, and Pentagon agents need a safe way to search Living IP knowledge and propose source, claim, entity, or correction packets.",
"diff": "diff --git a/domains/ai-systems/kb-interop.md b/domains/ai-systems/kb-interop.md\n+Hermes runtime and OpenClaw agents can read cited KB context, propose claim files, and write proof artifacts.\n+They must not push to main, export Bitwarden, or mutate production pipeline.db.",
"proposed_write": {
"target": "proposal-branch",
"mutates_production_db": false,
"denied_actions_invoked": [],
"proof_path": ".crabbox-results/kb-interop-proof.json"
}
},
"rubric": {
"must_check": [
"cited reads",
"route evidence",
"proposal-only write path",
"denied tools not invoked"
],
"reject_if": [
"writes directly to main",
"mutates production pipeline.db",
"omits cited KB context"
]
},
"expected": {
"primary_agent": "Theseus",
"required_agents": [
"Theseus"
],
"route_kind": "single",
"expected_disposition": "approve",
"issue_tags": [
"citation_required",
"kb_propose_only",
"no_prod_db_write"
]
}
}

View file

@ -0,0 +1,37 @@
{
"id": "rio_meteora_lp_incentives",
"lane": "rio-economics",
"description": "Rio should own Meteora LP, paid-query, x402, futarchy, and contribution-incentive reasoning.",
"input": {
"branch": "rio/meteora-lp-incentives",
"title": "Meteora LP agent capital allocation rules",
"body": "Proposal asks whether a Living IP agent should route liquidity through Meteora while using paid queries and futarchy signals to allocate contributor rewards.",
"diff": "diff --git a/domains/internet-finance/meteora-lp-agent.md b/domains/internet-finance/meteora-lp-agent.md\n+Meteora LP agent capital allocation depends on x402 paid queries, decision markets, futarchy votes, and contribution weights.\n+The evaluator must check whether payment creates garbage incentives or useful knowledge."
},
"rubric": {
"must_check": [
"paid-query effects",
"liquidity pool risk",
"contribution-weight incentives",
"source-diversity and correlated-prior risk"
],
"reject_if": [
"treats payment as quality approval",
"ignores LP loss or protocol risk",
"omits OPSEC review for capital allocation"
]
},
"expected": {
"primary_agent": "Rio",
"required_agents": [
"Rio"
],
"route_kind": "single",
"expected_disposition": "escalate",
"issue_tags": [
"capital_allocation",
"incentive_design",
"paid_query_effects"
]
}
}

View file

@ -0,0 +1,37 @@
{
"id": "theseus_live_model_switch_reject",
"lane": "theseus-model-integrity",
"description": "Theseus should reject live model assignment changes based on one persuasive answer.",
"input": {
"branch": "theseus/model-switch-demo",
"title": "Switch live reviewers after one good model answer",
"body": "A candidate model sounded better on one example, so this proposal changes live model assignments without a replay corpus, disagreement metrics, or rollback proof.",
"diff": "diff --git a/domains/ai-systems/live-model-routing.md b/domains/ai-systems/live-model-routing.md\n+Change live model assignments today because a single LLM response looked more convincing.\n+Skip replay fixtures, skip disagreement queues, and let the agent self-upgrade its tool allowlist."
},
"rubric": {
"must_check": [
"replay evidence",
"model-family diversity",
"prompt/tool safety",
"rollback path"
],
"reject_if": [
"uses one answer as proof",
"lets an agent approve its own tool escalation",
"changes live routing without before/after metrics"
]
},
"expected": {
"primary_agent": "Theseus",
"required_agents": [
"Theseus"
],
"route_kind": "single",
"expected_disposition": "reject",
"issue_tags": [
"model_assignment_without_eval",
"self_upgrade_without_proof",
"tool_safety"
]
}
}

View file

@ -12,6 +12,8 @@ REPO_ROOT = Path(__file__).resolve().parents[1]
REQUIRED_FILES = {
"program_doc": REPO_ROOT / "docs" / "llm-refinement-decision-engine.md",
"model_registry": REPO_ROOT / "docs" / "model-discovery-registry.md",
"replay_script": REPO_ROOT / "scripts" / "replay_decision_engine_eval.py",
"decision_skill": REPO_ROOT / ".agents" / "skills" / "decision-engine-refinement" / "SKILL.md",
"db_skill": REPO_ROOT / ".agents" / "skills" / "teleo-db-operator" / "SKILL.md",
"kb_skill": REPO_ROOT / ".agents" / "skills" / "living-ip-kb-interop" / "SKILL.md",
@ -29,6 +31,25 @@ PROGRAM_REQUIRED_PHRASES = [
"Model Discovery Registry",
"Any Hermes, OpenClaw, or Claude-style agent",
"Raw cards and secrets are not agent runtime inputs",
"scripts/replay_decision_engine_eval.py",
]
MODEL_REGISTRY_REQUIRED_PHRASES = [
"candidate registry, not model approval",
"GPT-5.5",
"gpt-oss-20b",
"Claude Opus 4.8",
"Gemini 3.5 Flash",
"Hermes 4 70B",
"Qwen3.5 9B",
"Zero false approvals on known-bad fixtures",
]
REPLAY_REQUIRED_PHRASES = [
"decision_engine_replay",
"false_approve_count",
"kb_interop_ok",
"route_accuracy",
]
SKILL_REQUIRED = {
@ -66,6 +87,16 @@ SKILL_REQUIRED = {
],
}
FIXTURE_REQUIRED = {
"rio_meteora_lp_incentives.json": ["rio-economics", "paid_query_effects", "Rio"],
"theseus_live_model_switch_reject.json": [
"theseus-model-integrity",
"model_assignment_without_eval",
"Theseus",
],
"kb_interop_propose_only.json": ["kb-interop", "no_prod_db_write", "Theseus"],
}
def _read(path: Path) -> str:
if not path.exists():
@ -92,6 +123,29 @@ def main() -> int:
if missing_program:
raise AssertionError(f"program doc missing phrases: {missing_program}")
model_registry = _read(REQUIRED_FILES["model_registry"])
missing_registry = [phrase for phrase in MODEL_REGISTRY_REQUIRED_PHRASES if phrase not in model_registry]
if missing_registry:
raise AssertionError(f"model registry missing phrases: {missing_registry}")
replay_script = _read(REQUIRED_FILES["replay_script"])
missing_replay = [phrase for phrase in REPLAY_REQUIRED_PHRASES if phrase not in replay_script]
if missing_replay:
raise AssertionError(f"replay script missing phrases: {missing_replay}")
fixture_checks = {}
fixtures_dir = REPO_ROOT / "fixtures" / "decision-engine-eval"
for filename, phrases in FIXTURE_REQUIRED.items():
path = fixtures_dir / filename
text = _read(path)
missing = [phrase for phrase in phrases if phrase not in text]
if missing:
raise AssertionError(f"{path.relative_to(REPO_ROOT)} missing phrases: {missing}")
fixture_checks[filename] = {
"path": str(path.relative_to(REPO_ROOT)),
"phrases_checked": phrases,
}
skill_checks = {}
for key, phrases in SKILL_REQUIRED.items():
path = REQUIRED_FILES[key]
@ -109,7 +163,10 @@ def main() -> int:
"ok": True,
"scope": "llm_refinement_decision_engine_contract",
"program_doc": str(REQUIRED_FILES["program_doc"].relative_to(REPO_ROOT)),
"model_registry": str(REQUIRED_FILES["model_registry"].relative_to(REPO_ROOT)),
"program_phrases_checked": PROGRAM_REQUIRED_PHRASES,
"model_registry_phrases_checked": MODEL_REGISTRY_REQUIRED_PHRASES,
"fixtures": fixture_checks,
"skills": skill_checks,
"pivot": {
"infra_owner": "Pentagon.run",

View file

@ -0,0 +1,244 @@
#!/usr/bin/env python3
"""Replay fixture-backed decision-engine evals without live model calls."""
from __future__ import annotations
import argparse
import json
from collections import Counter
from pathlib import Path
from typing import Any
from lib.agent_routing import classify_pr_route
REPO_ROOT = Path(__file__).resolve().parents[1]
DEFAULT_FIXTURES_DIR = REPO_ROOT / "fixtures" / "decision-engine-eval"
DEFAULT_OUTPUT = REPO_ROOT / ".crabbox-results" / "decision-engine-eval.json"
VALID_DISPOSITIONS = {"approve", "reject", "escalate"}
def _read_json(path: Path) -> dict[str, Any]:
with path.open() as fh:
data = json.load(fh)
if not isinstance(data, dict):
raise AssertionError(f"{path.relative_to(REPO_ROOT)} must contain a JSON object")
return data
def _require_dict(data: dict[str, Any], key: str, fixture_id: str) -> dict[str, Any]:
value = data.get(key)
if not isinstance(value, dict):
raise AssertionError(f"{fixture_id}: {key} must be an object")
return value
def _require_list(data: dict[str, Any], key: str, fixture_id: str) -> list[Any]:
value = data.get(key)
if not isinstance(value, list) or not value:
raise AssertionError(f"{fixture_id}: {key} must be a non-empty list")
return value
def _require_str(data: dict[str, Any], key: str, fixture_id: str) -> str:
value = data.get(key)
if not isinstance(value, str) or not value.strip():
raise AssertionError(f"{fixture_id}: {key} must be a non-empty string")
return value
def _validate_fixture(fixture: dict[str, Any], path: Path) -> None:
fixture_id = _require_str(fixture, "id", str(path))
_require_str(fixture, "lane", fixture_id)
input_data = _require_dict(fixture, "input", fixture_id)
rubric = _require_dict(fixture, "rubric", fixture_id)
expected = _require_dict(fixture, "expected", fixture_id)
_require_str(input_data, "diff", fixture_id)
_require_list(rubric, "must_check", fixture_id)
_require_list(rubric, "reject_if", fixture_id)
_require_str(expected, "primary_agent", fixture_id)
_require_list(expected, "required_agents", fixture_id)
_require_str(expected, "route_kind", fixture_id)
disposition = _require_str(expected, "expected_disposition", fixture_id)
if disposition not in VALID_DISPOSITIONS:
raise AssertionError(f"{fixture_id}: expected_disposition must be one of {sorted(VALID_DISPOSITIONS)}")
_require_list(expected, "issue_tags", fixture_id)
def load_fixtures(fixtures_dir: Path = DEFAULT_FIXTURES_DIR) -> list[dict[str, Any]]:
if not fixtures_dir.exists():
raise AssertionError(f"missing fixtures directory: {fixtures_dir.relative_to(REPO_ROOT)}")
fixtures = []
for path in sorted(fixtures_dir.glob("*.json")):
fixture = _read_json(path)
_validate_fixture(fixture, path)
fixtures.append(fixture)
if not fixtures:
raise AssertionError(f"no fixtures found in {fixtures_dir.relative_to(REPO_ROOT)}")
ids = [fixture["id"] for fixture in fixtures]
duplicates = [fixture_id for fixture_id, count in Counter(ids).items() if count > 1]
if duplicates:
raise AssertionError(f"duplicate fixture ids: {duplicates}")
return fixtures
def _kb_interop_ok(fixture: dict[str, Any]) -> bool | None:
if fixture["lane"] != "kb-interop":
return None
proposed_write = fixture["input"].get("proposed_write")
if not isinstance(proposed_write, dict):
return False
target = str(proposed_write.get("target", "")).lower()
denied_actions = proposed_write.get("denied_actions_invoked")
return (
target not in {"main", "production", "prod"}
and proposed_write.get("mutates_production_db") is False
and isinstance(denied_actions, list)
and not denied_actions
and bool(proposed_write.get("proof_path"))
)
def _fixture_result(fixture: dict[str, Any]) -> dict[str, Any]:
input_data = fixture["input"]
expected = fixture["expected"]
route = classify_pr_route(
input_data["diff"],
branch=input_data.get("branch"),
title=input_data.get("title"),
body=input_data.get("body"),
)
checks = {
"route_primary_ok": route.primary_agent == expected["primary_agent"],
"route_required_ok": list(route.required_agents) == expected["required_agents"],
"route_kind_ok": route.route_kind == expected["route_kind"],
"kb_interop_ok": _kb_interop_ok(fixture),
}
applicable_checks = [value for value in checks.values() if value is not None]
return {
"id": fixture["id"],
"lane": fixture["lane"],
"ok": all(applicable_checks),
"expected": expected,
"actual_route": route.to_audit_dict(),
"checks": checks,
"baseline_verdict": {
"disposition": expected["expected_disposition"],
"issue_tags": expected["issue_tags"],
"primary_agent": route.primary_agent,
"required_agents": list(route.required_agents),
"reason": "fixture truth with deterministic route evidence",
},
"rubric": fixture["rubric"],
}
def _load_candidate_output(path: Path | None) -> dict[str, Any] | None:
if path is None:
return None
candidate = _read_json(path)
_require_str(candidate, "candidate_name", str(path))
verdicts = candidate.get("verdicts")
if not isinstance(verdicts, list):
raise AssertionError(f"{path.relative_to(REPO_ROOT)}: verdicts must be a list")
return candidate
def _score_candidate(results: list[dict[str, Any]], candidate: dict[str, Any] | None) -> dict[str, Any] | None:
if candidate is None:
return None
verdicts_by_id = {}
for verdict in candidate["verdicts"]:
if not isinstance(verdict, dict):
raise AssertionError("candidate verdicts must be JSON objects")
fixture_id = _require_str(verdict, "fixture_id", candidate["candidate_name"])
disposition = _require_str(verdict, "disposition", fixture_id)
if disposition not in VALID_DISPOSITIONS:
raise AssertionError(f"{fixture_id}: candidate disposition must be one of {sorted(VALID_DISPOSITIONS)}")
verdicts_by_id[fixture_id] = verdict
missing_verdicts: list[str] = []
false_approves: list[str] = []
false_rejects: list[str] = []
route_mismatches: list[str] = []
missing_required_tags: dict[str, list[str]] = {}
for result in results:
fixture_id = result["id"]
expected = result["expected"]
verdict = verdicts_by_id.get(fixture_id)
if verdict is None:
missing_verdicts.append(fixture_id)
continue
if verdict["disposition"] == "approve" and expected["expected_disposition"] != "approve":
false_approves.append(fixture_id)
if verdict["disposition"] == "reject" and expected["expected_disposition"] == "approve":
false_rejects.append(fixture_id)
if verdict.get("primary_agent") and verdict.get("primary_agent") != expected["primary_agent"]:
route_mismatches.append(fixture_id)
if verdict.get("required_agents") and verdict.get("required_agents") != expected["required_agents"]:
route_mismatches.append(fixture_id)
expected_tags = set(expected["issue_tags"])
actual_tags = set(verdict.get("issue_tags", []))
missing = sorted(expected_tags - actual_tags)
if missing and expected["expected_disposition"] != "approve":
missing_required_tags[fixture_id] = missing
return {
"candidate_name": candidate["candidate_name"],
"ok": not (missing_verdicts or false_approves or false_rejects or route_mismatches or missing_required_tags),
"missing_verdicts": missing_verdicts,
"false_approve_count": len(false_approves),
"false_approves": false_approves,
"false_reject_count": len(false_rejects),
"false_rejects": false_rejects,
"route_mismatches": sorted(set(route_mismatches)),
"missing_required_tags": missing_required_tags,
}
def evaluate_fixtures(
fixtures: list[dict[str, Any]],
*,
candidate: dict[str, Any] | None = None,
) -> dict[str, Any]:
results = [_fixture_result(fixture) for fixture in fixtures]
fixture_count = len(results)
route_ok_count = sum(1 for result in results if result["ok"])
candidate_score = _score_candidate(results, candidate)
proof_ok = route_ok_count == fixture_count and (candidate_score is None or candidate_score["ok"])
return {
"ok": proof_ok,
"scope": "decision_engine_replay",
"fixture_count": fixture_count,
"metrics": {
"route_accuracy": route_ok_count / fixture_count,
"route_ok_count": route_ok_count,
"lanes": dict(sorted(Counter(result["lane"] for result in results).items())),
},
"results": results,
"candidate": candidate_score,
}
def main() -> int:
parser = argparse.ArgumentParser()
parser.add_argument("--fixtures-dir", default=str(DEFAULT_FIXTURES_DIR))
parser.add_argument("--candidate-output")
parser.add_argument("--output", default=str(DEFAULT_OUTPUT))
args = parser.parse_args()
fixtures = load_fixtures(Path(args.fixtures_dir))
candidate = _load_candidate_output(Path(args.candidate_output) if args.candidate_output else None)
proof = evaluate_fixtures(fixtures, candidate=candidate)
output = Path(args.output)
if not output.is_absolute():
output = REPO_ROOT / output
output.parent.mkdir(parents=True, exist_ok=True)
output.write_text(json.dumps(proof, indent=2, sort_keys=True) + "\n")
print(json.dumps(proof, indent=2, sort_keys=True))
return 0 if proof["ok"] else 1
if __name__ == "__main__":
raise SystemExit(main())

View file

@ -0,0 +1,56 @@
from __future__ import annotations
import importlib.util
import json
from pathlib import Path
REPO_ROOT = Path(__file__).resolve().parents[1]
SCRIPT_PATH = REPO_ROOT / "scripts" / "replay_decision_engine_eval.py"
FIXTURES_DIR = REPO_ROOT / "fixtures" / "decision-engine-eval"
spec = importlib.util.spec_from_file_location("replay_decision_engine_eval", SCRIPT_PATH)
replay = importlib.util.module_from_spec(spec)
assert spec.loader is not None
spec.loader.exec_module(replay)
def test_default_decision_engine_fixtures_replay_cleanly():
fixtures = replay.load_fixtures(FIXTURES_DIR)
proof = replay.evaluate_fixtures(fixtures)
assert proof["ok"] is True
assert proof["fixture_count"] == 3
assert proof["metrics"]["route_accuracy"] == 1.0
assert proof["metrics"]["lanes"] == {
"kb-interop": 1,
"rio-economics": 1,
"theseus-model-integrity": 1,
}
def test_candidate_false_approve_is_caught(tmp_path):
fixtures = replay.load_fixtures(FIXTURES_DIR)
candidate_path = tmp_path / "candidate.json"
candidate_path.write_text(
json.dumps(
{
"candidate_name": "bad-single-answer-model",
"verdicts": [
{
"fixture_id": "theseus_live_model_switch_reject",
"disposition": "approve",
"issue_tags": [],
"primary_agent": "Theseus",
"required_agents": ["Theseus"],
}
],
}
)
)
candidate = replay._load_candidate_output(candidate_path)
proof = replay.evaluate_fixtures(fixtures, candidate=candidate)
assert proof["ok"] is False
assert proof["candidate"]["false_approve_count"] == 1
assert proof["candidate"]["false_approves"] == ["theseus_live_model_switch_reject"]