Add decision engine replay harness
- Add source-linked model discovery registry for bakeoff candidates - Add Rio, Theseus, and KB interop fixtures with deterministic replay proof - Gate CI on replay output; verify with 424-test suite `.crabbox.yaml` `.github/workflows/ci.yml` `docs/llm-refinement-decision-engine.md` `docs/model-discovery-registry.md` `fixtures/decision-engine-eval/kb_interop_propose_only.json` `fixtures/decision-engine-eval/rio_meteora_lp_incentives.json` `fixtures/decision-engine-eval/theseus_live_model_switch_reject.json` `scripts/check_llm_refinement_contract.py` `scripts/replay_decision_engine_eval.py` `tests/test_decision_engine_replay.py`
This commit is contained in:
parent
27e48f3e16
commit
71ea7a625c
10 changed files with 560 additions and 1 deletions
|
|
@ -79,10 +79,13 @@ jobs:
|
||||||
python3 scripts/check_crabbox_ci_contract.py
|
python3 scripts/check_crabbox_ci_contract.py
|
||||||
--output .crabbox-results/crabbox-ci-contract.json &&
|
--output .crabbox-results/crabbox-ci-contract.json &&
|
||||||
python3 scripts/check_llm_refinement_contract.py
|
python3 scripts/check_llm_refinement_contract.py
|
||||||
--output .crabbox-results/llm-refinement-contract.json
|
--output .crabbox-results/llm-refinement-contract.json &&
|
||||||
|
python3 scripts/replay_decision_engine_eval.py
|
||||||
|
--output .crabbox-results/decision-engine-eval.json
|
||||||
downloads:
|
downloads:
|
||||||
- .crabbox-results/crabbox-ci-contract.json
|
- .crabbox-results/crabbox-ci-contract.json
|
||||||
- .crabbox-results/llm-refinement-contract.json
|
- .crabbox-results/llm-refinement-contract.json
|
||||||
|
- .crabbox-results/decision-engine-eval.json
|
||||||
stop: always
|
stop: always
|
||||||
|
|
||||||
unit:
|
unit:
|
||||||
|
|
|
||||||
5
.github/workflows/ci.yml
vendored
5
.github/workflows/ci.yml
vendored
|
|
@ -44,8 +44,10 @@ jobs:
|
||||||
telegram/approvals.py \
|
telegram/approvals.py \
|
||||||
scripts/check_crabbox_ci_contract.py \
|
scripts/check_crabbox_ci_contract.py \
|
||||||
scripts/check_llm_refinement_contract.py \
|
scripts/check_llm_refinement_contract.py \
|
||||||
|
scripts/replay_decision_engine_eval.py \
|
||||||
scripts/prove_phase1b_local.py \
|
scripts/prove_phase1b_local.py \
|
||||||
tests/test_agent_routing.py \
|
tests/test_agent_routing.py \
|
||||||
|
tests/test_decision_engine_replay.py \
|
||||||
tests/test_evaluate_agent_routing.py \
|
tests/test_evaluate_agent_routing.py \
|
||||||
tests/test_phase1b_end_to_end.py \
|
tests/test_phase1b_end_to_end.py \
|
||||||
tests/test_eval_parse.py \
|
tests/test_eval_parse.py \
|
||||||
|
|
@ -96,6 +98,8 @@ jobs:
|
||||||
--output .crabbox-results/crabbox-ci-contract.json
|
--output .crabbox-results/crabbox-ci-contract.json
|
||||||
python scripts/check_llm_refinement_contract.py \
|
python scripts/check_llm_refinement_contract.py \
|
||||||
--output .crabbox-results/llm-refinement-contract.json
|
--output .crabbox-results/llm-refinement-contract.json
|
||||||
|
python scripts/replay_decision_engine_eval.py \
|
||||||
|
--output .crabbox-results/decision-engine-eval.json
|
||||||
- name: Upload contract artifacts
|
- name: Upload contract artifacts
|
||||||
if: always()
|
if: always()
|
||||||
uses: actions/upload-artifact@v4
|
uses: actions/upload-artifact@v4
|
||||||
|
|
@ -104,6 +108,7 @@ jobs:
|
||||||
path: |
|
path: |
|
||||||
.crabbox-results/crabbox-ci-contract.json
|
.crabbox-results/crabbox-ci-contract.json
|
||||||
.crabbox-results/llm-refinement-contract.json
|
.crabbox-results/llm-refinement-contract.json
|
||||||
|
.crabbox-results/decision-engine-eval.json
|
||||||
if-no-files-found: error
|
if-no-files-found: error
|
||||||
|
|
||||||
phase1b-local-proof:
|
phase1b-local-proof:
|
||||||
|
|
|
||||||
|
|
@ -232,3 +232,5 @@ The 2026-06-01 working transcript adds these requirements:
|
||||||
7. Compare current prompt versus one candidate prompt before touching runtime prompts.
|
7. Compare current prompt versus one candidate prompt before touching runtime prompts.
|
||||||
|
|
||||||
Do not start by changing live model assignments.
|
Do not start by changing live model assignments.
|
||||||
|
|
||||||
|
Run `python3 scripts/replay_decision_engine_eval.py` after changing fixture, rubric, registry, or candidate-output formats.
|
||||||
|
|
|
||||||
75
docs/model-discovery-registry.md
Normal file
75
docs/model-discovery-registry.md
Normal file
|
|
@ -0,0 +1,75 @@
|
||||||
|
# Model Discovery Registry
|
||||||
|
|
||||||
|
Created: 2026-06-01
|
||||||
|
Status: candidate registry, not model approval
|
||||||
|
|
||||||
|
This registry exists to decide which models deserve a Living IP bakeoff fixture. It does not choose production models and it does not replace measured replay results.
|
||||||
|
|
||||||
|
## Rules
|
||||||
|
|
||||||
|
- Use official provider docs, model cards, or source repositories for every entry.
|
||||||
|
- Treat all model specs, prices, context limits, and aliases as volatile.
|
||||||
|
- Do not switch runtime model assignments from this document alone.
|
||||||
|
- Promote a model only after `scripts/replay_decision_engine_eval.py` shows no critical regression on the same fixture set.
|
||||||
|
- Prefer different model families for independent review so agreement is not just same-family correlation.
|
||||||
|
|
||||||
|
## Candidate Matrix
|
||||||
|
|
||||||
|
| Candidate | Surface | Why It Is Worth Testing | First Living IP Lane | Source |
|
||||||
|
| --- | --- | --- | --- | --- |
|
||||||
|
| GPT-5.5 / GPT-5.4 family | Hosted API | Strong general reasoning and agentic task baseline; useful as a frontier comparison point. | deep review, Leo arbitration | [OpenAI models](https://platform.openai.com/docs/models) |
|
||||||
|
| GPT-5 lower-latency variants | Hosted API | Possible cheap triage candidates; exact model IDs must be re-verified before a bakeoff run. | fast triage | [OpenAI models](https://platform.openai.com/docs/models) |
|
||||||
|
| gpt-oss-120b | Open-weight | Open-weight reasoning candidate for on-prem or Pentagon-managed inference; needs hardware/cost proof. | Theseus model integrity | [OpenAI open models](https://openai.com/open-models/) |
|
||||||
|
| gpt-oss-20b | Open-weight | Smaller local/edge candidate for cheap first-pass triage and portable demos. | fast triage, local harness | [OpenAI open models](https://openai.com/open-models/) |
|
||||||
|
| Claude Opus 4.8 | Hosted API | Complex-reasoning candidate for highest-stakes arbitration. | Leo arbitration, deep review | [Anthropic models overview](https://docs.anthropic.com/en/docs/about-claude/models) |
|
||||||
|
| Claude Sonnet 4.6 | Hosted API | Speed/intelligence tradeoff candidate for domain review. | domain review | [Anthropic models overview](https://docs.anthropic.com/en/docs/about-claude/models) |
|
||||||
|
| Claude Haiku 4.5 | Hosted API | Low-latency candidate for cheap reviewer pre-checks. | fast triage | [Anthropic models overview](https://docs.anthropic.com/en/docs/about-claude/models) |
|
||||||
|
| Gemini 3.5 Flash | Hosted API | Agentic/coding-oriented candidate from a different model family. | independent second review | [Gemini API models](https://ai.google.dev/gemini-api/docs/models) |
|
||||||
|
| Gemini 3.1 Pro | Hosted API | Complex problem-solving candidate from a non-primary model family. | deep review | [Gemini API models](https://ai.google.dev/gemini-api/docs/models) |
|
||||||
|
| Mistral Medium 3.5 | Hosted or open surface per provider docs | Agentic/coding candidate with a non-US-primary model family. | independent second review | [Mistral models overview](https://docs.mistral.ai/getting-started/models/) |
|
||||||
|
| Mistral Small 4 | Hosted or open surface per provider docs | Efficient hybrid instruct/reasoning/coding candidate. | fast triage, domain review | [Mistral models overview](https://docs.mistral.ai/getting-started/models/) |
|
||||||
|
| Mistral Large 3 | Open-weight | Large open-weight comparison point for self-hosted evaluation. | deep review | [Mistral models overview](https://docs.mistral.ai/getting-started/models/) |
|
||||||
|
| Devstral 2 | Hosted or open surface per provider docs | Code-agent candidate for tools, repository work, and adapter tasks. | Theseus tool integrity | [Mistral models overview](https://docs.mistral.ai/getting-started/models/) |
|
||||||
|
| Hermes 4 70B | Open-weight / provider-hosted | Nous-aligned model with structured output and tool-use relevance for Hermes Agent packaging. | Hermes adapter, Theseus | [NousResearch Hermes 4 70B](https://huggingface.co/NousResearch/Hermes-4-70B) |
|
||||||
|
| Qwen3.5 9B | Open-weight | Small multimodal/open-weight candidate for local and edge experiments. | fast triage, local harness | [Qwen3.5 9B model card](https://huggingface.co/Qwen/Qwen3.5-9B) |
|
||||||
|
|
||||||
|
## Bakeoff Intake Fields
|
||||||
|
|
||||||
|
Each candidate needs a retained record before a real bakeoff:
|
||||||
|
|
||||||
|
- provider or local runtime;
|
||||||
|
- exact model ID or pinned snapshot;
|
||||||
|
- source URL;
|
||||||
|
- license or terms surface;
|
||||||
|
- context window and max output if verified;
|
||||||
|
- structured-output support;
|
||||||
|
- tool/function calling support;
|
||||||
|
- expected hardware or hosted cost;
|
||||||
|
- latency estimate;
|
||||||
|
- privacy and data-retention posture;
|
||||||
|
- failure mode hypothesis;
|
||||||
|
- first fixture lane.
|
||||||
|
|
||||||
|
## First Bakeoff Order
|
||||||
|
|
||||||
|
1. Cheap triage: exact-ID-verified GPT-5 lower-latency variant, Claude Haiku 4.5, Mistral Small 4, Qwen3.5 9B, gpt-oss-20b.
|
||||||
|
2. Theseus integrity: Gemini 3.5 Flash, Hermes 4 70B, Devstral 2, gpt-oss-120b.
|
||||||
|
3. Rio economics: GPT-5.5/5.4, Claude Sonnet 4.6, Gemini 3.1 Pro, Mistral Medium 3.5.
|
||||||
|
4. Deep arbitration: Claude Opus 4.8, GPT-5.5, Gemini 3.1 Pro, Mistral Large 3.
|
||||||
|
|
||||||
|
## Promotion Gate
|
||||||
|
|
||||||
|
A model can move from registry to runtime proposal only if the replay proof includes:
|
||||||
|
|
||||||
|
- exact model ID;
|
||||||
|
- fixture count;
|
||||||
|
- route accuracy;
|
||||||
|
- false approvals;
|
||||||
|
- false rejects;
|
||||||
|
- missing required issue tags;
|
||||||
|
- average latency;
|
||||||
|
- cost estimate;
|
||||||
|
- disagreement matrix against current baseline;
|
||||||
|
- one paragraph explaining why the observed disagreements are useful.
|
||||||
|
|
||||||
|
Zero false approvals on known-bad fixtures is a hard gate for evaluator roles.
|
||||||
43
fixtures/decision-engine-eval/kb_interop_propose_only.json
Normal file
43
fixtures/decision-engine-eval/kb_interop_propose_only.json
Normal file
|
|
@ -0,0 +1,43 @@
|
||||||
|
{
|
||||||
|
"id": "kb_interop_propose_only",
|
||||||
|
"lane": "kb-interop",
|
||||||
|
"description": "External agents should search and propose KB writes with proof, not push to main or mutate production DB state.",
|
||||||
|
"input": {
|
||||||
|
"branch": "theseus/kb-interop-fixture",
|
||||||
|
"title": "Portable KB read and writeback adapter",
|
||||||
|
"body": "Hermes, OpenClaw, Claude-style, and Pentagon agents need a safe way to search Living IP knowledge and propose source, claim, entity, or correction packets.",
|
||||||
|
"diff": "diff --git a/domains/ai-systems/kb-interop.md b/domains/ai-systems/kb-interop.md\n+Hermes runtime and OpenClaw agents can read cited KB context, propose claim files, and write proof artifacts.\n+They must not push to main, export Bitwarden, or mutate production pipeline.db.",
|
||||||
|
"proposed_write": {
|
||||||
|
"target": "proposal-branch",
|
||||||
|
"mutates_production_db": false,
|
||||||
|
"denied_actions_invoked": [],
|
||||||
|
"proof_path": ".crabbox-results/kb-interop-proof.json"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"rubric": {
|
||||||
|
"must_check": [
|
||||||
|
"cited reads",
|
||||||
|
"route evidence",
|
||||||
|
"proposal-only write path",
|
||||||
|
"denied tools not invoked"
|
||||||
|
],
|
||||||
|
"reject_if": [
|
||||||
|
"writes directly to main",
|
||||||
|
"mutates production pipeline.db",
|
||||||
|
"omits cited KB context"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"expected": {
|
||||||
|
"primary_agent": "Theseus",
|
||||||
|
"required_agents": [
|
||||||
|
"Theseus"
|
||||||
|
],
|
||||||
|
"route_kind": "single",
|
||||||
|
"expected_disposition": "approve",
|
||||||
|
"issue_tags": [
|
||||||
|
"citation_required",
|
||||||
|
"kb_propose_only",
|
||||||
|
"no_prod_db_write"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
}
|
||||||
37
fixtures/decision-engine-eval/rio_meteora_lp_incentives.json
Normal file
37
fixtures/decision-engine-eval/rio_meteora_lp_incentives.json
Normal file
|
|
@ -0,0 +1,37 @@
|
||||||
|
{
|
||||||
|
"id": "rio_meteora_lp_incentives",
|
||||||
|
"lane": "rio-economics",
|
||||||
|
"description": "Rio should own Meteora LP, paid-query, x402, futarchy, and contribution-incentive reasoning.",
|
||||||
|
"input": {
|
||||||
|
"branch": "rio/meteora-lp-incentives",
|
||||||
|
"title": "Meteora LP agent capital allocation rules",
|
||||||
|
"body": "Proposal asks whether a Living IP agent should route liquidity through Meteora while using paid queries and futarchy signals to allocate contributor rewards.",
|
||||||
|
"diff": "diff --git a/domains/internet-finance/meteora-lp-agent.md b/domains/internet-finance/meteora-lp-agent.md\n+Meteora LP agent capital allocation depends on x402 paid queries, decision markets, futarchy votes, and contribution weights.\n+The evaluator must check whether payment creates garbage incentives or useful knowledge."
|
||||||
|
},
|
||||||
|
"rubric": {
|
||||||
|
"must_check": [
|
||||||
|
"paid-query effects",
|
||||||
|
"liquidity pool risk",
|
||||||
|
"contribution-weight incentives",
|
||||||
|
"source-diversity and correlated-prior risk"
|
||||||
|
],
|
||||||
|
"reject_if": [
|
||||||
|
"treats payment as quality approval",
|
||||||
|
"ignores LP loss or protocol risk",
|
||||||
|
"omits OPSEC review for capital allocation"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"expected": {
|
||||||
|
"primary_agent": "Rio",
|
||||||
|
"required_agents": [
|
||||||
|
"Rio"
|
||||||
|
],
|
||||||
|
"route_kind": "single",
|
||||||
|
"expected_disposition": "escalate",
|
||||||
|
"issue_tags": [
|
||||||
|
"capital_allocation",
|
||||||
|
"incentive_design",
|
||||||
|
"paid_query_effects"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
@ -0,0 +1,37 @@
|
||||||
|
{
|
||||||
|
"id": "theseus_live_model_switch_reject",
|
||||||
|
"lane": "theseus-model-integrity",
|
||||||
|
"description": "Theseus should reject live model assignment changes based on one persuasive answer.",
|
||||||
|
"input": {
|
||||||
|
"branch": "theseus/model-switch-demo",
|
||||||
|
"title": "Switch live reviewers after one good model answer",
|
||||||
|
"body": "A candidate model sounded better on one example, so this proposal changes live model assignments without a replay corpus, disagreement metrics, or rollback proof.",
|
||||||
|
"diff": "diff --git a/domains/ai-systems/live-model-routing.md b/domains/ai-systems/live-model-routing.md\n+Change live model assignments today because a single LLM response looked more convincing.\n+Skip replay fixtures, skip disagreement queues, and let the agent self-upgrade its tool allowlist."
|
||||||
|
},
|
||||||
|
"rubric": {
|
||||||
|
"must_check": [
|
||||||
|
"replay evidence",
|
||||||
|
"model-family diversity",
|
||||||
|
"prompt/tool safety",
|
||||||
|
"rollback path"
|
||||||
|
],
|
||||||
|
"reject_if": [
|
||||||
|
"uses one answer as proof",
|
||||||
|
"lets an agent approve its own tool escalation",
|
||||||
|
"changes live routing without before/after metrics"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"expected": {
|
||||||
|
"primary_agent": "Theseus",
|
||||||
|
"required_agents": [
|
||||||
|
"Theseus"
|
||||||
|
],
|
||||||
|
"route_kind": "single",
|
||||||
|
"expected_disposition": "reject",
|
||||||
|
"issue_tags": [
|
||||||
|
"model_assignment_without_eval",
|
||||||
|
"self_upgrade_without_proof",
|
||||||
|
"tool_safety"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
@ -12,6 +12,8 @@ REPO_ROOT = Path(__file__).resolve().parents[1]
|
||||||
|
|
||||||
REQUIRED_FILES = {
|
REQUIRED_FILES = {
|
||||||
"program_doc": REPO_ROOT / "docs" / "llm-refinement-decision-engine.md",
|
"program_doc": REPO_ROOT / "docs" / "llm-refinement-decision-engine.md",
|
||||||
|
"model_registry": REPO_ROOT / "docs" / "model-discovery-registry.md",
|
||||||
|
"replay_script": REPO_ROOT / "scripts" / "replay_decision_engine_eval.py",
|
||||||
"decision_skill": REPO_ROOT / ".agents" / "skills" / "decision-engine-refinement" / "SKILL.md",
|
"decision_skill": REPO_ROOT / ".agents" / "skills" / "decision-engine-refinement" / "SKILL.md",
|
||||||
"db_skill": REPO_ROOT / ".agents" / "skills" / "teleo-db-operator" / "SKILL.md",
|
"db_skill": REPO_ROOT / ".agents" / "skills" / "teleo-db-operator" / "SKILL.md",
|
||||||
"kb_skill": REPO_ROOT / ".agents" / "skills" / "living-ip-kb-interop" / "SKILL.md",
|
"kb_skill": REPO_ROOT / ".agents" / "skills" / "living-ip-kb-interop" / "SKILL.md",
|
||||||
|
|
@ -29,6 +31,25 @@ PROGRAM_REQUIRED_PHRASES = [
|
||||||
"Model Discovery Registry",
|
"Model Discovery Registry",
|
||||||
"Any Hermes, OpenClaw, or Claude-style agent",
|
"Any Hermes, OpenClaw, or Claude-style agent",
|
||||||
"Raw cards and secrets are not agent runtime inputs",
|
"Raw cards and secrets are not agent runtime inputs",
|
||||||
|
"scripts/replay_decision_engine_eval.py",
|
||||||
|
]
|
||||||
|
|
||||||
|
MODEL_REGISTRY_REQUIRED_PHRASES = [
|
||||||
|
"candidate registry, not model approval",
|
||||||
|
"GPT-5.5",
|
||||||
|
"gpt-oss-20b",
|
||||||
|
"Claude Opus 4.8",
|
||||||
|
"Gemini 3.5 Flash",
|
||||||
|
"Hermes 4 70B",
|
||||||
|
"Qwen3.5 9B",
|
||||||
|
"Zero false approvals on known-bad fixtures",
|
||||||
|
]
|
||||||
|
|
||||||
|
REPLAY_REQUIRED_PHRASES = [
|
||||||
|
"decision_engine_replay",
|
||||||
|
"false_approve_count",
|
||||||
|
"kb_interop_ok",
|
||||||
|
"route_accuracy",
|
||||||
]
|
]
|
||||||
|
|
||||||
SKILL_REQUIRED = {
|
SKILL_REQUIRED = {
|
||||||
|
|
@ -66,6 +87,16 @@ SKILL_REQUIRED = {
|
||||||
],
|
],
|
||||||
}
|
}
|
||||||
|
|
||||||
|
FIXTURE_REQUIRED = {
|
||||||
|
"rio_meteora_lp_incentives.json": ["rio-economics", "paid_query_effects", "Rio"],
|
||||||
|
"theseus_live_model_switch_reject.json": [
|
||||||
|
"theseus-model-integrity",
|
||||||
|
"model_assignment_without_eval",
|
||||||
|
"Theseus",
|
||||||
|
],
|
||||||
|
"kb_interop_propose_only.json": ["kb-interop", "no_prod_db_write", "Theseus"],
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
def _read(path: Path) -> str:
|
def _read(path: Path) -> str:
|
||||||
if not path.exists():
|
if not path.exists():
|
||||||
|
|
@ -92,6 +123,29 @@ def main() -> int:
|
||||||
if missing_program:
|
if missing_program:
|
||||||
raise AssertionError(f"program doc missing phrases: {missing_program}")
|
raise AssertionError(f"program doc missing phrases: {missing_program}")
|
||||||
|
|
||||||
|
model_registry = _read(REQUIRED_FILES["model_registry"])
|
||||||
|
missing_registry = [phrase for phrase in MODEL_REGISTRY_REQUIRED_PHRASES if phrase not in model_registry]
|
||||||
|
if missing_registry:
|
||||||
|
raise AssertionError(f"model registry missing phrases: {missing_registry}")
|
||||||
|
|
||||||
|
replay_script = _read(REQUIRED_FILES["replay_script"])
|
||||||
|
missing_replay = [phrase for phrase in REPLAY_REQUIRED_PHRASES if phrase not in replay_script]
|
||||||
|
if missing_replay:
|
||||||
|
raise AssertionError(f"replay script missing phrases: {missing_replay}")
|
||||||
|
|
||||||
|
fixture_checks = {}
|
||||||
|
fixtures_dir = REPO_ROOT / "fixtures" / "decision-engine-eval"
|
||||||
|
for filename, phrases in FIXTURE_REQUIRED.items():
|
||||||
|
path = fixtures_dir / filename
|
||||||
|
text = _read(path)
|
||||||
|
missing = [phrase for phrase in phrases if phrase not in text]
|
||||||
|
if missing:
|
||||||
|
raise AssertionError(f"{path.relative_to(REPO_ROOT)} missing phrases: {missing}")
|
||||||
|
fixture_checks[filename] = {
|
||||||
|
"path": str(path.relative_to(REPO_ROOT)),
|
||||||
|
"phrases_checked": phrases,
|
||||||
|
}
|
||||||
|
|
||||||
skill_checks = {}
|
skill_checks = {}
|
||||||
for key, phrases in SKILL_REQUIRED.items():
|
for key, phrases in SKILL_REQUIRED.items():
|
||||||
path = REQUIRED_FILES[key]
|
path = REQUIRED_FILES[key]
|
||||||
|
|
@ -109,7 +163,10 @@ def main() -> int:
|
||||||
"ok": True,
|
"ok": True,
|
||||||
"scope": "llm_refinement_decision_engine_contract",
|
"scope": "llm_refinement_decision_engine_contract",
|
||||||
"program_doc": str(REQUIRED_FILES["program_doc"].relative_to(REPO_ROOT)),
|
"program_doc": str(REQUIRED_FILES["program_doc"].relative_to(REPO_ROOT)),
|
||||||
|
"model_registry": str(REQUIRED_FILES["model_registry"].relative_to(REPO_ROOT)),
|
||||||
"program_phrases_checked": PROGRAM_REQUIRED_PHRASES,
|
"program_phrases_checked": PROGRAM_REQUIRED_PHRASES,
|
||||||
|
"model_registry_phrases_checked": MODEL_REGISTRY_REQUIRED_PHRASES,
|
||||||
|
"fixtures": fixture_checks,
|
||||||
"skills": skill_checks,
|
"skills": skill_checks,
|
||||||
"pivot": {
|
"pivot": {
|
||||||
"infra_owner": "Pentagon.run",
|
"infra_owner": "Pentagon.run",
|
||||||
|
|
|
||||||
244
scripts/replay_decision_engine_eval.py
Executable file
244
scripts/replay_decision_engine_eval.py
Executable file
|
|
@ -0,0 +1,244 @@
|
||||||
|
#!/usr/bin/env python3
|
||||||
|
"""Replay fixture-backed decision-engine evals without live model calls."""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import json
|
||||||
|
from collections import Counter
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Any
|
||||||
|
|
||||||
|
from lib.agent_routing import classify_pr_route
|
||||||
|
|
||||||
|
REPO_ROOT = Path(__file__).resolve().parents[1]
|
||||||
|
DEFAULT_FIXTURES_DIR = REPO_ROOT / "fixtures" / "decision-engine-eval"
|
||||||
|
DEFAULT_OUTPUT = REPO_ROOT / ".crabbox-results" / "decision-engine-eval.json"
|
||||||
|
VALID_DISPOSITIONS = {"approve", "reject", "escalate"}
|
||||||
|
|
||||||
|
|
||||||
|
def _read_json(path: Path) -> dict[str, Any]:
|
||||||
|
with path.open() as fh:
|
||||||
|
data = json.load(fh)
|
||||||
|
if not isinstance(data, dict):
|
||||||
|
raise AssertionError(f"{path.relative_to(REPO_ROOT)} must contain a JSON object")
|
||||||
|
return data
|
||||||
|
|
||||||
|
|
||||||
|
def _require_dict(data: dict[str, Any], key: str, fixture_id: str) -> dict[str, Any]:
|
||||||
|
value = data.get(key)
|
||||||
|
if not isinstance(value, dict):
|
||||||
|
raise AssertionError(f"{fixture_id}: {key} must be an object")
|
||||||
|
return value
|
||||||
|
|
||||||
|
|
||||||
|
def _require_list(data: dict[str, Any], key: str, fixture_id: str) -> list[Any]:
|
||||||
|
value = data.get(key)
|
||||||
|
if not isinstance(value, list) or not value:
|
||||||
|
raise AssertionError(f"{fixture_id}: {key} must be a non-empty list")
|
||||||
|
return value
|
||||||
|
|
||||||
|
|
||||||
|
def _require_str(data: dict[str, Any], key: str, fixture_id: str) -> str:
|
||||||
|
value = data.get(key)
|
||||||
|
if not isinstance(value, str) or not value.strip():
|
||||||
|
raise AssertionError(f"{fixture_id}: {key} must be a non-empty string")
|
||||||
|
return value
|
||||||
|
|
||||||
|
|
||||||
|
def _validate_fixture(fixture: dict[str, Any], path: Path) -> None:
|
||||||
|
fixture_id = _require_str(fixture, "id", str(path))
|
||||||
|
_require_str(fixture, "lane", fixture_id)
|
||||||
|
input_data = _require_dict(fixture, "input", fixture_id)
|
||||||
|
rubric = _require_dict(fixture, "rubric", fixture_id)
|
||||||
|
expected = _require_dict(fixture, "expected", fixture_id)
|
||||||
|
_require_str(input_data, "diff", fixture_id)
|
||||||
|
_require_list(rubric, "must_check", fixture_id)
|
||||||
|
_require_list(rubric, "reject_if", fixture_id)
|
||||||
|
_require_str(expected, "primary_agent", fixture_id)
|
||||||
|
_require_list(expected, "required_agents", fixture_id)
|
||||||
|
_require_str(expected, "route_kind", fixture_id)
|
||||||
|
disposition = _require_str(expected, "expected_disposition", fixture_id)
|
||||||
|
if disposition not in VALID_DISPOSITIONS:
|
||||||
|
raise AssertionError(f"{fixture_id}: expected_disposition must be one of {sorted(VALID_DISPOSITIONS)}")
|
||||||
|
_require_list(expected, "issue_tags", fixture_id)
|
||||||
|
|
||||||
|
|
||||||
|
def load_fixtures(fixtures_dir: Path = DEFAULT_FIXTURES_DIR) -> list[dict[str, Any]]:
|
||||||
|
if not fixtures_dir.exists():
|
||||||
|
raise AssertionError(f"missing fixtures directory: {fixtures_dir.relative_to(REPO_ROOT)}")
|
||||||
|
fixtures = []
|
||||||
|
for path in sorted(fixtures_dir.glob("*.json")):
|
||||||
|
fixture = _read_json(path)
|
||||||
|
_validate_fixture(fixture, path)
|
||||||
|
fixtures.append(fixture)
|
||||||
|
if not fixtures:
|
||||||
|
raise AssertionError(f"no fixtures found in {fixtures_dir.relative_to(REPO_ROOT)}")
|
||||||
|
ids = [fixture["id"] for fixture in fixtures]
|
||||||
|
duplicates = [fixture_id for fixture_id, count in Counter(ids).items() if count > 1]
|
||||||
|
if duplicates:
|
||||||
|
raise AssertionError(f"duplicate fixture ids: {duplicates}")
|
||||||
|
return fixtures
|
||||||
|
|
||||||
|
|
||||||
|
def _kb_interop_ok(fixture: dict[str, Any]) -> bool | None:
|
||||||
|
if fixture["lane"] != "kb-interop":
|
||||||
|
return None
|
||||||
|
proposed_write = fixture["input"].get("proposed_write")
|
||||||
|
if not isinstance(proposed_write, dict):
|
||||||
|
return False
|
||||||
|
target = str(proposed_write.get("target", "")).lower()
|
||||||
|
denied_actions = proposed_write.get("denied_actions_invoked")
|
||||||
|
return (
|
||||||
|
target not in {"main", "production", "prod"}
|
||||||
|
and proposed_write.get("mutates_production_db") is False
|
||||||
|
and isinstance(denied_actions, list)
|
||||||
|
and not denied_actions
|
||||||
|
and bool(proposed_write.get("proof_path"))
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def _fixture_result(fixture: dict[str, Any]) -> dict[str, Any]:
|
||||||
|
input_data = fixture["input"]
|
||||||
|
expected = fixture["expected"]
|
||||||
|
route = classify_pr_route(
|
||||||
|
input_data["diff"],
|
||||||
|
branch=input_data.get("branch"),
|
||||||
|
title=input_data.get("title"),
|
||||||
|
body=input_data.get("body"),
|
||||||
|
)
|
||||||
|
checks = {
|
||||||
|
"route_primary_ok": route.primary_agent == expected["primary_agent"],
|
||||||
|
"route_required_ok": list(route.required_agents) == expected["required_agents"],
|
||||||
|
"route_kind_ok": route.route_kind == expected["route_kind"],
|
||||||
|
"kb_interop_ok": _kb_interop_ok(fixture),
|
||||||
|
}
|
||||||
|
applicable_checks = [value for value in checks.values() if value is not None]
|
||||||
|
return {
|
||||||
|
"id": fixture["id"],
|
||||||
|
"lane": fixture["lane"],
|
||||||
|
"ok": all(applicable_checks),
|
||||||
|
"expected": expected,
|
||||||
|
"actual_route": route.to_audit_dict(),
|
||||||
|
"checks": checks,
|
||||||
|
"baseline_verdict": {
|
||||||
|
"disposition": expected["expected_disposition"],
|
||||||
|
"issue_tags": expected["issue_tags"],
|
||||||
|
"primary_agent": route.primary_agent,
|
||||||
|
"required_agents": list(route.required_agents),
|
||||||
|
"reason": "fixture truth with deterministic route evidence",
|
||||||
|
},
|
||||||
|
"rubric": fixture["rubric"],
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def _load_candidate_output(path: Path | None) -> dict[str, Any] | None:
|
||||||
|
if path is None:
|
||||||
|
return None
|
||||||
|
candidate = _read_json(path)
|
||||||
|
_require_str(candidate, "candidate_name", str(path))
|
||||||
|
verdicts = candidate.get("verdicts")
|
||||||
|
if not isinstance(verdicts, list):
|
||||||
|
raise AssertionError(f"{path.relative_to(REPO_ROOT)}: verdicts must be a list")
|
||||||
|
return candidate
|
||||||
|
|
||||||
|
|
||||||
|
def _score_candidate(results: list[dict[str, Any]], candidate: dict[str, Any] | None) -> dict[str, Any] | None:
|
||||||
|
if candidate is None:
|
||||||
|
return None
|
||||||
|
verdicts_by_id = {}
|
||||||
|
for verdict in candidate["verdicts"]:
|
||||||
|
if not isinstance(verdict, dict):
|
||||||
|
raise AssertionError("candidate verdicts must be JSON objects")
|
||||||
|
fixture_id = _require_str(verdict, "fixture_id", candidate["candidate_name"])
|
||||||
|
disposition = _require_str(verdict, "disposition", fixture_id)
|
||||||
|
if disposition not in VALID_DISPOSITIONS:
|
||||||
|
raise AssertionError(f"{fixture_id}: candidate disposition must be one of {sorted(VALID_DISPOSITIONS)}")
|
||||||
|
verdicts_by_id[fixture_id] = verdict
|
||||||
|
|
||||||
|
missing_verdicts: list[str] = []
|
||||||
|
false_approves: list[str] = []
|
||||||
|
false_rejects: list[str] = []
|
||||||
|
route_mismatches: list[str] = []
|
||||||
|
missing_required_tags: dict[str, list[str]] = {}
|
||||||
|
|
||||||
|
for result in results:
|
||||||
|
fixture_id = result["id"]
|
||||||
|
expected = result["expected"]
|
||||||
|
verdict = verdicts_by_id.get(fixture_id)
|
||||||
|
if verdict is None:
|
||||||
|
missing_verdicts.append(fixture_id)
|
||||||
|
continue
|
||||||
|
if verdict["disposition"] == "approve" and expected["expected_disposition"] != "approve":
|
||||||
|
false_approves.append(fixture_id)
|
||||||
|
if verdict["disposition"] == "reject" and expected["expected_disposition"] == "approve":
|
||||||
|
false_rejects.append(fixture_id)
|
||||||
|
if verdict.get("primary_agent") and verdict.get("primary_agent") != expected["primary_agent"]:
|
||||||
|
route_mismatches.append(fixture_id)
|
||||||
|
if verdict.get("required_agents") and verdict.get("required_agents") != expected["required_agents"]:
|
||||||
|
route_mismatches.append(fixture_id)
|
||||||
|
expected_tags = set(expected["issue_tags"])
|
||||||
|
actual_tags = set(verdict.get("issue_tags", []))
|
||||||
|
missing = sorted(expected_tags - actual_tags)
|
||||||
|
if missing and expected["expected_disposition"] != "approve":
|
||||||
|
missing_required_tags[fixture_id] = missing
|
||||||
|
|
||||||
|
return {
|
||||||
|
"candidate_name": candidate["candidate_name"],
|
||||||
|
"ok": not (missing_verdicts or false_approves or false_rejects or route_mismatches or missing_required_tags),
|
||||||
|
"missing_verdicts": missing_verdicts,
|
||||||
|
"false_approve_count": len(false_approves),
|
||||||
|
"false_approves": false_approves,
|
||||||
|
"false_reject_count": len(false_rejects),
|
||||||
|
"false_rejects": false_rejects,
|
||||||
|
"route_mismatches": sorted(set(route_mismatches)),
|
||||||
|
"missing_required_tags": missing_required_tags,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def evaluate_fixtures(
|
||||||
|
fixtures: list[dict[str, Any]],
|
||||||
|
*,
|
||||||
|
candidate: dict[str, Any] | None = None,
|
||||||
|
) -> dict[str, Any]:
|
||||||
|
results = [_fixture_result(fixture) for fixture in fixtures]
|
||||||
|
fixture_count = len(results)
|
||||||
|
route_ok_count = sum(1 for result in results if result["ok"])
|
||||||
|
candidate_score = _score_candidate(results, candidate)
|
||||||
|
proof_ok = route_ok_count == fixture_count and (candidate_score is None or candidate_score["ok"])
|
||||||
|
return {
|
||||||
|
"ok": proof_ok,
|
||||||
|
"scope": "decision_engine_replay",
|
||||||
|
"fixture_count": fixture_count,
|
||||||
|
"metrics": {
|
||||||
|
"route_accuracy": route_ok_count / fixture_count,
|
||||||
|
"route_ok_count": route_ok_count,
|
||||||
|
"lanes": dict(sorted(Counter(result["lane"] for result in results).items())),
|
||||||
|
},
|
||||||
|
"results": results,
|
||||||
|
"candidate": candidate_score,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def main() -> int:
|
||||||
|
parser = argparse.ArgumentParser()
|
||||||
|
parser.add_argument("--fixtures-dir", default=str(DEFAULT_FIXTURES_DIR))
|
||||||
|
parser.add_argument("--candidate-output")
|
||||||
|
parser.add_argument("--output", default=str(DEFAULT_OUTPUT))
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
fixtures = load_fixtures(Path(args.fixtures_dir))
|
||||||
|
candidate = _load_candidate_output(Path(args.candidate_output) if args.candidate_output else None)
|
||||||
|
proof = evaluate_fixtures(fixtures, candidate=candidate)
|
||||||
|
|
||||||
|
output = Path(args.output)
|
||||||
|
if not output.is_absolute():
|
||||||
|
output = REPO_ROOT / output
|
||||||
|
output.parent.mkdir(parents=True, exist_ok=True)
|
||||||
|
output.write_text(json.dumps(proof, indent=2, sort_keys=True) + "\n")
|
||||||
|
print(json.dumps(proof, indent=2, sort_keys=True))
|
||||||
|
return 0 if proof["ok"] else 1
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
raise SystemExit(main())
|
||||||
56
tests/test_decision_engine_replay.py
Normal file
56
tests/test_decision_engine_replay.py
Normal file
|
|
@ -0,0 +1,56 @@
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import importlib.util
|
||||||
|
import json
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
REPO_ROOT = Path(__file__).resolve().parents[1]
|
||||||
|
SCRIPT_PATH = REPO_ROOT / "scripts" / "replay_decision_engine_eval.py"
|
||||||
|
FIXTURES_DIR = REPO_ROOT / "fixtures" / "decision-engine-eval"
|
||||||
|
|
||||||
|
spec = importlib.util.spec_from_file_location("replay_decision_engine_eval", SCRIPT_PATH)
|
||||||
|
replay = importlib.util.module_from_spec(spec)
|
||||||
|
assert spec.loader is not None
|
||||||
|
spec.loader.exec_module(replay)
|
||||||
|
|
||||||
|
|
||||||
|
def test_default_decision_engine_fixtures_replay_cleanly():
|
||||||
|
fixtures = replay.load_fixtures(FIXTURES_DIR)
|
||||||
|
proof = replay.evaluate_fixtures(fixtures)
|
||||||
|
|
||||||
|
assert proof["ok"] is True
|
||||||
|
assert proof["fixture_count"] == 3
|
||||||
|
assert proof["metrics"]["route_accuracy"] == 1.0
|
||||||
|
assert proof["metrics"]["lanes"] == {
|
||||||
|
"kb-interop": 1,
|
||||||
|
"rio-economics": 1,
|
||||||
|
"theseus-model-integrity": 1,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def test_candidate_false_approve_is_caught(tmp_path):
|
||||||
|
fixtures = replay.load_fixtures(FIXTURES_DIR)
|
||||||
|
candidate_path = tmp_path / "candidate.json"
|
||||||
|
candidate_path.write_text(
|
||||||
|
json.dumps(
|
||||||
|
{
|
||||||
|
"candidate_name": "bad-single-answer-model",
|
||||||
|
"verdicts": [
|
||||||
|
{
|
||||||
|
"fixture_id": "theseus_live_model_switch_reject",
|
||||||
|
"disposition": "approve",
|
||||||
|
"issue_tags": [],
|
||||||
|
"primary_agent": "Theseus",
|
||||||
|
"required_agents": ["Theseus"],
|
||||||
|
}
|
||||||
|
],
|
||||||
|
}
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
|
candidate = replay._load_candidate_output(candidate_path)
|
||||||
|
proof = replay.evaluate_fixtures(fixtures, candidate=candidate)
|
||||||
|
|
||||||
|
assert proof["ok"] is False
|
||||||
|
assert proof["candidate"]["false_approve_count"] == 1
|
||||||
|
assert proof["candidate"]["false_approves"] == ["theseus_live_model_switch_reject"]
|
||||||
Loading…
Reference in a new issue