twentyOne2x 71ea7a625c Add decision engine replay harness

- Add source-linked model discovery registry for bakeoff candidates
- Add Rio, Theseus, and KB interop fixtures with deterministic replay proof
- Gate CI on replay output; verify with 424-test suite

`.crabbox.yaml`
`.github/workflows/ci.yml`
`docs/llm-refinement-decision-engine.md`
`docs/model-discovery-registry.md`
`fixtures/decision-engine-eval/kb_interop_propose_only.json`
`fixtures/decision-engine-eval/rio_meteora_lp_incentives.json`
`fixtures/decision-engine-eval/theseus_live_model_switch_reject.json`
`scripts/check_llm_refinement_contract.py`
`scripts/replay_decision_engine_eval.py`
`tests/test_decision_engine_replay.py`

2026-06-01 17:37:38 +02:00

5.2 KiB

Raw Blame History

Model Discovery Registry

Created: 2026-06-01 Status: candidate registry, not model approval

This registry exists to decide which models deserve a Living IP bakeoff fixture. It does not choose production models and it does not replace measured replay results.

Rules

Use official provider docs, model cards, or source repositories for every entry.
Treat all model specs, prices, context limits, and aliases as volatile.
Do not switch runtime model assignments from this document alone.
Promote a model only after scripts/replay_decision_engine_eval.py shows no critical regression on the same fixture set.
Prefer different model families for independent review so agreement is not just same-family correlation.

Candidate Matrix

Candidate	Surface	Why It Is Worth Testing	First Living IP Lane	Source
GPT-5.5 / GPT-5.4 family	Hosted API	Strong general reasoning and agentic task baseline; useful as a frontier comparison point.	deep review, Leo arbitration	OpenAI models
GPT-5 lower-latency variants	Hosted API	Possible cheap triage candidates; exact model IDs must be re-verified before a bakeoff run.	fast triage	OpenAI models
gpt-oss-120b	Open-weight	Open-weight reasoning candidate for on-prem or Pentagon-managed inference; needs hardware/cost proof.	Theseus model integrity	OpenAI open models
gpt-oss-20b	Open-weight	Smaller local/edge candidate for cheap first-pass triage and portable demos.	fast triage, local harness	OpenAI open models
Claude Opus 4.8	Hosted API	Complex-reasoning candidate for highest-stakes arbitration.	Leo arbitration, deep review	Anthropic models overview
Claude Sonnet 4.6	Hosted API	Speed/intelligence tradeoff candidate for domain review.	domain review	Anthropic models overview
Claude Haiku 4.5	Hosted API	Low-latency candidate for cheap reviewer pre-checks.	fast triage	Anthropic models overview
Gemini 3.5 Flash	Hosted API	Agentic/coding-oriented candidate from a different model family.	independent second review	Gemini API models
Gemini 3.1 Pro	Hosted API	Complex problem-solving candidate from a non-primary model family.	deep review	Gemini API models
Mistral Medium 3.5	Hosted or open surface per provider docs	Agentic/coding candidate with a non-US-primary model family.	independent second review	Mistral models overview
Mistral Small 4	Hosted or open surface per provider docs	Efficient hybrid instruct/reasoning/coding candidate.	fast triage, domain review	Mistral models overview
Mistral Large 3	Open-weight	Large open-weight comparison point for self-hosted evaluation.	deep review	Mistral models overview
Devstral 2	Hosted or open surface per provider docs	Code-agent candidate for tools, repository work, and adapter tasks.	Theseus tool integrity	Mistral models overview
Hermes 4 70B	Open-weight / provider-hosted	Nous-aligned model with structured output and tool-use relevance for Hermes Agent packaging.	Hermes adapter, Theseus	NousResearch Hermes 4 70B
Qwen3.5 9B	Open-weight	Small multimodal/open-weight candidate for local and edge experiments.	fast triage, local harness	Qwen3.5 9B model card

Bakeoff Intake Fields

Each candidate needs a retained record before a real bakeoff:

provider or local runtime;
exact model ID or pinned snapshot;
source URL;
license or terms surface;
context window and max output if verified;
structured-output support;
tool/function calling support;
expected hardware or hosted cost;
latency estimate;
privacy and data-retention posture;
failure mode hypothesis;
first fixture lane.

First Bakeoff Order

Cheap triage: exact-ID-verified GPT-5 lower-latency variant, Claude Haiku 4.5, Mistral Small 4, Qwen3.5 9B, gpt-oss-20b.
Theseus integrity: Gemini 3.5 Flash, Hermes 4 70B, Devstral 2, gpt-oss-120b.
Rio economics: GPT-5.5/5.4, Claude Sonnet 4.6, Gemini 3.1 Pro, Mistral Medium 3.5.
Deep arbitration: Claude Opus 4.8, GPT-5.5, Gemini 3.1 Pro, Mistral Large 3.

Promotion Gate

A model can move from registry to runtime proposal only if the replay proof includes:

exact model ID;
fixture count;
route accuracy;
false approvals;
false rejects;
missing required issue tags;
average latency;
cost estimate;
disagreement matrix against current baseline;
one paragraph explaining why the observed disagreements are useful.

Zero false approvals on known-bad fixtures is a hard gate for evaluator roles.

5.2 KiB Raw Blame History