- Add source-linked model discovery registry for bakeoff candidates - Add Rio, Theseus, and KB interop fixtures with deterministic replay proof - Gate CI on replay output; verify with 424-test suite `.crabbox.yaml` `.github/workflows/ci.yml` `docs/llm-refinement-decision-engine.md` `docs/model-discovery-registry.md` `fixtures/decision-engine-eval/kb_interop_propose_only.json` `fixtures/decision-engine-eval/rio_meteora_lp_incentives.json` `fixtures/decision-engine-eval/theseus_live_model_switch_reject.json` `scripts/check_llm_refinement_contract.py` `scripts/replay_decision_engine_eval.py` `tests/test_decision_engine_replay.py`
5.2 KiB
5.2 KiB
Model Discovery Registry
Created: 2026-06-01 Status: candidate registry, not model approval
This registry exists to decide which models deserve a Living IP bakeoff fixture. It does not choose production models and it does not replace measured replay results.
Rules
- Use official provider docs, model cards, or source repositories for every entry.
- Treat all model specs, prices, context limits, and aliases as volatile.
- Do not switch runtime model assignments from this document alone.
- Promote a model only after
scripts/replay_decision_engine_eval.pyshows no critical regression on the same fixture set. - Prefer different model families for independent review so agreement is not just same-family correlation.
Candidate Matrix
| Candidate | Surface | Why It Is Worth Testing | First Living IP Lane | Source |
|---|---|---|---|---|
| GPT-5.5 / GPT-5.4 family | Hosted API | Strong general reasoning and agentic task baseline; useful as a frontier comparison point. | deep review, Leo arbitration | OpenAI models |
| GPT-5 lower-latency variants | Hosted API | Possible cheap triage candidates; exact model IDs must be re-verified before a bakeoff run. | fast triage | OpenAI models |
| gpt-oss-120b | Open-weight | Open-weight reasoning candidate for on-prem or Pentagon-managed inference; needs hardware/cost proof. | Theseus model integrity | OpenAI open models |
| gpt-oss-20b | Open-weight | Smaller local/edge candidate for cheap first-pass triage and portable demos. | fast triage, local harness | OpenAI open models |
| Claude Opus 4.8 | Hosted API | Complex-reasoning candidate for highest-stakes arbitration. | Leo arbitration, deep review | Anthropic models overview |
| Claude Sonnet 4.6 | Hosted API | Speed/intelligence tradeoff candidate for domain review. | domain review | Anthropic models overview |
| Claude Haiku 4.5 | Hosted API | Low-latency candidate for cheap reviewer pre-checks. | fast triage | Anthropic models overview |
| Gemini 3.5 Flash | Hosted API | Agentic/coding-oriented candidate from a different model family. | independent second review | Gemini API models |
| Gemini 3.1 Pro | Hosted API | Complex problem-solving candidate from a non-primary model family. | deep review | Gemini API models |
| Mistral Medium 3.5 | Hosted or open surface per provider docs | Agentic/coding candidate with a non-US-primary model family. | independent second review | Mistral models overview |
| Mistral Small 4 | Hosted or open surface per provider docs | Efficient hybrid instruct/reasoning/coding candidate. | fast triage, domain review | Mistral models overview |
| Mistral Large 3 | Open-weight | Large open-weight comparison point for self-hosted evaluation. | deep review | Mistral models overview |
| Devstral 2 | Hosted or open surface per provider docs | Code-agent candidate for tools, repository work, and adapter tasks. | Theseus tool integrity | Mistral models overview |
| Hermes 4 70B | Open-weight / provider-hosted | Nous-aligned model with structured output and tool-use relevance for Hermes Agent packaging. | Hermes adapter, Theseus | NousResearch Hermes 4 70B |
| Qwen3.5 9B | Open-weight | Small multimodal/open-weight candidate for local and edge experiments. | fast triage, local harness | Qwen3.5 9B model card |
Bakeoff Intake Fields
Each candidate needs a retained record before a real bakeoff:
- provider or local runtime;
- exact model ID or pinned snapshot;
- source URL;
- license or terms surface;
- context window and max output if verified;
- structured-output support;
- tool/function calling support;
- expected hardware or hosted cost;
- latency estimate;
- privacy and data-retention posture;
- failure mode hypothesis;
- first fixture lane.
First Bakeoff Order
- Cheap triage: exact-ID-verified GPT-5 lower-latency variant, Claude Haiku 4.5, Mistral Small 4, Qwen3.5 9B, gpt-oss-20b.
- Theseus integrity: Gemini 3.5 Flash, Hermes 4 70B, Devstral 2, gpt-oss-120b.
- Rio economics: GPT-5.5/5.4, Claude Sonnet 4.6, Gemini 3.1 Pro, Mistral Medium 3.5.
- Deep arbitration: Claude Opus 4.8, GPT-5.5, Gemini 3.1 Pro, Mistral Large 3.
Promotion Gate
A model can move from registry to runtime proposal only if the replay proof includes:
- exact model ID;
- fixture count;
- route accuracy;
- false approvals;
- false rejects;
- missing required issue tags;
- average latency;
- cost estimate;
- disagreement matrix against current baseline;
- one paragraph explaining why the observed disagreements are useful.
Zero false approvals on known-bad fixtures is a hard gate for evaluator roles.