- Add source-linked model discovery registry for bakeoff candidates - Add Rio, Theseus, and KB interop fixtures with deterministic replay proof - Gate CI on replay output; verify with 424-test suite `.crabbox.yaml` `.github/workflows/ci.yml` `docs/llm-refinement-decision-engine.md` `docs/model-discovery-registry.md` `fixtures/decision-engine-eval/kb_interop_propose_only.json` `fixtures/decision-engine-eval/rio_meteora_lp_incentives.json` `fixtures/decision-engine-eval/theseus_live_model_switch_reject.json` `scripts/check_llm_refinement_contract.py` `scripts/replay_decision_engine_eval.py` `tests/test_decision_engine_replay.py`
75 lines
5.2 KiB
Markdown
75 lines
5.2 KiB
Markdown
# Model Discovery Registry
|
|
|
|
Created: 2026-06-01
|
|
Status: candidate registry, not model approval
|
|
|
|
This registry exists to decide which models deserve a Living IP bakeoff fixture. It does not choose production models and it does not replace measured replay results.
|
|
|
|
## Rules
|
|
|
|
- Use official provider docs, model cards, or source repositories for every entry.
|
|
- Treat all model specs, prices, context limits, and aliases as volatile.
|
|
- Do not switch runtime model assignments from this document alone.
|
|
- Promote a model only after `scripts/replay_decision_engine_eval.py` shows no critical regression on the same fixture set.
|
|
- Prefer different model families for independent review so agreement is not just same-family correlation.
|
|
|
|
## Candidate Matrix
|
|
|
|
| Candidate | Surface | Why It Is Worth Testing | First Living IP Lane | Source |
|
|
| --- | --- | --- | --- | --- |
|
|
| GPT-5.5 / GPT-5.4 family | Hosted API | Strong general reasoning and agentic task baseline; useful as a frontier comparison point. | deep review, Leo arbitration | [OpenAI models](https://platform.openai.com/docs/models) |
|
|
| GPT-5 lower-latency variants | Hosted API | Possible cheap triage candidates; exact model IDs must be re-verified before a bakeoff run. | fast triage | [OpenAI models](https://platform.openai.com/docs/models) |
|
|
| gpt-oss-120b | Open-weight | Open-weight reasoning candidate for on-prem or Pentagon-managed inference; needs hardware/cost proof. | Theseus model integrity | [OpenAI open models](https://openai.com/open-models/) |
|
|
| gpt-oss-20b | Open-weight | Smaller local/edge candidate for cheap first-pass triage and portable demos. | fast triage, local harness | [OpenAI open models](https://openai.com/open-models/) |
|
|
| Claude Opus 4.8 | Hosted API | Complex-reasoning candidate for highest-stakes arbitration. | Leo arbitration, deep review | [Anthropic models overview](https://docs.anthropic.com/en/docs/about-claude/models) |
|
|
| Claude Sonnet 4.6 | Hosted API | Speed/intelligence tradeoff candidate for domain review. | domain review | [Anthropic models overview](https://docs.anthropic.com/en/docs/about-claude/models) |
|
|
| Claude Haiku 4.5 | Hosted API | Low-latency candidate for cheap reviewer pre-checks. | fast triage | [Anthropic models overview](https://docs.anthropic.com/en/docs/about-claude/models) |
|
|
| Gemini 3.5 Flash | Hosted API | Agentic/coding-oriented candidate from a different model family. | independent second review | [Gemini API models](https://ai.google.dev/gemini-api/docs/models) |
|
|
| Gemini 3.1 Pro | Hosted API | Complex problem-solving candidate from a non-primary model family. | deep review | [Gemini API models](https://ai.google.dev/gemini-api/docs/models) |
|
|
| Mistral Medium 3.5 | Hosted or open surface per provider docs | Agentic/coding candidate with a non-US-primary model family. | independent second review | [Mistral models overview](https://docs.mistral.ai/getting-started/models/) |
|
|
| Mistral Small 4 | Hosted or open surface per provider docs | Efficient hybrid instruct/reasoning/coding candidate. | fast triage, domain review | [Mistral models overview](https://docs.mistral.ai/getting-started/models/) |
|
|
| Mistral Large 3 | Open-weight | Large open-weight comparison point for self-hosted evaluation. | deep review | [Mistral models overview](https://docs.mistral.ai/getting-started/models/) |
|
|
| Devstral 2 | Hosted or open surface per provider docs | Code-agent candidate for tools, repository work, and adapter tasks. | Theseus tool integrity | [Mistral models overview](https://docs.mistral.ai/getting-started/models/) |
|
|
| Hermes 4 70B | Open-weight / provider-hosted | Nous-aligned model with structured output and tool-use relevance for Hermes Agent packaging. | Hermes adapter, Theseus | [NousResearch Hermes 4 70B](https://huggingface.co/NousResearch/Hermes-4-70B) |
|
|
| Qwen3.5 9B | Open-weight | Small multimodal/open-weight candidate for local and edge experiments. | fast triage, local harness | [Qwen3.5 9B model card](https://huggingface.co/Qwen/Qwen3.5-9B) |
|
|
|
|
## Bakeoff Intake Fields
|
|
|
|
Each candidate needs a retained record before a real bakeoff:
|
|
|
|
- provider or local runtime;
|
|
- exact model ID or pinned snapshot;
|
|
- source URL;
|
|
- license or terms surface;
|
|
- context window and max output if verified;
|
|
- structured-output support;
|
|
- tool/function calling support;
|
|
- expected hardware or hosted cost;
|
|
- latency estimate;
|
|
- privacy and data-retention posture;
|
|
- failure mode hypothesis;
|
|
- first fixture lane.
|
|
|
|
## First Bakeoff Order
|
|
|
|
1. Cheap triage: exact-ID-verified GPT-5 lower-latency variant, Claude Haiku 4.5, Mistral Small 4, Qwen3.5 9B, gpt-oss-20b.
|
|
2. Theseus integrity: Gemini 3.5 Flash, Hermes 4 70B, Devstral 2, gpt-oss-120b.
|
|
3. Rio economics: GPT-5.5/5.4, Claude Sonnet 4.6, Gemini 3.1 Pro, Mistral Medium 3.5.
|
|
4. Deep arbitration: Claude Opus 4.8, GPT-5.5, Gemini 3.1 Pro, Mistral Large 3.
|
|
|
|
## Promotion Gate
|
|
|
|
A model can move from registry to runtime proposal only if the replay proof includes:
|
|
|
|
- exact model ID;
|
|
- fixture count;
|
|
- route accuracy;
|
|
- false approvals;
|
|
- false rejects;
|
|
- missing required issue tags;
|
|
- average latency;
|
|
- cost estimate;
|
|
- disagreement matrix against current baseline;
|
|
- one paragraph explaining why the observed disagreements are useful.
|
|
|
|
Zero false approvals on known-bad fixtures is a hard gate for evaluator roles.
|