# Model Discovery Registry Created: 2026-06-01 Status: candidate registry, not model approval This registry exists to decide which models deserve a Living IP bakeoff fixture. It does not choose production models and it does not replace measured replay results. ## Rules - Use official provider docs, model cards, or source repositories for every entry. - Treat all model specs, prices, context limits, and aliases as volatile. - Do not switch runtime model assignments from this document alone. - Promote a model only after `scripts/replay_decision_engine_eval.py` shows no critical regression on the same fixture set. - Prefer different model families for independent review so agreement is not just same-family correlation. ## Candidate Matrix | Candidate | Surface | Why It Is Worth Testing | First Living IP Lane | Source | | --- | --- | --- | --- | --- | | GPT-5.5 / GPT-5.4 family | Hosted API | Strong general reasoning and agentic task baseline; useful as a frontier comparison point. | deep review, Leo arbitration | [OpenAI models](https://platform.openai.com/docs/models) | | GPT-5 lower-latency variants | Hosted API | Possible cheap triage candidates; exact model IDs must be re-verified before a bakeoff run. | fast triage | [OpenAI models](https://platform.openai.com/docs/models) | | gpt-oss-120b | Open-weight | Open-weight reasoning candidate for on-prem or Pentagon-managed inference; needs hardware/cost proof. | Theseus model integrity | [OpenAI open models](https://openai.com/open-models/) | | gpt-oss-20b | Open-weight | Smaller local/edge candidate for cheap first-pass triage and portable demos. | fast triage, local harness | [OpenAI open models](https://openai.com/open-models/) | | Claude Opus 4.8 | Hosted API | Complex-reasoning candidate for highest-stakes arbitration. | Leo arbitration, deep review | [Anthropic models overview](https://docs.anthropic.com/en/docs/about-claude/models) | | Claude Sonnet 4.6 | Hosted API | Speed/intelligence tradeoff candidate for domain review. | domain review | [Anthropic models overview](https://docs.anthropic.com/en/docs/about-claude/models) | | Claude Haiku 4.5 | Hosted API | Low-latency candidate for cheap reviewer pre-checks. | fast triage | [Anthropic models overview](https://docs.anthropic.com/en/docs/about-claude/models) | | Gemini 3.5 Flash | Hosted API | Agentic/coding-oriented candidate from a different model family. | independent second review | [Gemini API models](https://ai.google.dev/gemini-api/docs/models) | | Gemini 3.1 Pro | Hosted API | Complex problem-solving candidate from a non-primary model family. | deep review | [Gemini API models](https://ai.google.dev/gemini-api/docs/models) | | Mistral Medium 3.5 | Hosted or open surface per provider docs | Agentic/coding candidate with a non-US-primary model family. | independent second review | [Mistral models overview](https://docs.mistral.ai/getting-started/models/) | | Mistral Small 4 | Hosted or open surface per provider docs | Efficient hybrid instruct/reasoning/coding candidate. | fast triage, domain review | [Mistral models overview](https://docs.mistral.ai/getting-started/models/) | | Mistral Large 3 | Open-weight | Large open-weight comparison point for self-hosted evaluation. | deep review | [Mistral models overview](https://docs.mistral.ai/getting-started/models/) | | Devstral 2 | Hosted or open surface per provider docs | Code-agent candidate for tools, repository work, and adapter tasks. | Theseus tool integrity | [Mistral models overview](https://docs.mistral.ai/getting-started/models/) | | Hermes 4 70B | Open-weight / provider-hosted | Nous-aligned model with structured output and tool-use relevance for Hermes Agent packaging. | Hermes adapter, Theseus | [NousResearch Hermes 4 70B](https://huggingface.co/NousResearch/Hermes-4-70B) | | Qwen3.5 9B | Open-weight | Small multimodal/open-weight candidate for local and edge experiments. | fast triage, local harness | [Qwen3.5 9B model card](https://huggingface.co/Qwen/Qwen3.5-9B) | ## Bakeoff Intake Fields Each candidate needs a retained record before a real bakeoff: - provider or local runtime; - exact model ID or pinned snapshot; - source URL; - license or terms surface; - context window and max output if verified; - structured-output support; - tool/function calling support; - expected hardware or hosted cost; - latency estimate; - privacy and data-retention posture; - failure mode hypothesis; - first fixture lane. ## First Bakeoff Order 1. Cheap triage: exact-ID-verified GPT-5 lower-latency variant, Claude Haiku 4.5, Mistral Small 4, Qwen3.5 9B, gpt-oss-20b. 2. Theseus integrity: Gemini 3.5 Flash, Hermes 4 70B, Devstral 2, gpt-oss-120b. 3. Rio economics: GPT-5.5/5.4, Claude Sonnet 4.6, Gemini 3.1 Pro, Mistral Medium 3.5. 4. Deep arbitration: Claude Opus 4.8, GPT-5.5, Gemini 3.1 Pro, Mistral Large 3. ## Promotion Gate A model can move from registry to runtime proposal only if the replay proof includes: - exact model ID; - fixture count; - route accuracy; - false approvals; - false rejects; - missing required issue tags; - average latency; - cost estimate; - disagreement matrix against current baseline; - one paragraph explaining why the observed disagreements are useful. Zero false approvals on known-bad fixtures is a hard gate for evaluator roles.