teleo-infrastructure/docs/model-discovery-registry.md
twentyOne2x 71ea7a625c Add decision engine replay harness
- Add source-linked model discovery registry for bakeoff candidates
- Add Rio, Theseus, and KB interop fixtures with deterministic replay proof
- Gate CI on replay output; verify with 424-test suite

`.crabbox.yaml`
`.github/workflows/ci.yml`
`docs/llm-refinement-decision-engine.md`
`docs/model-discovery-registry.md`
`fixtures/decision-engine-eval/kb_interop_propose_only.json`
`fixtures/decision-engine-eval/rio_meteora_lp_incentives.json`
`fixtures/decision-engine-eval/theseus_live_model_switch_reject.json`
`scripts/check_llm_refinement_contract.py`
`scripts/replay_decision_engine_eval.py`
`tests/test_decision_engine_replay.py`
2026-06-01 17:37:38 +02:00

5.2 KiB

Model Discovery Registry

Created: 2026-06-01 Status: candidate registry, not model approval

This registry exists to decide which models deserve a Living IP bakeoff fixture. It does not choose production models and it does not replace measured replay results.

Rules

  • Use official provider docs, model cards, or source repositories for every entry.
  • Treat all model specs, prices, context limits, and aliases as volatile.
  • Do not switch runtime model assignments from this document alone.
  • Promote a model only after scripts/replay_decision_engine_eval.py shows no critical regression on the same fixture set.
  • Prefer different model families for independent review so agreement is not just same-family correlation.

Candidate Matrix

Candidate Surface Why It Is Worth Testing First Living IP Lane Source
GPT-5.5 / GPT-5.4 family Hosted API Strong general reasoning and agentic task baseline; useful as a frontier comparison point. deep review, Leo arbitration OpenAI models
GPT-5 lower-latency variants Hosted API Possible cheap triage candidates; exact model IDs must be re-verified before a bakeoff run. fast triage OpenAI models
gpt-oss-120b Open-weight Open-weight reasoning candidate for on-prem or Pentagon-managed inference; needs hardware/cost proof. Theseus model integrity OpenAI open models
gpt-oss-20b Open-weight Smaller local/edge candidate for cheap first-pass triage and portable demos. fast triage, local harness OpenAI open models
Claude Opus 4.8 Hosted API Complex-reasoning candidate for highest-stakes arbitration. Leo arbitration, deep review Anthropic models overview
Claude Sonnet 4.6 Hosted API Speed/intelligence tradeoff candidate for domain review. domain review Anthropic models overview
Claude Haiku 4.5 Hosted API Low-latency candidate for cheap reviewer pre-checks. fast triage Anthropic models overview
Gemini 3.5 Flash Hosted API Agentic/coding-oriented candidate from a different model family. independent second review Gemini API models
Gemini 3.1 Pro Hosted API Complex problem-solving candidate from a non-primary model family. deep review Gemini API models
Mistral Medium 3.5 Hosted or open surface per provider docs Agentic/coding candidate with a non-US-primary model family. independent second review Mistral models overview
Mistral Small 4 Hosted or open surface per provider docs Efficient hybrid instruct/reasoning/coding candidate. fast triage, domain review Mistral models overview
Mistral Large 3 Open-weight Large open-weight comparison point for self-hosted evaluation. deep review Mistral models overview
Devstral 2 Hosted or open surface per provider docs Code-agent candidate for tools, repository work, and adapter tasks. Theseus tool integrity Mistral models overview
Hermes 4 70B Open-weight / provider-hosted Nous-aligned model with structured output and tool-use relevance for Hermes Agent packaging. Hermes adapter, Theseus NousResearch Hermes 4 70B
Qwen3.5 9B Open-weight Small multimodal/open-weight candidate for local and edge experiments. fast triage, local harness Qwen3.5 9B model card

Bakeoff Intake Fields

Each candidate needs a retained record before a real bakeoff:

  • provider or local runtime;
  • exact model ID or pinned snapshot;
  • source URL;
  • license or terms surface;
  • context window and max output if verified;
  • structured-output support;
  • tool/function calling support;
  • expected hardware or hosted cost;
  • latency estimate;
  • privacy and data-retention posture;
  • failure mode hypothesis;
  • first fixture lane.

First Bakeoff Order

  1. Cheap triage: exact-ID-verified GPT-5 lower-latency variant, Claude Haiku 4.5, Mistral Small 4, Qwen3.5 9B, gpt-oss-20b.
  2. Theseus integrity: Gemini 3.5 Flash, Hermes 4 70B, Devstral 2, gpt-oss-120b.
  3. Rio economics: GPT-5.5/5.4, Claude Sonnet 4.6, Gemini 3.1 Pro, Mistral Medium 3.5.
  4. Deep arbitration: Claude Opus 4.8, GPT-5.5, Gemini 3.1 Pro, Mistral Large 3.

Promotion Gate

A model can move from registry to runtime proposal only if the replay proof includes:

  • exact model ID;
  • fixture count;
  • route accuracy;
  • false approvals;
  • false rejects;
  • missing required issue tags;
  • average latency;
  • cost estimate;
  • disagreement matrix against current baseline;
  • one paragraph explaining why the observed disagreements are useful.

Zero false approvals on known-bad fixtures is a hard gate for evaluator roles.