- Add source-linked model discovery registry for bakeoff candidates - Add Rio, Theseus, and KB interop fixtures with deterministic replay proof - Gate CI on replay output; verify with 424-test suite `.crabbox.yaml` `.github/workflows/ci.yml` `docs/llm-refinement-decision-engine.md` `docs/model-discovery-registry.md` `fixtures/decision-engine-eval/kb_interop_propose_only.json` `fixtures/decision-engine-eval/rio_meteora_lp_incentives.json` `fixtures/decision-engine-eval/theseus_live_model_switch_reject.json` `scripts/check_llm_refinement_contract.py` `scripts/replay_decision_engine_eval.py` `tests/test_decision_engine_replay.py`
9.5 KiB
LLM Refinement And Decision Engine Program
Created: 2026-06-01 Status: active direction
Product Outcome
The decision engine should become the best judgment layer for Living IP: it routes knowledge changes to the right agent identities, tests competing LLMs against the same rubric, learns from disagreement, and improves prompts/tools only when measured deltas prove the change.
Pentagon.run should own disposable infrastructure and remote execution. This repo should own decision quality: rubrics, prompts, model selection, route evidence, database feedback loops, and agent tool packages.
What Rio And Theseus Become
Rio
Rio becomes the economic and incentive-quality evaluator.
Rio owns:
- contribution weights and role economics;
- paid-query effects and anti-pay-to-pollute rules;
- market, mechanism, futarchy, x402, token, and capital-formation reasoning;
- source-diversity and correlated-prior warnings;
- OPSEC for finance, deal terms, token economics, and internal allocations;
- model tests that expose weak economic reasoning.
Rio should not be "the crypto agent". Rio should be the agent that asks whether the system's incentives create useful knowledge or garbage incentives.
Theseus
Theseus becomes the model-integrity and agent-refinement evaluator.
Theseus owns:
- model diversity and correlated-blind-spot measurement;
- adversarial eval rubrics;
- prompt/tool safety and self-upgrade criteria;
- disagreement queues and verifier-divergence analysis;
- LLM capability evidence and agent-system architecture;
- tests that expose hallucinated certainty, weak causal claims, and prompt-injection fragility.
Theseus should not be "the AI safety agent". Theseus should be the agent that asks whether the decision system can be trusted when the models are persuasive but wrong.
Decision Engine Loop
flowchart TD
PR["Decision-engine PR or source record"] --> Route["Deterministic route evidence"]
Route --> Reviewers["Required agent reviewers"]
Reviewers --> Rubric["Shared rubric"]
Rubric --> ModelA["Primary model"]
Rubric --> ModelB["Independent model family"]
ModelA --> Verdicts["Structured verdicts"]
ModelB --> Verdicts
Verdicts --> Disagree{"Disagreement?"}
Disagree -->|yes| Queue["Disagreement queue"]
Disagree -->|no| Metrics["Calibration metrics"]
Queue --> HumanOrLeo["Leo or human arbitration"]
HumanOrLeo --> Metrics
Metrics --> DB["SQLite feedback state"]
DB --> Refine["Prompt, tool, or model proposal"]
Refine --> Delta["Before/after eval harness"]
Delta -->|passes| Update["Commit refinement"]
Delta -->|fails| Archive["Archive failed refinement"]
Model Portfolio
The goal is not to pick one favorite model. The goal is to assign models to failure modes.
| Lane | Primary evaluator | Independent check | Why |
|---|---|---|---|
| Fast triage | cheap small model | deterministic route evidence | triage should be cheap and overridable |
| Domain review | routed agent prompt | different model family | catch domain-specific errors without same-family agreement bias |
| Deep review | strongest available reasoning model | non-Claude or non-primary family | deep review is for structural claims and disagreement |
| Economic reasoning | Rio rubric | model with strong quantitative/mechanism reasoning | tests incentive design, paid-query effects, and contribution weights |
| Agent/refinement safety | Theseus rubric | model with strong adversarial critique | tests tool safety, self-upgrades, and evaluator drift |
Candidate models should enter only through a harness:
- fixed input set;
- fixed rubric;
- structured verdict JSON;
- cost and latency recorded;
- disagreement categories stored;
- before/after comparison against current baseline.
No model switch is accepted because it "sounds better" on one example.
Refinement Workstreams
R0: Model Discovery Registry
Create a registry before arguing about model preference. The registry should track:
- hosted frontier models;
- open-weight Hugging Face candidates;
- local or edge candidates;
- small, cheap triage models;
- larger reasoning models, including future in-house or 27B-class candidates;
- license, hardware, context, latency, cost, tool support, and known failure modes.
The registry does not bless a model. It decides which model deserves a bakeoff fixture.
R1: Rubric Packets
Create a small rubric packet for each evaluator role:
rio-economics-rubrictheseus-model-integrity-rubricleo-cross-domain-rubric- domain-specific factuality rubrics
Each packet must define allowed verdicts, rejection tags, must-check criteria, and examples of false positives.
R2: Evaluation Corpus
Build a replayable corpus from existing PRs:
- approved clean PRs;
- rejected PRs by issue tag;
- Rio/Theseus cross-domain PRs;
- paid-query or contribution-weight examples;
- adversarial malformed claims;
- near-duplicate and OPSEC edge cases.
Use local fixture data first. Production DB sampling requires the DB operator skill.
R3: Model Bakeoff
Run each candidate model against the same corpus and emit:
- accuracy against expected disposition;
- false-approve count;
- false-reject count;
- issue-tag precision;
- average latency;
- estimated cost;
- disagreement matrix by model pair.
The highest-signal metric is not raw approval rate. It is false approvals on bad claims plus useful disagreement on ambiguous claims.
R4: Feedback Loop
Use review_records, audit_log, costs, and PR state to find:
- recurring model failure categories;
- agents with repeated same-tag rejections;
- prompts that produce vague reviews;
- cost spikes without quality gain;
- routes that keep requiring manual override.
Every prompt/tool change should include a before/after proof over this loop.
R5: Agent Runtime Packages
Package the same decision-engine contract for:
- NousResearch Hermes Agent: skill/memory/model-switching oriented.
- OpenClaw: workspace skill plus
AGENTS.md,SOUL.md,TOOLS.mdoriented. - Claude-style, Pentagon, or other persistent agents: skill-oriented knowledge-base read/write interop.
Both packages should be fixture-first and no-secret by default. They are distribution surfaces for the decision engine, not separate evaluators with their own truth.
R6: Knowledge-Base Interop
Any Hermes, OpenClaw, or Claude-style agent should be able to read information from the Living IP knowledge base and propose writes back into it.
The contract is:
- read through deterministic search, claim indexes, copied SQLite state, or cited repo files;
- propose source, claim, entity, correction, and route artifacts;
- never write directly to main;
- never mutate production
pipeline.dbfrom a model response; - leave proof showing the exact query, cited reads, proposed write, and route evidence.
Use .agents/skills/living-ip-kb-interop/SKILL.md for runtime-neutral KB access, and .agents/skills/teleo-db-operator/SKILL.md for SQLite-specific work.
DB Usage Boundary
Default is read-only.
Writes are allowed only when all are true:
- the target DB is local, staging, or explicitly authorized production;
- a backup or copy exists;
- the write is wrapped in a transaction;
- the exact query is retained in a proof artifact;
- the post-write readback is retained.
Never let an agent tune prompts by mutating production state directly.
Pentagon.run Boundary
Pentagon.run should own:
- disposable VPS setup;
- Crabbox or remote proof execution;
- Hetzner lifecycle;
- runner cleanup;
- infra receipts.
- persistent agent teammates, company-brain infrastructure, and agent-to-agent transport when that is their managed stack.
This repo should own:
- decision-engine quality;
- model and prompt experiments;
- agent skills and adapter handoffs;
- database feedback analysis;
- proof schemas for eval quality.
Raw cards and secrets are not agent runtime inputs. Human operators may decide vendor billing and spend policy, but repo artifacts should only name secret slots, scoped tokens, spend limits, receipts, and setup checklists.
Transcript-Derived Requirements
The 2026-06-01 working transcript adds these requirements:
- LLM/refinement work should focus on model discovery, compression, context strategy, and decision-engine quality while Pentagon handles cloud/persistent-agent infrastructure.
- Rio should be the first place to route Meteora, LP, x402, futarchy, paid-query, and contribution-incentive questions.
- Theseus should own the skill/MCP/refinement path that makes model judgment portable across Hermes, OpenClaw, Claude-style agents, and Pentagon-style company brains.
- The knowledge-writing path should turn large founder/source corpora into structured, reviewable knowledge packets, not shallow summaries.
- Slack, Linear, email, billing, and provider accounts are external collaboration setup. They should unblock people, but they are not prerequisites for local fixture, rubric, and proof work.
Next Implementation Slice
- Add
docs/model-discovery-registry.md. - Add
scripts/replay_decision_engine_eval.pywith local fixture mode. - Add
fixtures/decision-engine-eval/*.json. - Store verdict outputs in
.crabbox-results/decision-engine-eval.json. - Add one Rio economics fixture and one Theseus model-integrity fixture.
- Add one KB interop fixture that searches existing context and proposes a write without touching main or production DB.
- Compare current prompt versus one candidate prompt before touching runtime prompts.
Do not start by changing live model assignments.
Run python3 scripts/replay_decision_engine_eval.py after changing fixture, rubric, registry, or candidate-output formats.