- Define Rio and Theseus as economics and model-integrity evaluators - Add DB, Hermes, and OpenClaw skills with no-secret defaults - Gate CI on LLM refinement contracts; verify with 422-test suite `.agents/skills/decision-engine-refinement/SKILL.md` `.agents/skills/nousresearch-hermes-agent/SKILL.md` `.agents/skills/openclaw-agent/SKILL.md` `.agents/skills/teleo-db-operator/SKILL.md` `.crabbox.yaml` `.github/workflows/ci.yml` `docs/llm-refinement-decision-engine.md` `scripts/check_llm_refinement_contract.py`
2.1 KiB
2.1 KiB
| name | description |
|---|---|
| decision-engine-refinement | Use when improving Living IP decision-engine quality, LLM model selection, evaluator prompts, rubrics, replay evals, Rio or Theseus reviewer behavior, or model bakeoffs. |
Decision Engine Refinement
Use this skill for quality work, not infrastructure work. Pentagon.run or Crabbox can run remote jobs; this repo owns model judgment, rubric design, prompt/tool refinement, and proof artifacts.
Workflow
- Read
docs/llm-refinement-decision-engine.md. - Identify the lane: Rio economics, Theseus model integrity, Leo cross-domain, domain factuality, retrieval quality, or prompt/tool self-upgrade.
- Build or reuse a replayable fixture before changing prompts or model assignments.
- Compare baseline vs candidate with the same input, same rubric, and structured verdict format.
- Record false approves, false rejects, useful disagreements, cost, and latency.
- Change runtime prompts/models only after the candidate shows a measured improvement with no critical regression.
Hard Rules
- Do not change live model assignments because one answer sounds better.
- Do not use production DB writes to tune prompts.
- Do not collapse Rio and Theseus into generic "reviewers".
- Do not treat payment, popularity, or engagement as quality approval.
- Do not claim production decision-engine improvement without replay evidence and live/staging readback.
Agent Responsibilities
- Rio: incentive design, contribution weights, paid-query effects, market/mechanism reasoning, OPSEC, correlated-prior warnings.
- Theseus: model diversity, adversarial evals, disagreement queues, self-upgrade criteria, prompt/tool safety, verifier drift.
- Leo: cross-domain synthesis, fallback review, final arbitration where the route or rubric is ambiguous.
Expected Artifacts
- fixture file or DB query used for sampling;
- baseline verdict output;
- candidate verdict output;
- summary JSON with quality, cost, latency, and disagreement metrics;
- patch scoped to prompts, model config, rubric docs, or eval harness.
Run python3 scripts/check_llm_refinement_contract.py after editing this surface.