- Define Rio and Theseus as economics and model-integrity evaluators - Add DB, Hermes, and OpenClaw skills with no-secret defaults - Gate CI on LLM refinement contracts; verify with 422-test suite `.agents/skills/decision-engine-refinement/SKILL.md` `.agents/skills/nousresearch-hermes-agent/SKILL.md` `.agents/skills/openclaw-agent/SKILL.md` `.agents/skills/teleo-db-operator/SKILL.md` `.crabbox.yaml` `.github/workflows/ci.yml` `docs/llm-refinement-decision-engine.md` `scripts/check_llm_refinement_contract.py`
41 lines
2.1 KiB
Markdown
41 lines
2.1 KiB
Markdown
---
|
|
name: decision-engine-refinement
|
|
description: Use when improving Living IP decision-engine quality, LLM model selection, evaluator prompts, rubrics, replay evals, Rio or Theseus reviewer behavior, or model bakeoffs.
|
|
---
|
|
|
|
# Decision Engine Refinement
|
|
|
|
Use this skill for quality work, not infrastructure work. Pentagon.run or Crabbox can run remote jobs; this repo owns model judgment, rubric design, prompt/tool refinement, and proof artifacts.
|
|
|
|
## Workflow
|
|
|
|
1. Read `docs/llm-refinement-decision-engine.md`.
|
|
2. Identify the lane: Rio economics, Theseus model integrity, Leo cross-domain, domain factuality, retrieval quality, or prompt/tool self-upgrade.
|
|
3. Build or reuse a replayable fixture before changing prompts or model assignments.
|
|
4. Compare baseline vs candidate with the same input, same rubric, and structured verdict format.
|
|
5. Record false approves, false rejects, useful disagreements, cost, and latency.
|
|
6. Change runtime prompts/models only after the candidate shows a measured improvement with no critical regression.
|
|
|
|
## Hard Rules
|
|
|
|
- Do not change live model assignments because one answer sounds better.
|
|
- Do not use production DB writes to tune prompts.
|
|
- Do not collapse Rio and Theseus into generic "reviewers".
|
|
- Do not treat payment, popularity, or engagement as quality approval.
|
|
- Do not claim production decision-engine improvement without replay evidence and live/staging readback.
|
|
|
|
## Agent Responsibilities
|
|
|
|
- Rio: incentive design, contribution weights, paid-query effects, market/mechanism reasoning, OPSEC, correlated-prior warnings.
|
|
- Theseus: model diversity, adversarial evals, disagreement queues, self-upgrade criteria, prompt/tool safety, verifier drift.
|
|
- Leo: cross-domain synthesis, fallback review, final arbitration where the route or rubric is ambiguous.
|
|
|
|
## Expected Artifacts
|
|
|
|
- fixture file or DB query used for sampling;
|
|
- baseline verdict output;
|
|
- candidate verdict output;
|
|
- summary JSON with quality, cost, latency, and disagreement metrics;
|
|
- patch scoped to prompts, model config, rubric docs, or eval harness.
|
|
|
|
Run `python3 scripts/check_llm_refinement_contract.py` after editing this surface.
|