twentyOne2x aee534e686 Add decision engine refinement contracts

- Define Rio and Theseus as economics and model-integrity evaluators
- Add DB, Hermes, and OpenClaw skills with no-secret defaults
- Gate CI on LLM refinement contracts; verify with 422-test suite

`.agents/skills/decision-engine-refinement/SKILL.md`
`.agents/skills/nousresearch-hermes-agent/SKILL.md`
`.agents/skills/openclaw-agent/SKILL.md`
`.agents/skills/teleo-db-operator/SKILL.md`
`.crabbox.yaml`
`.github/workflows/ci.yml`
`docs/llm-refinement-decision-engine.md`
`scripts/check_llm_refinement_contract.py`

2026-06-01 15:50:48 +02:00

2.1 KiB

Raw Blame History

name	description
decision-engine-refinement	Use when improving Living IP decision-engine quality, LLM model selection, evaluator prompts, rubrics, replay evals, Rio or Theseus reviewer behavior, or model bakeoffs.

Decision Engine Refinement

Use this skill for quality work, not infrastructure work. Pentagon.run or Crabbox can run remote jobs; this repo owns model judgment, rubric design, prompt/tool refinement, and proof artifacts.

Workflow

Read docs/llm-refinement-decision-engine.md.
Identify the lane: Rio economics, Theseus model integrity, Leo cross-domain, domain factuality, retrieval quality, or prompt/tool self-upgrade.
Build or reuse a replayable fixture before changing prompts or model assignments.
Compare baseline vs candidate with the same input, same rubric, and structured verdict format.
Record false approves, false rejects, useful disagreements, cost, and latency.
Change runtime prompts/models only after the candidate shows a measured improvement with no critical regression.

Hard Rules

Do not change live model assignments because one answer sounds better.
Do not use production DB writes to tune prompts.
Do not collapse Rio and Theseus into generic "reviewers".
Do not treat payment, popularity, or engagement as quality approval.
Do not claim production decision-engine improvement without replay evidence and live/staging readback.

Agent Responsibilities

Rio: incentive design, contribution weights, paid-query effects, market/mechanism reasoning, OPSEC, correlated-prior warnings.
Theseus: model diversity, adversarial evals, disagreement queues, self-upgrade criteria, prompt/tool safety, verifier drift.
Leo: cross-domain synthesis, fallback review, final arbitration where the route or rubric is ambiguous.

Expected Artifacts

fixture file or DB query used for sampling;
baseline verdict output;
candidate verdict output;
summary JSON with quality, cost, latency, and disagreement metrics;
patch scoped to prompts, model config, rubric docs, or eval harness.

Run python3 scripts/check_llm_refinement_contract.py after editing this surface.

2.1 KiB Raw Blame History