- Define Rio and Theseus as economics and model-integrity evaluators - Add DB, Hermes, and OpenClaw skills with no-secret defaults - Gate CI on LLM refinement contracts; verify with 422-test suite `.agents/skills/decision-engine-refinement/SKILL.md` `.agents/skills/nousresearch-hermes-agent/SKILL.md` `.agents/skills/openclaw-agent/SKILL.md` `.agents/skills/teleo-db-operator/SKILL.md` `.crabbox.yaml` `.github/workflows/ci.yml` `docs/llm-refinement-decision-engine.md` `scripts/check_llm_refinement_contract.py`
6.7 KiB
LLM Refinement And Decision Engine Program
Created: 2026-06-01 Status: active direction
Product Outcome
The decision engine should become the best judgment layer for Living IP: it routes knowledge changes to the right agent identities, tests competing LLMs against the same rubric, learns from disagreement, and improves prompts/tools only when measured deltas prove the change.
Pentagon.run should own disposable infrastructure and remote execution. This repo should own decision quality: rubrics, prompts, model selection, route evidence, database feedback loops, and agent tool packages.
What Rio And Theseus Become
Rio
Rio becomes the economic and incentive-quality evaluator.
Rio owns:
- contribution weights and role economics;
- paid-query effects and anti-pay-to-pollute rules;
- market, mechanism, futarchy, x402, token, and capital-formation reasoning;
- source-diversity and correlated-prior warnings;
- OPSEC for finance, deal terms, token economics, and internal allocations;
- model tests that expose weak economic reasoning.
Rio should not be "the crypto agent". Rio should be the agent that asks whether the system's incentives create useful knowledge or garbage incentives.
Theseus
Theseus becomes the model-integrity and agent-refinement evaluator.
Theseus owns:
- model diversity and correlated-blind-spot measurement;
- adversarial eval rubrics;
- prompt/tool safety and self-upgrade criteria;
- disagreement queues and verifier-divergence analysis;
- LLM capability evidence and agent-system architecture;
- tests that expose hallucinated certainty, weak causal claims, and prompt-injection fragility.
Theseus should not be "the AI safety agent". Theseus should be the agent that asks whether the decision system can be trusted when the models are persuasive but wrong.
Decision Engine Loop
flowchart TD
PR["Decision-engine PR or source record"] --> Route["Deterministic route evidence"]
Route --> Reviewers["Required agent reviewers"]
Reviewers --> Rubric["Shared rubric"]
Rubric --> ModelA["Primary model"]
Rubric --> ModelB["Independent model family"]
ModelA --> Verdicts["Structured verdicts"]
ModelB --> Verdicts
Verdicts --> Disagree{"Disagreement?"}
Disagree -->|yes| Queue["Disagreement queue"]
Disagree -->|no| Metrics["Calibration metrics"]
Queue --> HumanOrLeo["Leo or human arbitration"]
HumanOrLeo --> Metrics
Metrics --> DB["SQLite feedback state"]
DB --> Refine["Prompt, tool, or model proposal"]
Refine --> Delta["Before/after eval harness"]
Delta -->|passes| Update["Commit refinement"]
Delta -->|fails| Archive["Archive failed refinement"]
Model Portfolio
The goal is not to pick one favorite model. The goal is to assign models to failure modes.
| Lane | Primary evaluator | Independent check | Why |
|---|---|---|---|
| Fast triage | cheap small model | deterministic route evidence | triage should be cheap and overridable |
| Domain review | routed agent prompt | different model family | catch domain-specific errors without same-family agreement bias |
| Deep review | strongest available reasoning model | non-Claude or non-primary family | deep review is for structural claims and disagreement |
| Economic reasoning | Rio rubric | model with strong quantitative/mechanism reasoning | tests incentive design, paid-query effects, and contribution weights |
| Agent/refinement safety | Theseus rubric | model with strong adversarial critique | tests tool safety, self-upgrades, and evaluator drift |
Candidate models should enter only through a harness:
- fixed input set;
- fixed rubric;
- structured verdict JSON;
- cost and latency recorded;
- disagreement categories stored;
- before/after comparison against current baseline.
No model switch is accepted because it "sounds better" on one example.
Refinement Workstreams
R1: Rubric Packets
Create a small rubric packet for each evaluator role:
rio-economics-rubrictheseus-model-integrity-rubricleo-cross-domain-rubric- domain-specific factuality rubrics
Each packet must define allowed verdicts, rejection tags, must-check criteria, and examples of false positives.
R2: Evaluation Corpus
Build a replayable corpus from existing PRs:
- approved clean PRs;
- rejected PRs by issue tag;
- Rio/Theseus cross-domain PRs;
- paid-query or contribution-weight examples;
- adversarial malformed claims;
- near-duplicate and OPSEC edge cases.
Use local fixture data first. Production DB sampling requires the DB operator skill.
R3: Model Bakeoff
Run each candidate model against the same corpus and emit:
- accuracy against expected disposition;
- false-approve count;
- false-reject count;
- issue-tag precision;
- average latency;
- estimated cost;
- disagreement matrix by model pair.
The highest-signal metric is not raw approval rate. It is false approvals on bad claims plus useful disagreement on ambiguous claims.
R4: Feedback Loop
Use review_records, audit_log, costs, and PR state to find:
- recurring model failure categories;
- agents with repeated same-tag rejections;
- prompts that produce vague reviews;
- cost spikes without quality gain;
- routes that keep requiring manual override.
Every prompt/tool change should include a before/after proof over this loop.
R5: Agent Runtime Packages
Package the same decision-engine contract for:
- NousResearch Hermes Agent: skill/memory/model-switching oriented.
- OpenClaw: workspace skill plus
AGENTS.md,SOUL.md,TOOLS.mdoriented.
Both packages should be fixture-first and no-secret by default. They are distribution surfaces for the decision engine, not separate evaluators with their own truth.
DB Usage Boundary
Default is read-only.
Writes are allowed only when all are true:
- the target DB is local, staging, or explicitly authorized production;
- a backup or copy exists;
- the write is wrapped in a transaction;
- the exact query is retained in a proof artifact;
- the post-write readback is retained.
Never let an agent tune prompts by mutating production state directly.
Pentagon.run Boundary
Pentagon.run should own:
- disposable VPS setup;
- Crabbox or remote proof execution;
- Hetzner lifecycle;
- runner cleanup;
- infra receipts.
This repo should own:
- decision-engine quality;
- model and prompt experiments;
- agent skills and adapter handoffs;
- database feedback analysis;
- proof schemas for eval quality.
Next Implementation Slice
- Add
scripts/replay_decision_engine_eval.pywith local fixture mode. - Add
fixtures/decision-engine-eval/*.json. - Store verdict outputs in
.crabbox-results/decision-engine-eval.json. - Add one Rio economics fixture and one Theseus model-integrity fixture.
- Compare current prompt versus one candidate prompt before touching runtime prompts.
Do not start by changing live model assignments.