- Define Rio and Theseus as economics and model-integrity evaluators - Add DB, Hermes, and OpenClaw skills with no-secret defaults - Gate CI on LLM refinement contracts; verify with 422-test suite `.agents/skills/decision-engine-refinement/SKILL.md` `.agents/skills/nousresearch-hermes-agent/SKILL.md` `.agents/skills/openclaw-agent/SKILL.md` `.agents/skills/teleo-db-operator/SKILL.md` `.crabbox.yaml` `.github/workflows/ci.yml` `docs/llm-refinement-decision-engine.md` `scripts/check_llm_refinement_contract.py`
191 lines
6.7 KiB
Markdown
191 lines
6.7 KiB
Markdown
# LLM Refinement And Decision Engine Program
|
|
|
|
Created: 2026-06-01
|
|
Status: active direction
|
|
|
|
## Product Outcome
|
|
|
|
The decision engine should become the best judgment layer for Living IP: it routes knowledge changes to the right agent identities, tests competing LLMs against the same rubric, learns from disagreement, and improves prompts/tools only when measured deltas prove the change.
|
|
|
|
Pentagon.run should own disposable infrastructure and remote execution. This repo should own decision quality: rubrics, prompts, model selection, route evidence, database feedback loops, and agent tool packages.
|
|
|
|
## What Rio And Theseus Become
|
|
|
|
### Rio
|
|
|
|
Rio becomes the economic and incentive-quality evaluator.
|
|
|
|
Rio owns:
|
|
|
|
- contribution weights and role economics;
|
|
- paid-query effects and anti-pay-to-pollute rules;
|
|
- market, mechanism, futarchy, x402, token, and capital-formation reasoning;
|
|
- source-diversity and correlated-prior warnings;
|
|
- OPSEC for finance, deal terms, token economics, and internal allocations;
|
|
- model tests that expose weak economic reasoning.
|
|
|
|
Rio should not be "the crypto agent". Rio should be the agent that asks whether the system's incentives create useful knowledge or garbage incentives.
|
|
|
|
### Theseus
|
|
|
|
Theseus becomes the model-integrity and agent-refinement evaluator.
|
|
|
|
Theseus owns:
|
|
|
|
- model diversity and correlated-blind-spot measurement;
|
|
- adversarial eval rubrics;
|
|
- prompt/tool safety and self-upgrade criteria;
|
|
- disagreement queues and verifier-divergence analysis;
|
|
- LLM capability evidence and agent-system architecture;
|
|
- tests that expose hallucinated certainty, weak causal claims, and prompt-injection fragility.
|
|
|
|
Theseus should not be "the AI safety agent". Theseus should be the agent that asks whether the decision system can be trusted when the models are persuasive but wrong.
|
|
|
|
## Decision Engine Loop
|
|
|
|
```mermaid
|
|
flowchart TD
|
|
PR["Decision-engine PR or source record"] --> Route["Deterministic route evidence"]
|
|
Route --> Reviewers["Required agent reviewers"]
|
|
Reviewers --> Rubric["Shared rubric"]
|
|
Rubric --> ModelA["Primary model"]
|
|
Rubric --> ModelB["Independent model family"]
|
|
ModelA --> Verdicts["Structured verdicts"]
|
|
ModelB --> Verdicts
|
|
Verdicts --> Disagree{"Disagreement?"}
|
|
Disagree -->|yes| Queue["Disagreement queue"]
|
|
Disagree -->|no| Metrics["Calibration metrics"]
|
|
Queue --> HumanOrLeo["Leo or human arbitration"]
|
|
HumanOrLeo --> Metrics
|
|
Metrics --> DB["SQLite feedback state"]
|
|
DB --> Refine["Prompt, tool, or model proposal"]
|
|
Refine --> Delta["Before/after eval harness"]
|
|
Delta -->|passes| Update["Commit refinement"]
|
|
Delta -->|fails| Archive["Archive failed refinement"]
|
|
```
|
|
|
|
## Model Portfolio
|
|
|
|
The goal is not to pick one favorite model. The goal is to assign models to failure modes.
|
|
|
|
| Lane | Primary evaluator | Independent check | Why |
|
|
| --- | --- | --- | --- |
|
|
| Fast triage | cheap small model | deterministic route evidence | triage should be cheap and overridable |
|
|
| Domain review | routed agent prompt | different model family | catch domain-specific errors without same-family agreement bias |
|
|
| Deep review | strongest available reasoning model | non-Claude or non-primary family | deep review is for structural claims and disagreement |
|
|
| Economic reasoning | Rio rubric | model with strong quantitative/mechanism reasoning | tests incentive design, paid-query effects, and contribution weights |
|
|
| Agent/refinement safety | Theseus rubric | model with strong adversarial critique | tests tool safety, self-upgrades, and evaluator drift |
|
|
|
|
Candidate models should enter only through a harness:
|
|
|
|
1. fixed input set;
|
|
2. fixed rubric;
|
|
3. structured verdict JSON;
|
|
4. cost and latency recorded;
|
|
5. disagreement categories stored;
|
|
6. before/after comparison against current baseline.
|
|
|
|
No model switch is accepted because it "sounds better" on one example.
|
|
|
|
## Refinement Workstreams
|
|
|
|
### R1: Rubric Packets
|
|
|
|
Create a small rubric packet for each evaluator role:
|
|
|
|
- `rio-economics-rubric`
|
|
- `theseus-model-integrity-rubric`
|
|
- `leo-cross-domain-rubric`
|
|
- domain-specific factuality rubrics
|
|
|
|
Each packet must define allowed verdicts, rejection tags, must-check criteria, and examples of false positives.
|
|
|
|
### R2: Evaluation Corpus
|
|
|
|
Build a replayable corpus from existing PRs:
|
|
|
|
- approved clean PRs;
|
|
- rejected PRs by issue tag;
|
|
- Rio/Theseus cross-domain PRs;
|
|
- paid-query or contribution-weight examples;
|
|
- adversarial malformed claims;
|
|
- near-duplicate and OPSEC edge cases.
|
|
|
|
Use local fixture data first. Production DB sampling requires the DB operator skill.
|
|
|
|
### R3: Model Bakeoff
|
|
|
|
Run each candidate model against the same corpus and emit:
|
|
|
|
- accuracy against expected disposition;
|
|
- false-approve count;
|
|
- false-reject count;
|
|
- issue-tag precision;
|
|
- average latency;
|
|
- estimated cost;
|
|
- disagreement matrix by model pair.
|
|
|
|
The highest-signal metric is not raw approval rate. It is false approvals on bad claims plus useful disagreement on ambiguous claims.
|
|
|
|
### R4: Feedback Loop
|
|
|
|
Use `review_records`, `audit_log`, `costs`, and PR state to find:
|
|
|
|
- recurring model failure categories;
|
|
- agents with repeated same-tag rejections;
|
|
- prompts that produce vague reviews;
|
|
- cost spikes without quality gain;
|
|
- routes that keep requiring manual override.
|
|
|
|
Every prompt/tool change should include a before/after proof over this loop.
|
|
|
|
### R5: Agent Runtime Packages
|
|
|
|
Package the same decision-engine contract for:
|
|
|
|
- NousResearch Hermes Agent: skill/memory/model-switching oriented.
|
|
- OpenClaw: workspace skill plus `AGENTS.md`, `SOUL.md`, `TOOLS.md` oriented.
|
|
|
|
Both packages should be fixture-first and no-secret by default. They are distribution surfaces for the decision engine, not separate evaluators with their own truth.
|
|
|
|
## DB Usage Boundary
|
|
|
|
Default is read-only.
|
|
|
|
Writes are allowed only when all are true:
|
|
|
|
- the target DB is local, staging, or explicitly authorized production;
|
|
- a backup or copy exists;
|
|
- the write is wrapped in a transaction;
|
|
- the exact query is retained in a proof artifact;
|
|
- the post-write readback is retained.
|
|
|
|
Never let an agent tune prompts by mutating production state directly.
|
|
|
|
## Pentagon.run Boundary
|
|
|
|
Pentagon.run should own:
|
|
|
|
- disposable VPS setup;
|
|
- Crabbox or remote proof execution;
|
|
- Hetzner lifecycle;
|
|
- runner cleanup;
|
|
- infra receipts.
|
|
|
|
This repo should own:
|
|
|
|
- decision-engine quality;
|
|
- model and prompt experiments;
|
|
- agent skills and adapter handoffs;
|
|
- database feedback analysis;
|
|
- proof schemas for eval quality.
|
|
|
|
## Next Implementation Slice
|
|
|
|
1. Add `scripts/replay_decision_engine_eval.py` with local fixture mode.
|
|
2. Add `fixtures/decision-engine-eval/*.json`.
|
|
3. Store verdict outputs in `.crabbox-results/decision-engine-eval.json`.
|
|
4. Add one Rio economics fixture and one Theseus model-integrity fixture.
|
|
5. Compare current prompt versus one candidate prompt before touching runtime prompts.
|
|
|
|
Do not start by changing live model assignments.
|