twentyOne2x aee534e686 Add decision engine refinement contracts

- Define Rio and Theseus as economics and model-integrity evaluators
- Add DB, Hermes, and OpenClaw skills with no-secret defaults
- Gate CI on LLM refinement contracts; verify with 422-test suite

`.agents/skills/decision-engine-refinement/SKILL.md`
`.agents/skills/nousresearch-hermes-agent/SKILL.md`
`.agents/skills/openclaw-agent/SKILL.md`
`.agents/skills/teleo-db-operator/SKILL.md`
`.crabbox.yaml`
`.github/workflows/ci.yml`
`docs/llm-refinement-decision-engine.md`
`scripts/check_llm_refinement_contract.py`

2026-06-01 15:50:48 +02:00

6.7 KiB

Raw Blame History

LLM Refinement And Decision Engine Program

Created: 2026-06-01 Status: active direction

Product Outcome

The decision engine should become the best judgment layer for Living IP: it routes knowledge changes to the right agent identities, tests competing LLMs against the same rubric, learns from disagreement, and improves prompts/tools only when measured deltas prove the change.

Pentagon.run should own disposable infrastructure and remote execution. This repo should own decision quality: rubrics, prompts, model selection, route evidence, database feedback loops, and agent tool packages.

What Rio And Theseus Become

Rio

Rio becomes the economic and incentive-quality evaluator.

Rio owns:

contribution weights and role economics;
paid-query effects and anti-pay-to-pollute rules;
market, mechanism, futarchy, x402, token, and capital-formation reasoning;
source-diversity and correlated-prior warnings;
OPSEC for finance, deal terms, token economics, and internal allocations;
model tests that expose weak economic reasoning.

Rio should not be "the crypto agent". Rio should be the agent that asks whether the system's incentives create useful knowledge or garbage incentives.

Theseus

Theseus becomes the model-integrity and agent-refinement evaluator.

Theseus owns:

model diversity and correlated-blind-spot measurement;
adversarial eval rubrics;
prompt/tool safety and self-upgrade criteria;
disagreement queues and verifier-divergence analysis;
LLM capability evidence and agent-system architecture;
tests that expose hallucinated certainty, weak causal claims, and prompt-injection fragility.

Theseus should not be "the AI safety agent". Theseus should be the agent that asks whether the decision system can be trusted when the models are persuasive but wrong.

Decision Engine Loop

flowchart TD
  PR["Decision-engine PR or source record"] --> Route["Deterministic route evidence"]
  Route --> Reviewers["Required agent reviewers"]
  Reviewers --> Rubric["Shared rubric"]
  Rubric --> ModelA["Primary model"]
  Rubric --> ModelB["Independent model family"]
  ModelA --> Verdicts["Structured verdicts"]
  ModelB --> Verdicts
  Verdicts --> Disagree{"Disagreement?"}
  Disagree -->|yes| Queue["Disagreement queue"]
  Disagree -->|no| Metrics["Calibration metrics"]
  Queue --> HumanOrLeo["Leo or human arbitration"]
  HumanOrLeo --> Metrics
  Metrics --> DB["SQLite feedback state"]
  DB --> Refine["Prompt, tool, or model proposal"]
  Refine --> Delta["Before/after eval harness"]
  Delta -->|passes| Update["Commit refinement"]
  Delta -->|fails| Archive["Archive failed refinement"]

Model Portfolio

The goal is not to pick one favorite model. The goal is to assign models to failure modes.

Lane	Primary evaluator	Independent check	Why
Fast triage	cheap small model	deterministic route evidence	triage should be cheap and overridable
Domain review	routed agent prompt	different model family	catch domain-specific errors without same-family agreement bias
Deep review	strongest available reasoning model	non-Claude or non-primary family	deep review is for structural claims and disagreement
Economic reasoning	Rio rubric	model with strong quantitative/mechanism reasoning	tests incentive design, paid-query effects, and contribution weights
Agent/refinement safety	Theseus rubric	model with strong adversarial critique	tests tool safety, self-upgrades, and evaluator drift

Candidate models should enter only through a harness:

fixed input set;
fixed rubric;
structured verdict JSON;
cost and latency recorded;
disagreement categories stored;
before/after comparison against current baseline.

No model switch is accepted because it "sounds better" on one example.

Refinement Workstreams

R1: Rubric Packets

Create a small rubric packet for each evaluator role:

rio-economics-rubric
theseus-model-integrity-rubric
leo-cross-domain-rubric
domain-specific factuality rubrics

Each packet must define allowed verdicts, rejection tags, must-check criteria, and examples of false positives.

R2: Evaluation Corpus

Build a replayable corpus from existing PRs:

approved clean PRs;
rejected PRs by issue tag;
Rio/Theseus cross-domain PRs;
paid-query or contribution-weight examples;
adversarial malformed claims;
near-duplicate and OPSEC edge cases.

Use local fixture data first. Production DB sampling requires the DB operator skill.

R3: Model Bakeoff

Run each candidate model against the same corpus and emit:

accuracy against expected disposition;
false-approve count;
false-reject count;
issue-tag precision;
average latency;
estimated cost;
disagreement matrix by model pair.

The highest-signal metric is not raw approval rate. It is false approvals on bad claims plus useful disagreement on ambiguous claims.

R4: Feedback Loop

Use review_records, audit_log, costs, and PR state to find:

recurring model failure categories;
agents with repeated same-tag rejections;
prompts that produce vague reviews;
cost spikes without quality gain;
routes that keep requiring manual override.

Every prompt/tool change should include a before/after proof over this loop.

R5: Agent Runtime Packages

Package the same decision-engine contract for:

NousResearch Hermes Agent: skill/memory/model-switching oriented.
OpenClaw: workspace skill plus AGENTS.md, SOUL.md, TOOLS.md oriented.

Both packages should be fixture-first and no-secret by default. They are distribution surfaces for the decision engine, not separate evaluators with their own truth.

DB Usage Boundary

Default is read-only.

Writes are allowed only when all are true:

the target DB is local, staging, or explicitly authorized production;
a backup or copy exists;
the write is wrapped in a transaction;
the exact query is retained in a proof artifact;
the post-write readback is retained.

Never let an agent tune prompts by mutating production state directly.

Pentagon.run Boundary

Pentagon.run should own:

disposable VPS setup;
Crabbox or remote proof execution;
Hetzner lifecycle;
runner cleanup;
infra receipts.

This repo should own:

decision-engine quality;
model and prompt experiments;
agent skills and adapter handoffs;
database feedback analysis;
proof schemas for eval quality.

Next Implementation Slice

Add scripts/replay_decision_engine_eval.py with local fixture mode.
Add fixtures/decision-engine-eval/*.json.
Store verdict outputs in .crabbox-results/decision-engine-eval.json.
Add one Rio economics fixture and one Theseus model-integrity fixture.
Compare current prompt versus one candidate prompt before touching runtime prompts.

Do not start by changing live model assignments.

6.7 KiB Raw Blame History