Add decision engine refinement contracts

- Define Rio and Theseus as economics and model-integrity evaluators - Add DB, Hermes, and OpenClaw skills with no-secret defaults - Gate CI on LLM refinement contracts; verify with 422-test suite `.agents/skills/decision-engine-refinement/SKILL.md` `.agents/skills/nousresearch-hermes-agent/SKILL.md` `.agents/skills/openclaw-agent/SKILL.md` `.agents/skills/teleo-db-operator/SKILL.md` `.crabbox.yaml` `.github/workflows/ci.yml` `docs/llm-refinement-decision-engine.md` `scripts/check_llm_refinement_contract.py`
2026-06-01 15:50:48 +02:00 · 2026-06-01 15:50:48 +02:00 · aee534e686
commit aee534e686
parent a2620c1f19
8 changed files with 577 additions and 7 deletions
--- a/.agents/skills/decision-engine-refinement/SKILL.md
+++ b/.agents/skills/decision-engine-refinement/SKILL.md
@ -0,0 +1,41 @@
 ---
 name: decision-engine-refinement
 description: Use when improving Living IP decision-engine quality, LLM model selection, evaluator prompts, rubrics, replay evals, Rio or Theseus reviewer behavior, or model bakeoffs.
 ---
 # Decision Engine Refinement
 Use this skill for quality work, not infrastructure work. Pentagon.run or Crabbox can run remote jobs; this repo owns model judgment, rubric design, prompt/tool refinement, and proof artifacts.
 ## Workflow
 1. Read `docs/llm-refinement-decision-engine.md`.
 2. Identify the lane: Rio economics, Theseus model integrity, Leo cross-domain, domain factuality, retrieval quality, or prompt/tool self-upgrade.
 3. Build or reuse a replayable fixture before changing prompts or model assignments.
 4. Compare baseline vs candidate with the same input, same rubric, and structured verdict format.
 5. Record false approves, false rejects, useful disagreements, cost, and latency.
 6. Change runtime prompts/models only after the candidate shows a measured improvement with no critical regression.
 ## Hard Rules
 - Do not change live model assignments because one answer sounds better.
 - Do not use production DB writes to tune prompts.
 - Do not collapse Rio and Theseus into generic "reviewers".
 - Do not treat payment, popularity, or engagement as quality approval.
 - Do not claim production decision-engine improvement without replay evidence and live/staging readback.
 ## Agent Responsibilities
 - Rio: incentive design, contribution weights, paid-query effects, market/mechanism reasoning, OPSEC, correlated-prior warnings.
 - Theseus: model diversity, adversarial evals, disagreement queues, self-upgrade criteria, prompt/tool safety, verifier drift.
 - Leo: cross-domain synthesis, fallback review, final arbitration where the route or rubric is ambiguous.
 ## Expected Artifacts
 - fixture file or DB query used for sampling;
 - baseline verdict output;
 - candidate verdict output;
 - summary JSON with quality, cost, latency, and disagreement metrics;
 - patch scoped to prompts, model config, rubric docs, or eval harness.
 Run `python3 scripts/check_llm_refinement_contract.py` after editing this surface.
--- a/.agents/skills/nousresearch-hermes-agent/SKILL.md
+++ b/.agents/skills/nousresearch-hermes-agent/SKILL.md
@ -0,0 +1,69 @@
 ---
 name: nousresearch-hermes-agent
 description: Use when packaging Living IP agents, skills, prompts, memory, model routing, or decision-engine workflows for NousResearch Hermes Agent.
 ---
 # NousResearch Hermes Agent
 Use this skill to adapt Living IP decision-engine behavior to Hermes Agent. Keep the package fixture-first and no-secret by default.
 ## Current External Surface
 As of 2026-06-01, the upstream Hermes Agent README describes:
 - model switching via `hermes model`;
 - tools via `hermes tools`;
 - a messaging gateway for Telegram, Discord, Slack, WhatsApp, Signal, and CLI;
 - built-in skill creation and self-improvement;
 - cron scheduling;
 - terminal backends including local, Docker, SSH, Modal, and Daytona;
 - OpenClaw migration commands.
 Verify upstream docs before depending on a command in code.
 ## Living IP Package Shape
 Create a package that includes:
 - agent identity file for Rio or Theseus;
 - skill instructions copied from repo-owned `.agents/skills/*`;
 - no-secret tool allowlist;
 - fixture replay command;
 - model selection notes;
 - proof output path.
 Do not package production DBs, tokens, API keys, SSH keys, or Bitwarden exports.
 ## Rio Package
 Rio Hermes package should focus on:
 - internet finance and mechanism reasoning;
 - contribution weights and paid-query effects;
 - OPSEC finance filters;
 - source-diversity warnings;
 - fixture tests for false economic reasoning.
 ## Theseus Package
 Theseus Hermes package should focus on:
 - model-diversity evals;
 - disagreement queues;
 - self-upgrade criteria;
 - prompt/tool safety;
 - fixture tests for overconfident or poorly grounded model judgments.
 ## Handoff Contract
 Every Hermes handoff must include:
 1. install/config snippet;
 2. model/provider selection left configurable;
 3. tool allowlist;
 4. fixture-first demo;
 5. no-live-write default;
 6. proof artifact path;
 7. known blockers.
 Do not claim Hermes production integration until a Hermes runtime actually executes the fixture and writes proof.
--- a/.agents/skills/openclaw-agent/SKILL.md
+++ b/.agents/skills/openclaw-agent/SKILL.md
@ -0,0 +1,69 @@
 ---
 name: openclaw-agent
 description: Use when adapting Living IP decision-engine agents, skills, tools, prompt files, or no-secret workflows to OpenClaw agent workspaces.
 ---
 # OpenClaw Agent
 Use this skill to package Living IP decision-engine behavior for OpenClaw workspaces. Treat OpenClaw as a distribution/runtime surface, not a new source of truth.
 ## Current External Surface
 As of 2026-06-01, the upstream OpenClaw README describes:
 - Node 24 or Node 22.19+ runtime;
 - `openclaw onboard --install-daemon`;
 - Gateway daemon usage;
 - agent prompt files `AGENTS.md`, `SOUL.md`, and `TOOLS.md`;
 - workspace skills at `~/.openclaw/workspace/skills/<skill>/SKILL.md`;
 - model configuration in OpenClaw config;
 - security guidance for DM pairing, allowlists, and sandboxing.
 Verify upstream docs before depending on a command in code.
 ## Living IP Workspace Shape
 Create or update:
 - `AGENTS.md`: scope, repo boundaries, proof requirements;
 - `SOUL.md`: Rio or Theseus identity;
 - `TOOLS.md`: bounded tools only;
 - `skills/decision-engine-refinement/SKILL.md`;
 - `skills/teleo-db-operator/SKILL.md` only for read-only local copies unless explicitly authorized.
 ## Tool Policy
 Default allow:
 - read files;
 - run local fixture tests;
 - write proof artifacts;
 - inspect git diffs;
 - query copied SQLite DBs read-only.
 Default deny:
 - production DB writes;
 - token reads;
 - Bitwarden vault export;
 - live GitHub PR comments;
 - public messaging sends;
 - broad shell automation against host services.
 ## Rio And Theseus
 - Rio OpenClaw package: economic reasoning, contribution incentives, paid-query guardrails, OPSEC.
 - Theseus OpenClaw package: eval integrity, adversarial prompts, model bakeoffs, self-upgrade review.
 ## Proof Contract
 An OpenClaw adapter is useful only if it can run a fixture and produce:
 - prompt files used;
 - tool allowlist;
 - model selected;
 - fixture input;
 - structured verdict output;
 - proof that no denied tools were invoked.
 Do not claim OpenClaw production readiness until the package runs in an OpenClaw workspace and writes proof.
--- a/.agents/skills/teleo-db-operator/SKILL.md
+++ b/.agents/skills/teleo-db-operator/SKILL.md
@ -0,0 +1,76 @@
 ---
 name: teleo-db-operator
 description: Use when reading, auditing, backing up, querying, or safely writing the Teleo pipeline SQLite database, including review_records, audit_log, costs, prs, sources, and contributor feedback loops.
 ---
 # Teleo DB Operator
 Default to read-only. The database is evidence for decision-engine refinement, not a scratchpad.
 ## Discover
 1. Read `lib/config.py` for `DB_PATH` and related paths.
 2. Prefer local or copied DBs over production DBs.
 3. If using production, record whether access is read-only or write-authorized.
 4. Never print secret values found near DB paths or shell history.
 ## Read Path
 Use `sqlite3` or Python `sqlite3`.
 Recommended read targets:
 - `review_records`: evaluator, model, outcome, rejection reason.
 - `audit_log`: route decisions, approve/reject events, failure details.
 - `costs`: model cost by date/stage.
 - `prs`: status, tier, route compatibility fields, verdicts.
 - `sources`: priority, feedback, extraction model.
 For refinement work, export aggregated JSON or CSV into `.crabbox-results/` or `proof/`, not raw private DB snapshots.
 ## Write Path
 Writes require explicit authorization and a backup.
 Required sequence:
 1. Create a backup or operate on a copy.
 2. Write the exact SQL in a retained artifact.
 3. Use `BEGIN IMMEDIATE;`.
 4. Apply the minimal mutation.
 5. Read back the changed rows.
 6. Commit the transaction only after readback is correct.
 7. Write a blocker artifact instead of guessing if any precondition is missing.
 Never write production prompt/model state as part of an experiment. Experiments should replay fixtures and produce proof first.
 ## Safety Boundaries
 - Do not attach, copy, or commit `pipeline.db`.
 - Do not run broad `UPDATE` or `DELETE` without a `WHERE` clause and a prior row count.
 - Do not mutate `prs`, `sources`, or contributor state from a model response alone.
 - Do not treat local copied DB proof as production proof.
 ## Useful Queries
 ```sql
 SELECT reviewer, reviewer_model, outcome, rejection_reason, count(*) AS n
 FROM review_records
 GROUP BY reviewer, reviewer_model, outcome, rejection_reason
 ORDER BY n DESC;
 ```
 ```sql
 SELECT event, count(*) AS n
 FROM audit_log
 WHERE stage = 'evaluate'
 GROUP BY event
 ORDER BY n DESC;
 ```
 ```sql
 SELECT model, stage, calls, input_tokens, output_tokens, cost_usd
 FROM costs
 ORDER BY date DESC, cost_usd DESC
 LIMIT 50;
 ```
--- a/.crabbox.yaml
+++ b/.crabbox.yaml
@ -77,9 +77,12 @@ jobs:
      python3 -m pip install -e '.[dev]' &&
      mkdir -p .crabbox-results &&
      python3 scripts/check_crabbox_ci_contract.py
-      --output .crabbox-results/crabbox-ci-contract.json
+      --output .crabbox-results/crabbox-ci-contract.json &&
      python3 scripts/check_llm_refinement_contract.py
      --output .crabbox-results/llm-refinement-contract.json
    downloads:
      - .crabbox-results/crabbox-ci-contract.json
      - .crabbox-results/llm-refinement-contract.json
    stop: always
  unit:
--- a/.github/workflows/ci.yml
+++ b/.github/workflows/ci.yml
@ -43,6 +43,7 @@ jobs:
            lib/post_extract.py \
            telegram/approvals.py \
            scripts/check_crabbox_ci_contract.py \
            scripts/check_llm_refinement_contract.py \
            scripts/prove_phase1b_local.py \
            tests/test_agent_routing.py \
            tests/test_evaluate_agent_routing.py \
@ -76,8 +77,8 @@ jobs:
          path: .crabbox-results/pytest.xml
          if-no-files-found: warn
-  crabbox-contract:
+  repo-contracts:
-    name: Crabbox and Leo contract
+    name: Repo contracts
    runs-on: ubuntu-latest
    timeout-minutes: 10
    steps:
@ -93,12 +94,16 @@ jobs:
        run: |
          python scripts/check_crabbox_ci_contract.py \
            --output .crabbox-results/crabbox-ci-contract.json
-      - name: Upload contract artifact
+          python scripts/check_llm_refinement_contract.py \
            --output .crabbox-results/llm-refinement-contract.json
      - name: Upload contract artifacts
        if: always()
        uses: actions/upload-artifact@v4
        with:
-          name: teleo-infrastructure-crabbox-contract
+          name: teleo-infrastructure-repo-contracts
-          path: .crabbox-results/crabbox-ci-contract.json
+          path: |
            .crabbox-results/crabbox-ci-contract.json
            .crabbox-results/llm-refinement-contract.json
          if-no-files-found: error
  phase1b-local-proof:
@ -107,7 +112,7 @@ jobs:
    needs:
      - lint
      - test
-      - crabbox-contract
+      - repo-contracts
    timeout-minutes: 20
    env:
      PHASE1B_AGENT_ROUTING_ENABLED: "true"
--- a/docs/llm-refinement-decision-engine.md
+++ b/docs/llm-refinement-decision-engine.md
@ -0,0 +1,191 @@
 # LLM Refinement And Decision Engine Program
 Created: 2026-06-01
 Status: active direction
 ## Product Outcome
 The decision engine should become the best judgment layer for Living IP: it routes knowledge changes to the right agent identities, tests competing LLMs against the same rubric, learns from disagreement, and improves prompts/tools only when measured deltas prove the change.
 Pentagon.run should own disposable infrastructure and remote execution. This repo should own decision quality: rubrics, prompts, model selection, route evidence, database feedback loops, and agent tool packages.
 ## What Rio And Theseus Become
 ### Rio
 Rio becomes the economic and incentive-quality evaluator.
 Rio owns:
 - contribution weights and role economics;
 - paid-query effects and anti-pay-to-pollute rules;
 - market, mechanism, futarchy, x402, token, and capital-formation reasoning;
 - source-diversity and correlated-prior warnings;
 - OPSEC for finance, deal terms, token economics, and internal allocations;
 - model tests that expose weak economic reasoning.
 Rio should not be "the crypto agent". Rio should be the agent that asks whether the system's incentives create useful knowledge or garbage incentives.
 ### Theseus
 Theseus becomes the model-integrity and agent-refinement evaluator.
 Theseus owns:
 - model diversity and correlated-blind-spot measurement;
 - adversarial eval rubrics;
 - prompt/tool safety and self-upgrade criteria;
 - disagreement queues and verifier-divergence analysis;
 - LLM capability evidence and agent-system architecture;
 - tests that expose hallucinated certainty, weak causal claims, and prompt-injection fragility.
 Theseus should not be "the AI safety agent". Theseus should be the agent that asks whether the decision system can be trusted when the models are persuasive but wrong.
 ## Decision Engine Loop
 ```mermaid
 flowchart TD
  PR["Decision-engine PR or source record"] --> Route["Deterministic route evidence"]
  Route --> Reviewers["Required agent reviewers"]
  Reviewers --> Rubric["Shared rubric"]
  Rubric --> ModelA["Primary model"]
  Rubric --> ModelB["Independent model family"]
  ModelA --> Verdicts["Structured verdicts"]
  ModelB --> Verdicts
  Verdicts --> Disagree{"Disagreement?"}
  Disagree -->|yes| Queue["Disagreement queue"]
  Disagree -->|no| Metrics["Calibration metrics"]
  Queue --> HumanOrLeo["Leo or human arbitration"]
  HumanOrLeo --> Metrics
  Metrics --> DB["SQLite feedback state"]
  DB --> Refine["Prompt, tool, or model proposal"]
  Refine --> Delta["Before/after eval harness"]
  Delta -->|passes| Update["Commit refinement"]
  Delta -->|fails| Archive["Archive failed refinement"]
 ```
 ## Model Portfolio
 The goal is not to pick one favorite model. The goal is to assign models to failure modes.
 | Lane | Primary evaluator | Independent check | Why |
 | --- | --- | --- | --- |
 | Fast triage | cheap small model | deterministic route evidence | triage should be cheap and overridable |
 | Domain review | routed agent prompt | different model family | catch domain-specific errors without same-family agreement bias |
 | Deep review | strongest available reasoning model | non-Claude or non-primary family | deep review is for structural claims and disagreement |
 | Economic reasoning | Rio rubric | model with strong quantitative/mechanism reasoning | tests incentive design, paid-query effects, and contribution weights |
 | Agent/refinement safety | Theseus rubric | model with strong adversarial critique | tests tool safety, self-upgrades, and evaluator drift |
 Candidate models should enter only through a harness:
 1. fixed input set;
 2. fixed rubric;
 3. structured verdict JSON;
 4. cost and latency recorded;
 5. disagreement categories stored;
 6. before/after comparison against current baseline.
 No model switch is accepted because it "sounds better" on one example.
 ## Refinement Workstreams
 ### R1: Rubric Packets
 Create a small rubric packet for each evaluator role:
 - `rio-economics-rubric`
 - `theseus-model-integrity-rubric`
 - `leo-cross-domain-rubric`
 - domain-specific factuality rubrics
 Each packet must define allowed verdicts, rejection tags, must-check criteria, and examples of false positives.
 ### R2: Evaluation Corpus
 Build a replayable corpus from existing PRs:
 - approved clean PRs;
 - rejected PRs by issue tag;
 - Rio/Theseus cross-domain PRs;
 - paid-query or contribution-weight examples;
 - adversarial malformed claims;
 - near-duplicate and OPSEC edge cases.
 Use local fixture data first. Production DB sampling requires the DB operator skill.
 ### R3: Model Bakeoff
 Run each candidate model against the same corpus and emit:
 - accuracy against expected disposition;
 - false-approve count;
 - false-reject count;
 - issue-tag precision;
 - average latency;
 - estimated cost;
 - disagreement matrix by model pair.
 The highest-signal metric is not raw approval rate. It is false approvals on bad claims plus useful disagreement on ambiguous claims.
 ### R4: Feedback Loop
 Use `review_records`, `audit_log`, `costs`, and PR state to find:
 - recurring model failure categories;
 - agents with repeated same-tag rejections;
 - prompts that produce vague reviews;
 - cost spikes without quality gain;
 - routes that keep requiring manual override.
 Every prompt/tool change should include a before/after proof over this loop.
 ### R5: Agent Runtime Packages
 Package the same decision-engine contract for:
 - NousResearch Hermes Agent: skill/memory/model-switching oriented.
 - OpenClaw: workspace skill plus `AGENTS.md`, `SOUL.md`, `TOOLS.md` oriented.
 Both packages should be fixture-first and no-secret by default. They are distribution surfaces for the decision engine, not separate evaluators with their own truth.
 ## DB Usage Boundary
 Default is read-only.
 Writes are allowed only when all are true:
 - the target DB is local, staging, or explicitly authorized production;
 - a backup or copy exists;
 - the write is wrapped in a transaction;
 - the exact query is retained in a proof artifact;
 - the post-write readback is retained.
 Never let an agent tune prompts by mutating production state directly.
 ## Pentagon.run Boundary
 Pentagon.run should own:
 - disposable VPS setup;
 - Crabbox or remote proof execution;
 - Hetzner lifecycle;
 - runner cleanup;
 - infra receipts.
 This repo should own:
 - decision-engine quality;
 - model and prompt experiments;
 - agent skills and adapter handoffs;
 - database feedback analysis;
 - proof schemas for eval quality.
 ## Next Implementation Slice
 1. Add `scripts/replay_decision_engine_eval.py` with local fixture mode.
 2. Add `fixtures/decision-engine-eval/*.json`.
 3. Store verdict outputs in `.crabbox-results/decision-engine-eval.json`.
 4. Add one Rio economics fixture and one Theseus model-integrity fixture.
 5. Compare current prompt versus one candidate prompt before touching runtime prompts.
 Do not start by changing live model assignments.
--- a/scripts/check_llm_refinement_contract.py
+++ b/scripts/check_llm_refinement_contract.py
@ -0,0 +1,116 @@
 #!/usr/bin/env python3
 """Validate the LLM refinement and decision-engine guidance surface."""
 from __future__ import annotations
 import argparse
 import json
 import re
 from pathlib import Path
 REPO_ROOT = Path(__file__).resolve().parents[1]
 REQUIRED_FILES = {
    "program_doc": REPO_ROOT / "docs" / "llm-refinement-decision-engine.md",
    "decision_skill": REPO_ROOT / ".agents" / "skills" / "decision-engine-refinement" / "SKILL.md",
    "db_skill": REPO_ROOT / ".agents" / "skills" / "teleo-db-operator" / "SKILL.md",
    "hermes_skill": REPO_ROOT / ".agents" / "skills" / "nousresearch-hermes-agent" / "SKILL.md",
    "openclaw_skill": REPO_ROOT / ".agents" / "skills" / "openclaw-agent" / "SKILL.md",
 }
 PROGRAM_REQUIRED_PHRASES = [
    "Pentagon.run should own disposable infrastructure",
    "This repo should own decision quality",
    "Rio becomes the economic and incentive-quality evaluator",
    "Theseus becomes the model-integrity and agent-refinement evaluator",
    "No model switch is accepted because it",
    "Default is read-only",
 ]
 SKILL_REQUIRED = {
    "decision_skill": [
        "Rio economics",
        "Theseus model integrity",
        "Do not change live model assignments",
        "baseline verdict output",
    ],
    "db_skill": [
        "Default to read-only",
        "BEGIN IMMEDIATE",
        "Do not attach, copy, or commit `pipeline.db`",
        "review_records",
    ],
    "hermes_skill": [
        "model switching",
        "fixture-first",
        "Rio Hermes package",
        "Theseus Hermes package",
    ],
    "openclaw_skill": [
        "AGENTS.md",
        "SOUL.md",
        "TOOLS.md",
        "Default deny",
    ],
 }
 def _read(path: Path) -> str:
    if not path.exists():
        raise AssertionError(f"missing file: {path.relative_to(REPO_ROOT)}")
    return path.read_text()
 def _assert_frontmatter(path: Path, text: str) -> None:
    match = re.match(r"^---\n(?P<body>.*?)\n---\n", text, flags=re.DOTALL)
    if not match:
        raise AssertionError(f"{path.relative_to(REPO_ROOT)} missing YAML frontmatter")
    body = match.group("body")
    if "name:" not in body or "description:" not in body:
        raise AssertionError(f"{path.relative_to(REPO_ROOT)} frontmatter needs name and description")
 def main() -> int:
    parser = argparse.ArgumentParser()
    parser.add_argument("--output", default=".crabbox-results/llm-refinement-contract.json")
    args = parser.parse_args()
    program = _read(REQUIRED_FILES["program_doc"])
    missing_program = [phrase for phrase in PROGRAM_REQUIRED_PHRASES if phrase not in program]
    if missing_program:
        raise AssertionError(f"program doc missing phrases: {missing_program}")
    skill_checks = {}
    for key, phrases in SKILL_REQUIRED.items():
        path = REQUIRED_FILES[key]
        text = _read(path)
        _assert_frontmatter(path, text)
        missing = [phrase for phrase in phrases if phrase not in text]
        if missing:
            raise AssertionError(f"{path.relative_to(REPO_ROOT)} missing phrases: {missing}")
        skill_checks[key] = {
            "path": str(path.relative_to(REPO_ROOT)),
            "phrases_checked": phrases,
        }
    proof = {
        "ok": True,
        "scope": "llm_refinement_decision_engine_contract",
        "program_doc": str(REQUIRED_FILES["program_doc"].relative_to(REPO_ROOT)),
        "program_phrases_checked": PROGRAM_REQUIRED_PHRASES,
        "skills": skill_checks,
        "pivot": {
            "infra_owner": "Pentagon.run",
            "repo_owner": "decision quality, rubrics, model evals, prompt/tool refinement, DB feedback loops",
        },
    }
    output = REPO_ROOT / args.output
    output.parent.mkdir(parents=True, exist_ok=True)
    output.write_text(json.dumps(proof, indent=2, sort_keys=True) + "\n")
    print(json.dumps(proof, indent=2, sort_keys=True))
    return 0
 if __name__ == "__main__":
    raise SystemExit(main())