Add decision engine refinement contracts
- Define Rio and Theseus as economics and model-integrity evaluators - Add DB, Hermes, and OpenClaw skills with no-secret defaults - Gate CI on LLM refinement contracts; verify with 422-test suite `.agents/skills/decision-engine-refinement/SKILL.md` `.agents/skills/nousresearch-hermes-agent/SKILL.md` `.agents/skills/openclaw-agent/SKILL.md` `.agents/skills/teleo-db-operator/SKILL.md` `.crabbox.yaml` `.github/workflows/ci.yml` `docs/llm-refinement-decision-engine.md` `scripts/check_llm_refinement_contract.py`
This commit is contained in:
parent
a2620c1f19
commit
aee534e686
8 changed files with 577 additions and 7 deletions
41
.agents/skills/decision-engine-refinement/SKILL.md
Normal file
41
.agents/skills/decision-engine-refinement/SKILL.md
Normal file
|
|
@ -0,0 +1,41 @@
|
||||||
|
---
|
||||||
|
name: decision-engine-refinement
|
||||||
|
description: Use when improving Living IP decision-engine quality, LLM model selection, evaluator prompts, rubrics, replay evals, Rio or Theseus reviewer behavior, or model bakeoffs.
|
||||||
|
---
|
||||||
|
|
||||||
|
# Decision Engine Refinement
|
||||||
|
|
||||||
|
Use this skill for quality work, not infrastructure work. Pentagon.run or Crabbox can run remote jobs; this repo owns model judgment, rubric design, prompt/tool refinement, and proof artifacts.
|
||||||
|
|
||||||
|
## Workflow
|
||||||
|
|
||||||
|
1. Read `docs/llm-refinement-decision-engine.md`.
|
||||||
|
2. Identify the lane: Rio economics, Theseus model integrity, Leo cross-domain, domain factuality, retrieval quality, or prompt/tool self-upgrade.
|
||||||
|
3. Build or reuse a replayable fixture before changing prompts or model assignments.
|
||||||
|
4. Compare baseline vs candidate with the same input, same rubric, and structured verdict format.
|
||||||
|
5. Record false approves, false rejects, useful disagreements, cost, and latency.
|
||||||
|
6. Change runtime prompts/models only after the candidate shows a measured improvement with no critical regression.
|
||||||
|
|
||||||
|
## Hard Rules
|
||||||
|
|
||||||
|
- Do not change live model assignments because one answer sounds better.
|
||||||
|
- Do not use production DB writes to tune prompts.
|
||||||
|
- Do not collapse Rio and Theseus into generic "reviewers".
|
||||||
|
- Do not treat payment, popularity, or engagement as quality approval.
|
||||||
|
- Do not claim production decision-engine improvement without replay evidence and live/staging readback.
|
||||||
|
|
||||||
|
## Agent Responsibilities
|
||||||
|
|
||||||
|
- Rio: incentive design, contribution weights, paid-query effects, market/mechanism reasoning, OPSEC, correlated-prior warnings.
|
||||||
|
- Theseus: model diversity, adversarial evals, disagreement queues, self-upgrade criteria, prompt/tool safety, verifier drift.
|
||||||
|
- Leo: cross-domain synthesis, fallback review, final arbitration where the route or rubric is ambiguous.
|
||||||
|
|
||||||
|
## Expected Artifacts
|
||||||
|
|
||||||
|
- fixture file or DB query used for sampling;
|
||||||
|
- baseline verdict output;
|
||||||
|
- candidate verdict output;
|
||||||
|
- summary JSON with quality, cost, latency, and disagreement metrics;
|
||||||
|
- patch scoped to prompts, model config, rubric docs, or eval harness.
|
||||||
|
|
||||||
|
Run `python3 scripts/check_llm_refinement_contract.py` after editing this surface.
|
||||||
69
.agents/skills/nousresearch-hermes-agent/SKILL.md
Normal file
69
.agents/skills/nousresearch-hermes-agent/SKILL.md
Normal file
|
|
@ -0,0 +1,69 @@
|
||||||
|
---
|
||||||
|
name: nousresearch-hermes-agent
|
||||||
|
description: Use when packaging Living IP agents, skills, prompts, memory, model routing, or decision-engine workflows for NousResearch Hermes Agent.
|
||||||
|
---
|
||||||
|
|
||||||
|
# NousResearch Hermes Agent
|
||||||
|
|
||||||
|
Use this skill to adapt Living IP decision-engine behavior to Hermes Agent. Keep the package fixture-first and no-secret by default.
|
||||||
|
|
||||||
|
## Current External Surface
|
||||||
|
|
||||||
|
As of 2026-06-01, the upstream Hermes Agent README describes:
|
||||||
|
|
||||||
|
- model switching via `hermes model`;
|
||||||
|
- tools via `hermes tools`;
|
||||||
|
- a messaging gateway for Telegram, Discord, Slack, WhatsApp, Signal, and CLI;
|
||||||
|
- built-in skill creation and self-improvement;
|
||||||
|
- cron scheduling;
|
||||||
|
- terminal backends including local, Docker, SSH, Modal, and Daytona;
|
||||||
|
- OpenClaw migration commands.
|
||||||
|
|
||||||
|
Verify upstream docs before depending on a command in code.
|
||||||
|
|
||||||
|
## Living IP Package Shape
|
||||||
|
|
||||||
|
Create a package that includes:
|
||||||
|
|
||||||
|
- agent identity file for Rio or Theseus;
|
||||||
|
- skill instructions copied from repo-owned `.agents/skills/*`;
|
||||||
|
- no-secret tool allowlist;
|
||||||
|
- fixture replay command;
|
||||||
|
- model selection notes;
|
||||||
|
- proof output path.
|
||||||
|
|
||||||
|
Do not package production DBs, tokens, API keys, SSH keys, or Bitwarden exports.
|
||||||
|
|
||||||
|
## Rio Package
|
||||||
|
|
||||||
|
Rio Hermes package should focus on:
|
||||||
|
|
||||||
|
- internet finance and mechanism reasoning;
|
||||||
|
- contribution weights and paid-query effects;
|
||||||
|
- OPSEC finance filters;
|
||||||
|
- source-diversity warnings;
|
||||||
|
- fixture tests for false economic reasoning.
|
||||||
|
|
||||||
|
## Theseus Package
|
||||||
|
|
||||||
|
Theseus Hermes package should focus on:
|
||||||
|
|
||||||
|
- model-diversity evals;
|
||||||
|
- disagreement queues;
|
||||||
|
- self-upgrade criteria;
|
||||||
|
- prompt/tool safety;
|
||||||
|
- fixture tests for overconfident or poorly grounded model judgments.
|
||||||
|
|
||||||
|
## Handoff Contract
|
||||||
|
|
||||||
|
Every Hermes handoff must include:
|
||||||
|
|
||||||
|
1. install/config snippet;
|
||||||
|
2. model/provider selection left configurable;
|
||||||
|
3. tool allowlist;
|
||||||
|
4. fixture-first demo;
|
||||||
|
5. no-live-write default;
|
||||||
|
6. proof artifact path;
|
||||||
|
7. known blockers.
|
||||||
|
|
||||||
|
Do not claim Hermes production integration until a Hermes runtime actually executes the fixture and writes proof.
|
||||||
69
.agents/skills/openclaw-agent/SKILL.md
Normal file
69
.agents/skills/openclaw-agent/SKILL.md
Normal file
|
|
@ -0,0 +1,69 @@
|
||||||
|
---
|
||||||
|
name: openclaw-agent
|
||||||
|
description: Use when adapting Living IP decision-engine agents, skills, tools, prompt files, or no-secret workflows to OpenClaw agent workspaces.
|
||||||
|
---
|
||||||
|
|
||||||
|
# OpenClaw Agent
|
||||||
|
|
||||||
|
Use this skill to package Living IP decision-engine behavior for OpenClaw workspaces. Treat OpenClaw as a distribution/runtime surface, not a new source of truth.
|
||||||
|
|
||||||
|
## Current External Surface
|
||||||
|
|
||||||
|
As of 2026-06-01, the upstream OpenClaw README describes:
|
||||||
|
|
||||||
|
- Node 24 or Node 22.19+ runtime;
|
||||||
|
- `openclaw onboard --install-daemon`;
|
||||||
|
- Gateway daemon usage;
|
||||||
|
- agent prompt files `AGENTS.md`, `SOUL.md`, and `TOOLS.md`;
|
||||||
|
- workspace skills at `~/.openclaw/workspace/skills/<skill>/SKILL.md`;
|
||||||
|
- model configuration in OpenClaw config;
|
||||||
|
- security guidance for DM pairing, allowlists, and sandboxing.
|
||||||
|
|
||||||
|
Verify upstream docs before depending on a command in code.
|
||||||
|
|
||||||
|
## Living IP Workspace Shape
|
||||||
|
|
||||||
|
Create or update:
|
||||||
|
|
||||||
|
- `AGENTS.md`: scope, repo boundaries, proof requirements;
|
||||||
|
- `SOUL.md`: Rio or Theseus identity;
|
||||||
|
- `TOOLS.md`: bounded tools only;
|
||||||
|
- `skills/decision-engine-refinement/SKILL.md`;
|
||||||
|
- `skills/teleo-db-operator/SKILL.md` only for read-only local copies unless explicitly authorized.
|
||||||
|
|
||||||
|
## Tool Policy
|
||||||
|
|
||||||
|
Default allow:
|
||||||
|
|
||||||
|
- read files;
|
||||||
|
- run local fixture tests;
|
||||||
|
- write proof artifacts;
|
||||||
|
- inspect git diffs;
|
||||||
|
- query copied SQLite DBs read-only.
|
||||||
|
|
||||||
|
Default deny:
|
||||||
|
|
||||||
|
- production DB writes;
|
||||||
|
- token reads;
|
||||||
|
- Bitwarden vault export;
|
||||||
|
- live GitHub PR comments;
|
||||||
|
- public messaging sends;
|
||||||
|
- broad shell automation against host services.
|
||||||
|
|
||||||
|
## Rio And Theseus
|
||||||
|
|
||||||
|
- Rio OpenClaw package: economic reasoning, contribution incentives, paid-query guardrails, OPSEC.
|
||||||
|
- Theseus OpenClaw package: eval integrity, adversarial prompts, model bakeoffs, self-upgrade review.
|
||||||
|
|
||||||
|
## Proof Contract
|
||||||
|
|
||||||
|
An OpenClaw adapter is useful only if it can run a fixture and produce:
|
||||||
|
|
||||||
|
- prompt files used;
|
||||||
|
- tool allowlist;
|
||||||
|
- model selected;
|
||||||
|
- fixture input;
|
||||||
|
- structured verdict output;
|
||||||
|
- proof that no denied tools were invoked.
|
||||||
|
|
||||||
|
Do not claim OpenClaw production readiness until the package runs in an OpenClaw workspace and writes proof.
|
||||||
76
.agents/skills/teleo-db-operator/SKILL.md
Normal file
76
.agents/skills/teleo-db-operator/SKILL.md
Normal file
|
|
@ -0,0 +1,76 @@
|
||||||
|
---
|
||||||
|
name: teleo-db-operator
|
||||||
|
description: Use when reading, auditing, backing up, querying, or safely writing the Teleo pipeline SQLite database, including review_records, audit_log, costs, prs, sources, and contributor feedback loops.
|
||||||
|
---
|
||||||
|
|
||||||
|
# Teleo DB Operator
|
||||||
|
|
||||||
|
Default to read-only. The database is evidence for decision-engine refinement, not a scratchpad.
|
||||||
|
|
||||||
|
## Discover
|
||||||
|
|
||||||
|
1. Read `lib/config.py` for `DB_PATH` and related paths.
|
||||||
|
2. Prefer local or copied DBs over production DBs.
|
||||||
|
3. If using production, record whether access is read-only or write-authorized.
|
||||||
|
4. Never print secret values found near DB paths or shell history.
|
||||||
|
|
||||||
|
## Read Path
|
||||||
|
|
||||||
|
Use `sqlite3` or Python `sqlite3`.
|
||||||
|
|
||||||
|
Recommended read targets:
|
||||||
|
|
||||||
|
- `review_records`: evaluator, model, outcome, rejection reason.
|
||||||
|
- `audit_log`: route decisions, approve/reject events, failure details.
|
||||||
|
- `costs`: model cost by date/stage.
|
||||||
|
- `prs`: status, tier, route compatibility fields, verdicts.
|
||||||
|
- `sources`: priority, feedback, extraction model.
|
||||||
|
|
||||||
|
For refinement work, export aggregated JSON or CSV into `.crabbox-results/` or `proof/`, not raw private DB snapshots.
|
||||||
|
|
||||||
|
## Write Path
|
||||||
|
|
||||||
|
Writes require explicit authorization and a backup.
|
||||||
|
|
||||||
|
Required sequence:
|
||||||
|
|
||||||
|
1. Create a backup or operate on a copy.
|
||||||
|
2. Write the exact SQL in a retained artifact.
|
||||||
|
3. Use `BEGIN IMMEDIATE;`.
|
||||||
|
4. Apply the minimal mutation.
|
||||||
|
5. Read back the changed rows.
|
||||||
|
6. Commit the transaction only after readback is correct.
|
||||||
|
7. Write a blocker artifact instead of guessing if any precondition is missing.
|
||||||
|
|
||||||
|
Never write production prompt/model state as part of an experiment. Experiments should replay fixtures and produce proof first.
|
||||||
|
|
||||||
|
## Safety Boundaries
|
||||||
|
|
||||||
|
- Do not attach, copy, or commit `pipeline.db`.
|
||||||
|
- Do not run broad `UPDATE` or `DELETE` without a `WHERE` clause and a prior row count.
|
||||||
|
- Do not mutate `prs`, `sources`, or contributor state from a model response alone.
|
||||||
|
- Do not treat local copied DB proof as production proof.
|
||||||
|
|
||||||
|
## Useful Queries
|
||||||
|
|
||||||
|
```sql
|
||||||
|
SELECT reviewer, reviewer_model, outcome, rejection_reason, count(*) AS n
|
||||||
|
FROM review_records
|
||||||
|
GROUP BY reviewer, reviewer_model, outcome, rejection_reason
|
||||||
|
ORDER BY n DESC;
|
||||||
|
```
|
||||||
|
|
||||||
|
```sql
|
||||||
|
SELECT event, count(*) AS n
|
||||||
|
FROM audit_log
|
||||||
|
WHERE stage = 'evaluate'
|
||||||
|
GROUP BY event
|
||||||
|
ORDER BY n DESC;
|
||||||
|
```
|
||||||
|
|
||||||
|
```sql
|
||||||
|
SELECT model, stage, calls, input_tokens, output_tokens, cost_usd
|
||||||
|
FROM costs
|
||||||
|
ORDER BY date DESC, cost_usd DESC
|
||||||
|
LIMIT 50;
|
||||||
|
```
|
||||||
|
|
@ -77,9 +77,12 @@ jobs:
|
||||||
python3 -m pip install -e '.[dev]' &&
|
python3 -m pip install -e '.[dev]' &&
|
||||||
mkdir -p .crabbox-results &&
|
mkdir -p .crabbox-results &&
|
||||||
python3 scripts/check_crabbox_ci_contract.py
|
python3 scripts/check_crabbox_ci_contract.py
|
||||||
--output .crabbox-results/crabbox-ci-contract.json
|
--output .crabbox-results/crabbox-ci-contract.json &&
|
||||||
|
python3 scripts/check_llm_refinement_contract.py
|
||||||
|
--output .crabbox-results/llm-refinement-contract.json
|
||||||
downloads:
|
downloads:
|
||||||
- .crabbox-results/crabbox-ci-contract.json
|
- .crabbox-results/crabbox-ci-contract.json
|
||||||
|
- .crabbox-results/llm-refinement-contract.json
|
||||||
stop: always
|
stop: always
|
||||||
|
|
||||||
unit:
|
unit:
|
||||||
|
|
|
||||||
17
.github/workflows/ci.yml
vendored
17
.github/workflows/ci.yml
vendored
|
|
@ -43,6 +43,7 @@ jobs:
|
||||||
lib/post_extract.py \
|
lib/post_extract.py \
|
||||||
telegram/approvals.py \
|
telegram/approvals.py \
|
||||||
scripts/check_crabbox_ci_contract.py \
|
scripts/check_crabbox_ci_contract.py \
|
||||||
|
scripts/check_llm_refinement_contract.py \
|
||||||
scripts/prove_phase1b_local.py \
|
scripts/prove_phase1b_local.py \
|
||||||
tests/test_agent_routing.py \
|
tests/test_agent_routing.py \
|
||||||
tests/test_evaluate_agent_routing.py \
|
tests/test_evaluate_agent_routing.py \
|
||||||
|
|
@ -76,8 +77,8 @@ jobs:
|
||||||
path: .crabbox-results/pytest.xml
|
path: .crabbox-results/pytest.xml
|
||||||
if-no-files-found: warn
|
if-no-files-found: warn
|
||||||
|
|
||||||
crabbox-contract:
|
repo-contracts:
|
||||||
name: Crabbox and Leo contract
|
name: Repo contracts
|
||||||
runs-on: ubuntu-latest
|
runs-on: ubuntu-latest
|
||||||
timeout-minutes: 10
|
timeout-minutes: 10
|
||||||
steps:
|
steps:
|
||||||
|
|
@ -93,12 +94,16 @@ jobs:
|
||||||
run: |
|
run: |
|
||||||
python scripts/check_crabbox_ci_contract.py \
|
python scripts/check_crabbox_ci_contract.py \
|
||||||
--output .crabbox-results/crabbox-ci-contract.json
|
--output .crabbox-results/crabbox-ci-contract.json
|
||||||
- name: Upload contract artifact
|
python scripts/check_llm_refinement_contract.py \
|
||||||
|
--output .crabbox-results/llm-refinement-contract.json
|
||||||
|
- name: Upload contract artifacts
|
||||||
if: always()
|
if: always()
|
||||||
uses: actions/upload-artifact@v4
|
uses: actions/upload-artifact@v4
|
||||||
with:
|
with:
|
||||||
name: teleo-infrastructure-crabbox-contract
|
name: teleo-infrastructure-repo-contracts
|
||||||
path: .crabbox-results/crabbox-ci-contract.json
|
path: |
|
||||||
|
.crabbox-results/crabbox-ci-contract.json
|
||||||
|
.crabbox-results/llm-refinement-contract.json
|
||||||
if-no-files-found: error
|
if-no-files-found: error
|
||||||
|
|
||||||
phase1b-local-proof:
|
phase1b-local-proof:
|
||||||
|
|
@ -107,7 +112,7 @@ jobs:
|
||||||
needs:
|
needs:
|
||||||
- lint
|
- lint
|
||||||
- test
|
- test
|
||||||
- crabbox-contract
|
- repo-contracts
|
||||||
timeout-minutes: 20
|
timeout-minutes: 20
|
||||||
env:
|
env:
|
||||||
PHASE1B_AGENT_ROUTING_ENABLED: "true"
|
PHASE1B_AGENT_ROUTING_ENABLED: "true"
|
||||||
|
|
|
||||||
191
docs/llm-refinement-decision-engine.md
Normal file
191
docs/llm-refinement-decision-engine.md
Normal file
|
|
@ -0,0 +1,191 @@
|
||||||
|
# LLM Refinement And Decision Engine Program
|
||||||
|
|
||||||
|
Created: 2026-06-01
|
||||||
|
Status: active direction
|
||||||
|
|
||||||
|
## Product Outcome
|
||||||
|
|
||||||
|
The decision engine should become the best judgment layer for Living IP: it routes knowledge changes to the right agent identities, tests competing LLMs against the same rubric, learns from disagreement, and improves prompts/tools only when measured deltas prove the change.
|
||||||
|
|
||||||
|
Pentagon.run should own disposable infrastructure and remote execution. This repo should own decision quality: rubrics, prompts, model selection, route evidence, database feedback loops, and agent tool packages.
|
||||||
|
|
||||||
|
## What Rio And Theseus Become
|
||||||
|
|
||||||
|
### Rio
|
||||||
|
|
||||||
|
Rio becomes the economic and incentive-quality evaluator.
|
||||||
|
|
||||||
|
Rio owns:
|
||||||
|
|
||||||
|
- contribution weights and role economics;
|
||||||
|
- paid-query effects and anti-pay-to-pollute rules;
|
||||||
|
- market, mechanism, futarchy, x402, token, and capital-formation reasoning;
|
||||||
|
- source-diversity and correlated-prior warnings;
|
||||||
|
- OPSEC for finance, deal terms, token economics, and internal allocations;
|
||||||
|
- model tests that expose weak economic reasoning.
|
||||||
|
|
||||||
|
Rio should not be "the crypto agent". Rio should be the agent that asks whether the system's incentives create useful knowledge or garbage incentives.
|
||||||
|
|
||||||
|
### Theseus
|
||||||
|
|
||||||
|
Theseus becomes the model-integrity and agent-refinement evaluator.
|
||||||
|
|
||||||
|
Theseus owns:
|
||||||
|
|
||||||
|
- model diversity and correlated-blind-spot measurement;
|
||||||
|
- adversarial eval rubrics;
|
||||||
|
- prompt/tool safety and self-upgrade criteria;
|
||||||
|
- disagreement queues and verifier-divergence analysis;
|
||||||
|
- LLM capability evidence and agent-system architecture;
|
||||||
|
- tests that expose hallucinated certainty, weak causal claims, and prompt-injection fragility.
|
||||||
|
|
||||||
|
Theseus should not be "the AI safety agent". Theseus should be the agent that asks whether the decision system can be trusted when the models are persuasive but wrong.
|
||||||
|
|
||||||
|
## Decision Engine Loop
|
||||||
|
|
||||||
|
```mermaid
|
||||||
|
flowchart TD
|
||||||
|
PR["Decision-engine PR or source record"] --> Route["Deterministic route evidence"]
|
||||||
|
Route --> Reviewers["Required agent reviewers"]
|
||||||
|
Reviewers --> Rubric["Shared rubric"]
|
||||||
|
Rubric --> ModelA["Primary model"]
|
||||||
|
Rubric --> ModelB["Independent model family"]
|
||||||
|
ModelA --> Verdicts["Structured verdicts"]
|
||||||
|
ModelB --> Verdicts
|
||||||
|
Verdicts --> Disagree{"Disagreement?"}
|
||||||
|
Disagree -->|yes| Queue["Disagreement queue"]
|
||||||
|
Disagree -->|no| Metrics["Calibration metrics"]
|
||||||
|
Queue --> HumanOrLeo["Leo or human arbitration"]
|
||||||
|
HumanOrLeo --> Metrics
|
||||||
|
Metrics --> DB["SQLite feedback state"]
|
||||||
|
DB --> Refine["Prompt, tool, or model proposal"]
|
||||||
|
Refine --> Delta["Before/after eval harness"]
|
||||||
|
Delta -->|passes| Update["Commit refinement"]
|
||||||
|
Delta -->|fails| Archive["Archive failed refinement"]
|
||||||
|
```
|
||||||
|
|
||||||
|
## Model Portfolio
|
||||||
|
|
||||||
|
The goal is not to pick one favorite model. The goal is to assign models to failure modes.
|
||||||
|
|
||||||
|
| Lane | Primary evaluator | Independent check | Why |
|
||||||
|
| --- | --- | --- | --- |
|
||||||
|
| Fast triage | cheap small model | deterministic route evidence | triage should be cheap and overridable |
|
||||||
|
| Domain review | routed agent prompt | different model family | catch domain-specific errors without same-family agreement bias |
|
||||||
|
| Deep review | strongest available reasoning model | non-Claude or non-primary family | deep review is for structural claims and disagreement |
|
||||||
|
| Economic reasoning | Rio rubric | model with strong quantitative/mechanism reasoning | tests incentive design, paid-query effects, and contribution weights |
|
||||||
|
| Agent/refinement safety | Theseus rubric | model with strong adversarial critique | tests tool safety, self-upgrades, and evaluator drift |
|
||||||
|
|
||||||
|
Candidate models should enter only through a harness:
|
||||||
|
|
||||||
|
1. fixed input set;
|
||||||
|
2. fixed rubric;
|
||||||
|
3. structured verdict JSON;
|
||||||
|
4. cost and latency recorded;
|
||||||
|
5. disagreement categories stored;
|
||||||
|
6. before/after comparison against current baseline.
|
||||||
|
|
||||||
|
No model switch is accepted because it "sounds better" on one example.
|
||||||
|
|
||||||
|
## Refinement Workstreams
|
||||||
|
|
||||||
|
### R1: Rubric Packets
|
||||||
|
|
||||||
|
Create a small rubric packet for each evaluator role:
|
||||||
|
|
||||||
|
- `rio-economics-rubric`
|
||||||
|
- `theseus-model-integrity-rubric`
|
||||||
|
- `leo-cross-domain-rubric`
|
||||||
|
- domain-specific factuality rubrics
|
||||||
|
|
||||||
|
Each packet must define allowed verdicts, rejection tags, must-check criteria, and examples of false positives.
|
||||||
|
|
||||||
|
### R2: Evaluation Corpus
|
||||||
|
|
||||||
|
Build a replayable corpus from existing PRs:
|
||||||
|
|
||||||
|
- approved clean PRs;
|
||||||
|
- rejected PRs by issue tag;
|
||||||
|
- Rio/Theseus cross-domain PRs;
|
||||||
|
- paid-query or contribution-weight examples;
|
||||||
|
- adversarial malformed claims;
|
||||||
|
- near-duplicate and OPSEC edge cases.
|
||||||
|
|
||||||
|
Use local fixture data first. Production DB sampling requires the DB operator skill.
|
||||||
|
|
||||||
|
### R3: Model Bakeoff
|
||||||
|
|
||||||
|
Run each candidate model against the same corpus and emit:
|
||||||
|
|
||||||
|
- accuracy against expected disposition;
|
||||||
|
- false-approve count;
|
||||||
|
- false-reject count;
|
||||||
|
- issue-tag precision;
|
||||||
|
- average latency;
|
||||||
|
- estimated cost;
|
||||||
|
- disagreement matrix by model pair.
|
||||||
|
|
||||||
|
The highest-signal metric is not raw approval rate. It is false approvals on bad claims plus useful disagreement on ambiguous claims.
|
||||||
|
|
||||||
|
### R4: Feedback Loop
|
||||||
|
|
||||||
|
Use `review_records`, `audit_log`, `costs`, and PR state to find:
|
||||||
|
|
||||||
|
- recurring model failure categories;
|
||||||
|
- agents with repeated same-tag rejections;
|
||||||
|
- prompts that produce vague reviews;
|
||||||
|
- cost spikes without quality gain;
|
||||||
|
- routes that keep requiring manual override.
|
||||||
|
|
||||||
|
Every prompt/tool change should include a before/after proof over this loop.
|
||||||
|
|
||||||
|
### R5: Agent Runtime Packages
|
||||||
|
|
||||||
|
Package the same decision-engine contract for:
|
||||||
|
|
||||||
|
- NousResearch Hermes Agent: skill/memory/model-switching oriented.
|
||||||
|
- OpenClaw: workspace skill plus `AGENTS.md`, `SOUL.md`, `TOOLS.md` oriented.
|
||||||
|
|
||||||
|
Both packages should be fixture-first and no-secret by default. They are distribution surfaces for the decision engine, not separate evaluators with their own truth.
|
||||||
|
|
||||||
|
## DB Usage Boundary
|
||||||
|
|
||||||
|
Default is read-only.
|
||||||
|
|
||||||
|
Writes are allowed only when all are true:
|
||||||
|
|
||||||
|
- the target DB is local, staging, or explicitly authorized production;
|
||||||
|
- a backup or copy exists;
|
||||||
|
- the write is wrapped in a transaction;
|
||||||
|
- the exact query is retained in a proof artifact;
|
||||||
|
- the post-write readback is retained.
|
||||||
|
|
||||||
|
Never let an agent tune prompts by mutating production state directly.
|
||||||
|
|
||||||
|
## Pentagon.run Boundary
|
||||||
|
|
||||||
|
Pentagon.run should own:
|
||||||
|
|
||||||
|
- disposable VPS setup;
|
||||||
|
- Crabbox or remote proof execution;
|
||||||
|
- Hetzner lifecycle;
|
||||||
|
- runner cleanup;
|
||||||
|
- infra receipts.
|
||||||
|
|
||||||
|
This repo should own:
|
||||||
|
|
||||||
|
- decision-engine quality;
|
||||||
|
- model and prompt experiments;
|
||||||
|
- agent skills and adapter handoffs;
|
||||||
|
- database feedback analysis;
|
||||||
|
- proof schemas for eval quality.
|
||||||
|
|
||||||
|
## Next Implementation Slice
|
||||||
|
|
||||||
|
1. Add `scripts/replay_decision_engine_eval.py` with local fixture mode.
|
||||||
|
2. Add `fixtures/decision-engine-eval/*.json`.
|
||||||
|
3. Store verdict outputs in `.crabbox-results/decision-engine-eval.json`.
|
||||||
|
4. Add one Rio economics fixture and one Theseus model-integrity fixture.
|
||||||
|
5. Compare current prompt versus one candidate prompt before touching runtime prompts.
|
||||||
|
|
||||||
|
Do not start by changing live model assignments.
|
||||||
116
scripts/check_llm_refinement_contract.py
Executable file
116
scripts/check_llm_refinement_contract.py
Executable file
|
|
@ -0,0 +1,116 @@
|
||||||
|
#!/usr/bin/env python3
|
||||||
|
"""Validate the LLM refinement and decision-engine guidance surface."""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import json
|
||||||
|
import re
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
REPO_ROOT = Path(__file__).resolve().parents[1]
|
||||||
|
|
||||||
|
REQUIRED_FILES = {
|
||||||
|
"program_doc": REPO_ROOT / "docs" / "llm-refinement-decision-engine.md",
|
||||||
|
"decision_skill": REPO_ROOT / ".agents" / "skills" / "decision-engine-refinement" / "SKILL.md",
|
||||||
|
"db_skill": REPO_ROOT / ".agents" / "skills" / "teleo-db-operator" / "SKILL.md",
|
||||||
|
"hermes_skill": REPO_ROOT / ".agents" / "skills" / "nousresearch-hermes-agent" / "SKILL.md",
|
||||||
|
"openclaw_skill": REPO_ROOT / ".agents" / "skills" / "openclaw-agent" / "SKILL.md",
|
||||||
|
}
|
||||||
|
|
||||||
|
PROGRAM_REQUIRED_PHRASES = [
|
||||||
|
"Pentagon.run should own disposable infrastructure",
|
||||||
|
"This repo should own decision quality",
|
||||||
|
"Rio becomes the economic and incentive-quality evaluator",
|
||||||
|
"Theseus becomes the model-integrity and agent-refinement evaluator",
|
||||||
|
"No model switch is accepted because it",
|
||||||
|
"Default is read-only",
|
||||||
|
]
|
||||||
|
|
||||||
|
SKILL_REQUIRED = {
|
||||||
|
"decision_skill": [
|
||||||
|
"Rio economics",
|
||||||
|
"Theseus model integrity",
|
||||||
|
"Do not change live model assignments",
|
||||||
|
"baseline verdict output",
|
||||||
|
],
|
||||||
|
"db_skill": [
|
||||||
|
"Default to read-only",
|
||||||
|
"BEGIN IMMEDIATE",
|
||||||
|
"Do not attach, copy, or commit `pipeline.db`",
|
||||||
|
"review_records",
|
||||||
|
],
|
||||||
|
"hermes_skill": [
|
||||||
|
"model switching",
|
||||||
|
"fixture-first",
|
||||||
|
"Rio Hermes package",
|
||||||
|
"Theseus Hermes package",
|
||||||
|
],
|
||||||
|
"openclaw_skill": [
|
||||||
|
"AGENTS.md",
|
||||||
|
"SOUL.md",
|
||||||
|
"TOOLS.md",
|
||||||
|
"Default deny",
|
||||||
|
],
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def _read(path: Path) -> str:
|
||||||
|
if not path.exists():
|
||||||
|
raise AssertionError(f"missing file: {path.relative_to(REPO_ROOT)}")
|
||||||
|
return path.read_text()
|
||||||
|
|
||||||
|
|
||||||
|
def _assert_frontmatter(path: Path, text: str) -> None:
|
||||||
|
match = re.match(r"^---\n(?P<body>.*?)\n---\n", text, flags=re.DOTALL)
|
||||||
|
if not match:
|
||||||
|
raise AssertionError(f"{path.relative_to(REPO_ROOT)} missing YAML frontmatter")
|
||||||
|
body = match.group("body")
|
||||||
|
if "name:" not in body or "description:" not in body:
|
||||||
|
raise AssertionError(f"{path.relative_to(REPO_ROOT)} frontmatter needs name and description")
|
||||||
|
|
||||||
|
|
||||||
|
def main() -> int:
|
||||||
|
parser = argparse.ArgumentParser()
|
||||||
|
parser.add_argument("--output", default=".crabbox-results/llm-refinement-contract.json")
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
program = _read(REQUIRED_FILES["program_doc"])
|
||||||
|
missing_program = [phrase for phrase in PROGRAM_REQUIRED_PHRASES if phrase not in program]
|
||||||
|
if missing_program:
|
||||||
|
raise AssertionError(f"program doc missing phrases: {missing_program}")
|
||||||
|
|
||||||
|
skill_checks = {}
|
||||||
|
for key, phrases in SKILL_REQUIRED.items():
|
||||||
|
path = REQUIRED_FILES[key]
|
||||||
|
text = _read(path)
|
||||||
|
_assert_frontmatter(path, text)
|
||||||
|
missing = [phrase for phrase in phrases if phrase not in text]
|
||||||
|
if missing:
|
||||||
|
raise AssertionError(f"{path.relative_to(REPO_ROOT)} missing phrases: {missing}")
|
||||||
|
skill_checks[key] = {
|
||||||
|
"path": str(path.relative_to(REPO_ROOT)),
|
||||||
|
"phrases_checked": phrases,
|
||||||
|
}
|
||||||
|
|
||||||
|
proof = {
|
||||||
|
"ok": True,
|
||||||
|
"scope": "llm_refinement_decision_engine_contract",
|
||||||
|
"program_doc": str(REQUIRED_FILES["program_doc"].relative_to(REPO_ROOT)),
|
||||||
|
"program_phrases_checked": PROGRAM_REQUIRED_PHRASES,
|
||||||
|
"skills": skill_checks,
|
||||||
|
"pivot": {
|
||||||
|
"infra_owner": "Pentagon.run",
|
||||||
|
"repo_owner": "decision quality, rubrics, model evals, prompt/tool refinement, DB feedback loops",
|
||||||
|
},
|
||||||
|
}
|
||||||
|
|
||||||
|
output = REPO_ROOT / args.output
|
||||||
|
output.parent.mkdir(parents=True, exist_ok=True)
|
||||||
|
output.write_text(json.dumps(proof, indent=2, sort_keys=True) + "\n")
|
||||||
|
print(json.dumps(proof, indent=2, sort_keys=True))
|
||||||
|
return 0
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
raise SystemExit(main())
|
||||||
Loading…
Reference in a new issue