twentyOne2x 71ea7a625c Add decision engine replay harness

- Add source-linked model discovery registry for bakeoff candidates
- Add Rio, Theseus, and KB interop fixtures with deterministic replay proof
- Gate CI on replay output; verify with 424-test suite

`.crabbox.yaml`
`.github/workflows/ci.yml`
`docs/llm-refinement-decision-engine.md`
`docs/model-discovery-registry.md`
`fixtures/decision-engine-eval/kb_interop_propose_only.json`
`fixtures/decision-engine-eval/rio_meteora_lp_incentives.json`
`fixtures/decision-engine-eval/theseus_live_model_switch_reject.json`
`scripts/check_llm_refinement_contract.py`
`scripts/replay_decision_engine_eval.py`
`tests/test_decision_engine_replay.py`

2026-06-01 17:37:38 +02:00

9.5 KiB

Raw Permalink Blame History

LLM Refinement And Decision Engine Program

Created: 2026-06-01 Status: active direction

Product Outcome

The decision engine should become the best judgment layer for Living IP: it routes knowledge changes to the right agent identities, tests competing LLMs against the same rubric, learns from disagreement, and improves prompts/tools only when measured deltas prove the change.

Pentagon.run should own disposable infrastructure and remote execution. This repo should own decision quality: rubrics, prompts, model selection, route evidence, database feedback loops, and agent tool packages.

What Rio And Theseus Become

Rio

Rio becomes the economic and incentive-quality evaluator.

Rio owns:

contribution weights and role economics;
paid-query effects and anti-pay-to-pollute rules;
market, mechanism, futarchy, x402, token, and capital-formation reasoning;
source-diversity and correlated-prior warnings;
OPSEC for finance, deal terms, token economics, and internal allocations;
model tests that expose weak economic reasoning.

Rio should not be "the crypto agent". Rio should be the agent that asks whether the system's incentives create useful knowledge or garbage incentives.

Theseus

Theseus becomes the model-integrity and agent-refinement evaluator.

Theseus owns:

model diversity and correlated-blind-spot measurement;
adversarial eval rubrics;
prompt/tool safety and self-upgrade criteria;
disagreement queues and verifier-divergence analysis;
LLM capability evidence and agent-system architecture;
tests that expose hallucinated certainty, weak causal claims, and prompt-injection fragility.

Theseus should not be "the AI safety agent". Theseus should be the agent that asks whether the decision system can be trusted when the models are persuasive but wrong.

Decision Engine Loop

flowchart TD
  PR["Decision-engine PR or source record"] --> Route["Deterministic route evidence"]
  Route --> Reviewers["Required agent reviewers"]
  Reviewers --> Rubric["Shared rubric"]
  Rubric --> ModelA["Primary model"]
  Rubric --> ModelB["Independent model family"]
  ModelA --> Verdicts["Structured verdicts"]
  ModelB --> Verdicts
  Verdicts --> Disagree{"Disagreement?"}
  Disagree -->|yes| Queue["Disagreement queue"]
  Disagree -->|no| Metrics["Calibration metrics"]
  Queue --> HumanOrLeo["Leo or human arbitration"]
  HumanOrLeo --> Metrics
  Metrics --> DB["SQLite feedback state"]
  DB --> Refine["Prompt, tool, or model proposal"]
  Refine --> Delta["Before/after eval harness"]
  Delta -->|passes| Update["Commit refinement"]
  Delta -->|fails| Archive["Archive failed refinement"]

Model Portfolio

The goal is not to pick one favorite model. The goal is to assign models to failure modes.

Lane	Primary evaluator	Independent check	Why
Fast triage	cheap small model	deterministic route evidence	triage should be cheap and overridable
Domain review	routed agent prompt	different model family	catch domain-specific errors without same-family agreement bias
Deep review	strongest available reasoning model	non-Claude or non-primary family	deep review is for structural claims and disagreement
Economic reasoning	Rio rubric	model with strong quantitative/mechanism reasoning	tests incentive design, paid-query effects, and contribution weights
Agent/refinement safety	Theseus rubric	model with strong adversarial critique	tests tool safety, self-upgrades, and evaluator drift

Candidate models should enter only through a harness:

fixed input set;
fixed rubric;
structured verdict JSON;
cost and latency recorded;
disagreement categories stored;
before/after comparison against current baseline.

No model switch is accepted because it "sounds better" on one example.

Refinement Workstreams

R0: Model Discovery Registry

Create a registry before arguing about model preference. The registry should track:

hosted frontier models;
open-weight Hugging Face candidates;
local or edge candidates;
small, cheap triage models;
larger reasoning models, including future in-house or 27B-class candidates;
license, hardware, context, latency, cost, tool support, and known failure modes.

The registry does not bless a model. It decides which model deserves a bakeoff fixture.

R1: Rubric Packets

Create a small rubric packet for each evaluator role:

rio-economics-rubric
theseus-model-integrity-rubric
leo-cross-domain-rubric
domain-specific factuality rubrics

Each packet must define allowed verdicts, rejection tags, must-check criteria, and examples of false positives.

R2: Evaluation Corpus

Build a replayable corpus from existing PRs:

approved clean PRs;
rejected PRs by issue tag;
Rio/Theseus cross-domain PRs;
paid-query or contribution-weight examples;
adversarial malformed claims;
near-duplicate and OPSEC edge cases.

Use local fixture data first. Production DB sampling requires the DB operator skill.

R3: Model Bakeoff

Run each candidate model against the same corpus and emit:

accuracy against expected disposition;
false-approve count;
false-reject count;
issue-tag precision;
average latency;
estimated cost;
disagreement matrix by model pair.

The highest-signal metric is not raw approval rate. It is false approvals on bad claims plus useful disagreement on ambiguous claims.

R4: Feedback Loop

Use review_records, audit_log, costs, and PR state to find:

recurring model failure categories;
agents with repeated same-tag rejections;
prompts that produce vague reviews;
cost spikes without quality gain;
routes that keep requiring manual override.

Every prompt/tool change should include a before/after proof over this loop.

R5: Agent Runtime Packages

Package the same decision-engine contract for:

NousResearch Hermes Agent: skill/memory/model-switching oriented.
OpenClaw: workspace skill plus AGENTS.md, SOUL.md, TOOLS.md oriented.
Claude-style, Pentagon, or other persistent agents: skill-oriented knowledge-base read/write interop.

Both packages should be fixture-first and no-secret by default. They are distribution surfaces for the decision engine, not separate evaluators with their own truth.

R6: Knowledge-Base Interop

Any Hermes, OpenClaw, or Claude-style agent should be able to read information from the Living IP knowledge base and propose writes back into it.

The contract is:

read through deterministic search, claim indexes, copied SQLite state, or cited repo files;
propose source, claim, entity, correction, and route artifacts;
never write directly to main;
never mutate production pipeline.db from a model response;
leave proof showing the exact query, cited reads, proposed write, and route evidence.

Use .agents/skills/living-ip-kb-interop/SKILL.md for runtime-neutral KB access, and .agents/skills/teleo-db-operator/SKILL.md for SQLite-specific work.

DB Usage Boundary

Default is read-only.

Writes are allowed only when all are true:

the target DB is local, staging, or explicitly authorized production;
a backup or copy exists;
the write is wrapped in a transaction;
the exact query is retained in a proof artifact;
the post-write readback is retained.

Never let an agent tune prompts by mutating production state directly.

Pentagon.run Boundary

Pentagon.run should own:

disposable VPS setup;
Crabbox or remote proof execution;
Hetzner lifecycle;
runner cleanup;
infra receipts.
persistent agent teammates, company-brain infrastructure, and agent-to-agent transport when that is their managed stack.

This repo should own:

decision-engine quality;
model and prompt experiments;
agent skills and adapter handoffs;
database feedback analysis;
proof schemas for eval quality.

Raw cards and secrets are not agent runtime inputs. Human operators may decide vendor billing and spend policy, but repo artifacts should only name secret slots, scoped tokens, spend limits, receipts, and setup checklists.

Transcript-Derived Requirements

The 2026-06-01 working transcript adds these requirements:

LLM/refinement work should focus on model discovery, compression, context strategy, and decision-engine quality while Pentagon handles cloud/persistent-agent infrastructure.
Rio should be the first place to route Meteora, LP, x402, futarchy, paid-query, and contribution-incentive questions.
Theseus should own the skill/MCP/refinement path that makes model judgment portable across Hermes, OpenClaw, Claude-style agents, and Pentagon-style company brains.
The knowledge-writing path should turn large founder/source corpora into structured, reviewable knowledge packets, not shallow summaries.
Slack, Linear, email, billing, and provider accounts are external collaboration setup. They should unblock people, but they are not prerequisites for local fixture, rubric, and proof work.

Next Implementation Slice

Add docs/model-discovery-registry.md.
Add scripts/replay_decision_engine_eval.py with local fixture mode.
Add fixtures/decision-engine-eval/*.json.
Store verdict outputs in .crabbox-results/decision-engine-eval.json.
Add one Rio economics fixture and one Theseus model-integrity fixture.
Add one KB interop fixture that searches existing context and proposes a write without touching main or production DB.
Compare current prompt versus one candidate prompt before touching runtime prompts.

Do not start by changing live model assignments.

Run python3 scripts/replay_decision_engine_eval.py after changing fixture, rubric, registry, or candidate-output formats.

9.5 KiB Raw Permalink Blame History