twentyOne2x ca96f5f8e3 Harden local phase 1b review path

2026-05-29 14:16:12 +02:00

38 KiB

Raw Blame History

Phase 1b Agent Routing Spec

Created: 2026-05-29 Status: active draft Owner: Epimetheus pipeline implementation, with m3taversal as scope owner and Fwaz as VPS/runtime owner

Product Outcome Contract

Phase 1b makes the knowledge-base evaluation engine behave like a six-agent review system instead of a generic triage stack.

When a contribution changes the decision-engine KB, the pipeline must decide which Hermes agent identity is responsible for judging that change, run the required review or reviews, post agent-specific verdicts, and then let the existing merge or feedback machinery continue.

The user-visible outcome is not a new frontend. It is a PR review trail showing that the right agent or agents reviewed the right KB mutation.

Non-Goals

This spec does not implement:

Twitter/X posting.
x402, wallet, payment, or funding flows.
Decision markets, agent bidding, stake-weighted quorum, or prediction-market review.
Full general user-input routing outside the PR evaluation path.
Separate GitHub accounts for each agent.
A full Forgejo-to-GitHub daemon rewrite beyond what Phase 1b needs.
A dashboard redesign.
Production deployment without staging or VPS proof.

Program Decomposition

This is a medium-sized control-plane change with five execution lanes:

Agent identity routing.
Eval pipeline integration.
GitHub identity and bot comment posture.
Reporting and contributor compatibility.
Staging and production proof.

The implementation can remain in one PR only if lanes 1 through 4 are tightly tested and the staging proof remains a separate operator task. If the eval integration diff grows beyond the files named in this spec, split into:

PR 1: route contract and tests.
PR 2: eval integration and mocked state tests.
PR 3: GitHub/comment idempotency and reporting compatibility.
PR 4 or operator runbook: staging proof artifacts.

Child specs:

docs/phase1b/agent-identity-router-spec.md
docs/phase1b/eval-pipeline-integration-spec.md
docs/phase1b/github-identity-bot-posture-spec.md
docs/phase1b/reporting-contributor-compatibility-spec.md
docs/phase1b/staging-proof-spec.md

Priority Matrix

Rank	Workstream	Recurrence	Value	Readiness	Current state	Issue/spec mapping	Thread-claimed status	Verified implementation/proof status	Recommended next move
1	Canonical repo and eval target	Repeated confusion between `teleo-codex`, `teleo-kb`, and `decision-engine`.	Critical	Ready now	Confirmed by user: `decision-engine`. Some code still has Forgejo/teleo-codex defaults.	This spec, `handoff/phase1-step3-script-migration.md`	Clarified in chat.	Partially reflected in repo; not unified in daemon modules.	Make Phase 1b route/proof explicitly target `decision-engine`.
2	Agent identity routing	Repeated confusion between domain folders and agent ownership.	Critical	Ready now	Existing `lib/domains.py` is folder-first.	This spec	m3taversal clarified identity-first routing.	Initial local patch is insufficient.	Replace with identity-scored route contract.
3	Cross-domain review	Raised as scope expansion during clarification.	High	Ready now	Not implemented.	This spec	m3taversal confirmed cap at top 2.	No code proof.	Add top-2 required reviewer aggregation.
4	Single master bot account	GitHub bot/PAT issue was noted as blocker.	High	Ready now	Phase 1 handoff already documents single `livingIPbot` posture.	`handoff/phase1-step3-script-migration.md`	Separate identities ideal, likely too complex.	Handoff-only.	Use master bot comments with agent verdict tags.
5	Staging proof	User asked how to test without mutating prod VPS.	Critical for production	Draft gated	Needs VPS clone or Crabbox/staging access.	This spec	Proposed, not executed.	No proof.	Run after code PR passes local checks.

Goal

Implement Phase 1b for the decision-engine knowledge base: pipeline-v2 evaluates each incoming KB pull request by routing it to the Hermes agent identity that owns the relevant domain of judgment.

The implementation lives in teleo-infrastructure. The canonical KB repo for this phase is living-ip/decision-engine.

Phase 1b is complete only when single-domain and cross-domain PRs are routed to the expected required reviewer agents, verdicts are posted in the existing VERDICT:AGENT:* format, and the merge or feedback path continues from those verdicts.

User-Journey Contract

Contributor or agent flow:

A contributor or agent opens a PR against living-ip/decision-engine.
The PR changes one or more KB files.
Pipeline-v2 discovers the PR and fetches its diff.
The router scores Hermes agent identities from the diff, file paths, branch metadata, and eventually PR metadata.
The pipeline runs the required reviewer agents.
The master bot posts verdict comments that clearly name the agent identity in VERDICT:AGENT:* tags.
If all required reviewers approve, the existing approval and merge path continues.
If any required reviewer requests changes, the existing feedback/retry path continues.

Operator flow:

Operator can inspect a PR and see why each agent was selected.
Operator can inspect pipeline logs or audit rows and see route scores, required agents, verdicts, and aggregate result.
Operator can distinguish local proof, staging proof, and production proof.

Existing-Spec Inventory

Existing doc	Relevance	Decision	Reason
`handoff/phase1-step3-script-migration.md`	Establishes the Phase 1 move from Forgejo `teleo-codex` toward GitHub `living-ip/decision-engine`, and documents the single master bot account posture.	Reuse as context.	It owns migration history, not the Phase 1b routing implementation.
`handoff/deprecated/eval-scripts.md`	Confirms old eval dispatcher/worker scripts are dead and `lib/evaluate.py::evaluate_cycle` owns live eval behavior.	Reuse as context.	It prevents work from targeting retired scripts.
`docs/ARCHITECTURE.md`	Describes pipeline-v2 stages, SQLite state, Forgejo-era runtime topology, and existing evaluate/merge loops.	Reuse as context.	It is broader architecture; this spec is a Phase 1b delta spec.
`docs/multi-model-eval-architecture.md`	Documents the prior Leo-first plus second-model evaluation theory.	Supersede for Phase 1b eval routing only.	Phase 1b now routes to domain-owner agent identities, with capped top-2 cross-domain review. The old doc remains useful for later calibration.
`docs/queue.md`	Mentions domain evolution such as `ai-alignment` to `ai-systems`.	Reuse as signal.	It supports the identity-scored router rather than folder-only routing.

Current Implementation Audit

Current relevant implementation state:

teleo-pipeline.py runs pipeline-v2 as a single async daemon.
lib/evaluate.py::evaluate_cycle is the active eval loop.
lib/evaluate.py::evaluate_pr currently detects a domain, runs a domain review, then runs Leo review for non-LIGHT PRs.
lib/domains.py contains a folder-first DOMAIN_AGENT_MAP.
lib/llm.py contains prompt templates and run_domain_review, run_batch_domain_review, and run_leo_review.
lib/eval_parse.py::parse_verdict parses VERDICT:AGENT:APPROVE and VERDICT:AGENT:REQUEST_CHANGES.
pipeline-health-check.py is GitHub-oriented and points at living-ip/decision-engine.
lib/forgejo.py, lib/evaluate.py, and lib/merge.py still use Forgejo-named abstractions as the primary API surface.
Per-agent GitHub identity is deferred; Phase 1 uses one master bot account.

Fwaz clarification on 2026-05-29:

Separate GitHub identities are still ideal and blocked on GitHub/PAT setup; Phase 1b must not require them to land the routed-eval path.
Current production update behavior is pull -> services recognize pull -> edit on VPS -> PR to Leo; this is useful context, not the desired long-term control model.
New desired rule is no direct production self-upgrades: agents open PRs, and production deploys exact reviewed/tested SHAs approved and signed by Leo.
Crabbox is acceptable as the long-term disposable staging/test-box direction, while a production-like clone remains the highest-fidelity proof for systemd/VPS paths.

This branch implementation now includes:

lib/agent_routing.py with a pure identity-scored route contract.
PHASE1B_AGENT_ROUTING_ENABLED, defaulting off.
A Phase 1b eval path that runs routed required agents and disables stale domain batching under the flag.
Focused tests for six-agent routing, top-2 cross-domain routing, verdict parsing, and mocked eval aggregation.

Goal-Vs-Repo-Truth Diff

Desired Phase 1b behavior:

Route PRs against decision-engine, not teleo-codex.
Classify by agent identity ownership, not only by folder path.
Run exactly the required reviewer agents.
Use one master bot account if separate GitHub identities are too complex.
Preserve the existing verdict comment format.
Preserve existing merge and feedback behavior.
Support cross-domain PRs by requiring the top 2 routed agents.

Pre-implementation repo truth:

Pipeline eval still has a two-stage review shape: domain review plus Leo review.
Folder-domain mapping exists, but agent identity scoring does not.
Cross-domain review is not implemented as multiple required reviewer agents.
Batch eval can group rows before fetching diffs, which risks routing unclassified rows through general.
GitHub migration is partial: some scripts target GitHub decision-engine, but live daemon modules still have Forgejo-era names and assumptions.

Completion Percent And Remaining Delta

Estimated implementation progress on this branch:

B1 classifier foundation: 100 percent locally, pending staging calibration.
B2 routing layer: 75 percent locally behind a default-off feature flag.
Cross-domain top-2 review: 75 percent locally through mocked eval proof.
Local proof suite: 85 percent for router/eval/parser scope.
Staging or VPS proof: 0 percent.

Remaining delta:

Decide whether the production Phase 1b transport stays Forgejo-first for cutover or switches direct to GitHub decision-engine before staging.
Update reporting/health compatibility beyond review_records if staging shows false readiness.
Prove against staging before production.
Deploy only an exact reviewed/tested SHA after Leo signoff.

Closure, Endpoint, And Deployment Truth

Local closure means:

Focused tests pass in teleo-infrastructure.
A PR exists with the Phase 1b routing implementation and proof notes.

Staging closure means:

A cloned or disposable staging runtime is pointed at a sandbox decision-engine.
Six single-domain sandbox PRs and one cross-domain sandbox PR complete the expected eval path.
A machine-readable proof artifact captures routes, required agents, verdicts, status transitions, git SHAs, and logs.

Production closure means:

The exact reviewed SHA is deployed to the production VPS.
Production pipeline runs real decision-engine PRs through Phase 1b routing.
All six agents have completed at least one live review cycle.
Pipeline remains stable for at least 24 hours after cutover.

Without VPS or staging access, only local closure can be claimed.

Critical Assumptions And Invalidators

Assumptions:

decision-engine is the canonical KB repo for Phase 1b.
The active eval implementation is teleo-infrastructure/lib/evaluate.py, not retired shell scripts.
One master bot account is acceptable for Phase 1b verdict comments.
Required reviewer identity is encoded in the verdict tag, not necessarily in the GitHub account identity.
Agent state files in decision-engine/agents/{agent} are the right identity context source when present.

Invalidators:

Production pipeline is still wired to a different canonical repo.
The VPS runs code not represented by current teleo-infrastructure.
Branch protection requires separate GitHub identities before comments or reviews count.
Agent identity files are absent or materially different on the VPS.
Cross-domain review must include more than top 2 reviewers.

State And Truth Contract

The routing implementation must record or expose:

PR number.
Primary agent.
Required agents.
Route kind: single, multi, or escalated.
Route scores by agent.
Route evidence: path, branch, title, diff keyword, or fallback.
Verdict per required agent.
Aggregate result.
Failure reason for missing or unparseable verdicts.

This can be stored first in audit log details and test artifacts. A DB schema migration is optional for Phase 1b unless downstream dashboards require queryable route fields.

Route Decision Schema

The route decision should be serializable without importing Python classes. Use this JSON shape in audit rows and proof artifacts:

{
  "pr": 123,
  "repo": "living-ip/decision-engine",
  "route_version": "phase1b-v1",
  "route_kind": "single",
  "primary_agent": "Rio",
  "required_agents": ["Rio"],
  "scores": {
    "Leo": 0,
    "Theseus": 1,
    "Rio": 9,
    "Vida": 0,
    "Clay": 0,
    "Astra": 0
  },
  "evidence": [
    {
      "agent": "Rio",
      "signal": "path",
      "weight": 5,
      "value": "domains/internet-finance/example.md"
    }
  ],
  "fallback": false
}

route_kind values:

single: one required reviewer.
multi: two required reviewers from cross-domain scoring.
fallback: no confident route, Leo required.
escalated: route exceeded simple review bounds and was capped by policy.

Verdict State Schema

Aggregate review state should be serializable as:

{
  "pr": 123,
  "required_agents": ["Theseus", "Rio"],
  "agent_verdicts": {
    "Theseus": "approve",
    "Rio": "request_changes"
  },
  "aggregate_verdict": "request_changes",
  "blocking_agents": ["Rio"],
  "missing_agents": [],
  "unparseable_agents": [],
  "transport_failed_agents": []
}

Aggregate states:

approve: all required agents approved.
request_changes: at least one required agent requested changes or produced unparseable content.
retry: at least one required review failed for transport reasons and should not burn the PR as a substantive rejection.

Measurement Contract

Minimum metrics:

route_single_count
route_multi_count
route_escalated_count
review_required_agent_count
review_missing_verdict_count
review_request_changes_count
review_approve_count
route_fallback_count

Minimum proof matrix:

Case	Expected route
grand strategy PR	Leo
ai systems or ai alignment PR	Theseus
internet finance or x402 PR	Rio
health PR	Vida
entertainment PR	Clay
space, robotics, energy, or advanced manufacturing PR	Astra
ai plus x402 PR	Theseus and Rio
collective ai goals PR	Leo and Theseus, if both score in top 2

Score-To-100 Closure Plan

Preparedness score before implementation: 35/100.

Score band	Closure move	Evidence that moves score
35 -> 50	Route contract implemented and unit-tested.	`test_agent_routing.py` proves six single-agent routes, broadened identity ownership, top-2 cross-domain routes, and fallback behavior.
50 -> 65	Eval integration mocked locally.	Mocked eval tests prove required agents are invoked, default Leo review is removed, and aggregate verdicts drive approve/request-changes behavior.
65 -> 75	API/comment compatibility proven locally.	Tests prove all six verdict tags parse and master-bot comment bodies preserve existing parser expectations.
75 -> 85	Staging clone or disposable test box runs sandbox PR proof.	Six single-domain sandbox PRs plus one cross-domain sandbox PR produce expected comments and state transitions.
85 -> 95	Production deploy of exact reviewed SHA.	VPS deploy log, service restart readback, and route/proof artifact for first real PRs.
95 -> 100	24-hour production stability.	24-hour daemon readback with no duplicate comments, no stuck review rows, no production fallback spike, and all six agents represented in verdict history.

The implementation PR can be merged at 65-75 if reviewers accept staging as a deploy gate. It cannot claim Phase 1b complete below 100.

Backend Work Required

1. Agent identity router

Create or refactor into lib/agent_routing.py unless the existing lib/domains.py remains clearly small enough.

Define:

AgentRoute(
    primary_agent: str,
    required_agents: tuple[str, ...],
    route_kind: str,
    scores: dict[str, int],
    evidence: list[dict],
)

Router signals:

Path signals from domains/, entities/, core/, foundations/, and agents/.
Branch prefix signals such as rio/, theseus/, astra/, leo/.
Keyword signals from path, filename, branch, PR title/body when available, and capped diff text.
Agent identity ownership map.

Agent identity ownership map:

Agent	Owns
Leo	grand strategy, teleohumanity goals, collective AI self-understanding, meta strategy, nested collective intelligence concepts
Theseus	AI systems, AI alignment, AI governance, agent systems, safety, evaluation
Rio	internet finance, living capital, markets, crypto, futarchy, x402, payments, capital formation
Vida	health, healthcare, medicine, prevention, clinical systems, mental health, biohealth
Clay	entertainment, media, culture, IP, fandom, narrative, consumer attention
Astra	space development, robotics, energy, advanced manufacturing, physical frontier infrastructure

Routing rules:

If only one agent crosses the threshold, require that agent.
If more than one agent crosses the threshold, require the top 2 agents.
If no agent crosses threshold, fallback to Leo with route kind fallback.
Tie break by score, then deterministic configured order.

Implementation constraints:

The router must be deterministic.
The router must be pure and side-effect free.
Route scores must be explainable through evidence entries.
Folder paths should be strong evidence, not the whole classifier.
Keyword scoring must not require paid inference.
LLM classification may be added later only as shadow-mode evidence.

Recommended scoring starter:

Signal	Weight
Path directly under known primary ownership area	8
Path under broadened ownership area	6
Branch prefix matches agent	4
Filename keyword matches ownership	3
Diff keyword matches ownership	1 per capped hit
PR title/body keyword matches ownership, if available	2

Top-2 selection:

Include the highest-scoring agent.
Include a second agent only if its score is at least 40 percent of the first score and at least the minimum threshold.
Minimum threshold starts at 4.
Never include more than two required agents in Phase 1b.

2. Eval layer integration

Modify lib/evaluate.py:

Fetch PR diff.
Build route from diff and branch.
Store or audit route decision.
Run required reviewer agents.
Aggregate verdicts.
Remove default Leo second-review for normal single-agent PRs.
Keep existing bypasses for musings and reweave unless m3taversal changes policy.
Revisit batch eval: disable batching for Phase 1b or classify before batching.

Implementation sequence:

Add pure route builder and tests.
Add review aggregation helper and tests.
Add run_agent_review while leaving existing run_domain_review and run_leo_review intact.
Switch individual evaluate_pr path to the new router behind a feature flag such as PHASE1B_AGENT_ROUTING_ENABLED.
Disable batch domain eval when the feature flag is enabled unless route-aware batching is implemented in the same PR.
Remove or bypass the default Leo second-review when the feature flag is enabled.
Preserve old behavior when the feature flag is disabled.

Feature flag requirement:

PHASE1B_AGENT_ROUTING_ENABLED=false by default until staging proof exists.

The PR may set tests against enabled behavior without changing the production default.

3. Agent review runner

Modify or add in lib/llm.py:

async def run_agent_review(diff: str, files: str, agent: str, route: AgentRoute) -> tuple[str | None, dict]:
    ...

Prompt must include:

Agent identity context when available.
Route evidence.
Existing eval criteria.
Required verdict tag for that exact agent.

Continue using one master bot account for comments. The bot comment body must identify the routed agent via the verdict tag.

Agent context lookup order:

Runtime-configured KB worktree path, expected to point at decision-engine.
Existing config.MAIN_WORKTREE if production still uses that convention.
Explicit test fixture path in unit tests.

Context files:

agents/{agent}/identity.md
agents/{agent}/beliefs.md
agents/{agent}/reasoning.md
agents/{agent}/skills.md

Missing context files:

Log a warning.
Include an audit evidence entry.
Continue with the generic agent prompt.
Do not crash the eval cycle.

4. Verdict aggregation

Add helper:

aggregate_agent_verdicts(required_agents, reviews) -> AggregateVerdict

Rules:

All required agents approve: approved.
Any required agent requests changes: request changes.
Transport failure: reopen for retry.
Missing or unparseable verdict: request changes unless transport failure is explicit.

Comment format:

Preferred for one required agent:

<review text>

<!-- VERDICT:RIO:APPROVE -->

Preferred for two required agents:

## Theseus review

<review text>

<!-- VERDICT:THESEUS:APPROVE -->

## Rio review

<review text>

<!-- VERDICT:RIO:REQUEST_CHANGES -->

Two separate comments are acceptable if simpler and less risky for existing parsers.

5. Contributor and dashboard compatibility

Audit and update:

lib/contributor.py assumptions that Leo reviews every PR.
pipeline-health-check.py verdict parsing if needed.
Any dashboard code assuming only leo_verdict plus domain_verdict.

Avoid broad dashboard redesign in Phase 1b. If dashboards need richer route state, add an audit artifact first and defer UI.

Frontend Work Required

No frontend work is required for Phase 1b.

livingip-web Phase 1c can later reuse the same router as pre-PR guidance, but Phase 1b acceptance is based on decision-engine PR evaluation.

Operator Work Required

Operator or infrastructure owner must provide before production proof:

Current production deployed SHA for teleo-infrastructure.
Current production KB target and worktree path.
Current systemd units and restart commands.
Staging clone or disposable test runner access.
Sandbox decision-engine target or clear permission to create one.
Staging token set with no production mutation authority.
Rollback SHA and rollback command.

If these are unavailable, implementation can continue locally but production proof must remain blocked.

Expected Runtime And User-Visible Behavior

Single-domain PR:

Pipeline detects route.
Required agents has one name.
Master bot posts one review comment with VERDICT:AGENT:*.
Existing merge or feedback path continues.

Cross-domain PR:

Pipeline detects route.
Required agents has two names.
Master bot posts one review comment per required agent, or one structured comment with separate verdict sections if that is simpler.
Merge requires both approvals.
Any request changes blocks and feeds back.

The user-visible proof is PR comments and final PR disposition.

Staging Proof Contract

Staging must be production-like enough to test pipeline behavior but quarantined from production side effects.

Required staging safety controls:

Production services disabled before any daemon starts.
Production GitHub tokens removed or replaced.
Production OpenRouter/Claude/Hermes keys removed or replaced unless explicitly approved for staging spend.
Sandbox decision-engine repo configured.
Auto-merge either disabled or constrained to sandbox repo.
Hostname clearly changed to staging.

Required proof artifact:

{
  "phase": "1b",
  "environment": "staging",
  "teleo_infrastructure_sha": "...",
  "decision_engine_sha": "...",
  "pipeline_db_schema": 26,
  "feature_flags": {
    "PHASE1B_AGENT_ROUTING_ENABLED": "true"
  },
  "test_prs": [
    {
      "case": "internet-finance",
      "pr": 1,
      "required_agents": ["Rio"],
      "verdicts": {"Rio": "approve"},
      "final_state": "approved"
    }
  ],
  "cross_domain_pr": {
    "required_agents": ["Theseus", "Rio"],
    "final_state": "approved_or_feedback"
  },
  "prod_services_disabled": true,
  "proof_generated_at": "2026-05-29T00:00:00Z"
}

Staging proof does not satisfy the 24-hour production stability gate.

Validation And Test Matrix

Unit tests:

test_agent_routing.py
- routes six primary ownership cases.
- routes broadened Astra cases: energy, robotics, advanced manufacturing.
- routes Leo meta cases: collective AI goals, teleohumanity strategy.
- routes Theseus AI systems cases.
- routes Rio x402 and internet finance cases.
- caps cross-domain to top 2 agents.
- has deterministic tie breaking.

Parser tests:

Existing test_eval_parse.py remains valid.
Add explicit verdict parse coverage for all six agent names.

Mocked eval integration tests:

One required agent calls one runner and posts one verdict.
Two required agents call two runners and post two verdicts.
One request changes blocks aggregate approval.
Transport failure reopens for retry.
Default Leo second-review does not run unless Leo is routed.

Batch tests:

If batching remains enabled, batch grouping must use route decisions, not stale DB domain.
If batching is disabled for Phase 1b, assert cross-domain and single-domain PRs still process individually.

Smoke commands:

python3 -m venv .venv
. .venv/bin/activate
python3 -m pip install 'aiohttp>=3.9,<4' 'pytest>=8' 'pytest-asyncio>=0.23' 'ruff>=0.3' pyyaml
python3 -m pytest tests/test_agent_routing.py tests/test_evaluate_agent_routing.py tests/test_eval_parse.py

If local pytest is unavailable, that is a tooling blocker for full local proof, not an implementation blocker.

CI/CD, Release, And Pre-Push Gate Contract

Pre-push required:

python3 -m pytest for the focused routing/eval test set.
python3 -m ruff check lib tests if dev deps are installed.
Manual scan that no secrets are printed or committed.

PR required:

Summary of routing rule.
Test output.
Known non-prod proof boundary.
Statement that production acceptance still requires staging or VPS proof.

Deploy required:

Exact reviewed SHA.
Staging proof bundle first.
Production service restart plan.
Rollback SHA.

Release phases:

Phase	Feature flag	Environment	Required proof
Local implementation	Enabled only in tests	Local	Unit and mocked eval tests.
Staging shadow	Enabled against sandbox repo	Staging clone or Crabbox-like box	Seven sandbox PR proof artifact.
Production shadow	Optional, no merge mutation if supported	Production	Route decisions logged without changing verdict path.
Production cutover	Enabled	Production	Real PR verdicts by required agents.
Production closure	Enabled	Production	24-hour stability plus all six agents represented.

Rollback:

Flip PHASE1B_AGENT_ROUTING_ENABLED=false.
Restart teleo-pipeline.service.
Confirm eval path returns to prior behavior.
If code rollback is required, deploy the previous exact SHA and restart service.
Keep proof artifact explaining why rollback occurred.

Pre-push commands:

python3 -m pytest tests/test_agent_routing.py tests/test_evaluate_agent_routing.py tests/test_eval_parse.py
python3 -m ruff check lib tests
git diff --check

If dev dependencies are missing, install with:

python3 -m venv .venv
. .venv/bin/activate
python3 -m pip install 'aiohttp>=3.9,<4' 'pytest>=8' 'pytest-asyncio>=0.23' 'ruff>=0.3' pyyaml

Independent CLI Audit Contract

A reviewer should be able to run:

git diff --stat
git diff -- lib/agent_routing.py lib/domains.py lib/evaluate.py lib/llm.py tests/
python3 -m pytest tests/test_agent_routing.py tests/test_evaluate_agent_routing.py

The audit should confirm:

No direct production credentials are introduced.
decision-engine is the target in docs/config where Phase 1b needs it.
No old eval scripts are revived.
Default Leo second-review is not silently preserved for all PRs.
Multi-agent PRs require top 2 reviewer approvals.

Outside-The-Box Fix Paths

If identity-scored keyword routing is too noisy:

Use folder-first routing for strong path evidence and identity scoring only for ambiguous or cross-domain cases.
Add a cheap LLM classifier in shadow mode only, comparing against deterministic router decisions.
Require contributors/frontends to include an explicit domain or agent hint in PR metadata.

If live GitHub identity constraints block separate agent comments:

Keep one master bot account and agent-specific verdict tags.
Defer separate GitHub identities to Phase 2.

If staging VPS access is delayed:

Use a disposable Hetzner clone when available.
Use Crabbox or another remote test box for local dirty checkout proof.
Use a mocked local fake GitHub/Forgejo API server for the eval loop.

Maintenance Capture

Same-tranche maintenance that is justified now:

Extract route scoring into a dedicated module if lib/domains.py would become too broad.
Keep backward-compatible wrappers for existing agent_for_domain and detect_domain_from_diff until downstream callers are migrated.
Add tests around the existing bug-prone batch grouping surface.

Maintenance to avoid now:

Full Forgejo-to-GitHub daemon rewrite unless needed for the Phase 1b PR.
Dashboard redesign.
Contributor credit redesign beyond removing "Leo reviews every PR" assumptions.
Separate GitHub identities per agent.
Payment, wallet, Twitter, or decision-market work.

Parallelization And Fanout

Workstream	Classification	Owner	Notes
Agent identity router and tests	local_owner	Codex current turn	Core implementation surface. Do not fan out because it owns central route contract.
Eval layer integration and mocked tests	local_owner	Codex current turn	Needs tight coupling with router semantics.
Staging VPS clone proof	draft_gated	Fwaz or infrastructure owner	Requires VPS/provider access and secret quarantine.
GitHub identity model	draft_gated	Fwaz plus m3taversal	Deferred unless master bot account becomes unacceptable.
Dashboard/reporting polish	do_not_parallelize	Later	Avoid until route state contract is stable.

Workstream Sub-Spec: Agent Identity Router

Classification: local_owner

Owned files:

lib/agent_routing.py if created.
lib/domains.py compatibility wrappers.
tests/test_agent_routing.py.

Forbidden files:

lib/evaluate.py except imports needed for route type compatibility.
Any runtime secrets.
Any production config defaults outside route feature flags.

Binary done condition:

Pure route function returns expected required agents for every row in the proof matrix.
Tests prove deterministic top-2 behavior and fallback behavior.

Verification commands:

python3 -m pytest tests/test_agent_routing.py

Non-claims:

Does not prove PR comment posting.
Does not prove production target wiring.

Prompt-ready handoff:

implement phase 1b agent identity routing in teleo-infrastructure. own only route module and route tests. preserve compatibility wrappers. route decision must be pure, deterministic, evidence-bearing, and top-2 capped for cross-domain cases. do not touch production API or eval state transitions.

Workstream Sub-Spec: Eval Integration

Classification: local_owner

Owned files:

lib/evaluate.py
lib/llm.py
lib/eval_parse.py only if parser normalization is required.
tests/test_evaluate_agent_routing.py
tests/test_eval_parse.py

Forbidden files:

Old deprecated eval shell scripts.
Deploy scripts unless a feature flag must be exposed.
Dashboard UI except parser-compatible health checks.

Binary done condition:

With PHASE1B_AGENT_ROUTING_ENABLED=true, eval invokes only required reviewer agents.
With flag disabled, prior behavior remains available.
One request-changes verdict blocks aggregate approval.
All approve verdicts continue to existing approval path.

Verification commands:

python3 -m pytest tests/test_evaluate_agent_routing.py tests/test_eval_parse.py

Non-claims:

Does not prove live GitHub or VPS behavior.
Does not prove separate agent GitHub identities.

Prompt-ready handoff:

wire phase 1b routing into teleo-infrastructure eval path behind a feature flag. use required agents from the route result, run agent-specific reviews, aggregate verdicts, and preserve merge/feedback semantics. do not revive deprecated scripts or remove rollback path.

Workstream Sub-Spec: Staging Proof

Classification: draft_gated

Owned files and surfaces:

Staging VPS or disposable remote test box.
Sandbox decision-engine repo.
Staging secrets.
Machine-readable proof artifact.

Forbidden files and surfaces:

Production VPS services.
Production GitHub repo.
Production secrets.
Mainnet/payment/Twitter surfaces.

Binary done condition:

Six single-domain PRs and one cross-domain PR produce expected required-agent verdicts and final dispositions in staging.

Verification commands:

systemctl status teleo-pipeline
journalctl -u teleo-pipeline --since "1 hour ago"
sqlite3 /path/to/pipeline.db "select number, status, domain_agent, leo_verdict, domain_verdict from prs order by number desc limit 20;"
gh pr view --repo living-ip/decision-engine-sandbox PR_NUMBER --comments

Non-claims:

Does not prove production 24-hour stability.

Prompt-ready handoff:

create a quarantined staging proof for phase 1b. clone or provision a disposable server, disable production services and secrets before starting pipeline, point to a sandbox decision-engine repo, run six single-domain prs plus one cross-domain pr, and save a machine-readable proof artifact. do not mutate production.

Worker-ready ticket for later staging proof:

title: phase 1b staging proof on cloned vps
owned surfaces: staging vps, sandbox decision-engine repo, staging secrets, proof artifact
forbidden surfaces: production vps services, production github repo, production secrets
done condition: six single-domain prs plus one cross-domain pr produce expected required-agent verdicts and final dispositions
verification commands: systemd status readback, pipeline log scrape, sqlite route query, github pr comment readback
non-claims: does not prove 24h production stability
preferred executor: human/fwaz with codex support
handoff: create staging clone, disable prod services, inject sandbox config, run phase 1b proof script, save machine-readable proof

Acceptance Criteria

Local PR acceptance:

Focused tests pass.
Router returns correct single-agent routes.
Router returns top-2 required agents for cross-domain cases.
Eval layer invokes only required reviewer agents.
Verdict aggregation handles all approve, request changes, transport failure, and missing verdict.
Existing verdict format remains parseable.
No production readiness claim is made.

Staging acceptance:

Staging environment cannot mutate production.
Six single-domain sandbox PRs complete.
One cross-domain sandbox PR completes.
Required reviewer agents match proof matrix.
Proof artifact is retained.

Production exit:

Exact reviewed SHA deployed.
All six agents produce at least one verdict in their domain.
At least one cross-domain PR proves top-2 review behavior.
Pipeline stable for 24 hours.

Readiness And Claim Boundaries

Allowed claims after local implementation:

"Route logic is implemented and locally tested."
"Mocked eval integration proves required-agent invocation and aggregation."
"The implementation PR is ready for staging proof."

Forbidden claims after local implementation:

"Phase 1b is complete."
"Production is ready."
"All six agents have demonstrated live review cycles."
"The VPS is safely updated."

Allowed claims after staging proof:

"Phase 1b passed sandbox staging proof."
"The exact SHA is eligible for production cutover review."

Forbidden claims after staging proof:

"Production is stable."
"Live decision-engine PRs are proven."

Allowed claims after production 24-hour proof:

"Phase 1b production exit criteria are met."

Spec Quality Self-Audit

Required execution-grade headings present:

Current Implementation Audit: present.
Goal-Vs-Repo-Truth Diff: present.
Completion Percent And Remaining Delta: present.
Closure, Endpoint, And Deployment Truth: present.
Critical Assumptions And Invalidators: present.
State And Truth Contract: present.
Measurement Contract: present.
Backend Work Required: present.
Frontend Work Required: present.
Expected Runtime And User-Visible Behavior: present.
Validation And Test Matrix: present.
CI/CD, Release, And Pre-Push Gate Contract: present.
Independent CLI Audit Contract: present.
Outside-The-Box Fix Paths: present.
Maintenance Capture: present.
Parallelization And Fanout: present.

Additional spec-of-spec coverage:

Product Outcome Contract: present.
Non-Goals: present.
Program Decomposition: present.
Priority Matrix: present.
Score-To-100 Closure Plan: present.
Workstream sub-specs: present.
Staging Proof Contract: present.
Rollback contract: present.

Known incompleteness:

This spec cannot name the exact production deploy command until Fwaz or VPS truth confirms it.
This spec cannot name the exact sandbox repo until the operator creates or selects it.
This spec cannot prove whether production daemon code exactly matches local teleo-infrastructure until VPS readback exists.

Assistant-Added Caveats

This spec intentionally expands B1/B2 from folder-domain routing to identity-scored agent routing because m3taversal clarified that agent identities should route and folders are only signals. That is the right product interpretation, but it increases implementation scope versus the original simple path classifier.

This spec does not claim production readiness without staging or VPS proof.

38 KiB Raw Blame History

Phase 1b Agent Routing Spec

Product Outcome Contract

Non-Goals

Program Decomposition

Priority Matrix

Goal

User-Journey Contract

Existing-Spec Inventory

Current Implementation Audit

Goal-Vs-Repo-Truth Diff

Completion Percent And Remaining Delta

Closure, Endpoint, And Deployment Truth

Critical Assumptions And Invalidators

State And Truth Contract

Route Decision Schema

Verdict State Schema

Measurement Contract

Score-To-100 Closure Plan

Backend Work Required

1. Agent identity router

2. Eval layer integration

3. Agent review runner

4. Verdict aggregation

5. Contributor and dashboard compatibility

Frontend Work Required

Operator Work Required

Expected Runtime And User-Visible Behavior

Staging Proof Contract

Validation And Test Matrix

CI/CD, Release, And Pre-Push Gate Contract

Independent CLI Audit Contract

Outside-The-Box Fix Paths

Maintenance Capture

Parallelization And Fanout

Workstream Sub-Spec: Agent Identity Router

Workstream Sub-Spec: Eval Integration

Workstream Sub-Spec: Staging Proof

Acceptance Criteria

Readiness And Claim Boundaries

Spec Quality Self-Audit

Assistant-Added Caveats

38 KiB

Raw Blame History