38 KiB
Phase 1b Agent Routing Spec
Created: 2026-05-29 Status: active draft Owner: Epimetheus pipeline implementation, with m3taversal as scope owner and Fwaz as VPS/runtime owner
Product Outcome Contract
Phase 1b makes the knowledge-base evaluation engine behave like a six-agent review system instead of a generic triage stack.
When a contribution changes the decision-engine KB, the pipeline must decide which Hermes agent identity is responsible for judging that change, run the required review or reviews, post agent-specific verdicts, and then let the existing merge or feedback machinery continue.
The user-visible outcome is not a new frontend. It is a PR review trail showing that the right agent or agents reviewed the right KB mutation.
Non-Goals
This spec does not implement:
- Twitter/X posting.
- x402, wallet, payment, or funding flows.
- Decision markets, agent bidding, stake-weighted quorum, or prediction-market review.
- Full general user-input routing outside the PR evaluation path.
- Separate GitHub accounts for each agent.
- A full Forgejo-to-GitHub daemon rewrite beyond what Phase 1b needs.
- A dashboard redesign.
- Production deployment without staging or VPS proof.
Program Decomposition
This is a medium-sized control-plane change with five execution lanes:
- Agent identity routing.
- Eval pipeline integration.
- GitHub identity and bot comment posture.
- Reporting and contributor compatibility.
- Staging and production proof.
The implementation can remain in one PR only if lanes 1 through 4 are tightly tested and the staging proof remains a separate operator task. If the eval integration diff grows beyond the files named in this spec, split into:
- PR 1: route contract and tests.
- PR 2: eval integration and mocked state tests.
- PR 3: GitHub/comment idempotency and reporting compatibility.
- PR 4 or operator runbook: staging proof artifacts.
Child specs:
docs/phase1b/agent-identity-router-spec.mddocs/phase1b/eval-pipeline-integration-spec.mddocs/phase1b/github-identity-bot-posture-spec.mddocs/phase1b/reporting-contributor-compatibility-spec.mddocs/phase1b/staging-proof-spec.md
Priority Matrix
| Rank | Workstream | Recurrence | Value | Readiness | Current state | Issue/spec mapping | Thread-claimed status | Verified implementation/proof status | Recommended next move |
|---|---|---|---|---|---|---|---|---|---|
| 1 | Canonical repo and eval target | Repeated confusion between teleo-codex, teleo-kb, and decision-engine. |
Critical | Ready now | Confirmed by user: decision-engine. Some code still has Forgejo/teleo-codex defaults. |
This spec, handoff/phase1-step3-script-migration.md |
Clarified in chat. | Partially reflected in repo; not unified in daemon modules. | Make Phase 1b route/proof explicitly target decision-engine. |
| 2 | Agent identity routing | Repeated confusion between domain folders and agent ownership. | Critical | Ready now | Existing lib/domains.py is folder-first. |
This spec | m3taversal clarified identity-first routing. | Initial local patch is insufficient. | Replace with identity-scored route contract. |
| 3 | Cross-domain review | Raised as scope expansion during clarification. | High | Ready now | Not implemented. | This spec | m3taversal confirmed cap at top 2. | No code proof. | Add top-2 required reviewer aggregation. |
| 4 | Single master bot account | GitHub bot/PAT issue was noted as blocker. | High | Ready now | Phase 1 handoff already documents single livingIPbot posture. |
handoff/phase1-step3-script-migration.md |
Separate identities ideal, likely too complex. | Handoff-only. | Use master bot comments with agent verdict tags. |
| 5 | Staging proof | User asked how to test without mutating prod VPS. | Critical for production | Draft gated | Needs VPS clone or Crabbox/staging access. | This spec | Proposed, not executed. | No proof. | Run after code PR passes local checks. |
Goal
Implement Phase 1b for the decision-engine knowledge base: pipeline-v2 evaluates each incoming KB pull request by routing it to the Hermes agent identity that owns the relevant domain of judgment.
The implementation lives in teleo-infrastructure. The canonical KB repo for this phase is living-ip/decision-engine.
Phase 1b is complete only when single-domain and cross-domain PRs are routed to the expected required reviewer agents, verdicts are posted in the existing VERDICT:AGENT:* format, and the merge or feedback path continues from those verdicts.
User-Journey Contract
Contributor or agent flow:
- A contributor or agent opens a PR against
living-ip/decision-engine. - The PR changes one or more KB files.
- Pipeline-v2 discovers the PR and fetches its diff.
- The router scores Hermes agent identities from the diff, file paths, branch metadata, and eventually PR metadata.
- The pipeline runs the required reviewer agents.
- The master bot posts verdict comments that clearly name the agent identity in
VERDICT:AGENT:*tags. - If all required reviewers approve, the existing approval and merge path continues.
- If any required reviewer requests changes, the existing feedback/retry path continues.
Operator flow:
- Operator can inspect a PR and see why each agent was selected.
- Operator can inspect pipeline logs or audit rows and see route scores, required agents, verdicts, and aggregate result.
- Operator can distinguish local proof, staging proof, and production proof.
Existing-Spec Inventory
| Existing doc | Relevance | Decision | Reason |
|---|---|---|---|
handoff/phase1-step3-script-migration.md |
Establishes the Phase 1 move from Forgejo teleo-codex toward GitHub living-ip/decision-engine, and documents the single master bot account posture. |
Reuse as context. | It owns migration history, not the Phase 1b routing implementation. |
handoff/deprecated/eval-scripts.md |
Confirms old eval dispatcher/worker scripts are dead and lib/evaluate.py::evaluate_cycle owns live eval behavior. |
Reuse as context. | It prevents work from targeting retired scripts. |
docs/ARCHITECTURE.md |
Describes pipeline-v2 stages, SQLite state, Forgejo-era runtime topology, and existing evaluate/merge loops. | Reuse as context. | It is broader architecture; this spec is a Phase 1b delta spec. |
docs/multi-model-eval-architecture.md |
Documents the prior Leo-first plus second-model evaluation theory. | Supersede for Phase 1b eval routing only. | Phase 1b now routes to domain-owner agent identities, with capped top-2 cross-domain review. The old doc remains useful for later calibration. |
docs/queue.md |
Mentions domain evolution such as ai-alignment to ai-systems. |
Reuse as signal. | It supports the identity-scored router rather than folder-only routing. |
Current Implementation Audit
Current relevant implementation state:
teleo-pipeline.pyruns pipeline-v2 as a single async daemon.lib/evaluate.py::evaluate_cycleis the active eval loop.lib/evaluate.py::evaluate_prcurrently detects a domain, runs a domain review, then runs Leo review for non-LIGHT PRs.lib/domains.pycontains a folder-firstDOMAIN_AGENT_MAP.lib/llm.pycontains prompt templates andrun_domain_review,run_batch_domain_review, andrun_leo_review.lib/eval_parse.py::parse_verdictparsesVERDICT:AGENT:APPROVEandVERDICT:AGENT:REQUEST_CHANGES.pipeline-health-check.pyis GitHub-oriented and points atliving-ip/decision-engine.lib/forgejo.py,lib/evaluate.py, andlib/merge.pystill use Forgejo-named abstractions as the primary API surface.- Per-agent GitHub identity is deferred; Phase 1 uses one master bot account.
Fwaz clarification on 2026-05-29:
- Separate GitHub identities are still ideal and blocked on GitHub/PAT setup; Phase 1b must not require them to land the routed-eval path.
- Current production update behavior is
pull -> services recognize pull -> edit on VPS -> PR to Leo; this is useful context, not the desired long-term control model. - New desired rule is no direct production self-upgrades: agents open PRs, and production deploys exact reviewed/tested SHAs approved and signed by Leo.
- Crabbox is acceptable as the long-term disposable staging/test-box direction, while a production-like clone remains the highest-fidelity proof for systemd/VPS paths.
This branch implementation now includes:
lib/agent_routing.pywith a pure identity-scored route contract.PHASE1B_AGENT_ROUTING_ENABLED, defaulting off.- A Phase 1b eval path that runs routed required agents and disables stale domain batching under the flag.
- Focused tests for six-agent routing, top-2 cross-domain routing, verdict parsing, and mocked eval aggregation.
Goal-Vs-Repo-Truth Diff
Desired Phase 1b behavior:
- Route PRs against
decision-engine, notteleo-codex. - Classify by agent identity ownership, not only by folder path.
- Run exactly the required reviewer agents.
- Use one master bot account if separate GitHub identities are too complex.
- Preserve the existing verdict comment format.
- Preserve existing merge and feedback behavior.
- Support cross-domain PRs by requiring the top 2 routed agents.
Pre-implementation repo truth:
- Pipeline eval still has a two-stage review shape: domain review plus Leo review.
- Folder-domain mapping exists, but agent identity scoring does not.
- Cross-domain review is not implemented as multiple required reviewer agents.
- Batch eval can group rows before fetching diffs, which risks routing unclassified rows through
general. - GitHub migration is partial: some scripts target GitHub
decision-engine, but live daemon modules still have Forgejo-era names and assumptions.
Completion Percent And Remaining Delta
Estimated implementation progress on this branch:
- B1 classifier foundation: 100 percent locally, pending staging calibration.
- B2 routing layer: 75 percent locally behind a default-off feature flag.
- Cross-domain top-2 review: 75 percent locally through mocked eval proof.
- Local proof suite: 85 percent for router/eval/parser scope.
- Staging or VPS proof: 0 percent.
Remaining delta:
- Decide whether the production Phase 1b transport stays Forgejo-first for cutover or switches direct to GitHub
decision-enginebefore staging. - Update reporting/health compatibility beyond
review_recordsif staging shows false readiness. - Prove against staging before production.
- Deploy only an exact reviewed/tested SHA after Leo signoff.
Closure, Endpoint, And Deployment Truth
Local closure means:
- Focused tests pass in
teleo-infrastructure. - A PR exists with the Phase 1b routing implementation and proof notes.
Staging closure means:
- A cloned or disposable staging runtime is pointed at a sandbox
decision-engine. - Six single-domain sandbox PRs and one cross-domain sandbox PR complete the expected eval path.
- A machine-readable proof artifact captures routes, required agents, verdicts, status transitions, git SHAs, and logs.
Production closure means:
- The exact reviewed SHA is deployed to the production VPS.
- Production pipeline runs real
decision-enginePRs through Phase 1b routing. - All six agents have completed at least one live review cycle.
- Pipeline remains stable for at least 24 hours after cutover.
Without VPS or staging access, only local closure can be claimed.
Critical Assumptions And Invalidators
Assumptions:
decision-engineis the canonical KB repo for Phase 1b.- The active eval implementation is
teleo-infrastructure/lib/evaluate.py, not retired shell scripts. - One master bot account is acceptable for Phase 1b verdict comments.
- Required reviewer identity is encoded in the verdict tag, not necessarily in the GitHub account identity.
- Agent state files in
decision-engine/agents/{agent}are the right identity context source when present.
Invalidators:
- Production pipeline is still wired to a different canonical repo.
- The VPS runs code not represented by current
teleo-infrastructure. - Branch protection requires separate GitHub identities before comments or reviews count.
- Agent identity files are absent or materially different on the VPS.
- Cross-domain review must include more than top 2 reviewers.
State And Truth Contract
The routing implementation must record or expose:
- PR number.
- Primary agent.
- Required agents.
- Route kind:
single,multi, orescalated. - Route scores by agent.
- Route evidence: path, branch, title, diff keyword, or fallback.
- Verdict per required agent.
- Aggregate result.
- Failure reason for missing or unparseable verdicts.
This can be stored first in audit log details and test artifacts. A DB schema migration is optional for Phase 1b unless downstream dashboards require queryable route fields.
Route Decision Schema
The route decision should be serializable without importing Python classes. Use this JSON shape in audit rows and proof artifacts:
{
"pr": 123,
"repo": "living-ip/decision-engine",
"route_version": "phase1b-v1",
"route_kind": "single",
"primary_agent": "Rio",
"required_agents": ["Rio"],
"scores": {
"Leo": 0,
"Theseus": 1,
"Rio": 9,
"Vida": 0,
"Clay": 0,
"Astra": 0
},
"evidence": [
{
"agent": "Rio",
"signal": "path",
"weight": 5,
"value": "domains/internet-finance/example.md"
}
],
"fallback": false
}
route_kind values:
single: one required reviewer.multi: two required reviewers from cross-domain scoring.fallback: no confident route, Leo required.escalated: route exceeded simple review bounds and was capped by policy.
Verdict State Schema
Aggregate review state should be serializable as:
{
"pr": 123,
"required_agents": ["Theseus", "Rio"],
"agent_verdicts": {
"Theseus": "approve",
"Rio": "request_changes"
},
"aggregate_verdict": "request_changes",
"blocking_agents": ["Rio"],
"missing_agents": [],
"unparseable_agents": [],
"transport_failed_agents": []
}
Aggregate states:
approve: all required agents approved.request_changes: at least one required agent requested changes or produced unparseable content.retry: at least one required review failed for transport reasons and should not burn the PR as a substantive rejection.
Measurement Contract
Minimum metrics:
route_single_countroute_multi_countroute_escalated_countreview_required_agent_countreview_missing_verdict_countreview_request_changes_countreview_approve_countroute_fallback_count
Minimum proof matrix:
| Case | Expected route |
|---|---|
| grand strategy PR | Leo |
| ai systems or ai alignment PR | Theseus |
| internet finance or x402 PR | Rio |
| health PR | Vida |
| entertainment PR | Clay |
| space, robotics, energy, or advanced manufacturing PR | Astra |
| ai plus x402 PR | Theseus and Rio |
| collective ai goals PR | Leo and Theseus, if both score in top 2 |
Score-To-100 Closure Plan
Preparedness score before implementation: 35/100.
| Score band | Closure move | Evidence that moves score |
|---|---|---|
| 35 -> 50 | Route contract implemented and unit-tested. | test_agent_routing.py proves six single-agent routes, broadened identity ownership, top-2 cross-domain routes, and fallback behavior. |
| 50 -> 65 | Eval integration mocked locally. | Mocked eval tests prove required agents are invoked, default Leo review is removed, and aggregate verdicts drive approve/request-changes behavior. |
| 65 -> 75 | API/comment compatibility proven locally. | Tests prove all six verdict tags parse and master-bot comment bodies preserve existing parser expectations. |
| 75 -> 85 | Staging clone or disposable test box runs sandbox PR proof. | Six single-domain sandbox PRs plus one cross-domain sandbox PR produce expected comments and state transitions. |
| 85 -> 95 | Production deploy of exact reviewed SHA. | VPS deploy log, service restart readback, and route/proof artifact for first real PRs. |
| 95 -> 100 | 24-hour production stability. | 24-hour daemon readback with no duplicate comments, no stuck review rows, no production fallback spike, and all six agents represented in verdict history. |
The implementation PR can be merged at 65-75 if reviewers accept staging as a deploy gate. It cannot claim Phase 1b complete below 100.
Backend Work Required
1. Agent identity router
Create or refactor into lib/agent_routing.py unless the existing lib/domains.py remains clearly small enough.
Define:
AgentRoute(
primary_agent: str,
required_agents: tuple[str, ...],
route_kind: str,
scores: dict[str, int],
evidence: list[dict],
)
Router signals:
- Path signals from
domains/,entities/,core/,foundations/, andagents/. - Branch prefix signals such as
rio/,theseus/,astra/,leo/. - Keyword signals from path, filename, branch, PR title/body when available, and capped diff text.
- Agent identity ownership map.
Agent identity ownership map:
| Agent | Owns |
|---|---|
| Leo | grand strategy, teleohumanity goals, collective AI self-understanding, meta strategy, nested collective intelligence concepts |
| Theseus | AI systems, AI alignment, AI governance, agent systems, safety, evaluation |
| Rio | internet finance, living capital, markets, crypto, futarchy, x402, payments, capital formation |
| Vida | health, healthcare, medicine, prevention, clinical systems, mental health, biohealth |
| Clay | entertainment, media, culture, IP, fandom, narrative, consumer attention |
| Astra | space development, robotics, energy, advanced manufacturing, physical frontier infrastructure |
Routing rules:
- If only one agent crosses the threshold, require that agent.
- If more than one agent crosses the threshold, require the top 2 agents.
- If no agent crosses threshold, fallback to Leo with route kind
fallback. - Tie break by score, then deterministic configured order.
Implementation constraints:
- The router must be deterministic.
- The router must be pure and side-effect free.
- Route scores must be explainable through evidence entries.
- Folder paths should be strong evidence, not the whole classifier.
- Keyword scoring must not require paid inference.
- LLM classification may be added later only as shadow-mode evidence.
Recommended scoring starter:
| Signal | Weight |
|---|---|
| Path directly under known primary ownership area | 8 |
| Path under broadened ownership area | 6 |
| Branch prefix matches agent | 4 |
| Filename keyword matches ownership | 3 |
| Diff keyword matches ownership | 1 per capped hit |
| PR title/body keyword matches ownership, if available | 2 |
Top-2 selection:
- Include the highest-scoring agent.
- Include a second agent only if its score is at least 40 percent of the first score and at least the minimum threshold.
- Minimum threshold starts at 4.
- Never include more than two required agents in Phase 1b.
2. Eval layer integration
Modify lib/evaluate.py:
- Fetch PR diff.
- Build route from diff and branch.
- Store or audit route decision.
- Run required reviewer agents.
- Aggregate verdicts.
- Remove default Leo second-review for normal single-agent PRs.
- Keep existing bypasses for musings and reweave unless m3taversal changes policy.
- Revisit batch eval: disable batching for Phase 1b or classify before batching.
Implementation sequence:
- Add pure route builder and tests.
- Add review aggregation helper and tests.
- Add
run_agent_reviewwhile leaving existingrun_domain_reviewandrun_leo_reviewintact. - Switch individual
evaluate_prpath to the new router behind a feature flag such asPHASE1B_AGENT_ROUTING_ENABLED. - Disable batch domain eval when the feature flag is enabled unless route-aware batching is implemented in the same PR.
- Remove or bypass the default Leo second-review when the feature flag is enabled.
- Preserve old behavior when the feature flag is disabled.
Feature flag requirement:
PHASE1B_AGENT_ROUTING_ENABLED=false by default until staging proof exists.
The PR may set tests against enabled behavior without changing the production default.
3. Agent review runner
Modify or add in lib/llm.py:
async def run_agent_review(diff: str, files: str, agent: str, route: AgentRoute) -> tuple[str | None, dict]:
...
Prompt must include:
- Agent identity context when available.
- Route evidence.
- Existing eval criteria.
- Required verdict tag for that exact agent.
Continue using one master bot account for comments. The bot comment body must identify the routed agent via the verdict tag.
Agent context lookup order:
- Runtime-configured KB worktree path, expected to point at
decision-engine. - Existing
config.MAIN_WORKTREEif production still uses that convention. - Explicit test fixture path in unit tests.
Context files:
agents/{agent}/identity.mdagents/{agent}/beliefs.mdagents/{agent}/reasoning.mdagents/{agent}/skills.md
Missing context files:
- Log a warning.
- Include an audit evidence entry.
- Continue with the generic agent prompt.
- Do not crash the eval cycle.
4. Verdict aggregation
Add helper:
aggregate_agent_verdicts(required_agents, reviews) -> AggregateVerdict
Rules:
- All required agents approve: approved.
- Any required agent requests changes: request changes.
- Transport failure: reopen for retry.
- Missing or unparseable verdict: request changes unless transport failure is explicit.
Comment format:
Preferred for one required agent:
<review text>
<!-- VERDICT:RIO:APPROVE -->
Preferred for two required agents:
## Theseus review
<review text>
<!-- VERDICT:THESEUS:APPROVE -->
## Rio review
<review text>
<!-- VERDICT:RIO:REQUEST_CHANGES -->
Two separate comments are acceptable if simpler and less risky for existing parsers.
5. Contributor and dashboard compatibility
Audit and update:
lib/contributor.pyassumptions that Leo reviews every PR.pipeline-health-check.pyverdict parsing if needed.- Any dashboard code assuming only
leo_verdictplusdomain_verdict.
Avoid broad dashboard redesign in Phase 1b. If dashboards need richer route state, add an audit artifact first and defer UI.
Frontend Work Required
No frontend work is required for Phase 1b.
livingip-web Phase 1c can later reuse the same router as pre-PR guidance, but Phase 1b acceptance is based on decision-engine PR evaluation.
Operator Work Required
Operator or infrastructure owner must provide before production proof:
- Current production deployed SHA for
teleo-infrastructure. - Current production KB target and worktree path.
- Current systemd units and restart commands.
- Staging clone or disposable test runner access.
- Sandbox
decision-enginetarget or clear permission to create one. - Staging token set with no production mutation authority.
- Rollback SHA and rollback command.
If these are unavailable, implementation can continue locally but production proof must remain blocked.
Expected Runtime And User-Visible Behavior
Single-domain PR:
- Pipeline detects route.
- Required agents has one name.
- Master bot posts one review comment with
VERDICT:AGENT:*. - Existing merge or feedback path continues.
Cross-domain PR:
- Pipeline detects route.
- Required agents has two names.
- Master bot posts one review comment per required agent, or one structured comment with separate verdict sections if that is simpler.
- Merge requires both approvals.
- Any request changes blocks and feeds back.
The user-visible proof is PR comments and final PR disposition.
Staging Proof Contract
Staging must be production-like enough to test pipeline behavior but quarantined from production side effects.
Required staging safety controls:
- Production services disabled before any daemon starts.
- Production GitHub tokens removed or replaced.
- Production OpenRouter/Claude/Hermes keys removed or replaced unless explicitly approved for staging spend.
- Sandbox
decision-enginerepo configured. - Auto-merge either disabled or constrained to sandbox repo.
- Hostname clearly changed to staging.
Required proof artifact:
{
"phase": "1b",
"environment": "staging",
"teleo_infrastructure_sha": "...",
"decision_engine_sha": "...",
"pipeline_db_schema": 26,
"feature_flags": {
"PHASE1B_AGENT_ROUTING_ENABLED": "true"
},
"test_prs": [
{
"case": "internet-finance",
"pr": 1,
"required_agents": ["Rio"],
"verdicts": {"Rio": "approve"},
"final_state": "approved"
}
],
"cross_domain_pr": {
"required_agents": ["Theseus", "Rio"],
"final_state": "approved_or_feedback"
},
"prod_services_disabled": true,
"proof_generated_at": "2026-05-29T00:00:00Z"
}
Staging proof does not satisfy the 24-hour production stability gate.
Validation And Test Matrix
Unit tests:
test_agent_routing.py- routes six primary ownership cases.
- routes broadened Astra cases: energy, robotics, advanced manufacturing.
- routes Leo meta cases: collective AI goals, teleohumanity strategy.
- routes Theseus AI systems cases.
- routes Rio x402 and internet finance cases.
- caps cross-domain to top 2 agents.
- has deterministic tie breaking.
Parser tests:
- Existing
test_eval_parse.pyremains valid. - Add explicit verdict parse coverage for all six agent names.
Mocked eval integration tests:
- One required agent calls one runner and posts one verdict.
- Two required agents call two runners and post two verdicts.
- One request changes blocks aggregate approval.
- Transport failure reopens for retry.
- Default Leo second-review does not run unless Leo is routed.
Batch tests:
- If batching remains enabled, batch grouping must use route decisions, not stale DB domain.
- If batching is disabled for Phase 1b, assert cross-domain and single-domain PRs still process individually.
Smoke commands:
python3 -m venv .venv
. .venv/bin/activate
python3 -m pip install 'aiohttp>=3.9,<4' 'pytest>=8' 'pytest-asyncio>=0.23' 'ruff>=0.3' pyyaml
python3 -m pytest tests/test_agent_routing.py tests/test_evaluate_agent_routing.py tests/test_eval_parse.py
If local pytest is unavailable, that is a tooling blocker for full local proof, not an implementation blocker.
CI/CD, Release, And Pre-Push Gate Contract
Pre-push required:
python3 -m pytestfor the focused routing/eval test set.python3 -m ruff check lib testsif dev deps are installed.- Manual scan that no secrets are printed or committed.
PR required:
- Summary of routing rule.
- Test output.
- Known non-prod proof boundary.
- Statement that production acceptance still requires staging or VPS proof.
Deploy required:
- Exact reviewed SHA.
- Staging proof bundle first.
- Production service restart plan.
- Rollback SHA.
Release phases:
| Phase | Feature flag | Environment | Required proof |
|---|---|---|---|
| Local implementation | Enabled only in tests | Local | Unit and mocked eval tests. |
| Staging shadow | Enabled against sandbox repo | Staging clone or Crabbox-like box | Seven sandbox PR proof artifact. |
| Production shadow | Optional, no merge mutation if supported | Production | Route decisions logged without changing verdict path. |
| Production cutover | Enabled | Production | Real PR verdicts by required agents. |
| Production closure | Enabled | Production | 24-hour stability plus all six agents represented. |
Rollback:
- Flip
PHASE1B_AGENT_ROUTING_ENABLED=false. - Restart
teleo-pipeline.service. - Confirm eval path returns to prior behavior.
- If code rollback is required, deploy the previous exact SHA and restart service.
- Keep proof artifact explaining why rollback occurred.
Pre-push commands:
python3 -m pytest tests/test_agent_routing.py tests/test_evaluate_agent_routing.py tests/test_eval_parse.py
python3 -m ruff check lib tests
git diff --check
If dev dependencies are missing, install with:
python3 -m venv .venv
. .venv/bin/activate
python3 -m pip install 'aiohttp>=3.9,<4' 'pytest>=8' 'pytest-asyncio>=0.23' 'ruff>=0.3' pyyaml
Independent CLI Audit Contract
A reviewer should be able to run:
git diff --stat
git diff -- lib/agent_routing.py lib/domains.py lib/evaluate.py lib/llm.py tests/
python3 -m pytest tests/test_agent_routing.py tests/test_evaluate_agent_routing.py
The audit should confirm:
- No direct production credentials are introduced.
decision-engineis the target in docs/config where Phase 1b needs it.- No old eval scripts are revived.
- Default Leo second-review is not silently preserved for all PRs.
- Multi-agent PRs require top 2 reviewer approvals.
Outside-The-Box Fix Paths
If identity-scored keyword routing is too noisy:
- Use folder-first routing for strong path evidence and identity scoring only for ambiguous or cross-domain cases.
- Add a cheap LLM classifier in shadow mode only, comparing against deterministic router decisions.
- Require contributors/frontends to include an explicit domain or agent hint in PR metadata.
If live GitHub identity constraints block separate agent comments:
- Keep one master bot account and agent-specific verdict tags.
- Defer separate GitHub identities to Phase 2.
If staging VPS access is delayed:
- Use a disposable Hetzner clone when available.
- Use Crabbox or another remote test box for local dirty checkout proof.
- Use a mocked local fake GitHub/Forgejo API server for the eval loop.
Maintenance Capture
Same-tranche maintenance that is justified now:
- Extract route scoring into a dedicated module if
lib/domains.pywould become too broad. - Keep backward-compatible wrappers for existing
agent_for_domainanddetect_domain_from_diffuntil downstream callers are migrated. - Add tests around the existing bug-prone batch grouping surface.
Maintenance to avoid now:
- Full Forgejo-to-GitHub daemon rewrite unless needed for the Phase 1b PR.
- Dashboard redesign.
- Contributor credit redesign beyond removing "Leo reviews every PR" assumptions.
- Separate GitHub identities per agent.
- Payment, wallet, Twitter, or decision-market work.
Parallelization And Fanout
| Workstream | Classification | Owner | Notes |
|---|---|---|---|
| Agent identity router and tests | local_owner | Codex current turn | Core implementation surface. Do not fan out because it owns central route contract. |
| Eval layer integration and mocked tests | local_owner | Codex current turn | Needs tight coupling with router semantics. |
| Staging VPS clone proof | draft_gated | Fwaz or infrastructure owner | Requires VPS/provider access and secret quarantine. |
| GitHub identity model | draft_gated | Fwaz plus m3taversal | Deferred unless master bot account becomes unacceptable. |
| Dashboard/reporting polish | do_not_parallelize | Later | Avoid until route state contract is stable. |
Workstream Sub-Spec: Agent Identity Router
Classification: local_owner
Owned files:
lib/agent_routing.pyif created.lib/domains.pycompatibility wrappers.tests/test_agent_routing.py.
Forbidden files:
lib/evaluate.pyexcept imports needed for route type compatibility.- Any runtime secrets.
- Any production config defaults outside route feature flags.
Binary done condition:
- Pure route function returns expected required agents for every row in the proof matrix.
- Tests prove deterministic top-2 behavior and fallback behavior.
Verification commands:
python3 -m pytest tests/test_agent_routing.py
Non-claims:
- Does not prove PR comment posting.
- Does not prove production target wiring.
Prompt-ready handoff:
implement phase 1b agent identity routing in teleo-infrastructure. own only route module and route tests. preserve compatibility wrappers. route decision must be pure, deterministic, evidence-bearing, and top-2 capped for cross-domain cases. do not touch production API or eval state transitions.
Workstream Sub-Spec: Eval Integration
Classification: local_owner
Owned files:
lib/evaluate.pylib/llm.pylib/eval_parse.pyonly if parser normalization is required.tests/test_evaluate_agent_routing.pytests/test_eval_parse.py
Forbidden files:
- Old deprecated eval shell scripts.
- Deploy scripts unless a feature flag must be exposed.
- Dashboard UI except parser-compatible health checks.
Binary done condition:
- With
PHASE1B_AGENT_ROUTING_ENABLED=true, eval invokes only required reviewer agents. - With flag disabled, prior behavior remains available.
- One request-changes verdict blocks aggregate approval.
- All approve verdicts continue to existing approval path.
Verification commands:
python3 -m pytest tests/test_evaluate_agent_routing.py tests/test_eval_parse.py
Non-claims:
- Does not prove live GitHub or VPS behavior.
- Does not prove separate agent GitHub identities.
Prompt-ready handoff:
wire phase 1b routing into teleo-infrastructure eval path behind a feature flag. use required agents from the route result, run agent-specific reviews, aggregate verdicts, and preserve merge/feedback semantics. do not revive deprecated scripts or remove rollback path.
Workstream Sub-Spec: Staging Proof
Classification: draft_gated
Owned files and surfaces:
- Staging VPS or disposable remote test box.
- Sandbox
decision-enginerepo. - Staging secrets.
- Machine-readable proof artifact.
Forbidden files and surfaces:
- Production VPS services.
- Production GitHub repo.
- Production secrets.
- Mainnet/payment/Twitter surfaces.
Binary done condition:
- Six single-domain PRs and one cross-domain PR produce expected required-agent verdicts and final dispositions in staging.
Verification commands:
systemctl status teleo-pipeline
journalctl -u teleo-pipeline --since "1 hour ago"
sqlite3 /path/to/pipeline.db "select number, status, domain_agent, leo_verdict, domain_verdict from prs order by number desc limit 20;"
gh pr view --repo living-ip/decision-engine-sandbox PR_NUMBER --comments
Non-claims:
- Does not prove production 24-hour stability.
Prompt-ready handoff:
create a quarantined staging proof for phase 1b. clone or provision a disposable server, disable production services and secrets before starting pipeline, point to a sandbox decision-engine repo, run six single-domain prs plus one cross-domain pr, and save a machine-readable proof artifact. do not mutate production.
Worker-ready ticket for later staging proof:
title: phase 1b staging proof on cloned vps
owned surfaces: staging vps, sandbox decision-engine repo, staging secrets, proof artifact
forbidden surfaces: production vps services, production github repo, production secrets
done condition: six single-domain prs plus one cross-domain pr produce expected required-agent verdicts and final dispositions
verification commands: systemd status readback, pipeline log scrape, sqlite route query, github pr comment readback
non-claims: does not prove 24h production stability
preferred executor: human/fwaz with codex support
handoff: create staging clone, disable prod services, inject sandbox config, run phase 1b proof script, save machine-readable proof
Acceptance Criteria
Local PR acceptance:
- Focused tests pass.
- Router returns correct single-agent routes.
- Router returns top-2 required agents for cross-domain cases.
- Eval layer invokes only required reviewer agents.
- Verdict aggregation handles all approve, request changes, transport failure, and missing verdict.
- Existing verdict format remains parseable.
- No production readiness claim is made.
Staging acceptance:
- Staging environment cannot mutate production.
- Six single-domain sandbox PRs complete.
- One cross-domain sandbox PR completes.
- Required reviewer agents match proof matrix.
- Proof artifact is retained.
Production exit:
- Exact reviewed SHA deployed.
- All six agents produce at least one verdict in their domain.
- At least one cross-domain PR proves top-2 review behavior.
- Pipeline stable for 24 hours.
Readiness And Claim Boundaries
Allowed claims after local implementation:
- "Route logic is implemented and locally tested."
- "Mocked eval integration proves required-agent invocation and aggregation."
- "The implementation PR is ready for staging proof."
Forbidden claims after local implementation:
- "Phase 1b is complete."
- "Production is ready."
- "All six agents have demonstrated live review cycles."
- "The VPS is safely updated."
Allowed claims after staging proof:
- "Phase 1b passed sandbox staging proof."
- "The exact SHA is eligible for production cutover review."
Forbidden claims after staging proof:
- "Production is stable."
- "Live
decision-enginePRs are proven."
Allowed claims after production 24-hour proof:
- "Phase 1b production exit criteria are met."
Spec Quality Self-Audit
Required execution-grade headings present:
- Current Implementation Audit: present.
- Goal-Vs-Repo-Truth Diff: present.
- Completion Percent And Remaining Delta: present.
- Closure, Endpoint, And Deployment Truth: present.
- Critical Assumptions And Invalidators: present.
- State And Truth Contract: present.
- Measurement Contract: present.
- Backend Work Required: present.
- Frontend Work Required: present.
- Expected Runtime And User-Visible Behavior: present.
- Validation And Test Matrix: present.
- CI/CD, Release, And Pre-Push Gate Contract: present.
- Independent CLI Audit Contract: present.
- Outside-The-Box Fix Paths: present.
- Maintenance Capture: present.
- Parallelization And Fanout: present.
Additional spec-of-spec coverage:
- Product Outcome Contract: present.
- Non-Goals: present.
- Program Decomposition: present.
- Priority Matrix: present.
- Score-To-100 Closure Plan: present.
- Workstream sub-specs: present.
- Staging Proof Contract: present.
- Rollback contract: present.
Known incompleteness:
- This spec cannot name the exact production deploy command until Fwaz or VPS truth confirms it.
- This spec cannot name the exact sandbox repo until the operator creates or selects it.
- This spec cannot prove whether production daemon code exactly matches local
teleo-infrastructureuntil VPS readback exists.
Assistant-Added Caveats
This spec intentionally expands B1/B2 from folder-domain routing to identity-scored agent routing because m3taversal clarified that agent identities should route and folders are only signals. That is the right product interpretation, but it increases implementation scope versus the original simple path classifier.
This spec does not claim production readiness without staging or VPS proof.