diff --git a/docs/phase1b-agent-routing-spec.md b/docs/phase1b-agent-routing-spec.md new file mode 100644 index 0000000..23760aa --- /dev/null +++ b/docs/phase1b-agent-routing-spec.md @@ -0,0 +1,997 @@ +# Phase 1b Agent Routing Spec + +Created: 2026-05-29 +Status: active draft +Owner: Epimetheus pipeline implementation, with m3taversal as scope owner and Fwaz as VPS/runtime owner + +## Product Outcome Contract + +Phase 1b makes the knowledge-base evaluation engine behave like a six-agent review system instead of a generic triage stack. + +When a contribution changes the `decision-engine` KB, the pipeline must decide which Hermes agent identity is responsible for judging that change, run the required review or reviews, post agent-specific verdicts, and then let the existing merge or feedback machinery continue. + +The user-visible outcome is not a new frontend. It is a PR review trail showing that the right agent or agents reviewed the right KB mutation. + +## Non-Goals + +This spec does not implement: + +- Twitter/X posting. +- x402, wallet, payment, or funding flows. +- Decision markets, agent bidding, stake-weighted quorum, or prediction-market review. +- Full general user-input routing outside the PR evaluation path. +- Separate GitHub accounts for each agent. +- A full Forgejo-to-GitHub daemon rewrite beyond what Phase 1b needs. +- A dashboard redesign. +- Production deployment without staging or VPS proof. + +## Program Decomposition + +This is a medium-sized control-plane change with five execution lanes: + +1. Agent identity routing. +2. Eval pipeline integration. +3. GitHub identity and bot comment posture. +4. Reporting and contributor compatibility. +5. Staging and production proof. + +The implementation can remain in one PR only if lanes 1 through 4 are tightly tested and the staging proof remains a separate operator task. If the eval integration diff grows beyond the files named in this spec, split into: + +- PR 1: route contract and tests. +- PR 2: eval integration and mocked state tests. +- PR 3: GitHub/comment idempotency and reporting compatibility. +- PR 4 or operator runbook: staging proof artifacts. + +Child specs: + +- `docs/phase1b/agent-identity-router-spec.md` +- `docs/phase1b/eval-pipeline-integration-spec.md` +- `docs/phase1b/github-identity-bot-posture-spec.md` +- `docs/phase1b/reporting-contributor-compatibility-spec.md` +- `docs/phase1b/staging-proof-spec.md` + +## Priority Matrix + +| Rank | Workstream | Recurrence | Value | Readiness | Current state | Issue/spec mapping | Thread-claimed status | Verified implementation/proof status | Recommended next move | +| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | +| 1 | Canonical repo and eval target | Repeated confusion between `teleo-codex`, `teleo-kb`, and `decision-engine`. | Critical | Ready now | Confirmed by user: `decision-engine`. Some code still has Forgejo/teleo-codex defaults. | This spec, `handoff/phase1-step3-script-migration.md` | Clarified in chat. | Partially reflected in repo; not unified in daemon modules. | Make Phase 1b route/proof explicitly target `decision-engine`. | +| 2 | Agent identity routing | Repeated confusion between domain folders and agent ownership. | Critical | Ready now | Existing `lib/domains.py` is folder-first. | This spec | m3taversal clarified identity-first routing. | Initial local patch is insufficient. | Replace with identity-scored route contract. | +| 3 | Cross-domain review | Raised as scope expansion during clarification. | High | Ready now | Not implemented. | This spec | m3taversal confirmed cap at top 2. | No code proof. | Add top-2 required reviewer aggregation. | +| 4 | Single master bot account | GitHub bot/PAT issue was noted as blocker. | High | Ready now | Phase 1 handoff already documents single `livingIPbot` posture. | `handoff/phase1-step3-script-migration.md` | Separate identities ideal, likely too complex. | Handoff-only. | Use master bot comments with agent verdict tags. | +| 5 | Staging proof | User asked how to test without mutating prod VPS. | Critical for production | Draft gated | Needs VPS clone or Crabbox/staging access. | This spec | Proposed, not executed. | No proof. | Run after code PR passes local checks. | + +## Goal + +Implement Phase 1b for the `decision-engine` knowledge base: pipeline-v2 evaluates each incoming KB pull request by routing it to the Hermes agent identity that owns the relevant domain of judgment. + +The implementation lives in `teleo-infrastructure`. The canonical KB repo for this phase is `living-ip/decision-engine`. + +Phase 1b is complete only when single-domain and cross-domain PRs are routed to the expected required reviewer agents, verdicts are posted in the existing `VERDICT:AGENT:*` format, and the merge or feedback path continues from those verdicts. + +## User-Journey Contract + +Contributor or agent flow: + +1. A contributor or agent opens a PR against `living-ip/decision-engine`. +2. The PR changes one or more KB files. +3. Pipeline-v2 discovers the PR and fetches its diff. +4. The router scores Hermes agent identities from the diff, file paths, branch metadata, and eventually PR metadata. +5. The pipeline runs the required reviewer agents. +6. The master bot posts verdict comments that clearly name the agent identity in `VERDICT:AGENT:*` tags. +7. If all required reviewers approve, the existing approval and merge path continues. +8. If any required reviewer requests changes, the existing feedback/retry path continues. + +Operator flow: + +1. Operator can inspect a PR and see why each agent was selected. +2. Operator can inspect pipeline logs or audit rows and see route scores, required agents, verdicts, and aggregate result. +3. Operator can distinguish local proof, staging proof, and production proof. + +## Existing-Spec Inventory + +| Existing doc | Relevance | Decision | Reason | +| --- | --- | --- | --- | +| `handoff/phase1-step3-script-migration.md` | Establishes the Phase 1 move from Forgejo `teleo-codex` toward GitHub `living-ip/decision-engine`, and documents the single master bot account posture. | Reuse as context. | It owns migration history, not the Phase 1b routing implementation. | +| `handoff/deprecated/eval-scripts.md` | Confirms old eval dispatcher/worker scripts are dead and `lib/evaluate.py::evaluate_cycle` owns live eval behavior. | Reuse as context. | It prevents work from targeting retired scripts. | +| `docs/ARCHITECTURE.md` | Describes pipeline-v2 stages, SQLite state, Forgejo-era runtime topology, and existing evaluate/merge loops. | Reuse as context. | It is broader architecture; this spec is a Phase 1b delta spec. | +| `docs/multi-model-eval-architecture.md` | Documents the prior Leo-first plus second-model evaluation theory. | Supersede for Phase 1b eval routing only. | Phase 1b now routes to domain-owner agent identities, with capped top-2 cross-domain review. The old doc remains useful for later calibration. | +| `docs/queue.md` | Mentions domain evolution such as `ai-alignment` to `ai-systems`. | Reuse as signal. | It supports the identity-scored router rather than folder-only routing. | + +## Current Implementation Audit + +Current relevant implementation state: + +- `teleo-pipeline.py` runs pipeline-v2 as a single async daemon. +- `lib/evaluate.py::evaluate_cycle` is the active eval loop. +- `lib/evaluate.py::evaluate_pr` currently detects a domain, runs a domain review, then runs Leo review for non-LIGHT PRs. +- `lib/domains.py` contains a folder-first `DOMAIN_AGENT_MAP`. +- `lib/llm.py` contains prompt templates and `run_domain_review`, `run_batch_domain_review`, and `run_leo_review`. +- `lib/eval_parse.py::parse_verdict` parses `VERDICT:AGENT:APPROVE` and `VERDICT:AGENT:REQUEST_CHANGES`. +- `pipeline-health-check.py` is GitHub-oriented and points at `living-ip/decision-engine`. +- `lib/forgejo.py`, `lib/evaluate.py`, and `lib/merge.py` still use Forgejo-named abstractions as the primary API surface. +- Per-agent GitHub identity is deferred; Phase 1 uses one master bot account. + +Fwaz clarification on 2026-05-29: + +- Separate GitHub identities are still ideal and blocked on GitHub/PAT setup; Phase 1b must not require them to land the routed-eval path. +- Current production update behavior is `pull -> services recognize pull -> edit on VPS -> PR to Leo`; this is useful context, not the desired long-term control model. +- New desired rule is no direct production self-upgrades: agents open PRs, and production deploys exact reviewed/tested SHAs approved and signed by Leo. +- Crabbox is acceptable as the long-term disposable staging/test-box direction, while a production-like clone remains the highest-fidelity proof for systemd/VPS paths. + +This branch implementation now includes: + +- `lib/agent_routing.py` with a pure identity-scored route contract. +- `PHASE1B_AGENT_ROUTING_ENABLED`, defaulting off. +- A Phase 1b eval path that runs routed required agents and disables stale domain batching under the flag. +- Focused tests for six-agent routing, top-2 cross-domain routing, verdict parsing, and mocked eval aggregation. + +## Goal-Vs-Repo-Truth Diff + +Desired Phase 1b behavior: + +- Route PRs against `decision-engine`, not `teleo-codex`. +- Classify by agent identity ownership, not only by folder path. +- Run exactly the required reviewer agents. +- Use one master bot account if separate GitHub identities are too complex. +- Preserve the existing verdict comment format. +- Preserve existing merge and feedback behavior. +- Support cross-domain PRs by requiring the top 2 routed agents. + +Pre-implementation repo truth: + +- Pipeline eval still has a two-stage review shape: domain review plus Leo review. +- Folder-domain mapping exists, but agent identity scoring does not. +- Cross-domain review is not implemented as multiple required reviewer agents. +- Batch eval can group rows before fetching diffs, which risks routing unclassified rows through `general`. +- GitHub migration is partial: some scripts target GitHub `decision-engine`, but live daemon modules still have Forgejo-era names and assumptions. + +## Completion Percent And Remaining Delta + +Estimated implementation progress on this branch: + +- B1 classifier foundation: 100 percent locally, pending staging calibration. +- B2 routing layer: 70 percent locally behind a default-off feature flag. +- Cross-domain top-2 review: 70 percent locally through mocked eval proof. +- Local proof suite: 80 percent for router/eval/parser scope. +- Staging or VPS proof: 0 percent. + +Remaining delta: + +1. Decide whether the production Phase 1b transport stays Forgejo-first for cutover or switches direct to GitHub `decision-engine` before staging. +2. Add stronger idempotency for duplicate comments on retry after partial multi-agent success. +3. Update reporting/health compatibility beyond `review_records` if staging shows false readiness. +4. Prove against staging before production. +5. Deploy only an exact reviewed/tested SHA after Leo signoff. + +## Closure, Endpoint, And Deployment Truth + +Local closure means: + +- Focused tests pass in `teleo-infrastructure`. +- A PR exists with the Phase 1b routing implementation and proof notes. + +Staging closure means: + +- A cloned or disposable staging runtime is pointed at a sandbox `decision-engine`. +- Six single-domain sandbox PRs and one cross-domain sandbox PR complete the expected eval path. +- A machine-readable proof artifact captures routes, required agents, verdicts, status transitions, git SHAs, and logs. + +Production closure means: + +- The exact reviewed SHA is deployed to the production VPS. +- Production pipeline runs real `decision-engine` PRs through Phase 1b routing. +- All six agents have completed at least one live review cycle. +- Pipeline remains stable for at least 24 hours after cutover. + +Without VPS or staging access, only local closure can be claimed. + +## Critical Assumptions And Invalidators + +Assumptions: + +- `decision-engine` is the canonical KB repo for Phase 1b. +- The active eval implementation is `teleo-infrastructure/lib/evaluate.py`, not retired shell scripts. +- One master bot account is acceptable for Phase 1b verdict comments. +- Required reviewer identity is encoded in the verdict tag, not necessarily in the GitHub account identity. +- Agent state files in `decision-engine/agents/{agent}` are the right identity context source when present. + +Invalidators: + +- Production pipeline is still wired to a different canonical repo. +- The VPS runs code not represented by current `teleo-infrastructure`. +- Branch protection requires separate GitHub identities before comments or reviews count. +- Agent identity files are absent or materially different on the VPS. +- Cross-domain review must include more than top 2 reviewers. + +## State And Truth Contract + +The routing implementation must record or expose: + +- PR number. +- Primary agent. +- Required agents. +- Route kind: `single`, `multi`, or `escalated`. +- Route scores by agent. +- Route evidence: path, branch, title, diff keyword, or fallback. +- Verdict per required agent. +- Aggregate result. +- Failure reason for missing or unparseable verdicts. + +This can be stored first in audit log details and test artifacts. A DB schema migration is optional for Phase 1b unless downstream dashboards require queryable route fields. + +### Route Decision Schema + +The route decision should be serializable without importing Python classes. Use this JSON shape in audit rows and proof artifacts: + +```json +{ + "pr": 123, + "repo": "living-ip/decision-engine", + "route_version": "phase1b-v1", + "route_kind": "single", + "primary_agent": "Rio", + "required_agents": ["Rio"], + "scores": { + "Leo": 0, + "Theseus": 1, + "Rio": 9, + "Vida": 0, + "Clay": 0, + "Astra": 0 + }, + "evidence": [ + { + "agent": "Rio", + "signal": "path", + "weight": 5, + "value": "domains/internet-finance/example.md" + } + ], + "fallback": false +} +``` + +`route_kind` values: + +- `single`: one required reviewer. +- `multi`: two required reviewers from cross-domain scoring. +- `fallback`: no confident route, Leo required. +- `escalated`: route exceeded simple review bounds and was capped by policy. + +### Verdict State Schema + +Aggregate review state should be serializable as: + +```json +{ + "pr": 123, + "required_agents": ["Theseus", "Rio"], + "agent_verdicts": { + "Theseus": "approve", + "Rio": "request_changes" + }, + "aggregate_verdict": "request_changes", + "blocking_agents": ["Rio"], + "missing_agents": [], + "unparseable_agents": [], + "transport_failed_agents": [] +} +``` + +Aggregate states: + +- `approve`: all required agents approved. +- `request_changes`: at least one required agent requested changes or produced unparseable content. +- `retry`: at least one required review failed for transport reasons and should not burn the PR as a substantive rejection. + +## Measurement Contract + +Minimum metrics: + +- `route_single_count` +- `route_multi_count` +- `route_escalated_count` +- `review_required_agent_count` +- `review_missing_verdict_count` +- `review_request_changes_count` +- `review_approve_count` +- `route_fallback_count` + +Minimum proof matrix: + +| Case | Expected route | +| --- | --- | +| grand strategy PR | Leo | +| ai systems or ai alignment PR | Theseus | +| internet finance or x402 PR | Rio | +| health PR | Vida | +| entertainment PR | Clay | +| space, robotics, energy, or advanced manufacturing PR | Astra | +| ai plus x402 PR | Theseus and Rio | +| collective ai goals PR | Leo and Theseus, if both score in top 2 | + +## Score-To-100 Closure Plan + +Preparedness score before implementation: 35/100. + +| Score band | Closure move | Evidence that moves score | +| --- | --- | --- | +| 35 -> 50 | Route contract implemented and unit-tested. | `test_agent_routing.py` proves six single-agent routes, broadened identity ownership, top-2 cross-domain routes, and fallback behavior. | +| 50 -> 65 | Eval integration mocked locally. | Mocked eval tests prove required agents are invoked, default Leo review is removed, and aggregate verdicts drive approve/request-changes behavior. | +| 65 -> 75 | API/comment compatibility proven locally. | Tests prove all six verdict tags parse and master-bot comment bodies preserve existing parser expectations. | +| 75 -> 85 | Staging clone or disposable test box runs sandbox PR proof. | Six single-domain sandbox PRs plus one cross-domain sandbox PR produce expected comments and state transitions. | +| 85 -> 95 | Production deploy of exact reviewed SHA. | VPS deploy log, service restart readback, and route/proof artifact for first real PRs. | +| 95 -> 100 | 24-hour production stability. | 24-hour daemon readback with no duplicate comments, no stuck review rows, no production fallback spike, and all six agents represented in verdict history. | + +The implementation PR can be merged at 65-75 if reviewers accept staging as a deploy gate. It cannot claim Phase 1b complete below 100. + +## Backend Work Required + +### 1. Agent identity router + +Create or refactor into `lib/agent_routing.py` unless the existing `lib/domains.py` remains clearly small enough. + +Define: + +```python +AgentRoute( + primary_agent: str, + required_agents: tuple[str, ...], + route_kind: str, + scores: dict[str, int], + evidence: list[dict], +) +``` + +Router signals: + +- Path signals from `domains/`, `entities/`, `core/`, `foundations/`, and `agents/`. +- Branch prefix signals such as `rio/`, `theseus/`, `astra/`, `leo/`. +- Keyword signals from path, filename, branch, PR title/body when available, and capped diff text. +- Agent identity ownership map. + +Agent identity ownership map: + +| Agent | Owns | +| --- | --- | +| Leo | grand strategy, teleohumanity goals, collective AI self-understanding, meta strategy, nested collective intelligence concepts | +| Theseus | AI systems, AI alignment, AI governance, agent systems, safety, evaluation | +| Rio | internet finance, living capital, markets, crypto, futarchy, x402, payments, capital formation | +| Vida | health, healthcare, medicine, prevention, clinical systems, mental health, biohealth | +| Clay | entertainment, media, culture, IP, fandom, narrative, consumer attention | +| Astra | space development, robotics, energy, advanced manufacturing, physical frontier infrastructure | + +Routing rules: + +- If only one agent crosses the threshold, require that agent. +- If more than one agent crosses the threshold, require the top 2 agents. +- If no agent crosses threshold, fallback to Leo with route kind `fallback`. +- Tie break by score, then deterministic configured order. + +Implementation constraints: + +- The router must be deterministic. +- The router must be pure and side-effect free. +- Route scores must be explainable through evidence entries. +- Folder paths should be strong evidence, not the whole classifier. +- Keyword scoring must not require paid inference. +- LLM classification may be added later only as shadow-mode evidence. + +Recommended scoring starter: + +| Signal | Weight | +| --- | --- | +| Path directly under known primary ownership area | 8 | +| Path under broadened ownership area | 6 | +| Branch prefix matches agent | 4 | +| Filename keyword matches ownership | 3 | +| Diff keyword matches ownership | 1 per capped hit | +| PR title/body keyword matches ownership, if available | 2 | + +Top-2 selection: + +- Include the highest-scoring agent. +- Include a second agent only if its score is at least 40 percent of the first score and at least the minimum threshold. +- Minimum threshold starts at 4. +- Never include more than two required agents in Phase 1b. + +### 2. Eval layer integration + +Modify `lib/evaluate.py`: + +- Fetch PR diff. +- Build route from diff and branch. +- Store or audit route decision. +- Run required reviewer agents. +- Aggregate verdicts. +- Remove default Leo second-review for normal single-agent PRs. +- Keep existing bypasses for musings and reweave unless m3taversal changes policy. +- Revisit batch eval: disable batching for Phase 1b or classify before batching. + +Implementation sequence: + +1. Add pure route builder and tests. +2. Add review aggregation helper and tests. +3. Add `run_agent_review` while leaving existing `run_domain_review` and `run_leo_review` intact. +4. Switch individual `evaluate_pr` path to the new router behind a feature flag such as `PHASE1B_AGENT_ROUTING_ENABLED`. +5. Disable batch domain eval when the feature flag is enabled unless route-aware batching is implemented in the same PR. +6. Remove or bypass the default Leo second-review when the feature flag is enabled. +7. Preserve old behavior when the feature flag is disabled. + +Feature flag requirement: + +```text +PHASE1B_AGENT_ROUTING_ENABLED=false by default until staging proof exists. +``` + +The PR may set tests against enabled behavior without changing the production default. + +### 3. Agent review runner + +Modify or add in `lib/llm.py`: + +```python +async def run_agent_review(diff: str, files: str, agent: str, route: AgentRoute) -> tuple[str | None, dict]: + ... +``` + +Prompt must include: + +- Agent identity context when available. +- Route evidence. +- Existing eval criteria. +- Required verdict tag for that exact agent. + +Continue using one master bot account for comments. The bot comment body must identify the routed agent via the verdict tag. + +Agent context lookup order: + +1. Runtime-configured KB worktree path, expected to point at `decision-engine`. +2. Existing `config.MAIN_WORKTREE` if production still uses that convention. +3. Explicit test fixture path in unit tests. + +Context files: + +- `agents/{agent}/identity.md` +- `agents/{agent}/beliefs.md` +- `agents/{agent}/reasoning.md` +- `agents/{agent}/skills.md` + +Missing context files: + +- Log a warning. +- Include an audit evidence entry. +- Continue with the generic agent prompt. +- Do not crash the eval cycle. + +### 4. Verdict aggregation + +Add helper: + +```python +aggregate_agent_verdicts(required_agents, reviews) -> AggregateVerdict +``` + +Rules: + +- All required agents approve: approved. +- Any required agent requests changes: request changes. +- Transport failure: reopen for retry. +- Missing or unparseable verdict: request changes unless transport failure is explicit. + +Comment format: + +Preferred for one required agent: + +```text + + + +``` + +Preferred for two required agents: + +```text +## Theseus review + + + + + +## Rio review + + + + +``` + +Two separate comments are acceptable if simpler and less risky for existing parsers. + +### 5. Contributor and dashboard compatibility + +Audit and update: + +- `lib/contributor.py` assumptions that Leo reviews every PR. +- `pipeline-health-check.py` verdict parsing if needed. +- Any dashboard code assuming only `leo_verdict` plus `domain_verdict`. + +Avoid broad dashboard redesign in Phase 1b. If dashboards need richer route state, add an audit artifact first and defer UI. + +## Frontend Work Required + +No frontend work is required for Phase 1b. + +`livingip-web` Phase 1c can later reuse the same router as pre-PR guidance, but Phase 1b acceptance is based on `decision-engine` PR evaluation. + +## Operator Work Required + +Operator or infrastructure owner must provide before production proof: + +- Current production deployed SHA for `teleo-infrastructure`. +- Current production KB target and worktree path. +- Current systemd units and restart commands. +- Staging clone or disposable test runner access. +- Sandbox `decision-engine` target or clear permission to create one. +- Staging token set with no production mutation authority. +- Rollback SHA and rollback command. + +If these are unavailable, implementation can continue locally but production proof must remain blocked. + +## Expected Runtime And User-Visible Behavior + +Single-domain PR: + +1. Pipeline detects route. +2. Required agents has one name. +3. Master bot posts one review comment with `VERDICT:AGENT:*`. +4. Existing merge or feedback path continues. + +Cross-domain PR: + +1. Pipeline detects route. +2. Required agents has two names. +3. Master bot posts one review comment per required agent, or one structured comment with separate verdict sections if that is simpler. +4. Merge requires both approvals. +5. Any request changes blocks and feeds back. + +The user-visible proof is PR comments and final PR disposition. + +## Staging Proof Contract + +Staging must be production-like enough to test pipeline behavior but quarantined from production side effects. + +Required staging safety controls: + +- Production services disabled before any daemon starts. +- Production GitHub tokens removed or replaced. +- Production OpenRouter/Claude/Hermes keys removed or replaced unless explicitly approved for staging spend. +- Sandbox `decision-engine` repo configured. +- Auto-merge either disabled or constrained to sandbox repo. +- Hostname clearly changed to staging. + +Required proof artifact: + +```json +{ + "phase": "1b", + "environment": "staging", + "teleo_infrastructure_sha": "...", + "decision_engine_sha": "...", + "pipeline_db_schema": 26, + "feature_flags": { + "PHASE1B_AGENT_ROUTING_ENABLED": "true" + }, + "test_prs": [ + { + "case": "internet-finance", + "pr": 1, + "required_agents": ["Rio"], + "verdicts": {"Rio": "approve"}, + "final_state": "approved" + } + ], + "cross_domain_pr": { + "required_agents": ["Theseus", "Rio"], + "final_state": "approved_or_feedback" + }, + "prod_services_disabled": true, + "proof_generated_at": "2026-05-29T00:00:00Z" +} +``` + +Staging proof does not satisfy the 24-hour production stability gate. + +## Validation And Test Matrix + +Unit tests: + +- `test_agent_routing.py` + - routes six primary ownership cases. + - routes broadened Astra cases: energy, robotics, advanced manufacturing. + - routes Leo meta cases: collective AI goals, teleohumanity strategy. + - routes Theseus AI systems cases. + - routes Rio x402 and internet finance cases. + - caps cross-domain to top 2 agents. + - has deterministic tie breaking. + +Parser tests: + +- Existing `test_eval_parse.py` remains valid. +- Add explicit verdict parse coverage for all six agent names. + +Mocked eval integration tests: + +- One required agent calls one runner and posts one verdict. +- Two required agents call two runners and post two verdicts. +- One request changes blocks aggregate approval. +- Transport failure reopens for retry. +- Default Leo second-review does not run unless Leo is routed. + +Batch tests: + +- If batching remains enabled, batch grouping must use route decisions, not stale DB domain. +- If batching is disabled for Phase 1b, assert cross-domain and single-domain PRs still process individually. + +Smoke commands: + +```bash +python3 -m venv .venv +. .venv/bin/activate +python3 -m pip install 'aiohttp>=3.9,<4' 'pytest>=8' 'pytest-asyncio>=0.23' 'ruff>=0.3' pyyaml +python3 -m pytest tests/test_agent_routing.py tests/test_evaluate_agent_routing.py tests/test_eval_parse.py +``` + +If local `pytest` is unavailable, that is a tooling blocker for full local proof, not an implementation blocker. + +## CI/CD, Release, And Pre-Push Gate Contract + +Pre-push required: + +- `python3 -m pytest` for the focused routing/eval test set. +- `python3 -m ruff check lib tests` if dev deps are installed. +- Manual scan that no secrets are printed or committed. + +PR required: + +- Summary of routing rule. +- Test output. +- Known non-prod proof boundary. +- Statement that production acceptance still requires staging or VPS proof. + +Deploy required: + +- Exact reviewed SHA. +- Staging proof bundle first. +- Production service restart plan. +- Rollback SHA. + +Release phases: + +| Phase | Feature flag | Environment | Required proof | +| --- | --- | --- | --- | +| Local implementation | Enabled only in tests | Local | Unit and mocked eval tests. | +| Staging shadow | Enabled against sandbox repo | Staging clone or Crabbox-like box | Seven sandbox PR proof artifact. | +| Production shadow | Optional, no merge mutation if supported | Production | Route decisions logged without changing verdict path. | +| Production cutover | Enabled | Production | Real PR verdicts by required agents. | +| Production closure | Enabled | Production | 24-hour stability plus all six agents represented. | + +Rollback: + +- Flip `PHASE1B_AGENT_ROUTING_ENABLED=false`. +- Restart `teleo-pipeline.service`. +- Confirm eval path returns to prior behavior. +- If code rollback is required, deploy the previous exact SHA and restart service. +- Keep proof artifact explaining why rollback occurred. + +Pre-push commands: + +```bash +python3 -m pytest tests/test_agent_routing.py tests/test_evaluate_agent_routing.py tests/test_eval_parse.py +python3 -m ruff check lib tests +git diff --check +``` + +If dev dependencies are missing, install with: + +```bash +python3 -m venv .venv +. .venv/bin/activate +python3 -m pip install 'aiohttp>=3.9,<4' 'pytest>=8' 'pytest-asyncio>=0.23' 'ruff>=0.3' pyyaml +``` + +## Independent CLI Audit Contract + +A reviewer should be able to run: + +```bash +git diff --stat +git diff -- lib/agent_routing.py lib/domains.py lib/evaluate.py lib/llm.py tests/ +python3 -m pytest tests/test_agent_routing.py tests/test_evaluate_agent_routing.py +``` + +The audit should confirm: + +- No direct production credentials are introduced. +- `decision-engine` is the target in docs/config where Phase 1b needs it. +- No old eval scripts are revived. +- Default Leo second-review is not silently preserved for all PRs. +- Multi-agent PRs require top 2 reviewer approvals. + +## Outside-The-Box Fix Paths + +If identity-scored keyword routing is too noisy: + +- Use folder-first routing for strong path evidence and identity scoring only for ambiguous or cross-domain cases. +- Add a cheap LLM classifier in shadow mode only, comparing against deterministic router decisions. +- Require contributors/frontends to include an explicit domain or agent hint in PR metadata. + +If live GitHub identity constraints block separate agent comments: + +- Keep one master bot account and agent-specific verdict tags. +- Defer separate GitHub identities to Phase 2. + +If staging VPS access is delayed: + +- Use a disposable Hetzner clone when available. +- Use Crabbox or another remote test box for local dirty checkout proof. +- Use a mocked local fake GitHub/Forgejo API server for the eval loop. + +## Maintenance Capture + +Same-tranche maintenance that is justified now: + +- Extract route scoring into a dedicated module if `lib/domains.py` would become too broad. +- Keep backward-compatible wrappers for existing `agent_for_domain` and `detect_domain_from_diff` until downstream callers are migrated. +- Add tests around the existing bug-prone batch grouping surface. + +Maintenance to avoid now: + +- Full Forgejo-to-GitHub daemon rewrite unless needed for the Phase 1b PR. +- Dashboard redesign. +- Contributor credit redesign beyond removing "Leo reviews every PR" assumptions. +- Separate GitHub identities per agent. +- Payment, wallet, Twitter, or decision-market work. + +## Parallelization And Fanout + +| Workstream | Classification | Owner | Notes | +| --- | --- | --- | --- | +| Agent identity router and tests | local_owner | Codex current turn | Core implementation surface. Do not fan out because it owns central route contract. | +| Eval layer integration and mocked tests | local_owner | Codex current turn | Needs tight coupling with router semantics. | +| Staging VPS clone proof | draft_gated | Fwaz or infrastructure owner | Requires VPS/provider access and secret quarantine. | +| GitHub identity model | draft_gated | Fwaz plus m3taversal | Deferred unless master bot account becomes unacceptable. | +| Dashboard/reporting polish | do_not_parallelize | Later | Avoid until route state contract is stable. | + +### Workstream Sub-Spec: Agent Identity Router + +Classification: local_owner + +Owned files: + +- `lib/agent_routing.py` if created. +- `lib/domains.py` compatibility wrappers. +- `tests/test_agent_routing.py`. + +Forbidden files: + +- `lib/evaluate.py` except imports needed for route type compatibility. +- Any runtime secrets. +- Any production config defaults outside route feature flags. + +Binary done condition: + +- Pure route function returns expected required agents for every row in the proof matrix. +- Tests prove deterministic top-2 behavior and fallback behavior. + +Verification commands: + +```bash +python3 -m pytest tests/test_agent_routing.py +``` + +Non-claims: + +- Does not prove PR comment posting. +- Does not prove production target wiring. + +Prompt-ready handoff: + +```text +implement phase 1b agent identity routing in teleo-infrastructure. own only route module and route tests. preserve compatibility wrappers. route decision must be pure, deterministic, evidence-bearing, and top-2 capped for cross-domain cases. do not touch production API or eval state transitions. +``` + +### Workstream Sub-Spec: Eval Integration + +Classification: local_owner + +Owned files: + +- `lib/evaluate.py` +- `lib/llm.py` +- `lib/eval_parse.py` only if parser normalization is required. +- `tests/test_evaluate_agent_routing.py` +- `tests/test_eval_parse.py` + +Forbidden files: + +- Old deprecated eval shell scripts. +- Deploy scripts unless a feature flag must be exposed. +- Dashboard UI except parser-compatible health checks. + +Binary done condition: + +- With `PHASE1B_AGENT_ROUTING_ENABLED=true`, eval invokes only required reviewer agents. +- With flag disabled, prior behavior remains available. +- One request-changes verdict blocks aggregate approval. +- All approve verdicts continue to existing approval path. + +Verification commands: + +```bash +python3 -m pytest tests/test_evaluate_agent_routing.py tests/test_eval_parse.py +``` + +Non-claims: + +- Does not prove live GitHub or VPS behavior. +- Does not prove separate agent GitHub identities. + +Prompt-ready handoff: + +```text +wire phase 1b routing into teleo-infrastructure eval path behind a feature flag. use required agents from the route result, run agent-specific reviews, aggregate verdicts, and preserve merge/feedback semantics. do not revive deprecated scripts or remove rollback path. +``` + +### Workstream Sub-Spec: Staging Proof + +Classification: draft_gated + +Owned files and surfaces: + +- Staging VPS or disposable remote test box. +- Sandbox `decision-engine` repo. +- Staging secrets. +- Machine-readable proof artifact. + +Forbidden files and surfaces: + +- Production VPS services. +- Production GitHub repo. +- Production secrets. +- Mainnet/payment/Twitter surfaces. + +Binary done condition: + +- Six single-domain PRs and one cross-domain PR produce expected required-agent verdicts and final dispositions in staging. + +Verification commands: + +```bash +systemctl status teleo-pipeline +journalctl -u teleo-pipeline --since "1 hour ago" +sqlite3 /path/to/pipeline.db "select number, status, domain_agent, leo_verdict, domain_verdict from prs order by number desc limit 20;" +gh pr view --repo living-ip/decision-engine-sandbox PR_NUMBER --comments +``` + +Non-claims: + +- Does not prove production 24-hour stability. + +Prompt-ready handoff: + +```text +create a quarantined staging proof for phase 1b. clone or provision a disposable server, disable production services and secrets before starting pipeline, point to a sandbox decision-engine repo, run six single-domain prs plus one cross-domain pr, and save a machine-readable proof artifact. do not mutate production. +``` + +Worker-ready ticket for later staging proof: + +```text +title: phase 1b staging proof on cloned vps +owned surfaces: staging vps, sandbox decision-engine repo, staging secrets, proof artifact +forbidden surfaces: production vps services, production github repo, production secrets +done condition: six single-domain prs plus one cross-domain pr produce expected required-agent verdicts and final dispositions +verification commands: systemd status readback, pipeline log scrape, sqlite route query, github pr comment readback +non-claims: does not prove 24h production stability +preferred executor: human/fwaz with codex support +handoff: create staging clone, disable prod services, inject sandbox config, run phase 1b proof script, save machine-readable proof +``` + +## Acceptance Criteria + +Local PR acceptance: + +- Focused tests pass. +- Router returns correct single-agent routes. +- Router returns top-2 required agents for cross-domain cases. +- Eval layer invokes only required reviewer agents. +- Verdict aggregation handles all approve, request changes, transport failure, and missing verdict. +- Existing verdict format remains parseable. +- No production readiness claim is made. + +Staging acceptance: + +- Staging environment cannot mutate production. +- Six single-domain sandbox PRs complete. +- One cross-domain sandbox PR completes. +- Required reviewer agents match proof matrix. +- Proof artifact is retained. + +Production exit: + +- Exact reviewed SHA deployed. +- All six agents produce at least one verdict in their domain. +- At least one cross-domain PR proves top-2 review behavior. +- Pipeline stable for 24 hours. + +## Readiness And Claim Boundaries + +Allowed claims after local implementation: + +- "Route logic is implemented and locally tested." +- "Mocked eval integration proves required-agent invocation and aggregation." +- "The implementation PR is ready for staging proof." + +Forbidden claims after local implementation: + +- "Phase 1b is complete." +- "Production is ready." +- "All six agents have demonstrated live review cycles." +- "The VPS is safely updated." + +Allowed claims after staging proof: + +- "Phase 1b passed sandbox staging proof." +- "The exact SHA is eligible for production cutover review." + +Forbidden claims after staging proof: + +- "Production is stable." +- "Live `decision-engine` PRs are proven." + +Allowed claims after production 24-hour proof: + +- "Phase 1b production exit criteria are met." + +## Spec Quality Self-Audit + +Required execution-grade headings present: + +- Current Implementation Audit: present. +- Goal-Vs-Repo-Truth Diff: present. +- Completion Percent And Remaining Delta: present. +- Closure, Endpoint, And Deployment Truth: present. +- Critical Assumptions And Invalidators: present. +- State And Truth Contract: present. +- Measurement Contract: present. +- Backend Work Required: present. +- Frontend Work Required: present. +- Expected Runtime And User-Visible Behavior: present. +- Validation And Test Matrix: present. +- CI/CD, Release, And Pre-Push Gate Contract: present. +- Independent CLI Audit Contract: present. +- Outside-The-Box Fix Paths: present. +- Maintenance Capture: present. +- Parallelization And Fanout: present. + +Additional spec-of-spec coverage: + +- Product Outcome Contract: present. +- Non-Goals: present. +- Program Decomposition: present. +- Priority Matrix: present. +- Score-To-100 Closure Plan: present. +- Workstream sub-specs: present. +- Staging Proof Contract: present. +- Rollback contract: present. + +Known incompleteness: + +- This spec cannot name the exact production deploy command until Fwaz or VPS truth confirms it. +- This spec cannot name the exact sandbox repo until the operator creates or selects it. +- This spec cannot prove whether production daemon code exactly matches local `teleo-infrastructure` until VPS readback exists. + +## Assistant-Added Caveats + +This spec intentionally expands B1/B2 from folder-domain routing to identity-scored agent routing because m3taversal clarified that agent identities should route and folders are only signals. That is the right product interpretation, but it increases implementation scope versus the original simple path classifier. + +This spec does not claim production readiness without staging or VPS proof. diff --git a/docs/phase1b/README.md b/docs/phase1b/README.md new file mode 100644 index 0000000..3473c3c --- /dev/null +++ b/docs/phase1b/README.md @@ -0,0 +1,30 @@ +# Phase 1b Spec Index + +Status: active draft +Parent spec: `docs/phase1b-agent-routing-spec.md` + +## Scope + +Phase 1b is the `decision-engine` PR evaluation router. It sends each KB mutation to the owning Hermes agent identity, supports top-2 cross-domain review, posts parseable `VERDICT:AGENT:*` comments through one master bot account, preserves existing merge or feedback behavior, and proves the change in staging before production cutover. + +## Specs + +| Workstream | Spec | Implementation posture | +| --- | --- | --- | +| Agent identity router | `docs/phase1b/agent-identity-router-spec.md` | ready_now | +| Eval pipeline integration | `docs/phase1b/eval-pipeline-integration-spec.md` | ready_now after router contract freezes | +| GitHub identity and bot comments | `docs/phase1b/github-identity-bot-posture-spec.md` | ready_now after canonical target config freezes | +| Reporting and contributor compatibility | `docs/phase1b/reporting-contributor-compatibility-spec.md` | ready_now after verdict state shape freezes | +| Staging proof | `docs/phase1b/staging-proof-spec.md` | draft_gated on staging/VPS or disposable remote access | + +## Execution Order + +1. Implement router contract and tests. +2. Wire eval pipeline to required reviewer agents under a feature flag. +3. Route comments through the canonical GitHub target with idempotency markers. +4. Update reporting and contributor accounting to read reviewer sets rather than fixed Leo plus domain slots. +5. Run staging proof on a clone or disposable remote target before production cutover. + +## Claim Boundary + +These specs plus the Phase 1b branch prove only local implementation behavior. A production completion claim requires merged code, passing tests, staging proof, exact production SHA deployment, Leo signoff, and 24-hour production daemon stability. diff --git a/docs/phase1b/agent-identity-router-spec.md b/docs/phase1b/agent-identity-router-spec.md new file mode 100644 index 0000000..dae9fa2 --- /dev/null +++ b/docs/phase1b/agent-identity-router-spec.md @@ -0,0 +1,338 @@ +# Phase 1b Child Spec: Agent Identity Router + +Created: 2026-05-29 +Status: active draft +Parent spec: `docs/phase1b-agent-routing-spec.md` + +## Product Outcome Contract + +The router decides which Hermes agent identity should review a `decision-engine` KB PR. It must route by agent ownership, with file paths as strong evidence but not the only source of truth. + +## Goal + +Implement a pure, deterministic, evidence-bearing route scorer that returns one or two required reviewer agents for a PR. + +## Non-Goals + +- Do not call paid LLMs for routing. +- Do not post PR comments. +- Do not mutate pipeline DB state. +- Do not deploy to VPS. +- Do not implement general user-input routing outside PR evaluation. + +## Current Implementation Audit + +Current relevant code: + +- `lib/domains.py` contains `DOMAIN_AGENT_MAP`, `agent_for_domain`, `detect_domain_from_diff`, and `detect_domain_from_branch`. +- `lib/agent_routing.py` now owns the Phase 1b identity-scored route contract. +- The obsolete local `DomainRoute` folder-first draft and its draft tests were removed before this branch was committed. +- Cross-domain PRs now require the top 2 routed agents locally, with `route_kind="escalated"` when more than two agents scored. + +Existing implementation truth: + +- The repo already has domain detection that can be reused for path signals. +- The new route tests cover six primary agents, broadened ownership domains, top-2 cross-domain routing, fallback, and deterministic repeat behavior. +- The existing map includes adjacent domains such as `mechanisms`, `living-capital`, `living-agents`, `critical-systems`, `collective-intelligence`, `teleological-economics`, and `cultural-dynamics`. +- The product owner clarified that Phase 1b should use agent identities to route, not only folder names. + +## Existing-Spec Inventory + +| Existing doc | Relevance | Decision | +| --- | --- | --- | +| `docs/phase1b-agent-routing-spec.md` | Umbrella source of truth. | Reuse. | +| `docs/queue.md` | Notes `ai-alignment` domain evolution. | Reuse as a signal for Theseus ownership. | +| `docs/ARCHITECTURE.md` | Describes eval stage shape. | Context only. | + +## Goal-Vs-Repo-Truth Diff + +Goal: + +- Return `AgentRoute` with `primary_agent`, `required_agents`, `route_kind`, `scores`, and `evidence`. +- Cap cross-domain routes at top 2 agents. +- Treat folders as evidence, not the complete classifier. +- Be testable without network, DB, GitHub, or LLM calls. + +Repo truth: + +- Existing classifier returns one folder-domain string or `None`. +- No scores, evidence, or top-2 agent set exist. +- Existing tests do not cover identity-broadened ownership. + +## Completion Percent And Remaining Delta + +Current completion on this branch: 100 percent for local route logic, 0 percent for staging route calibration. + +Remaining delta: + +1. Review the route weights against real recent `decision-engine` PRs. +2. Calibrate ambiguous keyword cases from staging evidence. +3. Decide whether escalated routes should remain top-2 total or become Leo plus top-2 later. + +## Closure, Endpoint, And Deployment Truth + +Local closure: + +- Route tests pass. +- No network or DB dependency exists in route tests. + +Staging closure: + +- Staging proof artifact records route scores and evidence for seven sandbox PRs. + +Production closure: + +- Live PR audit rows show route evidence and required agents. + +This child spec alone cannot prove staging or production behavior. + +## Critical Assumptions And Invalidators + +Assumptions: + +- `decision-engine` file layout is close enough to current local clone for path signals to apply. +- Agent identity ownership from m3taversal is authoritative. +- Top-2 cap is acceptable for cross-domain cases. + +Invalidators: + +- Product owner changes cross-domain rule from top 2 to all touched agents. +- Agent ownership boundaries change materially. +- Production PR metadata lacks branch or changed-file data. + +## State And Truth Contract + +Route output schema: + +```python +AgentRoute( + primary_agent="Rio", + required_agents=("Rio",), + route_kind="single", + scores={"Leo": 0, "Theseus": 1, "Rio": 9, "Vida": 0, "Clay": 0, "Astra": 0}, + evidence=[ + {"agent": "Rio", "signal": "path", "weight": 8, "value": "domains/internet-finance/foo.md"} + ], + fallback=False, +) +``` + +`route_kind` values: + +- `single` +- `multi` +- `fallback` +- `escalated` + +`required_agents` must never contain more than two agents in Phase 1b. + +## Measurement Contract + +Required route fixture cases: + +| Fixture | Expected | +| --- | --- | +| `domains/grand-strategy/foo.md` | Leo | +| `domains/ai-alignment/foo.md` | Theseus | +| `domains/internet-finance/foo.md` | Rio | +| `domains/health/foo.md` | Vida | +| `domains/entertainment/foo.md` | Clay | +| `domains/space-development/foo.md` | Astra | +| `domains/energy/foo.md` | Astra | +| `domains/robotics/foo.md` | Astra | +| `domains/manufacturing/foo.md` | Astra | +| `core/living-capital/foo.md` | Rio | +| `core/living-agents/foo.md` | Theseus | +| `foundations/cultural-dynamics/foo.md` | Clay | +| AI plus x402 diff | Theseus and Rio | +| collective AI goals diff | Leo and Theseus | + +Minimum quality metrics: + +- `route_fixture_pass_rate = 100 percent` +- `fallback_count = 0` for known fixtures +- deterministic repeat count: same input returns same result 100 times + +## Backend Work Required + +Owned files: + +- `lib/agent_routing.py` +- `lib/domains.py` +- `tests/test_agent_routing.py` + +Implementation steps: + +1. Move new identity routing into `lib/agent_routing.py`. +2. Keep `lib/domains.py` as compatibility for domain-oriented callers. +3. Define `AGENT_ORDER = ("Leo", "Theseus", "Rio", "Vida", "Clay", "Astra")`. +4. Define identity signals per agent. +5. Add path signal extraction for `domains`, `entities`, `core`, `foundations`, and `agents`. +6. Add branch prefix signal extraction. +7. Add capped keyword scoring from filenames and diff text. +8. Add top-2 selection rule. +9. Add fallback to Leo. +10. Add tests. + +Forbidden files: + +- `lib/evaluate.py` +- `lib/llm.py` +- deploy scripts +- secrets or runtime config outside route feature flag wiring + +## Frontend Work Required + +None. + +## Expected Runtime And User-Visible Behavior + +The router itself has no user-visible UI. Its behavior becomes visible through audit logs, PR comment reviewer selection, and proof artifacts. + +Example: + +```text +input: domains/internet-finance/x402-agent-payments.md +output: required_agents = ["Rio"] +``` + +Cross-domain example: + +```text +input: ai systems claim plus x402 payment claim +output: required_agents = ["Theseus", "Rio"] +``` + +## Validation And Test Matrix + +Commands: + +```bash +python3 -m pytest tests/test_agent_routing.py +python3 -m ruff check lib/agent_routing.py lib/domains.py tests/test_agent_routing.py +git diff --check +``` + +Test classes: + +- primary ownership routes +- broadened ownership routes +- branch fallback routes +- keyword routes +- top-2 cross-domain routes +- fallback routes +- deterministic tie-breaking +- compatibility wrapper behavior + +## CI/CD, Release, And Pre-Push Gate Contract + +Before PR: + +- Route tests pass locally. +- No production config defaults change. +- No network dependency enters route tests. + +Before staging: + +- Eval integration spec consumes the route result without modifying route internals. + +Before production: + +- Route evidence appears in staging proof artifact. + +## Independent CLI Audit Contract + +Reviewer commands: + +```bash +git diff -- lib/agent_routing.py lib/domains.py tests/test_agent_routing.py +python3 -m pytest tests/test_agent_routing.py +``` + +Reviewer checks: + +- Route function is pure. +- Scores are explainable. +- Top-2 cap is enforced. +- Folder paths are not the only signal. +- Old callers still work or have a clear migration path. + +## Outside-The-Box Fix Paths + +If keyword scoring is noisy: + +- Disable diff keyword scoring and use path plus branch only. +- Use LLM classifier in shadow mode only. +- Add explicit PR label or frontmatter hint later. + +If identity boundaries are ambiguous: + +- Prefer top-2 over fallback when two agents have meaningful scores. +- Log route evidence for later calibration. + +## Maintenance Capture + +Beneficial now: + +- Keep route logic out of `lib/evaluate.py`. +- Keep compatibility wrappers narrow. + +Avoid now: + +- Large domain taxonomy rewrite. +- Dashboard UI changes. +- Paid classifier calls. + +## Parallelization And Fanout + +Classification: local_owner. + +Do not fan out implementation. This module is a root contract consumed by eval integration. + +Worker-ready prompt: + +```text +implement the phase 1b agent identity router in teleo-infrastructure. own lib/agent_routing.py, lib/domains.py compatibility wrappers, and route tests only. make the route function pure, deterministic, evidence-bearing, and capped at top 2 required agents. do not touch eval integration or deploy code. +``` + +## Acceptance Criteria + +- All required route fixtures pass. +- Route result includes primary agent, required agents, route kind, scores, evidence, and fallback status. +- Cross-domain route never requires more than two agents. +- No LLM, network, DB, or GitHub calls occur in the router. + +## Readiness And Claim Boundaries + +Allowed claim: + +- "Agent identity routing is locally implemented and unit-tested." + +Forbidden claim: + +- "Phase 1b eval is complete." + +## Spec Quality Self-Audit + +Required headings present: + +- Current Implementation Audit: present. +- Goal-Vs-Repo-Truth Diff: present. +- Completion Percent And Remaining Delta: present. +- Closure, Endpoint, And Deployment Truth: present. +- Critical Assumptions And Invalidators: present. +- State And Truth Contract: present. +- Measurement Contract: present. +- Backend Work Required: present. +- Frontend Work Required: present. +- Expected Runtime And User-Visible Behavior: present. +- Validation And Test Matrix: present. +- CI/CD, Release, And Pre-Push Gate Contract: present. +- Independent CLI Audit Contract: present. +- Outside-The-Box Fix Paths: present. +- Maintenance Capture: present. +- Parallelization And Fanout: present. + +## Assistant-Added Caveats + +This child spec intentionally keeps routing deterministic and no-spend. That may be less semantically smart than an LLM classifier, but it is the right first implementation for Phase 1b because it is testable, cheap, and auditable. diff --git a/docs/phase1b/eval-pipeline-integration-spec.md b/docs/phase1b/eval-pipeline-integration-spec.md new file mode 100644 index 0000000..616d6d2 --- /dev/null +++ b/docs/phase1b/eval-pipeline-integration-spec.md @@ -0,0 +1,343 @@ +# Phase 1b Child Spec: Eval Pipeline Integration + +Created: 2026-05-29 +Status: active draft +Parent spec: `docs/phase1b-agent-routing-spec.md` + +## Product Outcome Contract + +Pipeline-v2 must use the Phase 1b route result to run the required Hermes agent reviews for a `decision-engine` PR. The old default shape where every non-LIGHT PR receives a domain review plus Leo review must be bypassed when Phase 1b routing is enabled. + +## Goal + +Integrate agent identity routing into `lib/evaluate.py` behind a feature flag, run one or two required reviewer agents, aggregate verdicts, and preserve existing merge or feedback behavior. + +## Non-Goals + +- Do not remove the old eval path until staging proof exists. +- Do not rewrite the full Forgejo/GitHub API abstraction. +- Do not redesign dashboards. +- Do not implement separate GitHub identities. +- Do not change extraction or validation behavior except as needed for eval tests. + +## Current Implementation Audit + +Current relevant code: + +- `lib/evaluate.py::evaluate_pr` owns single PR evaluation. +- `lib/evaluate.py::evaluate_cycle` selects eligible PRs. +- `_build_domain_batches` groups STANDARD PRs by DB domain before fetching diffs. +- `_run_batch_domain_eval` runs batch domain reviews, then individual Leo reviews. +- `run_domain_review` in `lib/llm.py` prompts a domain expert through OpenRouter. +- `run_leo_review` in `lib/llm.py` prompts Leo through OpenRouter or Claude path depending on tier. +- `parse_verdict` in `lib/eval_parse.py` parses reviewer-specific verdict tags. +- `approve_pr`, `reopen_pr`, `close_pr`, and `start_review` handle state transitions. + +Current behavior: + +- Diff path detects a domain. +- `agent_for_domain(domain)` selects one domain agent. +- Domain review runs first. +- Leo review runs after domain approval for non-LIGHT PRs. +- `leo_verdict` and `domain_verdict` are the stored verdict fields. +- Contributor credit logic assumes Leo can be one evaluator and `domain_agent` can be the other. + +## Existing-Spec Inventory + +| Existing doc | Relevance | Decision | +| --- | --- | --- | +| `docs/phase1b-agent-routing-spec.md` | Parent route and eval contract. | Reuse. | +| `docs/ARCHITECTURE.md` | Existing pipeline stage model. | Reuse as baseline. | +| `docs/multi-model-eval-architecture.md` | Prior Leo-plus-second-model design. | Supersede for Phase 1b eval path only. | +| `handoff/deprecated/eval-scripts.md` | Confirms shell eval scripts are dead. | Reuse to avoid wrong surface. | + +## Goal-Vs-Repo-Truth Diff + +Goal: + +- `evaluate_pr` calls the route scorer. +- Required agents are the only reviewer agents. +- One required agent means one review. +- Two required agents means two reviews and aggregate verdict. +- Default Leo second-review is removed when the feature flag is enabled. +- Old behavior remains available when the feature flag is disabled. + +Branch truth: + +- Legacy eval is still available when the feature flag is false. +- When the feature flag is true, eval invokes the identity route, runs required agents only, writes `review_records`, and projects aggregate state back into legacy `leo_verdict` and `domain_verdict` columns. +- Batch eval is disabled while the feature flag is true because stale DB-domain grouping is not route-aware. +- `run_agent_review` exists, but it uses prompt-level identity context rather than loading full KB identity/belief/reasoning files. + +## Completion Percent And Remaining Delta + +Current completion on this branch: 70 percent local implementation behind a default-off feature flag. + +Remaining delta: + +1. Decide direct GitHub `decision-engine` comment transport versus Forgejo-first cutover compatibility. +2. Add duplicate-comment idempotency for retry after partial multi-agent success. +3. Prove with staging PRs and real daemon logs. +4. Update contributor/dashboard assumptions only where staging or tests prove breakage. + +## Closure, Endpoint, And Deployment Truth + +Local closure: + +- Mocked eval tests prove route-to-review-to-aggregate behavior. + +Staging closure: + +- Staging sandbox PRs receive expected comments and DB state transitions. + +Production closure: + +- Live `decision-engine` PRs are handled by Phase 1b route path for 24 hours. + +This spec cannot claim production closure without VPS proof. + +## Critical Assumptions And Invalidators + +Assumptions: + +- Feature flag rollback is acceptable. +- Existing state fields can support Phase 1b initially by storing primary agent in `domain_agent` and aggregate details in audit rows. +- A DB schema migration is avoidable for the first PR. +- Master bot comments with `VERDICT:AGENT:*` are acceptable. + +Invalidators: + +- Downstream merge logic requires formal reviews from separate GitHub users. +- Dashboards or contributor credit fail hard when Leo is not present. +- Batch eval cannot be safely disabled and must be route-aware from day one. +- Production env cannot set feature flags. + +## State And Truth Contract + +Feature flag: + +```text +PHASE1B_AGENT_ROUTING_ENABLED=false +``` + +When false: + +- Existing eval behavior continues. + +When true: + +- Eval route is built for every non-bypass PR. +- Audit log records route JSON. +- Required agent reviews run. +- Aggregate verdict determines approval or feedback. + +Minimal DB field use: + +- `domain`: keep route primary domain or `multi`. +- `domain_agent`: keep primary agent. +- `domain_verdict`: keep aggregate non-Leo review verdict or aggregate verdict. +- `leo_verdict`: set `skipped` unless Leo is a required agent; if Leo is required, store Leo verdict. +- `review_records`: write one row per required reviewer attempt with reviewer agent, model, outcome, and notes. +- audit log: route and all per-agent verdicts. + +This is a compatibility posture, not the ideal long-term schema. + +## Measurement Contract + +Required local assertions: + +- Phase 1b flag disabled uses old runner calls. +- Phase 1b flag enabled calls `run_agent_review` once for single route. +- Phase 1b flag enabled calls `run_agent_review` twice for multi route. +- `run_leo_review` is not called unless Leo is in `required_agents`. +- all approve returns approved aggregate. +- one request changes returns feedback aggregate. +- transport failure reopens for retry. +- retry after a partial multi-agent success does not duplicate existing posted verdict comments. + +## Backend Work Required + +Owned files: + +- `lib/evaluate.py` +- `lib/llm.py` +- `lib/config.py` +- `lib/eval_parse.py` only if parser compatibility needs explicit tests or normalization. +- `tests/test_evaluate_agent_routing.py` +- `tests/test_eval_parse.py` + +Implementation steps: + +1. Add `PHASE1B_AGENT_ROUTING_ENABLED` to `lib/config.py`. +2. Import route scorer. +3. Add `run_agent_review` in `lib/llm.py`. +4. Add helper to load agent context from KB worktree. +5. Add `aggregate_agent_verdicts`. +6. In `evaluate_pr`, after bypasses and diff filtering, branch into Phase 1b path when flag is true. +7. In Phase 1b path, run required reviews and post comments through the existing API helper. +8. Update DB fields conservatively. +9. Write `review_records` rows for each required reviewer attempt. +10. Preserve old logic under flag false. +11. Disable `_build_domain_batches` while flag is true or make it route-aware. + +Forbidden files: + +- Deprecated eval shell scripts. +- Deployment scripts unless needed for documenting the flag. +- Runtime secrets. + +## Frontend Work Required + +None. + +## Expected Runtime And User-Visible Behavior + +Single-agent example: + +```text +PR touches internet finance. +route.required_agents = ["Rio"] +pipeline posts a Rio verdict. +merge proceeds if Rio approves. +``` + +Cross-agent example: + +```text +PR touches AI systems and x402 payments. +route.required_agents = ["Theseus", "Rio"] +pipeline posts Theseus and Rio verdicts. +merge proceeds only if both approve. +``` + +Fallback example: + +```text +PR cannot be confidently routed. +route.required_agents = ["Leo"] +pipeline posts Leo verdict. +route_kind = fallback is audited. +``` + +## Validation And Test Matrix + +Commands: + +```bash +python3 -m pytest tests/test_evaluate_agent_routing.py tests/test_eval_parse.py +python3 -m ruff check lib/evaluate.py lib/llm.py lib/config.py tests/test_evaluate_agent_routing.py +git diff --check +``` + +Test cases: + +- flag-off old behavior smoke +- flag-on single reviewer approve +- flag-on single reviewer request changes +- flag-on two reviewer approve +- flag-on two reviewer one reject +- missing verdict +- transport failure +- Leo required route +- Leo not required route +- batch disabled or route-aware under flag + +## CI/CD, Release, And Pre-Push Gate Contract + +Before PR: + +- Focused tests pass. +- Old behavior remains behind flag false. +- No production default flips to true. + +Before staging: + +- Operator can enable flag in staging env. +- Sandbox repo target is configured. + +Before production: + +- Staging proof artifact exists. +- Rollback command is known. + +## Independent CLI Audit Contract + +Reviewer commands: + +```bash +git diff -- lib/evaluate.py lib/llm.py lib/config.py tests/test_evaluate_agent_routing.py +python3 -m pytest tests/test_evaluate_agent_routing.py +``` + +Reviewer checks: + +- No deprecated scripts revived. +- No secrets introduced. +- Feature flag false preserves old path. +- Feature flag true bypasses default Leo second-review. +- Cross-domain aggregate requires all required reviewers to approve. + +## Outside-The-Box Fix Paths + +If compatibility fields become confusing: + +- Add a narrow DB migration for `route_json` and `agent_verdicts_json`. + +If batch eval blocks safe integration: + +- Disable batch eval under Phase 1b flag for one release. + +If LLM review prompts get too large: + +- Load only identity plus beliefs first, then add reasoning/skills later. + +## Maintenance Capture + +Beneficial now: + +- Isolate Phase 1b logic into helpers instead of expanding `evaluate_pr` deeply. +- Keep rollback path explicit. + +Avoid now: + +- Full eval architecture rewrite. +- Dashboard redesign. +- Broad DB migration unless tests require it. + +## Parallelization And Fanout + +Classification: local_owner. + +Do not fan out before the router contract lands. Eval integration depends tightly on route result semantics. + +Worker-ready prompt: + +```text +wire phase 1b routing into teleo-infrastructure eval behind PHASE1B_AGENT_ROUTING_ENABLED. own lib/evaluate.py, lib/llm.py, lib/config.py, and mocked eval tests. run required agents from the route result, aggregate verdicts, preserve old behavior when the flag is false, and do not revive deprecated scripts. +``` + +## Acceptance Criteria + +- Flag false path remains available. +- Flag true path runs required agents only. +- One or two verdicts aggregate correctly. +- Existing merge or feedback path is preserved. +- Focused mocked tests pass. + +## Readiness And Claim Boundaries + +Allowed claim: + +- "Phase 1b eval integration is locally tested behind a feature flag." + +Forbidden claim: + +- "Phase 1b is live." + +## Spec Quality Self-Audit + +All required execution-grade headings are present. This spec intentionally defers exact production commands to the staging/proof child spec because they depend on VPS truth. + +## Assistant-Added Caveats + +The compatibility use of `domain_verdict` and `leo_verdict` is a pragmatic Phase 1b bridge. A cleaner route schema may be worth adding after staging proof, but a premature migration would widen the blast radius. diff --git a/docs/phase1b/github-identity-bot-posture-spec.md b/docs/phase1b/github-identity-bot-posture-spec.md new file mode 100644 index 0000000..0fde97a --- /dev/null +++ b/docs/phase1b/github-identity-bot-posture-spec.md @@ -0,0 +1,296 @@ +# Phase 1b Child Spec: GitHub Identity And Bot Posture + +Created: 2026-05-29 +Status: active draft +Parent spec: `docs/phase1b-agent-routing-spec.md` + +## Product Outcome Contract + +Phase 1b must post agent-specific verdicts for `decision-engine` PRs without requiring six separate GitHub accounts. Agent identity is represented in the comment content and verdict tags, while a single master bot account owns transport. + +## Goal + +Define and implement the minimum GitHub identity and comment transport posture for Phase 1b: + +- canonical target is `living-ip/decision-engine`; +- one master bot token is acceptable; +- verdict comments preserve `VERDICT:AGENT:*`; +- duplicate comments are prevented; +- old Forgejo or mirror behavior remains rollback-safe until staging proof. + +## Non-Goals + +- Do not create separate GitHub users for all agents. +- Do not require GitHub branch protection to count separate formal reviewers in Phase 1b. +- Do not rewrite every Forgejo-named helper unless needed for Phase 1b comments. +- Do not redesign contributor credit. +- Do not revive deprecated eval shell scripts. + +## Current Implementation Audit + +Current truth: + +- `pipeline-health-check.py` targets `https://api.github.com/repos/living-ip/decision-engine`. +- `research/research-session.sh` targets GitHub `living-ip/decision-engine` and `github-admin-token`. +- `handoff/phase1-step3-script-migration.md` documents Phase 1 single `livingIPbot` posture and defers per-agent identities. +- `lib/config.py` still defaults to Forgejo `teleo/teleo-codex`. +- `lib/github_feedback.py` hardcodes `living-ip/teleo-codex` and reads `github-pat`, not `decision-engine` and `github-admin-token`. +- `lib/evaluate.py` posts review comments through Forgejo helpers and per-agent Forgejo tokens. +- `lib/github_feedback.py` is a mirror feedback channel keyed by `prs.github_pr`, not the canonical review transport. +- `deploy/sync-mirror.sh` still references `living-ip/teleo-codex`. +- Fwaz confirmed separate GitHub identities are ideal and blocked on GitHub/PAT setup; Phase 1b implementation should not wait on six distinct accounts if the pipeline can post parseable `VERDICT:AGENT:*` comments through the pipeline bot. + +## Existing-Spec Inventory + +| Existing doc | Relevance | Decision | +| --- | --- | --- | +| `docs/phase1b-agent-routing-spec.md` | Parent identity posture. | Reuse. | +| `handoff/phase1-step3-script-migration.md` | Documents single bot token and GitHub `decision-engine` migration for scripts. | Reuse. | +| `handoff/deprecated/eval-scripts.md` | Confirms old eval scripts should not be revived. | Reuse. | + +## Goal-Vs-Repo-Truth Diff + +Goal: + +- One canonical GitHub target for Phase 1b: `living-ip/decision-engine`. +- One master bot token for Phase 1b comments. +- Agent identity lives in verdict tags and comment headings. +- Comment posting supports idempotency by PR, head SHA, and agent. + +Repo truth: + +- GitHub target and token names are split across files. +- Eval comments still use Forgejo helpers. +- GitHub feedback is non-fatal mirror feedback, not agent review transport. + +## Completion Percent And Remaining Delta + +Current completion: 15 percent. + +Remaining delta: + +1. Add explicit GitHub target config with staging override. +2. Normalize token file selection or document compatibility. +3. Add Phase 1b comment posting helper for GitHub `decision-engine`. +4. Add idempotency marker. +5. Add tests for URL target, token path, missing token, and duplicate prevention. +6. Decide direct GitHub mode versus Forgejo-mirror mode before staging. + +## Closure, Endpoint, And Deployment Truth + +Local closure: + +- Tests prove comments target `living-ip/decision-engine` and token material is not logged. + +Staging closure: + +- Sandbox PR comments are posted by master bot with agent verdict tags. + +Production closure: + +- Live `decision-engine` PR comments are posted by master bot without duplicates. + +## Critical Assumptions And Invalidators + +Assumptions: + +- One bot account is enough for Phase 1b. +- Agent identity in verdict content satisfies acceptance. +- Formal GitHub reviews from distinct accounts are not required now. +- Per-agent PATs can be added later without changing the route contract. + +Invalidators: + +- Branch protection requires distinct GitHub reviewer identities. +- GitHub org disallows the selected PAT or bot account. +- Production daemon must remain Forgejo-first for the cutover window. +- Direct GitHub PRs lack the DB linkage used by existing `github_feedback`. + +## State And Truth Contract + +Comment idempotency marker: + +```text + +``` + +Verdict marker remains: + +```text + +``` + +Required config: + +```python +GITHUB_OWNER = "living-ip" +GITHUB_REPO = "decision-engine" +GITHUB_TOKEN_FILE = SECRETS_DIR / "github-admin-token" +``` + +Staging must override repo or owner without code changes. + +## Measurement Contract + +Minimum tests: + +- URL builder targets `https://api.github.com/repos/living-ip/decision-engine`. +- Staging override changes target. +- Missing token returns non-fatal failure and audit detail. +- Token value is never logged. +- Duplicate marker prevents repeat comment for same PR, SHA, and agent. +- Six agent verdict tags remain parseable. + +## Backend Work Required + +Owned files: + +- `lib/github_feedback.py` or a new `lib/github_reviews.py`. +- `lib/config.py`. +- `lib/evaluate.py` only where the eval integration calls the comment helper. +- `tests/test_github_identity.py` or equivalent. + +Implementation steps: + +1. Add canonical GitHub target config. +2. Add token lookup that prefers `github-admin-token` for Phase 1b and can fall back only if explicitly configured. +3. Add comment helper for agent verdict comments. +4. Add idempotency marker and readback check. +5. Add tests. +6. Wire eval integration to the helper under Phase 1b flag. + +Forbidden files: + +- Deprecated eval shell scripts. +- Production secrets. +- Broad deploy rewrite. + +## Frontend Work Required + +None. + +## Expected Runtime And User-Visible Behavior + +PR comment example: + +```text +## Rio review + + + + + +``` + +The GitHub account may be a master bot. The comment content must show which agent reviewed. + +## Validation And Test Matrix + +Commands: + +```bash +python3 -m pytest tests/test_github_identity.py tests/test_eval_parse.py +python3 -m ruff check lib/github_feedback.py lib/config.py tests/test_github_identity.py +git diff --check +``` + +Test cases: + +- canonical target +- staging override +- missing token +- no token logging +- idempotent comment marker +- all six verdict tags parse + +## CI/CD, Release, And Pre-Push Gate Contract + +Before PR: + +- Local tests prove target and idempotency. + +Before staging: + +- Sandbox repo token exists. +- Production token is not used. + +Before production: + +- Bot account has comment permissions on `decision-engine`. +- Rollback path is old Forgejo or disabled Phase 1b flag. + +## Independent CLI Audit Contract + +Reviewer checks: + +```bash +rg -n "teleo-codex|decision-engine|github-admin-token|github-pat|VERDICT|PHASE1B_REVIEW" lib tests pipeline-health-check.py research deploy +``` + +Audit questions: + +- Which files still target `teleo-codex`? +- Are those files in the Phase 1b runtime path? +- Does any log path expose token values? +- Does idempotency prevent duplicate comments? + +## Outside-The-Box Fix Paths + +If direct GitHub comments are not safe in the first PR: + +- Keep Forgejo review transport and post GitHub mirror feedback only in staging. +- Add a dry-run comment mode that writes the planned body into audit logs. + +If GitHub PAT remains blocked: + +- Use a GitHub App only for comment posting. +- Keep master bot for git push but app token for PR comments. + +## Maintenance Capture + +Beneficial now: + +- Name GitHub target config clearly. +- Avoid proliferating `github-pat` versus `github-admin-token`. + +Avoid now: + +- Separate agent GitHub users. +- Full mirror rewrite. +- Contributor identity overhaul. + +## Parallelization And Fanout + +Classification: ready_now after the implementer explicitly chooses direct GitHub comments or Forgejo-mirror compatibility for the Phase 1b flag path. + +Worker-ready prompt: + +```text +implement phase 1b github review comment posture. use one master bot token, target living-ip/decision-engine with staging override support, add agent-specific verdict comment helper with idempotency marker, and prove no token leakage. do not create separate agent accounts or rewrite deploy/mirror broadly. +``` + +## Acceptance Criteria + +- Phase 1b comment helper targets `decision-engine`. +- Master bot can post agent verdict tags. +- Duplicate comments are prevented. +- Missing token is non-fatal and auditable. +- Existing old transport remains rollback-safe. + +## Readiness And Claim Boundaries + +Allowed claim: + +- "Master-bot GitHub verdict comment posture is locally specified/tested." + +Forbidden claim: + +- "Separate agent GitHub identities are solved." + +## Spec Quality Self-Audit + +All required execution-grade headings are present. The exact direct-GitHub versus Forgejo-mirror cutover remains a deliberate implementation decision because current daemon code is Forgejo-first. + +## Assistant-Added Caveats + +The repo has real target drift between `teleo-codex` and `decision-engine`. Do not hide that drift in the eval implementation. The Phase 1b PR should either fix the runtime path it uses or explicitly leave non-runtime references for a later migration. diff --git a/docs/phase1b/reporting-contributor-compatibility-spec.md b/docs/phase1b/reporting-contributor-compatibility-spec.md new file mode 100644 index 0000000..8ead10d --- /dev/null +++ b/docs/phase1b/reporting-contributor-compatibility-spec.md @@ -0,0 +1,275 @@ +# Phase 1b Child Spec: Reporting And Contributor Compatibility + +Created: 2026-05-29 +Status: active draft +Parent spec: `docs/phase1b-agent-routing-spec.md` + +## Product Outcome Contract + +Phase 1b must not make dashboards, health checks, or contributor credit lie about review state. Reporting may stay minimal, but it must not mark a cross-domain PR as ready before all required agents have reviewed. + +## Goal + +Update compatibility surfaces so Phase 1b required-agent reviews are represented accurately enough for operations, health, and contributor attribution without doing a dashboard redesign. + +## Non-Goals + +- Do not redesign the dashboard UI. +- Do not implement a new leaderboard model. +- Do not require a broad DB migration unless `review_records` is insufficient. +- Do not make production-readiness claims from health-check summaries alone. + +## Current Implementation Audit + +Current truth: + +- `lib/db.py` already has `review_records` with `pr_number`, `domain`, `agent`, `reviewer`, `reviewer_model`, `outcome`, `rejection_reason`, and `notes`. +- `lib/contributor.py` assumes Leo reviews every PR and credits Leo plus one `domain_agent`. +- `lib/health.py` computes approval rates from `domain_verdict` and `leo_verdict`. +- `lib/health.py` builds reviewer strings only from `domain_verdict` and `leo_verdict`. +- `pipeline-health-check.py` can parse arbitrary `VERDICT:AGENT:*` tags, but it has no required-agent concept. +- A cross-domain PR with one approval and one missing required review could be misclassified if reporting only checks "any approve". + +## Existing-Spec Inventory + +| Existing doc | Relevance | Decision | +| --- | --- | --- | +| `docs/phase1b-agent-routing-spec.md` | Parent route/verdict state. | Reuse. | +| `docs/ARCHITECTURE.md` | Health/dashboard baseline. | Reuse as context. | +| `docs/DIAGNOSTICS-AGENT-SPEC.md` | Diagnostics philosophy. | Reuse as later direction, not immediate scope. | + +## Goal-Vs-Repo-Truth Diff + +Goal: + +- Required-agent state is visible enough to avoid false readiness. +- Contributor evaluator credit follows actual approved reviewer agents. +- Health and pipeline checks can distinguish incomplete cross-domain review. + +Repo truth: + +- Legacy fields only represent `domain_verdict` plus `leo_verdict`. +- Contributor credit hardcodes Leo as universal reviewer. +- `pipeline-health-check.py` parses comments but does not know required reviewers. + +## Completion Percent And Remaining Delta + +Current completion: 10 percent because `review_records` already exists. + +Remaining delta: + +1. Ensure eval integration writes one `review_records` row per required reviewer. +2. Update contributor attribution to prefer approved `review_records`. +3. Keep legacy fields as projection only. +4. Add optional route marker parsing to `pipeline-health-check.py`. +5. Add tests proving no partial-review false readiness. + +## Closure, Endpoint, And Deployment Truth + +Local closure: + +- Tests prove contributor credit and stage classification respect required reviewers. + +Staging closure: + +- Staging proof artifact and health readback agree on required-agent completion. + +Production closure: + +- Production health does not show PRs as ready before all required agents approve. + +## Critical Assumptions And Invalidators + +Assumptions: + +- `review_records` is available in production DB schema. +- Eval integration can write `review_records` for each required reviewer. +- Dashboards can tolerate legacy projections during Phase 1b. + +Invalidators: + +- Production DB lacks `review_records`. +- Contributor code path cannot query `review_records` without performance issues. +- Branch protection or merge logic uses legacy fields directly for readiness. + +## State And Truth Contract + +`review_records` becomes the compatibility source for per-agent reviewer history. + +Required eval write: + +```text +one review_records row per required reviewer per PR attempt +``` + +Legacy projection: + +- `domain_agent = primary_agent` +- `domain_verdict = aggregate_verdict` +- `leo_verdict = actual Leo verdict when Leo is required, else skipped` + +Route/audit JSON remains the source for `required_agents`. + +## Measurement Contract + +Minimum compatibility metrics: + +- `review_records_written_count` +- `required_reviews_missing_count` +- `partial_review_not_ready_count` +- `contributor_evaluator_credit_count_by_agent` + +Minimum proof: + +- A two-agent PR with one approval and one missing verdict is not classified as ready. +- A two-agent PR with two approvals is classified as ready. +- Contributor credit includes both approved reviewers. + +## Backend Work Required + +Owned files: + +- `lib/contributor.py` +- `lib/health.py` +- `pipeline-health-check.py` +- `tests/test_contributor.py` or new focused test. +- `tests/test_pipeline_health_phase1b.py` if added. + +Implementation steps: + +1. Confirm `review_records` exists in local schema and migrations. +2. Update eval integration spec to write review records per required reviewer. +3. Update contributor credit to prefer approved `review_records.reviewer` rows. +4. Fall back to legacy `leo_verdict` and `domain_verdict` for old data. +5. Update health output to include review records or route audit fields where available. +6. Update pipeline health check to read required-agent markers if present. +7. Add tests. + +Forbidden work: + +- Dashboard redesign. +- New leaderboard model. +- Broad schema migration before proof requires it. + +## Frontend Work Required + +None. + +## Expected Runtime And User-Visible Behavior + +Operators should see: + +- Per-agent reviewer outcomes when available. +- Cross-domain PRs not marked ready until all required reviewers approve. +- Contributor credit reflecting actual approved reviewer agents. + +Existing dashboard layout can remain unchanged if data is honest. + +## Validation And Test Matrix + +Commands: + +```bash +python3 -m pytest tests/test_contributor.py tests/test_pipeline_health_phase1b.py +python3 -m ruff check lib/contributor.py lib/health.py pipeline-health-check.py tests +git diff --check +``` + +Test cases: + +- old data fallback credits Leo/domain reviewer. +- new `review_records` data credits all approved required reviewers. +- request-changes reviewer receives no evaluator credit. +- one missing required reviewer blocks ready classification. +- all required reviewers approve enables ready classification. + +## CI/CD, Release, And Pre-Push Gate Contract + +Before PR: + +- Compatibility tests pass or are documented as not runnable due missing dev deps. + +Before staging: + +- Staging proof includes health and contributor-readback commands. + +Before production: + +- Operator verifies no partial-review false readiness in logs/health readback. + +## Independent CLI Audit Contract + +Reviewer commands: + +```bash +rg -n "Leo reviews every PR|leo_verdict|domain_verdict|review_records|required_agents|VERDICT" lib pipeline-health-check.py tests +sqlite3 /path/to/pipeline.db ".schema review_records" +``` + +Reviewer checks: + +- `review_records` is preferred for new evaluator credit. +- Legacy fallback remains for old rows. +- Health does not rely on any-approve for multi-review readiness. + +## Outside-The-Box Fix Paths + +If `review_records` is insufficient: + +- Add additive `route_json` and `agent_verdicts_json` columns to `prs`. + +If `pipeline-health-check.py` cannot read route markers: + +- Treat cross-domain PRs as awaiting review unless all verdict tags expected by route artifact are present. + +If contributor credit is too risky for Phase 1b: + +- Defer credit mutation and emit review-record-only proof until after eval stability. + +## Maintenance Capture + +Beneficial now: + +- Replace comments claiming "Leo reviews every PR." +- Add focused tests for the compatibility projection. + +Avoid now: + +- Dashboard UI rewrite. +- Historical backfill. +- Leaderboard redesign. + +## Parallelization And Fanout + +Classification: ready_now after eval integration establishes review record writes. + +Worker-ready prompt: + +```text +make reporting and contributor attribution phase 1b-compatible. prefer review_records for new evaluator credit, preserve legacy fallback, and prevent health/pipeline checks from marking cross-domain prs ready before all required agents approve. do not redesign dashboards or add broad schema migrations unless tests prove necessary. +``` + +## Acceptance Criteria + +- No code path claims Leo reviews every new Phase 1b PR. +- Approved `review_records` can credit all required reviewer agents. +- Health/check logic avoids partial-review false readiness. +- Legacy data still renders. + +## Readiness And Claim Boundaries + +Allowed claim: + +- "Reporting compatibility is updated to avoid false readiness and credit loss." + +Forbidden claim: + +- "Dashboards are redesigned for Phase 1b." + +## Spec Quality Self-Audit + +All required execution-grade headings are present. This spec is intentionally compatibility-scoped and does not attempt a full reporting product redesign. + +## Assistant-Added Caveats + +The safest first move is to write accurate `review_records` and route audit JSON. Rich dashboards should wait until production behavior proves stable. diff --git a/docs/phase1b/staging-proof-spec.md b/docs/phase1b/staging-proof-spec.md new file mode 100644 index 0000000..123c30f --- /dev/null +++ b/docs/phase1b/staging-proof-spec.md @@ -0,0 +1,356 @@ +# Phase 1b Child Spec: Staging Proof + +Created: 2026-05-29 +Status: active draft +Parent spec: `docs/phase1b-agent-routing-spec.md` + +## Product Outcome Contract + +Phase 1b must be tested without mutating the production VPS or production `decision-engine` PRs. A staging clone or disposable remote test box must prove routing, verdict posting, and merge or feedback behavior against a sandbox target before production cutover. + +## Goal + +Define the staging proof path for Phase 1b: provision an isolated production-like runtime, disable production authority, run six single-domain PR cycles plus one cross-domain PR cycle, save a machine-readable proof artifact, then destroy or shut down the staging environment. + +## Non-Goals + +- Do not mutate production PRs. +- Do not use production GitHub tokens in staging. +- Do not prove 24-hour production stability. +- Do not promote a mutated staging server as production. +- Do not test payment, wallet, Twitter, or mainnet flows. + +## Current Implementation Audit + +Known repo truth: + +- `systemd/teleo-pipeline.service` defines the production-style pipeline service. +- `deploy/` contains deployment and mirror scripts. +- `docs/ARCHITECTURE.md` documents VPS path assumptions and SQLite state. +- `docs/INFRASTRUCTURE.md` documents production as Hetzner `77.42.65.182`, root path `/opt/teleo-eval`, diagnostics on port `8081`, and health on port `8080`. +- `deploy/auto-deploy.sh` pulls from `/opt/teleo-eval/workspaces/deploy-infra`, syncs code into runtime paths, restarts changed Python services, and updates `/opt/teleo-eval/.last-deploy-sha` after smoke checks. +- `systemd/teleo-pipeline.service` expects `/opt/teleo-eval/pipeline/fix-ownership.sh`, while this repo stores that script under `deploy/fix-ownership.sh`; staging bootstrap must verify the live runtime path before assuming the unit works. +- `handoff/phase1-step3-script-migration.md` documents GitHub migration posture and `decision-engine` target for scripts. +- `handoff/deprecated/eval-scripts.md` confirms old eval scripts are dead. +- Fwaz described the current production update path as `pull -> services recognize pull -> edit on VPS -> PR to Leo`; staging proof must treat that as an unsafe legacy behavior to replace, not as a release gate. +- Fwaz approved Crabbox as the long-term disposable staging/test-box direction. + +Unknown production truth: + +- Exact current deployed SHA. +- Whether production service files match this repo. +- Whether production still points at Forgejo in the live daemon. +- Exact restart/deploy commands used by Fwaz or agents. +- Current secrets layout. +- Current `systemctl cat` output for `teleo-pipeline`, `teleo-diagnostics`, auto-deploy timers, cron-like research jobs, Telegram-related services, and any agent daemons. +- Whether production has uncommitted hotfixes, generated scripts, or local service patches under `/opt/teleo-eval`. +- Read-only live access is not available in this workspace; the infrastructure audit attempted SSH readback and hit authentication denial, so no production SHA or service state should be claimed from this spec. + +## Existing-Spec Inventory + +| Existing doc | Relevance | Decision | +| --- | --- | --- | +| `docs/phase1b-agent-routing-spec.md` | Parent proof requirements. | Reuse. | +| `docs/ARCHITECTURE.md` | VPS topology and service assumptions. | Reuse with current-readback requirement. | +| `systemd/teleo-pipeline.service` | Service command template. | Reuse as staging baseline. | +| `handoff/phase1-step3-script-migration.md` | GitHub `decision-engine` target context. | Reuse. | + +## Goal-Vs-Repo-Truth Diff + +Goal: + +- Staging proof runs against sandbox `decision-engine`. +- Production services and secrets are disabled before test daemon starts. +- Proof artifact captures routes, verdicts, final PR states, SHAs, DB schema, feature flags, and logs. + +Repo truth: + +- Staging automation does not exist. +- No proof script exists for seven PR cases. +- No machine-readable Phase 1b proof schema exists outside the umbrella spec. + +## Completion Percent And Remaining Delta + +Current completion: 0 percent. + +Remaining delta: + +1. Choose staging substrate: Hetzner snapshot clone, Crabbox, or another disposable test box. +2. Define sandbox repo. +3. Define staging secrets. +4. Write or run proof sequence. +5. Retain proof artifact. +6. Confirm staging cannot mutate production. + +## Closure, Endpoint, And Deployment Truth + +Staging closure means: + +- Staging environment is isolated. +- Sandbox PRs are created and processed. +- Required reviewer verdicts appear in PR comments. +- Pipeline state transitions match expected behavior. +- Proof artifact exists. + +Production closure is separate and requires exact reviewed SHA deployment plus 24-hour readback. + +## Critical Assumptions And Invalidators + +Assumptions: + +- A VPS snapshot or disposable equivalent can run the pipeline. +- Production secrets can be removed or replaced before daemon start. +- A sandbox GitHub repo can be used. +- The proof can run without real production inference spend, or spend is explicitly approved. + +Invalidators: + +- Clone boots production services before quarantine. +- Sandbox target cannot receive PRs/comments. +- No operator has cloud or VPS access. +- Secrets cannot be separated from production. +- Service paths on production are materially different from repo docs. + +## State And Truth Contract + +Proof artifact path should be under staging, then copied back into the PR or retained artifact location. Suggested filename: + +```text +proof/phase1b-staging-proof-YYYYMMDD-HHMMSS.json +``` + +Required JSON fields: + +```json +{ + "phase": "1b", + "schema_version": 1, + "environment": { + "kind": "hetzner_snapshot|crabbox|disposable_remote", + "host": "...", + "snapshot_id": "...", + "created_from_prod_host": "77.42.65.182" + }, + "teleo_infrastructure_sha": "...", + "decision_engine_target": "living-ip/decision-engine-sandbox", + "pipeline_db_schema": 26, + "feature_flags": {"PHASE1B_AGENT_ROUTING_ENABLED": "true"}, + "safety": { + "prod_services_disabled": true, + "prod_timers_disabled": true, + "prod_crons_disabled": true, + "prod_secrets_removed": true, + "auto_merge_constrained": true + }, + "test_cases": [], + "verification_outputs": { + "service_status_path": "...", + "journal_excerpt_path": "...", + "db_snapshot_path": "...", + "github_comments_path": "..." + }, + "rollback": { + "production_sha_before": "...", + "candidate_sha": "...", + "rollback_command": "..." + }, + "created_at": "..." +} +``` + +Each test case: + +```json +{ + "case": "internet-finance", + "pr": 12, + "required_agents": ["Rio"], + "posted_verdicts": {"Rio": "approve"}, + "final_state": "approved", + "route_kind": "single" +} +``` + +## Measurement Contract + +Minimum staging cases: + +- grand strategy -> Leo +- ai systems or ai alignment -> Theseus +- internet finance -> Rio +- health -> Vida +- entertainment -> Clay +- space, robotics, energy, or advanced manufacturing -> Astra +- cross-domain ai plus x402 -> Theseus and Rio + +Pass criteria: + +- 7 of 7 route decisions match expected required agents. +- 7 of 7 PRs receive parseable verdict comments. +- No production repo receives comments. +- No production service remains enabled during staging run. + +## Backend Work Required + +Owned surfaces: + +- Staging host. +- Sandbox repo. +- Staging env/config. +- Proof artifact generator or manual proof script. + +Implementation steps: + +1. Snapshot or provision staging environment. +2. Block public/prod access. +3. Disable production services. +4. Remove production secrets. +5. Set hostname to staging. +6. Configure sandbox target. +7. Deploy exact implementation SHA. +8. Enable Phase 1b feature flag. +9. Create seven sandbox PRs. +10. Run pipeline until verdicts and states are visible. +11. Save proof artifact. +12. Shut down or destroy staging. + +## Frontend Work Required + +None. + +## Expected Runtime And User-Visible Behavior + +Operator sees: + +- Staging service status. +- Sandbox PR comments with agent verdict tags. +- SQLite rows or logs showing route decisions. +- Proof artifact summarizing pass/fail. + +No production user-visible behavior should change during staging. + +## Validation And Test Matrix + +Commands will vary by staging substrate. Baseline readback: + +```bash +hostname +git -C /opt/teleo-eval/workspaces/deploy-infra rev-parse HEAD +cat /opt/teleo-eval/.last-deploy-sha +systemctl is-active teleo-pipeline teleo-diagnostics teleo-auto-deploy.timer +systemctl list-timers | grep -E 'teleo|sync|extract|research' || true +curl -s localhost:8080/health | python3 -m json.tool +journalctl -u teleo-pipeline --since "1 hour ago" --no-pager +sqlite3 /opt/teleo-eval/pipeline/pipeline.db "select max(version) from schema_version;" +sqlite3 /opt/teleo-eval/pipeline/pipeline.db "select number,status,domain,domain_agent,leo_verdict,domain_verdict,auto_merge,github_pr from prs order by number desc limit 20;" +gh pr list --repo living-ip/decision-engine-sandbox --state all +gh pr view --repo living-ip/decision-engine-sandbox PR_NUMBER --comments +``` + +Safety checks: + +```bash +systemctl is-enabled teleo-pipeline +systemctl cat teleo-pipeline +systemctl cat teleo-diagnostics +grep -R "github-admin-token" /opt/teleo-eval/secrets 2>/dev/null +git -C /opt/teleo-eval/workspaces/main remote -v +``` + +## CI/CD, Release, And Pre-Push Gate Contract + +Before staging: + +- Code PR has passed local tests. +- Sandbox target selected. +- Staging secrets prepared. + +Before production: + +- Staging proof artifact exists. +- Exact SHA to deploy is recorded. +- Rollback path is recorded. +- Leo approval/signoff for the exact reviewed SHA is recorded. +- The production cutover avoids direct agent self-edits on the VPS. + +## Independent CLI Audit Contract + +Auditor should verify: + +- Staging host is not production. +- Production services were disabled before test daemon start. +- GitHub target is sandbox. +- Proof artifact PR IDs belong to sandbox repo. +- Logs show no production mutation. + +## Outside-The-Box Fix Paths + +If Hetzner snapshot clone is too risky: + +- Use Crabbox with a synced checkout and fake/sandbox services. +- Use a fresh Hetzner server and repo checkout instead of disk clone. +- Use local fake GitHub/Forgejo API for pure pipeline proof. + +Substrate guidance: + +- Prefer a Hetzner snapshot clone for canonical staging proof because it exercises `/opt/teleo-eval`, systemd units, timers, runtime user ownership, SQLite path assumptions, and deploy scripts. +- Crabbox is acceptable and preferred long-term as `disposable_remote` proof for command/test execution, but it does not count as VPS-clone fidelity unless it recreates the same unit files, runtime paths, service user, database path, and deploy flow. +- A local fake GitHub/Forgejo API can prove parser and state logic, but it cannot close the staging acceptance gate for real GitHub comments. + +If inference spend is a concern: + +- Mock agent review responses in staging. +- Use a staging-specific cheap model. +- Run only one real model call after mocked proof passes. + +## Maintenance Capture + +Beneficial now: + +- Add a reusable `proof/phase1b` script later if manual staging repeats. +- Record exact service and config readback. + +Avoid now: + +- Building a full deployment platform. +- Giving Crabbox or staging production secrets. +- Replacing production with staging server. + +## Parallelization And Fanout + +Classification: draft_gated. + +This can be delegated to Fwaz or the infrastructure owner after code PR exists. + +Worker-ready prompt: + +```text +run phase 1b staging proof without mutating production. provision or clone a staging box, disable production services and secrets before starting the daemon, point the runtime at a sandbox decision-engine repo, enable phase 1b routing, run six single-domain prs plus one cross-domain pr, and save a machine-readable proof artifact. do not touch production prs or production secrets. +``` + +## Acceptance Criteria + +- Staging is isolated. +- Seven sandbox PR cases run. +- Required agents match expected matrix. +- Verdicts are parseable. +- Proof artifact exists. +- Staging is stopped or destroyed after proof. + +## Readiness And Claim Boundaries + +Allowed claim: + +- "Phase 1b passed staging proof." + +Forbidden claim: + +- "Production Phase 1b is complete." + +## Spec Quality Self-Audit + +All required execution-grade headings are present. Exact production commands remain unknown until VPS truth is read back. + +## Assistant-Added Caveats + +Crabbox is useful here only as a disposable staging/test substrate. It should not receive production secrets until there is a deliberate security review. diff --git a/lib/agent_routing.py b/lib/agent_routing.py new file mode 100644 index 0000000..c78b176 --- /dev/null +++ b/lib/agent_routing.py @@ -0,0 +1,287 @@ +"""Phase 1b Hermes agent routing. + +Routes knowledge-base PRs to the agent identity that owns the changed domain. +This module is deliberately pure: no network, database, LLM, or filesystem IO. +""" + +from __future__ import annotations + +import re +from dataclasses import asdict, dataclass + +AGENT_ORDER: tuple[str, ...] = ("Leo", "Theseus", "Rio", "Vida", "Clay", "Astra") +_AGENT_RANK = {agent: idx for idx, agent in enumerate(AGENT_ORDER)} + +DOMAIN_AGENT_MAP: dict[str, str] = { + "grand-strategy": "Leo", + "strategy": "Leo", + "teleohumanity": "Leo", + "collective-intelligence": "Leo", + "ai-alignment": "Theseus", + "ai-systems": "Theseus", + "living-agents": "Theseus", + "critical-systems": "Theseus", + "internet-finance": "Rio", + "mechanisms": "Rio", + "living-capital": "Rio", + "teleological-economics": "Rio", + "health": "Vida", + "entertainment": "Clay", + "cultural-dynamics": "Clay", + "space-development": "Astra", + "space": "Astra", + "robotics": "Astra", + "energy": "Astra", + "manufacturing": "Astra", + "advanced-manufacturing": "Astra", +} + +_AGENT_PRIMARY_DOMAIN: dict[str, str] = { + "leo": "grand-strategy", + "theseus": "ai-systems", + "rio": "internet-finance", + "vida": "health", + "clay": "entertainment", + "astra": "space-development", +} + +_INGESTION_SOURCE_DOMAIN: dict[str, str] = { + "futardio": "internet-finance", + "metadao": "internet-finance", + "x402": "internet-finance", +} + +_DOMAIN_PATH_RE = re.compile(r"^(?:domains|entities|core|foundations)/([^/]+)/") +_AGENT_PATH_RE = re.compile(r"^agents/([^/]+)/") + +_KEYWORDS: dict[str, tuple[str, ...]] = { + "Leo": ( + "grand strategy", + "collective ai", + "collective ais", + "collective goals", + "goal of the collective", + "self-understanding", + "self understanding", + "teleohumanity", + "meta-governance", + ), + "Theseus": ( + "ai alignment", + "ai systems", + "ai safety", + "agent alignment", + "prompt injection", + "model behavior", + "llm", + "hermes runtime", + ), + "Rio": ( + "internet finance", + "x402", + "wallet", + "payment", + "payments", + "onchain", + "defi", + "futarchy", + "metadao", + "prediction market", + "decision market", + "stablecoin", + ), + "Vida": ( + "health", + "medicine", + "clinical", + "patient", + "doctor", + "disease", + "longevity", + "biotech", + "glp-1", + ), + "Clay": ( + "entertainment", + "game", + "games", + "media", + "story", + "film", + "music", + "culture", + ), + "Astra": ( + "space", + "robotics", + "robot", + "energy", + "manufacturing", + "advanced manufacturing", + "hardware", + "satellite", + "rocket", + "nuclear", + ), +} + + +@dataclass(frozen=True) +class RouteEvidence: + agent: str + signal: str + weight: int + value: str + + +@dataclass(frozen=True) +class AgentRoute: + primary_agent: str + required_agents: tuple[str, ...] + route_kind: str + scores: dict[str, int] + evidence: tuple[RouteEvidence, ...] + fallback: bool = False + touched_domains: tuple[str, ...] = () + + def to_audit_dict(self) -> dict: + return { + "primary_agent": self.primary_agent, + "required_agents": list(self.required_agents), + "route_kind": self.route_kind, + "scores": self.scores, + "evidence": [asdict(item) for item in self.evidence], + "fallback": self.fallback, + "touched_domains": list(self.touched_domains), + } + + +def _changed_paths(diff: str) -> tuple[str, ...]: + paths: list[str] = [] + for line in diff.splitlines(): + if not line.startswith("diff --git "): + continue + match = re.match(r"diff --git a/(.*?) b/(.*)$", line) + if match: + paths.append(match.group(2)) + return tuple(paths) + + +def _add_score( + scores: dict[str, int], + evidence: list[RouteEvidence], + agent: str, + signal: str, + weight: int, + value: str, +) -> None: + if agent not in scores: + return + scores[agent] += weight + evidence.append(RouteEvidence(agent=agent, signal=signal, weight=weight, value=value)) + + +def _domain_for_branch(branch: str) -> str | None: + prefix = branch.split("/")[0].lower() if "/" in branch else "" + if prefix in _AGENT_PRIMARY_DOMAIN: + return _AGENT_PRIMARY_DOMAIN[prefix] + if prefix == "ingestion": + rest = branch.split("/", 1)[1].lower() if "/" in branch else "" + for source_key, domain in _INGESTION_SOURCE_DOMAIN.items(): + if source_key in rest: + return domain + return None + + +def _keyword_hits(agent: str, text: str) -> list[str]: + hits = [] + for keyword in _KEYWORDS[agent]: + pattern = rf"(? AgentRoute: + """Classify a PR into one or two required Hermes reviewer agents.""" + max_required_agents = max(1, min(max_required_agents, 2)) + scores = {agent: 0 for agent in AGENT_ORDER} + evidence: list[RouteEvidence] = [] + touched_domains: list[str] = [] + path_signal_found = False + + for path in _changed_paths(diff): + domain_match = _DOMAIN_PATH_RE.match(path) + if domain_match: + domain = domain_match.group(1).lower() + if domain in DOMAIN_AGENT_MAP: + agent = DOMAIN_AGENT_MAP[domain] + _add_score(scores, evidence, agent, "path", 8, path) + touched_domains.append(domain) + path_signal_found = True + continue + + agent_match = _AGENT_PATH_RE.match(path) + if agent_match: + agent_key = agent_match.group(1).lower() + for agent in AGENT_ORDER: + if agent.lower() == agent_key: + _add_score(scores, evidence, agent, "agent_path", 8, path) + path_signal_found = True + break + + if branch and not path_signal_found: + branch_domain = _domain_for_branch(branch) + if branch_domain: + agent = DOMAIN_AGENT_MAP[branch_domain] + _add_score(scores, evidence, agent, "branch", 4, branch) + touched_domains.append(branch_domain) + + keyword_text = "\n".join(part for part in (title or "", body or "", branch or "", diff) if part).lower() + for agent in AGENT_ORDER: + hits = _keyword_hits(agent, keyword_text) + for keyword in hits[:4]: + _add_score(scores, evidence, agent, "keyword", 2, keyword) + + ranked = sorted( + (agent for agent, score in scores.items() if score > 0), + key=lambda agent: (-scores[agent], _AGENT_RANK[agent]), + ) + + if not ranked: + evidence.append(RouteEvidence(agent="Leo", signal="fallback", weight=0, value="no route signal")) + return AgentRoute( + primary_agent="Leo", + required_agents=("Leo",), + route_kind="fallback", + scores=scores, + evidence=tuple(evidence), + fallback=True, + touched_domains=(), + ) + + primary = ranked[0] + required = tuple(ranked[:max_required_agents]) + if len(ranked) > max_required_agents: + route_kind = "escalated" + elif len(required) > 1: + route_kind = "multi" + else: + route_kind = "single" + + return AgentRoute( + primary_agent=primary, + required_agents=required, + route_kind=route_kind, + scores=scores, + evidence=tuple(evidence), + fallback=False, + touched_domains=tuple(dict.fromkeys(touched_domains)), + ) diff --git a/lib/config.py b/lib/config.py index b408fd4..7775293 100644 --- a/lib/config.py +++ b/lib/config.py @@ -192,6 +192,11 @@ SAMPLE_AUDIT_MODEL = MODEL_OPUS # Opus for audit — different family from Haik BATCH_EVAL_MAX_PRS = int(os.environ.get("BATCH_EVAL_MAX_PRS", "5")) BATCH_EVAL_MAX_DIFF_BYTES = int(os.environ.get("BATCH_EVAL_MAX_DIFF_BYTES", "100000")) # 100KB +# --- Phase 1b agent routing --- +# When enabled, eval uses the identity router to run exactly the routed Hermes +# reviewer agents instead of the legacy domain review + default Leo review path. +PHASE1B_AGENT_ROUTING_ENABLED = os.environ.get("PHASE1B_AGENT_ROUTING_ENABLED", "false").lower() == "true" + # --- Tier logic --- # LIGHT_SKIP_LLM: when True, LIGHT PRs skip domain+Leo review entirely (auto-approve on Tier 0 pass). # Set False for shadow mode (domain review runs but logs only). Flip True after 24h validation (Rhea). diff --git a/lib/domains.py b/lib/domains.py index bb1979a..825a84a 100644 --- a/lib/domains.py +++ b/lib/domains.py @@ -12,14 +12,20 @@ DOMAIN_AGENT_MAP: dict[str, str] = { "entertainment": "Clay", "health": "Vida", "ai-alignment": "Theseus", + "ai-systems": "Theseus", "space-development": "Astra", + "space": "Astra", + "robotics": "Astra", + "energy": "Astra", + "manufacturing": "Astra", + "advanced-manufacturing": "Astra", "mechanisms": "Rio", "living-capital": "Rio", "living-agents": "Theseus", "teleohumanity": "Leo", "grand-strategy": "Leo", "critical-systems": "Theseus", - "collective-intelligence": "Theseus", + "collective-intelligence": "Leo", "teleological-economics": "Rio", "cultural-dynamics": "Clay", } @@ -31,7 +37,7 @@ VALID_DOMAINS: frozenset[str] = frozenset(DOMAIN_AGENT_MAP.keys()) _AGENT_PRIMARY_DOMAIN: dict[str, str] = { "rio": "internet-finance", "clay": "entertainment", - "theseus": "ai-alignment", + "theseus": "ai-systems", "vida": "health", "astra": "space-development", "leo": "grand-strategy", diff --git a/lib/evaluate.py b/lib/evaluate.py index ddb7de5..530dcf4 100644 --- a/lib/evaluate.py +++ b/lib/evaluate.py @@ -24,7 +24,9 @@ import random from datetime import datetime, timezone from . import config, db +from .agent_routing import AgentRoute, classify_pr_route from .domains import agent_for_domain, detect_domain_from_branch, detect_domain_from_diff +from .eval_actions import dispose_rejected_pr, post_formal_approvals, terminate_pr from .eval_parse import ( deterministic_tier, diff_contains_claim_type, @@ -38,12 +40,10 @@ from .eval_parse import ( ) from .forgejo import api as forgejo_api from .forgejo import get_agent_token, get_pr_diff, repo_path -from .merge import PIPELINE_OWNED_PREFIXES -from .llm import run_batch_domain_review, run_domain_review, run_leo_review, triage_pr -from .eval_actions import dispose_rejected_pr, post_formal_approvals, terminate_pr from .github_feedback import on_eval_complete +from .llm import run_agent_review, run_batch_domain_review, run_domain_review, run_leo_review, triage_pr +from .merge import PIPELINE_OWNED_PREFIXES from .pr_state import approve_pr, close_pr, reopen_pr, start_review -from .validate import load_existing_claims logger = logging.getLogger("pipeline.evaluate") @@ -57,6 +57,216 @@ logger = logging.getLogger("pipeline.evaluate") # ─── Single PR evaluation ───────────────────────────────────────────────── +def _phase1b_domain_for_route(route: AgentRoute) -> str: + if route.route_kind in ("multi", "escalated"): + return "multi" + if route.touched_domains: + return route.touched_domains[0] + return "general" + + +def _phase1b_review_model(agent: str, tier: str) -> str: + if agent == "Leo": + return config.EVAL_LEO_STANDARD_MODEL + return config.EVAL_DOMAIN_MODEL + + +def _phase1b_compat_verdicts(agent_verdicts: dict[str, str]) -> tuple[str, str]: + """Project arbitrary routed verdicts into legacy leo/domain columns.""" + leo_verdict = agent_verdicts.get("Leo", "skipped") + non_leo = [verdict for agent, verdict in agent_verdicts.items() if agent != "Leo"] + aggregate = "request_changes" if "request_changes" in agent_verdicts.values() else "approve" + domain_verdict = aggregate if non_leo else "skipped" + return leo_verdict, domain_verdict + + +async def _evaluate_pr_phase1b( + conn, + pr_number: int, + *, + tier: str, + diff: str, + review_diff: str, + files: str, + branch_name: str, + eval_attempts: int, + pr_cost: float, +) -> dict: + """Evaluate a PR using the Phase 1b identity router.""" + from . import costs + + route = classify_pr_route(diff, branch=branch_name) + domain = _phase1b_domain_for_route(route) + route_context = json.dumps(route.to_audit_dict(), sort_keys=True) + + conn.execute( + "UPDATE prs SET domain = ?, domain_agent = ? WHERE number = ?", + (domain, route.primary_agent, pr_number), + ) + db.audit( + conn, + "evaluate", + "phase1b_route", + json.dumps({"pr": pr_number, "tier": tier, "route": route.to_audit_dict()}), + ) + + reviews: dict[str, str] = {} + agent_verdicts: dict[str, str] = {} + usage_by_agent: dict[str, dict] = {} + + for agent in route.required_agents: + logger.info("PR #%d: Phase 1b %s review (tier=%s, route=%s)", pr_number, agent, tier, route.route_kind) + review_text, usage = await run_agent_review(review_diff, files, agent, route_context, tier=tier) + if review_text is None: + reopen_pr(conn, pr_number) + if pr_cost > 0: + conn.execute("UPDATE prs SET cost_usd = cost_usd + ? WHERE number = ?", (pr_cost, pr_number)) + return { + "pr": pr_number, + "skipped": True, + "reason": "phase1b_agent_review_failed", + "agent": agent, + } + + verdict = parse_verdict(review_text, agent) + reviews[agent] = review_text + agent_verdicts[agent] = verdict + usage_by_agent[agent] = usage + + await forgejo_api( + "POST", + repo_path(f"issues/{pr_number}/comments"), + {"body": review_text}, + ) + + db.record_review( + conn, + pr_number, + "approved" if verdict == "approve" else "rejected", + domain=domain, + agent=route.primary_agent, + reviewer=agent, + reviewer_model=_phase1b_review_model(agent, tier), + rejection_reason=",".join(parse_issues(review_text)) if verdict == "request_changes" else None, + notes=review_text, + ) + + aggregate_approve = all(verdict == "approve" for verdict in agent_verdicts.values()) + leo_verdict, domain_verdict = _phase1b_compat_verdicts(agent_verdicts) + conn.execute( + "UPDATE prs SET leo_verdict = ?, domain_verdict = ?, domain_model = ? WHERE number = ?", + (leo_verdict, domain_verdict, "phase1b-agent-routing", pr_number), + ) + + for agent, usage in usage_by_agent.items(): + model = _phase1b_review_model(agent, tier) + pr_cost += costs.record_usage( + conn, + model, + "eval_agent", + input_tokens=usage.get("prompt_tokens", 0), + output_tokens=usage.get("completion_tokens", 0), + backend="openrouter", + ) + + if aggregate_approve: + pr_info = await forgejo_api("GET", repo_path(f"pulls/{pr_number}")) + pr_author = pr_info.get("user", {}).get("login", "") if pr_info else "" + await post_formal_approvals(pr_number, pr_author) + + is_agent_pr = not branch_name.startswith(PIPELINE_OWNED_PREFIXES) + approve_pr( + conn, + pr_number, + domain=domain, + auto_merge=1 if is_agent_pr else 0, + leo_verdict=leo_verdict, + domain_verdict=domain_verdict, + ) + db.audit( + conn, + "evaluate", + "phase1b_approved", + json.dumps( + { + "pr": pr_number, + "tier": tier, + "route": route.to_audit_dict(), + "agent_verdicts": agent_verdicts, + "auto_merge": is_agent_pr, + } + ), + ) + try: + await on_eval_complete(conn, pr_number, outcome="approved", review_text="\n\n".join(reviews.values())) + except Exception: + logger.exception("PR #%d: GitHub eval feedback failed (non-fatal)", pr_number) + else: + all_issues: list[str] = [] + for agent, verdict in agent_verdicts.items(): + if verdict == "request_changes": + all_issues.extend(parse_issues(reviews[agent])) + + reopen_pr( + conn, + pr_number, + leo_verdict=leo_verdict, + domain_verdict=domain_verdict, + last_error="phase1b agent review requested changes", + eval_issues=json.dumps(all_issues), + ) + feedback = { + "route": route.to_audit_dict(), + "agent_verdicts": agent_verdicts, + "tier": tier, + "issues": all_issues, + } + conn.execute( + "UPDATE sources SET feedback = ? WHERE path = (SELECT source_path FROM prs WHERE number = ?)", + (json.dumps(feedback), pr_number), + ) + db.audit( + conn, + "evaluate", + "phase1b_changes_requested", + json.dumps( + { + "pr": pr_number, + "tier": tier, + "route": route.to_audit_dict(), + "agent_verdicts": agent_verdicts, + "issues": all_issues, + } + ), + ) + await dispose_rejected_pr(conn, pr_number, eval_attempts, all_issues) + try: + await on_eval_complete( + conn, + pr_number, + outcome="rejected", + review_text="\n\n".join(reviews.values()), + issues=all_issues, + ) + except Exception: + logger.exception("PR #%d: GitHub eval feedback failed (non-fatal)", pr_number) + + if pr_cost > 0: + conn.execute("UPDATE prs SET cost_usd = cost_usd + ? WHERE number = ?", (pr_cost, pr_number)) + + return { + "pr": pr_number, + "tier": tier, + "domain": domain, + "phase1b": True, + "route": route.to_audit_dict(), + "agent_verdicts": agent_verdicts, + "approved": aggregate_approve, + "leo_verdict": leo_verdict, + "domain_verdict": domain_verdict, + } + + async def evaluate_pr(conn, pr_number: int, tier: str = None) -> dict: """Evaluate a single PR. Returns result dict.""" from . import costs @@ -201,6 +411,19 @@ async def evaluate_pr(conn, pr_number: int, tier: str = None) -> dict: (pr_number,), ) + if config.PHASE1B_AGENT_ROUTING_ENABLED: + return await _evaluate_pr_phase1b( + conn, + pr_number, + tier=tier, + diff=diff, + review_diff=review_diff, + files=files, + branch_name=branch_name, + eval_attempts=eval_attempts, + pr_cost=pr_cost, + ) + # Check if domain review already completed (resuming after Leo rate limit) existing = conn.execute("SELECT domain_verdict, leo_verdict FROM prs WHERE number = ?", (pr_number,)).fetchone() existing_domain_verdict = existing["domain_verdict"] if existing else "pending" @@ -543,7 +766,7 @@ async def _run_batch_domain_eval( "diff": review_diff, "files": files, "full_diff": diff, # kept for Leo review - "file_count": len([l for l in files.split("\n") if l.strip()]), + "file_count": len([line for line in files.split("\n") if line.strip()]), }) claimed_prs.append(pr_num) @@ -581,7 +804,7 @@ async def _run_batch_domain_eval( "UPDATE prs SET domain = COALESCE(domain, ?), domain_agent = ? WHERE number IN ({})".format( ",".join("?" * len(claimed_prs)) ), - [domain, agent] + claimed_prs, + [domain, agent, *claimed_prs], ) # Step 2: Run batch domain review @@ -859,8 +1082,12 @@ async def evaluate_cycle(conn, max_workers=None) -> tuple[int, int]: succeeded = 0 failed = 0 - # Group STANDARD PRs by domain for batch eval - domain_batches, individual_prs = _build_domain_batches(rows, conn) + # Phase 1b routes per PR by identity and supports cross-domain top-2 review, + # so stale DB-domain batching is disabled while the feature flag is on. + if config.PHASE1B_AGENT_ROUTING_ENABLED: + domain_batches, individual_prs = {}, list(rows) + else: + domain_batches, individual_prs = _build_domain_batches(rows, conn) # Process batch domain reviews first for domain, batch_prs in domain_batches.items(): diff --git a/lib/llm.py b/lib/llm.py index 1e72c0e..d4044fa 100644 --- a/lib/llm.py +++ b/lib/llm.py @@ -117,6 +117,48 @@ End your review with exactly one of: --- CHANGED FILES --- {files}""" +AGENT_REVIEW_PROMPT = """You are {agent}, a Hermes evaluator for TeleoHumanity's knowledge base. + +You are reviewing this PR because the Phase 1b router assigned it to your agent identity. +Route context: +{route_context} + +IMPORTANT — This PR may contain different content types: +- **Claims** (type: claim): arguable assertions with confidence levels. Review fully. +- **Entities** (type: entity, files in entities/): descriptive records of projects, people, protocols. Do NOT reject entities for missing confidence or source fields — they have a different schema. +- **Sources** (files in inbox/): archive metadata. Auto-approve these. + +Review this PR through your assigned identity. For EACH criterion below, write one sentence stating what you found: + +1. **Domain ownership** — Is this change inside your area of responsibility? If not, still review the portion relevant to your routed responsibility. +2. **Factual accuracy** — Are the claims/entities factually correct? Name any specific errors. +3. **Confidence calibration** — For claims only. Is the confidence level right for the evidence? +4. **System impact** — Does this change alter how agents, domains, or the collective understand goals, incentives, or operating assumptions? +5. **Wiki links** — Note broken [[wiki links]], but do NOT let them affect your verdict. Broken links are expected. + +VERDICT RULES: +- APPROVE if claims are factually correct and evidence supports them. +- APPROVE entity files unless they contain factual errors. +- APPROVE even if wiki links are broken. +- REQUEST_CHANGES only for blocking factual errors, duplicated evidence, clear confidence miscalibration, or a materially wrong domain/system implication. + +{style_guide} + +If requesting changes, tag the specific issues using ONLY these tags (do not invent new tags): + + +Valid tags: frontmatter_schema, title_overclaims, confidence_miscalibration, date_errors, factual_discrepancy, near_duplicate, scope_error + +End your review with exactly one of: + + + +--- PR DIFF --- +{diff} + +--- CHANGED FILES --- +{files}""" + LEO_PROMPT_STANDARD = """You are Leo, the lead evaluator for TeleoHumanity's knowledge base. IMPORTANT — Content types have DIFFERENT schemas: @@ -420,6 +462,28 @@ async def run_domain_review(diff: str, files: str, domain: str, agent: str) -> t return result, usage +async def run_agent_review( + diff: str, + files: str, + agent: str, + route_context: str = "", + tier: str = "STANDARD", +) -> tuple[str | None, dict]: + """Run a Phase 1b routed Hermes agent review via OpenRouter.""" + prompt = AGENT_REVIEW_PROMPT.format( + agent=agent, + agent_upper=agent.upper(), + route_context=route_context or "(no route context)", + style_guide=REVIEW_STYLE_GUIDE, + diff=diff, + files=files, + ) + model = config.EVAL_LEO_STANDARD_MODEL if agent == "Leo" else config.EVAL_DOMAIN_MODEL + timeout = config.EVAL_TIMEOUT_OPUS if tier == "DEEP" and agent == "Leo" else config.EVAL_TIMEOUT + result, usage = await openrouter_call(model, prompt, timeout_sec=timeout) + return result, usage + + async def run_leo_review(diff: str, files: str, tier: str) -> tuple[str | None, dict]: """Run Leo review. DEEP → Opus (Claude Max, queue if limited). STANDARD → GPT-4o (OpenRouter). diff --git a/tests/test_agent_routing.py b/tests/test_agent_routing.py new file mode 100644 index 0000000..8ec23a9 --- /dev/null +++ b/tests/test_agent_routing.py @@ -0,0 +1,129 @@ +"""Tests for Phase 1b identity-based agent routing.""" + +from lib.agent_routing import AGENT_ORDER, classify_pr_route + + +def _diff_for(*paths_and_lines: tuple[str, str] | str) -> str: + chunks = [] + for item in paths_and_lines: + if isinstance(item, tuple): + path, line = item + else: + path, line = item, "+content" + chunks.append(f"diff --git a/{path} b/{path}\n{line}") + return "\n".join(chunks) + + +def test_six_primary_domains_route_to_expected_agents(): + expected = { + "grand-strategy": "Leo", + "ai-alignment": "Theseus", + "internet-finance": "Rio", + "health": "Vida", + "entertainment": "Clay", + "space-development": "Astra", + } + + for domain, agent in expected.items(): + route = classify_pr_route(_diff_for(f"domains/{domain}/claim.md")) + assert route.primary_agent == agent + assert route.required_agents == (agent,) + assert route.route_kind == "single" + assert route.fallback is False + + +def test_broadened_identity_domains_route_to_owners(): + expected = { + "ai-systems": "Theseus", + "living-agents": "Theseus", + "living-capital": "Rio", + "collective-intelligence": "Leo", + "cultural-dynamics": "Clay", + "energy": "Astra", + "robotics": "Astra", + "manufacturing": "Astra", + "advanced-manufacturing": "Astra", + } + + for domain, agent in expected.items(): + route = classify_pr_route(_diff_for(f"foundations/{domain}/claim.md")) + assert route.primary_agent == agent + assert route.required_agents == (agent,) + + +def test_cross_domain_ai_and_x402_requires_theseus_and_rio(): + route = classify_pr_route( + _diff_for( + ("domains/ai-alignment/agent-wallets.md", "+AI systems route agents around x402 payments"), + ("domains/internet-finance/x402.md", "+x402 payment rail for onchain agent transactions"), + ) + ) + + assert route.primary_agent == "Rio" + assert set(route.required_agents) == {"Theseus", "Rio"} + assert len(route.required_agents) == 2 + assert route.route_kind == "multi" + + +def test_collective_ai_goals_routes_to_leo_and_theseus(): + route = classify_pr_route( + _diff_for( + ( + "foundations/collective-intelligence/collective-ai-goals.md", + "+Collective AI goals and AI systems self-understanding need review.", + ) + ) + ) + + assert route.primary_agent == "Leo" + assert route.required_agents == ("Leo", "Theseus") + assert route.route_kind == "multi" + + +def test_too_many_touched_domains_caps_at_two_and_marks_escalated(): + route = classify_pr_route( + _diff_for( + "domains/internet-finance/a.md", + "domains/internet-finance/b.md", + "domains/health/c.md", + "domains/entertainment/d.md", + "domains/space-development/e.md", + ) + ) + + assert route.primary_agent == "Rio" + assert route.required_agents == ("Rio", "Vida") + assert route.route_kind == "escalated" + assert len(route.required_agents) == 2 + + +def test_branch_prefix_used_when_diff_has_no_route_path(): + route = classify_pr_route(_diff_for("inbox/archive/source.md"), branch="vida/research-glp1") + + assert route.primary_agent == "Vida" + assert route.required_agents == ("Vida",) + assert route.route_kind == "single" + + +def test_unknown_route_falls_back_to_leo(): + route = classify_pr_route(_diff_for("docs/readme.md"), branch="misc/update") + + assert route.primary_agent == "Leo" + assert route.required_agents == ("Leo",) + assert route.route_kind == "fallback" + assert route.fallback is True + + +def test_routing_is_deterministic_for_repeated_inputs(): + diff = _diff_for( + ("domains/health/agent-care.md", "+AI systems and health medicine review"), + ("domains/ai-systems/care-agent.md", "+clinical model behavior"), + ) + first = classify_pr_route(diff) + + for _ in range(100): + assert classify_pr_route(diff) == first + + +def test_agent_order_is_stable(): + assert AGENT_ORDER == ("Leo", "Theseus", "Rio", "Vida", "Clay", "Astra") diff --git a/tests/test_eval_parse.py b/tests/test_eval_parse.py index 786d5a6..6f0f781 100644 --- a/tests/test_eval_parse.py +++ b/tests/test_eval_parse.py @@ -170,6 +170,11 @@ class TestParseVerdict: def test_case_insensitive_reviewer(self): assert parse_verdict("VERDICT:LEO:APPROVE", "leo") == "approve" + @pytest.mark.parametrize("agent", ["LEO", "THESEUS", "RIO", "VIDA", "CLAY", "ASTRA"]) + def test_phase1b_agent_verdicts(self, agent): + assert parse_verdict(f"", agent) == "approve" + assert parse_verdict(f"", agent) == "request_changes" + # --------------------------------------------------------------------------- # normalize_tag diff --git a/tests/test_evaluate_agent_routing.py b/tests/test_evaluate_agent_routing.py new file mode 100644 index 0000000..af61a38 --- /dev/null +++ b/tests/test_evaluate_agent_routing.py @@ -0,0 +1,214 @@ +"""Tests for Phase 1b eval integration.""" + +import sqlite3 +from unittest.mock import AsyncMock + +import pytest + +from lib import config +from lib.evaluate import _evaluate_pr_phase1b, evaluate_pr + + +@pytest.fixture +def phase1b_conn(): + conn = sqlite3.connect(":memory:") + conn.row_factory = sqlite3.Row + conn.executescript( + """ + CREATE TABLE prs ( + number INTEGER PRIMARY KEY, + source_path TEXT, + branch TEXT, + status TEXT NOT NULL DEFAULT 'open', + domain TEXT, + agent TEXT, + tier TEXT, + tier0_pass INTEGER, + leo_verdict TEXT DEFAULT 'pending', + domain_verdict TEXT DEFAULT 'pending', + domain_agent TEXT, + domain_model TEXT, + eval_attempts INTEGER DEFAULT 0, + eval_issues TEXT DEFAULT '[]', + merge_cycled INTEGER DEFAULT 0, + last_error TEXT, + last_attempt TEXT, + cost_usd REAL DEFAULT 0, + auto_merge INTEGER DEFAULT 0, + created_at TEXT DEFAULT (datetime('now')), + merged_at TEXT + ); + CREATE TABLE sources ( + path TEXT PRIMARY KEY, + status TEXT DEFAULT 'extracted', + feedback TEXT + ); + CREATE TABLE audit_log ( + id INTEGER PRIMARY KEY AUTOINCREMENT, + stage TEXT, + event TEXT, + detail TEXT + ); + CREATE TABLE review_records ( + id INTEGER PRIMARY KEY AUTOINCREMENT, + pr_number INTEGER NOT NULL, + claim_path TEXT, + domain TEXT, + agent TEXT, + reviewer TEXT, + reviewer_model TEXT, + outcome TEXT NOT NULL, + rejection_reason TEXT, + disagreement_type TEXT, + notes TEXT, + batch_id TEXT, + claims_in_batch INTEGER, + reviewed_at TEXT DEFAULT (datetime('now')) + ); + CREATE TABLE costs ( + date TEXT, + model TEXT, + stage TEXT, + calls INTEGER DEFAULT 0, + input_tokens INTEGER DEFAULT 0, + output_tokens INTEGER DEFAULT 0, + cost_usd REAL DEFAULT 0, + duration_ms INTEGER DEFAULT 0, + cache_read_tokens INTEGER DEFAULT 0, + cache_write_tokens INTEGER DEFAULT 0, + cost_estimate_usd REAL DEFAULT 0, + PRIMARY KEY (date, model, stage) + ); + """ + ) + yield conn + conn.close() + + +def _diff_for(*paths: str) -> str: + return "\n".join(f"diff --git a/{path} b/{path}\n+type: claim\n+description: test" for path in paths) + + +def _insert_pr(conn, number=1, branch="rio/test", source_path="inbox/archive/test.md"): + conn.execute("INSERT INTO sources (path, status) VALUES (?, ?)", (source_path, "extracted")) + conn.execute( + """INSERT INTO prs + (number, source_path, branch, status, tier, tier0_pass, leo_verdict, domain_verdict, eval_attempts) + VALUES (?, ?, ?, 'open', 'STANDARD', 1, 'pending', 'pending', 0)""", + (number, source_path, branch), + ) + + +async def _fake_agent_review(_diff, _files, agent, _route_context, tier="STANDARD"): + return f"{agent} review\n", { + "prompt_tokens": 10, + "completion_tokens": 5, + } + + +async def _fake_agent_review_reject_vida(_diff, _files, agent, _route_context, tier="STANDARD"): + verdict = "REQUEST_CHANGES" if agent == "Vida" else "APPROVE" + issues = "\n" if verdict == "REQUEST_CHANGES" else "" + return f"{agent} review{issues}\n", { + "prompt_tokens": 10, + "completion_tokens": 5, + } + + +async def _fake_forgejo_api(method, path, body=None, token=None): + if method == "GET" and "pulls/" in path: + return {"user": {"login": "contributor"}} + return {"id": 1} + + +@pytest.mark.asyncio +async def test_phase1b_cross_domain_approves_after_all_required_agents(phase1b_conn, monkeypatch): + conn = phase1b_conn + _insert_pr(conn, branch="rio/ai-x402") + monkeypatch.setattr("lib.evaluate.run_agent_review", _fake_agent_review) + monkeypatch.setattr("lib.evaluate.forgejo_api", _fake_forgejo_api) + post_formal = AsyncMock() + monkeypatch.setattr("lib.evaluate.post_formal_approvals", post_formal) + monkeypatch.setattr("lib.evaluate.on_eval_complete", AsyncMock()) + + diff = _diff_for("domains/ai-systems/agent-wallets.md", "domains/internet-finance/x402.md") + result = await _evaluate_pr_phase1b( + conn, + 1, + tier="STANDARD", + diff=diff, + review_diff=diff, + files="domains/ai-systems/agent-wallets.md\ndomains/internet-finance/x402.md", + branch_name="rio/ai-x402", + eval_attempts=1, + pr_cost=0, + ) + + assert result["approved"] is True + assert set(result["agent_verdicts"]) == {"Theseus", "Rio"} + row = conn.execute("SELECT status, domain, domain_agent, leo_verdict, domain_verdict FROM prs WHERE number = 1").fetchone() + assert row["status"] == "approved" + assert row["domain"] == "multi" + assert row["leo_verdict"] == "skipped" + assert row["domain_verdict"] == "approve" + assert row["domain_agent"] in {"Theseus", "Rio"} + review_count = conn.execute("SELECT COUNT(*) AS n FROM review_records WHERE pr_number = 1").fetchone()["n"] + assert review_count == 2 + post_formal.assert_awaited_once() + + +@pytest.mark.asyncio +async def test_phase1b_request_changes_blocks_merge(phase1b_conn, monkeypatch): + conn = phase1b_conn + _insert_pr(conn, branch="vida/health") + monkeypatch.setattr("lib.evaluate.run_agent_review", _fake_agent_review_reject_vida) + monkeypatch.setattr("lib.evaluate.forgejo_api", _fake_forgejo_api) + monkeypatch.setattr("lib.evaluate.post_formal_approvals", AsyncMock()) + dispose = AsyncMock() + monkeypatch.setattr("lib.evaluate.dispose_rejected_pr", dispose) + monkeypatch.setattr("lib.evaluate.on_eval_complete", AsyncMock()) + + diff = _diff_for("domains/health/claim.md") + result = await _evaluate_pr_phase1b( + conn, + 1, + tier="STANDARD", + diff=diff, + review_diff=diff, + files="domains/health/claim.md", + branch_name="vida/health", + eval_attempts=1, + pr_cost=0, + ) + + assert result["approved"] is False + assert result["agent_verdicts"] == {"Vida": "request_changes"} + row = conn.execute("SELECT status, domain_agent, domain_verdict, eval_issues FROM prs WHERE number = 1").fetchone() + assert row["status"] == "open" + assert row["domain_agent"] == "Vida" + assert row["domain_verdict"] == "request_changes" + assert "factual_discrepancy" in row["eval_issues"] + dispose.assert_awaited_once() + + +@pytest.mark.asyncio +async def test_evaluate_pr_flag_uses_phase1b_and_not_legacy_reviewers(phase1b_conn, monkeypatch): + conn = phase1b_conn + _insert_pr(conn, branch="rio/x402") + monkeypatch.setattr(config, "PHASE1B_AGENT_ROUTING_ENABLED", True) + monkeypatch.setattr("lib.evaluate.get_pr_diff", AsyncMock(return_value=_diff_for("domains/internet-finance/x402.md"))) + monkeypatch.setattr("lib.evaluate.run_agent_review", _fake_agent_review) + legacy_domain = AsyncMock() + legacy_leo = AsyncMock() + monkeypatch.setattr("lib.evaluate.run_domain_review", legacy_domain) + monkeypatch.setattr("lib.evaluate.run_leo_review", legacy_leo) + monkeypatch.setattr("lib.evaluate.forgejo_api", _fake_forgejo_api) + monkeypatch.setattr("lib.evaluate.post_formal_approvals", AsyncMock()) + monkeypatch.setattr("lib.evaluate.on_eval_complete", AsyncMock()) + + result = await evaluate_pr(conn, 1, tier="STANDARD") + + assert result["phase1b"] is True + assert result["agent_verdicts"] == {"Rio": "approve"} + legacy_domain.assert_not_awaited() + legacy_leo.assert_not_awaited()