11 KiB
Phase 1b Child Spec: Eval Pipeline Integration
Created: 2026-05-29
Status: active draft
Parent spec: docs/phase1b-agent-routing-spec.md
Product Outcome Contract
Pipeline-v2 must use the Phase 1b route result to run the required Hermes agent reviews for a decision-engine PR. The old default shape where every non-LIGHT PR receives a domain review plus Leo review must be bypassed when Phase 1b routing is enabled.
Goal
Integrate agent identity routing into lib/evaluate.py behind a feature flag, run one or two required reviewer agents, aggregate verdicts, and preserve existing merge or feedback behavior.
Non-Goals
- Do not remove the old eval path until staging proof exists.
- Do not rewrite the full Forgejo/GitHub API abstraction.
- Do not redesign dashboards.
- Do not implement separate GitHub identities.
- Do not change extraction or validation behavior except as needed for eval tests.
Current Implementation Audit
Current relevant code:
lib/evaluate.py::evaluate_prowns single PR evaluation.lib/evaluate.py::evaluate_cycleselects eligible PRs._build_domain_batchesgroups STANDARD PRs by DB domain before fetching diffs._run_batch_domain_evalruns batch domain reviews, then individual Leo reviews.run_domain_reviewinlib/llm.pyprompts a domain expert through OpenRouter.run_leo_reviewinlib/llm.pyprompts Leo through OpenRouter or Claude path depending on tier.parse_verdictinlib/eval_parse.pyparses reviewer-specific verdict tags.approve_pr,reopen_pr,close_pr, andstart_reviewhandle state transitions.
Current behavior:
- Diff path detects a domain.
agent_for_domain(domain)selects one domain agent.- Domain review runs first.
- Leo review runs after domain approval for non-LIGHT PRs.
leo_verdictanddomain_verdictare the stored verdict fields.- Contributor credit logic assumes Leo can be one evaluator and
domain_agentcan be the other.
Existing-Spec Inventory
| Existing doc | Relevance | Decision |
|---|---|---|
docs/phase1b-agent-routing-spec.md |
Parent route and eval contract. | Reuse. |
docs/ARCHITECTURE.md |
Existing pipeline stage model. | Reuse as baseline. |
docs/multi-model-eval-architecture.md |
Prior Leo-plus-second-model design. | Supersede for Phase 1b eval path only. |
handoff/deprecated/eval-scripts.md |
Confirms shell eval scripts are dead. | Reuse to avoid wrong surface. |
Goal-Vs-Repo-Truth Diff
Goal:
evaluate_prcalls the route scorer.- Required agents are the only reviewer agents.
- One required agent means one review.
- Two required agents means two reviews and aggregate verdict.
- Default Leo second-review is removed when the feature flag is enabled.
- Old behavior remains available when the feature flag is disabled.
Branch truth:
- Legacy eval is still available when the feature flag is false.
- When the feature flag is true, eval invokes the identity route, runs required agents only, writes
review_records, and projects aggregate state back into legacyleo_verdictanddomain_verdictcolumns. - Batch eval is disabled while the feature flag is true because stale DB-domain grouping is not route-aware.
run_agent_reviewexists, but it uses prompt-level identity context rather than loading full KB identity/belief/reasoning files.
Completion Percent And Remaining Delta
Current completion on this branch: 75 percent local implementation behind a default-off feature flag.
Remaining delta:
- Decide direct GitHub
decision-enginecomment transport versus Forgejo-first cutover compatibility. - Prove with staging PRs and real daemon logs.
- Update contributor/dashboard assumptions only where staging or tests prove breakage.
Closure, Endpoint, And Deployment Truth
Local closure:
- Mocked eval tests prove route-to-review-to-aggregate behavior.
Staging closure:
- Staging sandbox PRs receive expected comments and DB state transitions.
Production closure:
- Live
decision-enginePRs are handled by Phase 1b route path for 24 hours.
This spec cannot claim production closure without VPS proof.
Critical Assumptions And Invalidators
Assumptions:
- Feature flag rollback is acceptable.
- Existing state fields can support Phase 1b initially by storing primary agent in
domain_agentand aggregate details in audit rows. - A DB schema migration is avoidable for the first PR.
- Master bot comments with
VERDICT:AGENT:*are acceptable.
Invalidators:
- Downstream merge logic requires formal reviews from separate GitHub users.
- Dashboards or contributor credit fail hard when Leo is not present.
- Batch eval cannot be safely disabled and must be route-aware from day one.
- Production env cannot set feature flags.
State And Truth Contract
Feature flag:
PHASE1B_AGENT_ROUTING_ENABLED=false
When false:
- Existing eval behavior continues.
When true:
- Eval route is built for every non-bypass PR.
- Audit log records route JSON.
- Required agent reviews run.
- Aggregate verdict determines approval or feedback.
Minimal DB field use:
domain: keep route primary domain ormulti.domain_agent: keep primary agent.domain_verdict: keep aggregate non-Leo review verdict or aggregate verdict.leo_verdict: setskippedunless Leo is a required agent; if Leo is required, store Leo verdict.review_records: write one row per required reviewer attempt with reviewer agent, model, outcome, and notes.- review comments include a
PHASE1B_REVIEWmarker and the current local helper suppresses duplicate posts for the same PR and agent. - audit log: route and all per-agent verdicts.
This is a compatibility posture, not the ideal long-term schema.
Measurement Contract
Required local assertions:
- Phase 1b flag disabled uses old runner calls.
- Phase 1b flag enabled calls
run_agent_reviewonce for single route. - Phase 1b flag enabled calls
run_agent_reviewtwice for multi route. run_leo_reviewis not called unless Leo is inrequired_agents.- all approve returns approved aggregate.
- one request changes returns feedback aggregate.
- transport failure reopens for retry.
- retry after a partial multi-agent success does not duplicate existing posted verdict comments.
Backend Work Required
Owned files:
lib/evaluate.pylib/llm.pylib/config.pylib/eval_parse.pyonly if parser compatibility needs explicit tests or normalization.tests/test_evaluate_agent_routing.pytests/test_eval_parse.py
Implementation steps:
- Add
PHASE1B_AGENT_ROUTING_ENABLEDtolib/config.py. - Import route scorer.
- Add
run_agent_reviewinlib/llm.py. - Add helper to load agent context from KB worktree.
- Add
aggregate_agent_verdicts. - In
evaluate_pr, after bypasses and diff filtering, branch into Phase 1b path when flag is true. - In Phase 1b path, run required reviews and post comments through the existing API helper.
- Update DB fields conservatively.
- Write
review_recordsrows for each required reviewer attempt. - Preserve old logic under flag false.
- Disable
_build_domain_batcheswhile flag is true or make it route-aware.
Forbidden files:
- Deprecated eval shell scripts.
- Deployment scripts unless needed for documenting the flag.
- Runtime secrets.
Frontend Work Required
None.
Expected Runtime And User-Visible Behavior
Single-agent example:
PR touches internet finance.
route.required_agents = ["Rio"]
pipeline posts a Rio verdict.
merge proceeds if Rio approves.
Cross-agent example:
PR touches AI systems and x402 payments.
route.required_agents = ["Theseus", "Rio"]
pipeline posts Theseus and Rio verdicts.
merge proceeds only if both approve.
Fallback example:
PR cannot be confidently routed.
route.required_agents = ["Leo"]
pipeline posts Leo verdict.
route_kind = fallback is audited.
Validation And Test Matrix
Commands:
python3 -m pytest tests/test_evaluate_agent_routing.py tests/test_eval_parse.py
python3 -m ruff check lib/evaluate.py lib/llm.py lib/config.py tests/test_evaluate_agent_routing.py
git diff --check
Test cases:
- flag-off old behavior smoke
- flag-on single reviewer approve
- flag-on single reviewer request changes
- flag-on two reviewer approve
- flag-on two reviewer one reject
- missing verdict
- transport failure
- Leo required route
- Leo not required route
- batch disabled or route-aware under flag
CI/CD, Release, And Pre-Push Gate Contract
Before PR:
- Focused tests pass.
- Old behavior remains behind flag false.
- No production default flips to true.
Before staging:
- Operator can enable flag in staging env.
- Sandbox repo target is configured.
Before production:
- Staging proof artifact exists.
- Rollback command is known.
Independent CLI Audit Contract
Reviewer commands:
git diff -- lib/evaluate.py lib/llm.py lib/config.py tests/test_evaluate_agent_routing.py
python3 -m pytest tests/test_evaluate_agent_routing.py
Reviewer checks:
- No deprecated scripts revived.
- No secrets introduced.
- Feature flag false preserves old path.
- Feature flag true bypasses default Leo second-review.
- Cross-domain aggregate requires all required reviewers to approve.
Outside-The-Box Fix Paths
If compatibility fields become confusing:
- Add a narrow DB migration for
route_jsonandagent_verdicts_json.
If batch eval blocks safe integration:
- Disable batch eval under Phase 1b flag for one release.
If LLM review prompts get too large:
- Load only identity plus beliefs first, then add reasoning/skills later.
Maintenance Capture
Beneficial now:
- Isolate Phase 1b logic into helpers instead of expanding
evaluate_prdeeply. - Keep rollback path explicit.
Avoid now:
- Full eval architecture rewrite.
- Dashboard redesign.
- Broad DB migration unless tests require it.
Parallelization And Fanout
Classification: local_owner.
Do not fan out before the router contract lands. Eval integration depends tightly on route result semantics.
Worker-ready prompt:
wire phase 1b routing into teleo-infrastructure eval behind PHASE1B_AGENT_ROUTING_ENABLED. own lib/evaluate.py, lib/llm.py, lib/config.py, and mocked eval tests. run required agents from the route result, aggregate verdicts, preserve old behavior when the flag is false, and do not revive deprecated scripts.
Acceptance Criteria
- Flag false path remains available.
- Flag true path runs required agents only.
- One or two verdicts aggregate correctly.
- Existing merge or feedback path is preserved.
- Focused mocked tests pass.
Readiness And Claim Boundaries
Allowed claim:
- "Phase 1b eval integration is locally tested behind a feature flag."
Forbidden claim:
- "Phase 1b is live."
Spec Quality Self-Audit
All required execution-grade headings are present. This spec intentionally defers exact production commands to the staging/proof child spec because they depend on VPS truth.
Assistant-Added Caveats
The compatibility use of domain_verdict and leo_verdict is a pragmatic Phase 1b bridge. A cleaner route schema may be worth adding after staging proof, but a premature migration would widen the blast radius.