343 lines
11 KiB
Markdown
343 lines
11 KiB
Markdown
# Phase 1b Child Spec: Eval Pipeline Integration
|
|
|
|
Created: 2026-05-29
|
|
Status: active draft
|
|
Parent spec: `docs/phase1b-agent-routing-spec.md`
|
|
|
|
## Product Outcome Contract
|
|
|
|
Pipeline-v2 must use the Phase 1b route result to run the required Hermes agent reviews for a `decision-engine` PR. The old default shape where every non-LIGHT PR receives a domain review plus Leo review must be bypassed when Phase 1b routing is enabled.
|
|
|
|
## Goal
|
|
|
|
Integrate agent identity routing into `lib/evaluate.py` behind a feature flag, run one or two required reviewer agents, aggregate verdicts, and preserve existing merge or feedback behavior.
|
|
|
|
## Non-Goals
|
|
|
|
- Do not remove the old eval path until staging proof exists.
|
|
- Do not rewrite the full Forgejo/GitHub API abstraction.
|
|
- Do not redesign dashboards.
|
|
- Do not implement separate GitHub identities.
|
|
- Do not change extraction or validation behavior except as needed for eval tests.
|
|
|
|
## Current Implementation Audit
|
|
|
|
Current relevant code:
|
|
|
|
- `lib/evaluate.py::evaluate_pr` owns single PR evaluation.
|
|
- `lib/evaluate.py::evaluate_cycle` selects eligible PRs.
|
|
- `_build_domain_batches` groups STANDARD PRs by DB domain before fetching diffs.
|
|
- `_run_batch_domain_eval` runs batch domain reviews, then individual Leo reviews.
|
|
- `run_domain_review` in `lib/llm.py` prompts a domain expert through OpenRouter.
|
|
- `run_leo_review` in `lib/llm.py` prompts Leo through OpenRouter or Claude path depending on tier.
|
|
- `parse_verdict` in `lib/eval_parse.py` parses reviewer-specific verdict tags.
|
|
- `approve_pr`, `reopen_pr`, `close_pr`, and `start_review` handle state transitions.
|
|
|
|
Current behavior:
|
|
|
|
- Diff path detects a domain.
|
|
- `agent_for_domain(domain)` selects one domain agent.
|
|
- Domain review runs first.
|
|
- Leo review runs after domain approval for non-LIGHT PRs.
|
|
- `leo_verdict` and `domain_verdict` are the stored verdict fields.
|
|
- Contributor credit logic assumes Leo can be one evaluator and `domain_agent` can be the other.
|
|
|
|
## Existing-Spec Inventory
|
|
|
|
| Existing doc | Relevance | Decision |
|
|
| --- | --- | --- |
|
|
| `docs/phase1b-agent-routing-spec.md` | Parent route and eval contract. | Reuse. |
|
|
| `docs/ARCHITECTURE.md` | Existing pipeline stage model. | Reuse as baseline. |
|
|
| `docs/multi-model-eval-architecture.md` | Prior Leo-plus-second-model design. | Supersede for Phase 1b eval path only. |
|
|
| `handoff/deprecated/eval-scripts.md` | Confirms shell eval scripts are dead. | Reuse to avoid wrong surface. |
|
|
|
|
## Goal-Vs-Repo-Truth Diff
|
|
|
|
Goal:
|
|
|
|
- `evaluate_pr` calls the route scorer.
|
|
- Required agents are the only reviewer agents.
|
|
- One required agent means one review.
|
|
- Two required agents means two reviews and aggregate verdict.
|
|
- Default Leo second-review is removed when the feature flag is enabled.
|
|
- Old behavior remains available when the feature flag is disabled.
|
|
|
|
Branch truth:
|
|
|
|
- Legacy eval is still available when the feature flag is false.
|
|
- When the feature flag is true, eval invokes the identity route, runs required agents only, writes `review_records`, and projects aggregate state back into legacy `leo_verdict` and `domain_verdict` columns.
|
|
- Batch eval is disabled while the feature flag is true because stale DB-domain grouping is not route-aware.
|
|
- `run_agent_review` exists, but it uses prompt-level identity context rather than loading full KB identity/belief/reasoning files.
|
|
|
|
## Completion Percent And Remaining Delta
|
|
|
|
Current completion on this branch: 75 percent local implementation behind a default-off feature flag.
|
|
|
|
Remaining delta:
|
|
|
|
1. Decide direct GitHub `decision-engine` comment transport versus Forgejo-first cutover compatibility.
|
|
2. Prove with staging PRs and real daemon logs.
|
|
3. Update contributor/dashboard assumptions only where staging or tests prove breakage.
|
|
|
|
## Closure, Endpoint, And Deployment Truth
|
|
|
|
Local closure:
|
|
|
|
- Mocked eval tests prove route-to-review-to-aggregate behavior.
|
|
|
|
Staging closure:
|
|
|
|
- Staging sandbox PRs receive expected comments and DB state transitions.
|
|
|
|
Production closure:
|
|
|
|
- Live `decision-engine` PRs are handled by Phase 1b route path for 24 hours.
|
|
|
|
This spec cannot claim production closure without VPS proof.
|
|
|
|
## Critical Assumptions And Invalidators
|
|
|
|
Assumptions:
|
|
|
|
- Feature flag rollback is acceptable.
|
|
- Existing state fields can support Phase 1b initially by storing primary agent in `domain_agent` and aggregate details in audit rows.
|
|
- A DB schema migration is avoidable for the first PR.
|
|
- Master bot comments with `VERDICT:AGENT:*` are acceptable.
|
|
|
|
Invalidators:
|
|
|
|
- Downstream merge logic requires formal reviews from separate GitHub users.
|
|
- Dashboards or contributor credit fail hard when Leo is not present.
|
|
- Batch eval cannot be safely disabled and must be route-aware from day one.
|
|
- Production env cannot set feature flags.
|
|
|
|
## State And Truth Contract
|
|
|
|
Feature flag:
|
|
|
|
```text
|
|
PHASE1B_AGENT_ROUTING_ENABLED=false
|
|
```
|
|
|
|
When false:
|
|
|
|
- Existing eval behavior continues.
|
|
|
|
When true:
|
|
|
|
- Eval route is built for every non-bypass PR.
|
|
- Audit log records route JSON.
|
|
- Required agent reviews run.
|
|
- Aggregate verdict determines approval or feedback.
|
|
|
|
Minimal DB field use:
|
|
|
|
- `domain`: keep route primary domain or `multi`.
|
|
- `domain_agent`: keep primary agent.
|
|
- `domain_verdict`: keep aggregate non-Leo review verdict or aggregate verdict.
|
|
- `leo_verdict`: set `skipped` unless Leo is a required agent; if Leo is required, store Leo verdict.
|
|
- `review_records`: write one row per required reviewer attempt with reviewer agent, model, outcome, and notes.
|
|
- review comments include a `PHASE1B_REVIEW` marker and the current local helper suppresses duplicate posts for the same PR and agent.
|
|
- audit log: route and all per-agent verdicts.
|
|
|
|
This is a compatibility posture, not the ideal long-term schema.
|
|
|
|
## Measurement Contract
|
|
|
|
Required local assertions:
|
|
|
|
- Phase 1b flag disabled uses old runner calls.
|
|
- Phase 1b flag enabled calls `run_agent_review` once for single route.
|
|
- Phase 1b flag enabled calls `run_agent_review` twice for multi route.
|
|
- `run_leo_review` is not called unless Leo is in `required_agents`.
|
|
- all approve returns approved aggregate.
|
|
- one request changes returns feedback aggregate.
|
|
- transport failure reopens for retry.
|
|
- retry after a partial multi-agent success does not duplicate existing posted verdict comments.
|
|
|
|
## Backend Work Required
|
|
|
|
Owned files:
|
|
|
|
- `lib/evaluate.py`
|
|
- `lib/llm.py`
|
|
- `lib/config.py`
|
|
- `lib/eval_parse.py` only if parser compatibility needs explicit tests or normalization.
|
|
- `tests/test_evaluate_agent_routing.py`
|
|
- `tests/test_eval_parse.py`
|
|
|
|
Implementation steps:
|
|
|
|
1. Add `PHASE1B_AGENT_ROUTING_ENABLED` to `lib/config.py`.
|
|
2. Import route scorer.
|
|
3. Add `run_agent_review` in `lib/llm.py`.
|
|
4. Add helper to load agent context from KB worktree.
|
|
5. Add `aggregate_agent_verdicts`.
|
|
6. In `evaluate_pr`, after bypasses and diff filtering, branch into Phase 1b path when flag is true.
|
|
7. In Phase 1b path, run required reviews and post comments through the existing API helper.
|
|
8. Update DB fields conservatively.
|
|
9. Write `review_records` rows for each required reviewer attempt.
|
|
10. Preserve old logic under flag false.
|
|
11. Disable `_build_domain_batches` while flag is true or make it route-aware.
|
|
|
|
Forbidden files:
|
|
|
|
- Deprecated eval shell scripts.
|
|
- Deployment scripts unless needed for documenting the flag.
|
|
- Runtime secrets.
|
|
|
|
## Frontend Work Required
|
|
|
|
None.
|
|
|
|
## Expected Runtime And User-Visible Behavior
|
|
|
|
Single-agent example:
|
|
|
|
```text
|
|
PR touches internet finance.
|
|
route.required_agents = ["Rio"]
|
|
pipeline posts a Rio verdict.
|
|
merge proceeds if Rio approves.
|
|
```
|
|
|
|
Cross-agent example:
|
|
|
|
```text
|
|
PR touches AI systems and x402 payments.
|
|
route.required_agents = ["Theseus", "Rio"]
|
|
pipeline posts Theseus and Rio verdicts.
|
|
merge proceeds only if both approve.
|
|
```
|
|
|
|
Fallback example:
|
|
|
|
```text
|
|
PR cannot be confidently routed.
|
|
route.required_agents = ["Leo"]
|
|
pipeline posts Leo verdict.
|
|
route_kind = fallback is audited.
|
|
```
|
|
|
|
## Validation And Test Matrix
|
|
|
|
Commands:
|
|
|
|
```bash
|
|
python3 -m pytest tests/test_evaluate_agent_routing.py tests/test_eval_parse.py
|
|
python3 -m ruff check lib/evaluate.py lib/llm.py lib/config.py tests/test_evaluate_agent_routing.py
|
|
git diff --check
|
|
```
|
|
|
|
Test cases:
|
|
|
|
- flag-off old behavior smoke
|
|
- flag-on single reviewer approve
|
|
- flag-on single reviewer request changes
|
|
- flag-on two reviewer approve
|
|
- flag-on two reviewer one reject
|
|
- missing verdict
|
|
- transport failure
|
|
- Leo required route
|
|
- Leo not required route
|
|
- batch disabled or route-aware under flag
|
|
|
|
## CI/CD, Release, And Pre-Push Gate Contract
|
|
|
|
Before PR:
|
|
|
|
- Focused tests pass.
|
|
- Old behavior remains behind flag false.
|
|
- No production default flips to true.
|
|
|
|
Before staging:
|
|
|
|
- Operator can enable flag in staging env.
|
|
- Sandbox repo target is configured.
|
|
|
|
Before production:
|
|
|
|
- Staging proof artifact exists.
|
|
- Rollback command is known.
|
|
|
|
## Independent CLI Audit Contract
|
|
|
|
Reviewer commands:
|
|
|
|
```bash
|
|
git diff -- lib/evaluate.py lib/llm.py lib/config.py tests/test_evaluate_agent_routing.py
|
|
python3 -m pytest tests/test_evaluate_agent_routing.py
|
|
```
|
|
|
|
Reviewer checks:
|
|
|
|
- No deprecated scripts revived.
|
|
- No secrets introduced.
|
|
- Feature flag false preserves old path.
|
|
- Feature flag true bypasses default Leo second-review.
|
|
- Cross-domain aggregate requires all required reviewers to approve.
|
|
|
|
## Outside-The-Box Fix Paths
|
|
|
|
If compatibility fields become confusing:
|
|
|
|
- Add a narrow DB migration for `route_json` and `agent_verdicts_json`.
|
|
|
|
If batch eval blocks safe integration:
|
|
|
|
- Disable batch eval under Phase 1b flag for one release.
|
|
|
|
If LLM review prompts get too large:
|
|
|
|
- Load only identity plus beliefs first, then add reasoning/skills later.
|
|
|
|
## Maintenance Capture
|
|
|
|
Beneficial now:
|
|
|
|
- Isolate Phase 1b logic into helpers instead of expanding `evaluate_pr` deeply.
|
|
- Keep rollback path explicit.
|
|
|
|
Avoid now:
|
|
|
|
- Full eval architecture rewrite.
|
|
- Dashboard redesign.
|
|
- Broad DB migration unless tests require it.
|
|
|
|
## Parallelization And Fanout
|
|
|
|
Classification: local_owner.
|
|
|
|
Do not fan out before the router contract lands. Eval integration depends tightly on route result semantics.
|
|
|
|
Worker-ready prompt:
|
|
|
|
```text
|
|
wire phase 1b routing into teleo-infrastructure eval behind PHASE1B_AGENT_ROUTING_ENABLED. own lib/evaluate.py, lib/llm.py, lib/config.py, and mocked eval tests. run required agents from the route result, aggregate verdicts, preserve old behavior when the flag is false, and do not revive deprecated scripts.
|
|
```
|
|
|
|
## Acceptance Criteria
|
|
|
|
- Flag false path remains available.
|
|
- Flag true path runs required agents only.
|
|
- One or two verdicts aggregate correctly.
|
|
- Existing merge or feedback path is preserved.
|
|
- Focused mocked tests pass.
|
|
|
|
## Readiness And Claim Boundaries
|
|
|
|
Allowed claim:
|
|
|
|
- "Phase 1b eval integration is locally tested behind a feature flag."
|
|
|
|
Forbidden claim:
|
|
|
|
- "Phase 1b is live."
|
|
|
|
## Spec Quality Self-Audit
|
|
|
|
All required execution-grade headings are present. This spec intentionally defers exact production commands to the staging/proof child spec because they depend on VPS truth.
|
|
|
|
## Assistant-Added Caveats
|
|
|
|
The compatibility use of `domain_verdict` and `leo_verdict` is a pragmatic Phase 1b bridge. A cleaner route schema may be worth adding after staging proof, but a premature migration would widen the blast radius.
|