teleo-infrastructure/docs/phase1b/eval-pipeline-integration-spec.md
2026-05-29 14:16:12 +02:00

343 lines
11 KiB
Markdown

# Phase 1b Child Spec: Eval Pipeline Integration
Created: 2026-05-29
Status: active draft
Parent spec: `docs/phase1b-agent-routing-spec.md`
## Product Outcome Contract
Pipeline-v2 must use the Phase 1b route result to run the required Hermes agent reviews for a `decision-engine` PR. The old default shape where every non-LIGHT PR receives a domain review plus Leo review must be bypassed when Phase 1b routing is enabled.
## Goal
Integrate agent identity routing into `lib/evaluate.py` behind a feature flag, run one or two required reviewer agents, aggregate verdicts, and preserve existing merge or feedback behavior.
## Non-Goals
- Do not remove the old eval path until staging proof exists.
- Do not rewrite the full Forgejo/GitHub API abstraction.
- Do not redesign dashboards.
- Do not implement separate GitHub identities.
- Do not change extraction or validation behavior except as needed for eval tests.
## Current Implementation Audit
Current relevant code:
- `lib/evaluate.py::evaluate_pr` owns single PR evaluation.
- `lib/evaluate.py::evaluate_cycle` selects eligible PRs.
- `_build_domain_batches` groups STANDARD PRs by DB domain before fetching diffs.
- `_run_batch_domain_eval` runs batch domain reviews, then individual Leo reviews.
- `run_domain_review` in `lib/llm.py` prompts a domain expert through OpenRouter.
- `run_leo_review` in `lib/llm.py` prompts Leo through OpenRouter or Claude path depending on tier.
- `parse_verdict` in `lib/eval_parse.py` parses reviewer-specific verdict tags.
- `approve_pr`, `reopen_pr`, `close_pr`, and `start_review` handle state transitions.
Current behavior:
- Diff path detects a domain.
- `agent_for_domain(domain)` selects one domain agent.
- Domain review runs first.
- Leo review runs after domain approval for non-LIGHT PRs.
- `leo_verdict` and `domain_verdict` are the stored verdict fields.
- Contributor credit logic assumes Leo can be one evaluator and `domain_agent` can be the other.
## Existing-Spec Inventory
| Existing doc | Relevance | Decision |
| --- | --- | --- |
| `docs/phase1b-agent-routing-spec.md` | Parent route and eval contract. | Reuse. |
| `docs/ARCHITECTURE.md` | Existing pipeline stage model. | Reuse as baseline. |
| `docs/multi-model-eval-architecture.md` | Prior Leo-plus-second-model design. | Supersede for Phase 1b eval path only. |
| `handoff/deprecated/eval-scripts.md` | Confirms shell eval scripts are dead. | Reuse to avoid wrong surface. |
## Goal-Vs-Repo-Truth Diff
Goal:
- `evaluate_pr` calls the route scorer.
- Required agents are the only reviewer agents.
- One required agent means one review.
- Two required agents means two reviews and aggregate verdict.
- Default Leo second-review is removed when the feature flag is enabled.
- Old behavior remains available when the feature flag is disabled.
Branch truth:
- Legacy eval is still available when the feature flag is false.
- When the feature flag is true, eval invokes the identity route, runs required agents only, writes `review_records`, and projects aggregate state back into legacy `leo_verdict` and `domain_verdict` columns.
- Batch eval is disabled while the feature flag is true because stale DB-domain grouping is not route-aware.
- `run_agent_review` exists, but it uses prompt-level identity context rather than loading full KB identity/belief/reasoning files.
## Completion Percent And Remaining Delta
Current completion on this branch: 75 percent local implementation behind a default-off feature flag.
Remaining delta:
1. Decide direct GitHub `decision-engine` comment transport versus Forgejo-first cutover compatibility.
2. Prove with staging PRs and real daemon logs.
3. Update contributor/dashboard assumptions only where staging or tests prove breakage.
## Closure, Endpoint, And Deployment Truth
Local closure:
- Mocked eval tests prove route-to-review-to-aggregate behavior.
Staging closure:
- Staging sandbox PRs receive expected comments and DB state transitions.
Production closure:
- Live `decision-engine` PRs are handled by Phase 1b route path for 24 hours.
This spec cannot claim production closure without VPS proof.
## Critical Assumptions And Invalidators
Assumptions:
- Feature flag rollback is acceptable.
- Existing state fields can support Phase 1b initially by storing primary agent in `domain_agent` and aggregate details in audit rows.
- A DB schema migration is avoidable for the first PR.
- Master bot comments with `VERDICT:AGENT:*` are acceptable.
Invalidators:
- Downstream merge logic requires formal reviews from separate GitHub users.
- Dashboards or contributor credit fail hard when Leo is not present.
- Batch eval cannot be safely disabled and must be route-aware from day one.
- Production env cannot set feature flags.
## State And Truth Contract
Feature flag:
```text
PHASE1B_AGENT_ROUTING_ENABLED=false
```
When false:
- Existing eval behavior continues.
When true:
- Eval route is built for every non-bypass PR.
- Audit log records route JSON.
- Required agent reviews run.
- Aggregate verdict determines approval or feedback.
Minimal DB field use:
- `domain`: keep route primary domain or `multi`.
- `domain_agent`: keep primary agent.
- `domain_verdict`: keep aggregate non-Leo review verdict or aggregate verdict.
- `leo_verdict`: set `skipped` unless Leo is a required agent; if Leo is required, store Leo verdict.
- `review_records`: write one row per required reviewer attempt with reviewer agent, model, outcome, and notes.
- review comments include a `PHASE1B_REVIEW` marker and the current local helper suppresses duplicate posts for the same PR and agent.
- audit log: route and all per-agent verdicts.
This is a compatibility posture, not the ideal long-term schema.
## Measurement Contract
Required local assertions:
- Phase 1b flag disabled uses old runner calls.
- Phase 1b flag enabled calls `run_agent_review` once for single route.
- Phase 1b flag enabled calls `run_agent_review` twice for multi route.
- `run_leo_review` is not called unless Leo is in `required_agents`.
- all approve returns approved aggregate.
- one request changes returns feedback aggregate.
- transport failure reopens for retry.
- retry after a partial multi-agent success does not duplicate existing posted verdict comments.
## Backend Work Required
Owned files:
- `lib/evaluate.py`
- `lib/llm.py`
- `lib/config.py`
- `lib/eval_parse.py` only if parser compatibility needs explicit tests or normalization.
- `tests/test_evaluate_agent_routing.py`
- `tests/test_eval_parse.py`
Implementation steps:
1. Add `PHASE1B_AGENT_ROUTING_ENABLED` to `lib/config.py`.
2. Import route scorer.
3. Add `run_agent_review` in `lib/llm.py`.
4. Add helper to load agent context from KB worktree.
5. Add `aggregate_agent_verdicts`.
6. In `evaluate_pr`, after bypasses and diff filtering, branch into Phase 1b path when flag is true.
7. In Phase 1b path, run required reviews and post comments through the existing API helper.
8. Update DB fields conservatively.
9. Write `review_records` rows for each required reviewer attempt.
10. Preserve old logic under flag false.
11. Disable `_build_domain_batches` while flag is true or make it route-aware.
Forbidden files:
- Deprecated eval shell scripts.
- Deployment scripts unless needed for documenting the flag.
- Runtime secrets.
## Frontend Work Required
None.
## Expected Runtime And User-Visible Behavior
Single-agent example:
```text
PR touches internet finance.
route.required_agents = ["Rio"]
pipeline posts a Rio verdict.
merge proceeds if Rio approves.
```
Cross-agent example:
```text
PR touches AI systems and x402 payments.
route.required_agents = ["Theseus", "Rio"]
pipeline posts Theseus and Rio verdicts.
merge proceeds only if both approve.
```
Fallback example:
```text
PR cannot be confidently routed.
route.required_agents = ["Leo"]
pipeline posts Leo verdict.
route_kind = fallback is audited.
```
## Validation And Test Matrix
Commands:
```bash
python3 -m pytest tests/test_evaluate_agent_routing.py tests/test_eval_parse.py
python3 -m ruff check lib/evaluate.py lib/llm.py lib/config.py tests/test_evaluate_agent_routing.py
git diff --check
```
Test cases:
- flag-off old behavior smoke
- flag-on single reviewer approve
- flag-on single reviewer request changes
- flag-on two reviewer approve
- flag-on two reviewer one reject
- missing verdict
- transport failure
- Leo required route
- Leo not required route
- batch disabled or route-aware under flag
## CI/CD, Release, And Pre-Push Gate Contract
Before PR:
- Focused tests pass.
- Old behavior remains behind flag false.
- No production default flips to true.
Before staging:
- Operator can enable flag in staging env.
- Sandbox repo target is configured.
Before production:
- Staging proof artifact exists.
- Rollback command is known.
## Independent CLI Audit Contract
Reviewer commands:
```bash
git diff -- lib/evaluate.py lib/llm.py lib/config.py tests/test_evaluate_agent_routing.py
python3 -m pytest tests/test_evaluate_agent_routing.py
```
Reviewer checks:
- No deprecated scripts revived.
- No secrets introduced.
- Feature flag false preserves old path.
- Feature flag true bypasses default Leo second-review.
- Cross-domain aggregate requires all required reviewers to approve.
## Outside-The-Box Fix Paths
If compatibility fields become confusing:
- Add a narrow DB migration for `route_json` and `agent_verdicts_json`.
If batch eval blocks safe integration:
- Disable batch eval under Phase 1b flag for one release.
If LLM review prompts get too large:
- Load only identity plus beliefs first, then add reasoning/skills later.
## Maintenance Capture
Beneficial now:
- Isolate Phase 1b logic into helpers instead of expanding `evaluate_pr` deeply.
- Keep rollback path explicit.
Avoid now:
- Full eval architecture rewrite.
- Dashboard redesign.
- Broad DB migration unless tests require it.
## Parallelization And Fanout
Classification: local_owner.
Do not fan out before the router contract lands. Eval integration depends tightly on route result semantics.
Worker-ready prompt:
```text
wire phase 1b routing into teleo-infrastructure eval behind PHASE1B_AGENT_ROUTING_ENABLED. own lib/evaluate.py, lib/llm.py, lib/config.py, and mocked eval tests. run required agents from the route result, aggregate verdicts, preserve old behavior when the flag is false, and do not revive deprecated scripts.
```
## Acceptance Criteria
- Flag false path remains available.
- Flag true path runs required agents only.
- One or two verdicts aggregate correctly.
- Existing merge or feedback path is preserved.
- Focused mocked tests pass.
## Readiness And Claim Boundaries
Allowed claim:
- "Phase 1b eval integration is locally tested behind a feature flag."
Forbidden claim:
- "Phase 1b is live."
## Spec Quality Self-Audit
All required execution-grade headings are present. This spec intentionally defers exact production commands to the staging/proof child spec because they depend on VPS truth.
## Assistant-Added Caveats
The compatibility use of `domain_verdict` and `leo_verdict` is a pragmatic Phase 1b bridge. A cleaner route schema may be worth adding after staging proof, but a premature migration would widen the blast radius.