teleo-infrastructure/docs/phase1b/eval-pipeline-integration-spec.md
2026-05-29 14:16:12 +02:00

11 KiB

Phase 1b Child Spec: Eval Pipeline Integration

Created: 2026-05-29 Status: active draft Parent spec: docs/phase1b-agent-routing-spec.md

Product Outcome Contract

Pipeline-v2 must use the Phase 1b route result to run the required Hermes agent reviews for a decision-engine PR. The old default shape where every non-LIGHT PR receives a domain review plus Leo review must be bypassed when Phase 1b routing is enabled.

Goal

Integrate agent identity routing into lib/evaluate.py behind a feature flag, run one or two required reviewer agents, aggregate verdicts, and preserve existing merge or feedback behavior.

Non-Goals

  • Do not remove the old eval path until staging proof exists.
  • Do not rewrite the full Forgejo/GitHub API abstraction.
  • Do not redesign dashboards.
  • Do not implement separate GitHub identities.
  • Do not change extraction or validation behavior except as needed for eval tests.

Current Implementation Audit

Current relevant code:

  • lib/evaluate.py::evaluate_pr owns single PR evaluation.
  • lib/evaluate.py::evaluate_cycle selects eligible PRs.
  • _build_domain_batches groups STANDARD PRs by DB domain before fetching diffs.
  • _run_batch_domain_eval runs batch domain reviews, then individual Leo reviews.
  • run_domain_review in lib/llm.py prompts a domain expert through OpenRouter.
  • run_leo_review in lib/llm.py prompts Leo through OpenRouter or Claude path depending on tier.
  • parse_verdict in lib/eval_parse.py parses reviewer-specific verdict tags.
  • approve_pr, reopen_pr, close_pr, and start_review handle state transitions.

Current behavior:

  • Diff path detects a domain.
  • agent_for_domain(domain) selects one domain agent.
  • Domain review runs first.
  • Leo review runs after domain approval for non-LIGHT PRs.
  • leo_verdict and domain_verdict are the stored verdict fields.
  • Contributor credit logic assumes Leo can be one evaluator and domain_agent can be the other.

Existing-Spec Inventory

Existing doc Relevance Decision
docs/phase1b-agent-routing-spec.md Parent route and eval contract. Reuse.
docs/ARCHITECTURE.md Existing pipeline stage model. Reuse as baseline.
docs/multi-model-eval-architecture.md Prior Leo-plus-second-model design. Supersede for Phase 1b eval path only.
handoff/deprecated/eval-scripts.md Confirms shell eval scripts are dead. Reuse to avoid wrong surface.

Goal-Vs-Repo-Truth Diff

Goal:

  • evaluate_pr calls the route scorer.
  • Required agents are the only reviewer agents.
  • One required agent means one review.
  • Two required agents means two reviews and aggregate verdict.
  • Default Leo second-review is removed when the feature flag is enabled.
  • Old behavior remains available when the feature flag is disabled.

Branch truth:

  • Legacy eval is still available when the feature flag is false.
  • When the feature flag is true, eval invokes the identity route, runs required agents only, writes review_records, and projects aggregate state back into legacy leo_verdict and domain_verdict columns.
  • Batch eval is disabled while the feature flag is true because stale DB-domain grouping is not route-aware.
  • run_agent_review exists, but it uses prompt-level identity context rather than loading full KB identity/belief/reasoning files.

Completion Percent And Remaining Delta

Current completion on this branch: 75 percent local implementation behind a default-off feature flag.

Remaining delta:

  1. Decide direct GitHub decision-engine comment transport versus Forgejo-first cutover compatibility.
  2. Prove with staging PRs and real daemon logs.
  3. Update contributor/dashboard assumptions only where staging or tests prove breakage.

Closure, Endpoint, And Deployment Truth

Local closure:

  • Mocked eval tests prove route-to-review-to-aggregate behavior.

Staging closure:

  • Staging sandbox PRs receive expected comments and DB state transitions.

Production closure:

  • Live decision-engine PRs are handled by Phase 1b route path for 24 hours.

This spec cannot claim production closure without VPS proof.

Critical Assumptions And Invalidators

Assumptions:

  • Feature flag rollback is acceptable.
  • Existing state fields can support Phase 1b initially by storing primary agent in domain_agent and aggregate details in audit rows.
  • A DB schema migration is avoidable for the first PR.
  • Master bot comments with VERDICT:AGENT:* are acceptable.

Invalidators:

  • Downstream merge logic requires formal reviews from separate GitHub users.
  • Dashboards or contributor credit fail hard when Leo is not present.
  • Batch eval cannot be safely disabled and must be route-aware from day one.
  • Production env cannot set feature flags.

State And Truth Contract

Feature flag:

PHASE1B_AGENT_ROUTING_ENABLED=false

When false:

  • Existing eval behavior continues.

When true:

  • Eval route is built for every non-bypass PR.
  • Audit log records route JSON.
  • Required agent reviews run.
  • Aggregate verdict determines approval or feedback.

Minimal DB field use:

  • domain: keep route primary domain or multi.
  • domain_agent: keep primary agent.
  • domain_verdict: keep aggregate non-Leo review verdict or aggregate verdict.
  • leo_verdict: set skipped unless Leo is a required agent; if Leo is required, store Leo verdict.
  • review_records: write one row per required reviewer attempt with reviewer agent, model, outcome, and notes.
  • review comments include a PHASE1B_REVIEW marker and the current local helper suppresses duplicate posts for the same PR and agent.
  • audit log: route and all per-agent verdicts.

This is a compatibility posture, not the ideal long-term schema.

Measurement Contract

Required local assertions:

  • Phase 1b flag disabled uses old runner calls.
  • Phase 1b flag enabled calls run_agent_review once for single route.
  • Phase 1b flag enabled calls run_agent_review twice for multi route.
  • run_leo_review is not called unless Leo is in required_agents.
  • all approve returns approved aggregate.
  • one request changes returns feedback aggregate.
  • transport failure reopens for retry.
  • retry after a partial multi-agent success does not duplicate existing posted verdict comments.

Backend Work Required

Owned files:

  • lib/evaluate.py
  • lib/llm.py
  • lib/config.py
  • lib/eval_parse.py only if parser compatibility needs explicit tests or normalization.
  • tests/test_evaluate_agent_routing.py
  • tests/test_eval_parse.py

Implementation steps:

  1. Add PHASE1B_AGENT_ROUTING_ENABLED to lib/config.py.
  2. Import route scorer.
  3. Add run_agent_review in lib/llm.py.
  4. Add helper to load agent context from KB worktree.
  5. Add aggregate_agent_verdicts.
  6. In evaluate_pr, after bypasses and diff filtering, branch into Phase 1b path when flag is true.
  7. In Phase 1b path, run required reviews and post comments through the existing API helper.
  8. Update DB fields conservatively.
  9. Write review_records rows for each required reviewer attempt.
  10. Preserve old logic under flag false.
  11. Disable _build_domain_batches while flag is true or make it route-aware.

Forbidden files:

  • Deprecated eval shell scripts.
  • Deployment scripts unless needed for documenting the flag.
  • Runtime secrets.

Frontend Work Required

None.

Expected Runtime And User-Visible Behavior

Single-agent example:

PR touches internet finance.
route.required_agents = ["Rio"]
pipeline posts a Rio verdict.
merge proceeds if Rio approves.

Cross-agent example:

PR touches AI systems and x402 payments.
route.required_agents = ["Theseus", "Rio"]
pipeline posts Theseus and Rio verdicts.
merge proceeds only if both approve.

Fallback example:

PR cannot be confidently routed.
route.required_agents = ["Leo"]
pipeline posts Leo verdict.
route_kind = fallback is audited.

Validation And Test Matrix

Commands:

python3 -m pytest tests/test_evaluate_agent_routing.py tests/test_eval_parse.py
python3 -m ruff check lib/evaluate.py lib/llm.py lib/config.py tests/test_evaluate_agent_routing.py
git diff --check

Test cases:

  • flag-off old behavior smoke
  • flag-on single reviewer approve
  • flag-on single reviewer request changes
  • flag-on two reviewer approve
  • flag-on two reviewer one reject
  • missing verdict
  • transport failure
  • Leo required route
  • Leo not required route
  • batch disabled or route-aware under flag

CI/CD, Release, And Pre-Push Gate Contract

Before PR:

  • Focused tests pass.
  • Old behavior remains behind flag false.
  • No production default flips to true.

Before staging:

  • Operator can enable flag in staging env.
  • Sandbox repo target is configured.

Before production:

  • Staging proof artifact exists.
  • Rollback command is known.

Independent CLI Audit Contract

Reviewer commands:

git diff -- lib/evaluate.py lib/llm.py lib/config.py tests/test_evaluate_agent_routing.py
python3 -m pytest tests/test_evaluate_agent_routing.py

Reviewer checks:

  • No deprecated scripts revived.
  • No secrets introduced.
  • Feature flag false preserves old path.
  • Feature flag true bypasses default Leo second-review.
  • Cross-domain aggregate requires all required reviewers to approve.

Outside-The-Box Fix Paths

If compatibility fields become confusing:

  • Add a narrow DB migration for route_json and agent_verdicts_json.

If batch eval blocks safe integration:

  • Disable batch eval under Phase 1b flag for one release.

If LLM review prompts get too large:

  • Load only identity plus beliefs first, then add reasoning/skills later.

Maintenance Capture

Beneficial now:

  • Isolate Phase 1b logic into helpers instead of expanding evaluate_pr deeply.
  • Keep rollback path explicit.

Avoid now:

  • Full eval architecture rewrite.
  • Dashboard redesign.
  • Broad DB migration unless tests require it.

Parallelization And Fanout

Classification: local_owner.

Do not fan out before the router contract lands. Eval integration depends tightly on route result semantics.

Worker-ready prompt:

wire phase 1b routing into teleo-infrastructure eval behind PHASE1B_AGENT_ROUTING_ENABLED. own lib/evaluate.py, lib/llm.py, lib/config.py, and mocked eval tests. run required agents from the route result, aggregate verdicts, preserve old behavior when the flag is false, and do not revive deprecated scripts.

Acceptance Criteria

  • Flag false path remains available.
  • Flag true path runs required agents only.
  • One or two verdicts aggregate correctly.
  • Existing merge or feedback path is preserved.
  • Focused mocked tests pass.

Readiness And Claim Boundaries

Allowed claim:

  • "Phase 1b eval integration is locally tested behind a feature flag."

Forbidden claim:

  • "Phase 1b is live."

Spec Quality Self-Audit

All required execution-grade headings are present. This spec intentionally defers exact production commands to the staging/proof child spec because they depend on VPS truth.

Assistant-Added Caveats

The compatibility use of domain_verdict and leo_verdict is a pragmatic Phase 1b bridge. A cleaner route schema may be worth adding after staging proof, but a premature migration would widen the blast radius.