twentyOne2x 7390e1e843 Implement phase 1b agent routing

2026-05-29 14:00:13 +02:00

8.8 KiB

Raw Blame History

Phase 1b Child Spec: Reporting And Contributor Compatibility

Created: 2026-05-29 Status: active draft Parent spec: docs/phase1b-agent-routing-spec.md

Product Outcome Contract

Phase 1b must not make dashboards, health checks, or contributor credit lie about review state. Reporting may stay minimal, but it must not mark a cross-domain PR as ready before all required agents have reviewed.

Goal

Update compatibility surfaces so Phase 1b required-agent reviews are represented accurately enough for operations, health, and contributor attribution without doing a dashboard redesign.

Non-Goals

Do not redesign the dashboard UI.
Do not implement a new leaderboard model.
Do not require a broad DB migration unless review_records is insufficient.
Do not make production-readiness claims from health-check summaries alone.

Current Implementation Audit

Current truth:

lib/db.py already has review_records with pr_number, domain, agent, reviewer, reviewer_model, outcome, rejection_reason, and notes.
lib/contributor.py assumes Leo reviews every PR and credits Leo plus one domain_agent.
lib/health.py computes approval rates from domain_verdict and leo_verdict.
lib/health.py builds reviewer strings only from domain_verdict and leo_verdict.
pipeline-health-check.py can parse arbitrary VERDICT:AGENT:* tags, but it has no required-agent concept.
A cross-domain PR with one approval and one missing required review could be misclassified if reporting only checks "any approve".

Existing-Spec Inventory

Existing doc	Relevance	Decision
`docs/phase1b-agent-routing-spec.md`	Parent route/verdict state.	Reuse.
`docs/ARCHITECTURE.md`	Health/dashboard baseline.	Reuse as context.
`docs/DIAGNOSTICS-AGENT-SPEC.md`	Diagnostics philosophy.	Reuse as later direction, not immediate scope.

Goal-Vs-Repo-Truth Diff

Goal:

Required-agent state is visible enough to avoid false readiness.
Contributor evaluator credit follows actual approved reviewer agents.
Health and pipeline checks can distinguish incomplete cross-domain review.

Repo truth:

Legacy fields only represent domain_verdict plus leo_verdict.
Contributor credit hardcodes Leo as universal reviewer.
pipeline-health-check.py parses comments but does not know required reviewers.

Completion Percent And Remaining Delta

Current completion: 10 percent because review_records already exists.

Remaining delta:

Ensure eval integration writes one review_records row per required reviewer.
Update contributor attribution to prefer approved review_records.
Keep legacy fields as projection only.
Add optional route marker parsing to pipeline-health-check.py.
Add tests proving no partial-review false readiness.

Closure, Endpoint, And Deployment Truth

Local closure:

Tests prove contributor credit and stage classification respect required reviewers.

Staging closure:

Staging proof artifact and health readback agree on required-agent completion.

Production closure:

Production health does not show PRs as ready before all required agents approve.

Critical Assumptions And Invalidators

Assumptions:

review_records is available in production DB schema.
Eval integration can write review_records for each required reviewer.
Dashboards can tolerate legacy projections during Phase 1b.

Invalidators:

Production DB lacks review_records.
Contributor code path cannot query review_records without performance issues.
Branch protection or merge logic uses legacy fields directly for readiness.

State And Truth Contract

review_records becomes the compatibility source for per-agent reviewer history.

Required eval write:

one review_records row per required reviewer per PR attempt

Legacy projection:

domain_agent = primary_agent
domain_verdict = aggregate_verdict
leo_verdict = actual Leo verdict when Leo is required, else skipped

Route/audit JSON remains the source for required_agents.

Measurement Contract

Minimum compatibility metrics:

review_records_written_count
required_reviews_missing_count
partial_review_not_ready_count
contributor_evaluator_credit_count_by_agent

Minimum proof:

A two-agent PR with one approval and one missing verdict is not classified as ready.
A two-agent PR with two approvals is classified as ready.
Contributor credit includes both approved reviewers.

Backend Work Required

Owned files:

lib/contributor.py
lib/health.py
pipeline-health-check.py
tests/test_contributor.py or new focused test.
tests/test_pipeline_health_phase1b.py if added.

Implementation steps:

Confirm review_records exists in local schema and migrations.
Update eval integration spec to write review records per required reviewer.
Update contributor credit to prefer approved review_records.reviewer rows.
Fall back to legacy leo_verdict and domain_verdict for old data.
Update health output to include review records or route audit fields where available.
Update pipeline health check to read required-agent markers if present.
Add tests.

Forbidden work:

Dashboard redesign.
New leaderboard model.
Broad schema migration before proof requires it.

Frontend Work Required

None.

Expected Runtime And User-Visible Behavior

Operators should see:

Per-agent reviewer outcomes when available.
Cross-domain PRs not marked ready until all required reviewers approve.
Contributor credit reflecting actual approved reviewer agents.

Existing dashboard layout can remain unchanged if data is honest.

Validation And Test Matrix

Commands:

python3 -m pytest tests/test_contributor.py tests/test_pipeline_health_phase1b.py
python3 -m ruff check lib/contributor.py lib/health.py pipeline-health-check.py tests
git diff --check

Test cases:

old data fallback credits Leo/domain reviewer.
new review_records data credits all approved required reviewers.
request-changes reviewer receives no evaluator credit.
one missing required reviewer blocks ready classification.
all required reviewers approve enables ready classification.

CI/CD, Release, And Pre-Push Gate Contract

Before PR:

Compatibility tests pass or are documented as not runnable due missing dev deps.

Before staging:

Staging proof includes health and contributor-readback commands.

Before production:

Operator verifies no partial-review false readiness in logs/health readback.

Independent CLI Audit Contract

Reviewer commands:

rg -n "Leo reviews every PR|leo_verdict|domain_verdict|review_records|required_agents|VERDICT" lib pipeline-health-check.py tests
sqlite3 /path/to/pipeline.db ".schema review_records"

Reviewer checks:

review_records is preferred for new evaluator credit.
Legacy fallback remains for old rows.
Health does not rely on any-approve for multi-review readiness.

Outside-The-Box Fix Paths

If review_records is insufficient:

Add additive route_json and agent_verdicts_json columns to prs.

If pipeline-health-check.py cannot read route markers:

Treat cross-domain PRs as awaiting review unless all verdict tags expected by route artifact are present.

If contributor credit is too risky for Phase 1b:

Defer credit mutation and emit review-record-only proof until after eval stability.

Maintenance Capture

Beneficial now:

Replace comments claiming "Leo reviews every PR."
Add focused tests for the compatibility projection.

Avoid now:

Dashboard UI rewrite.
Historical backfill.
Leaderboard redesign.

Parallelization And Fanout

Classification: ready_now after eval integration establishes review record writes.

Worker-ready prompt:

make reporting and contributor attribution phase 1b-compatible. prefer review_records for new evaluator credit, preserve legacy fallback, and prevent health/pipeline checks from marking cross-domain prs ready before all required agents approve. do not redesign dashboards or add broad schema migrations unless tests prove necessary.

Acceptance Criteria

No code path claims Leo reviews every new Phase 1b PR.
Approved review_records can credit all required reviewer agents.
Health/check logic avoids partial-review false readiness.
Legacy data still renders.

Readiness And Claim Boundaries

Allowed claim:

"Reporting compatibility is updated to avoid false readiness and credit loss."

Forbidden claim:

"Dashboards are redesigned for Phase 1b."

Spec Quality Self-Audit

All required execution-grade headings are present. This spec is intentionally compatibility-scoped and does not attempt a full reporting product redesign.

Assistant-Added Caveats

The safest first move is to write accurate review_records and route audit JSON. Rich dashboards should wait until production behavior proves stable.

8.8 KiB Raw Blame History