275 lines
8.8 KiB
Markdown
275 lines
8.8 KiB
Markdown
# Phase 1b Child Spec: Reporting And Contributor Compatibility
|
|
|
|
Created: 2026-05-29
|
|
Status: active draft
|
|
Parent spec: `docs/phase1b-agent-routing-spec.md`
|
|
|
|
## Product Outcome Contract
|
|
|
|
Phase 1b must not make dashboards, health checks, or contributor credit lie about review state. Reporting may stay minimal, but it must not mark a cross-domain PR as ready before all required agents have reviewed.
|
|
|
|
## Goal
|
|
|
|
Update compatibility surfaces so Phase 1b required-agent reviews are represented accurately enough for operations, health, and contributor attribution without doing a dashboard redesign.
|
|
|
|
## Non-Goals
|
|
|
|
- Do not redesign the dashboard UI.
|
|
- Do not implement a new leaderboard model.
|
|
- Do not require a broad DB migration unless `review_records` is insufficient.
|
|
- Do not make production-readiness claims from health-check summaries alone.
|
|
|
|
## Current Implementation Audit
|
|
|
|
Current truth:
|
|
|
|
- `lib/db.py` already has `review_records` with `pr_number`, `domain`, `agent`, `reviewer`, `reviewer_model`, `outcome`, `rejection_reason`, and `notes`.
|
|
- `lib/contributor.py` assumes Leo reviews every PR and credits Leo plus one `domain_agent`.
|
|
- `lib/health.py` computes approval rates from `domain_verdict` and `leo_verdict`.
|
|
- `lib/health.py` builds reviewer strings only from `domain_verdict` and `leo_verdict`.
|
|
- `pipeline-health-check.py` can parse arbitrary `VERDICT:AGENT:*` tags, but it has no required-agent concept.
|
|
- A cross-domain PR with one approval and one missing required review could be misclassified if reporting only checks "any approve".
|
|
|
|
## Existing-Spec Inventory
|
|
|
|
| Existing doc | Relevance | Decision |
|
|
| --- | --- | --- |
|
|
| `docs/phase1b-agent-routing-spec.md` | Parent route/verdict state. | Reuse. |
|
|
| `docs/ARCHITECTURE.md` | Health/dashboard baseline. | Reuse as context. |
|
|
| `docs/DIAGNOSTICS-AGENT-SPEC.md` | Diagnostics philosophy. | Reuse as later direction, not immediate scope. |
|
|
|
|
## Goal-Vs-Repo-Truth Diff
|
|
|
|
Goal:
|
|
|
|
- Required-agent state is visible enough to avoid false readiness.
|
|
- Contributor evaluator credit follows actual approved reviewer agents.
|
|
- Health and pipeline checks can distinguish incomplete cross-domain review.
|
|
|
|
Repo truth:
|
|
|
|
- Legacy fields only represent `domain_verdict` plus `leo_verdict`.
|
|
- Contributor credit hardcodes Leo as universal reviewer.
|
|
- `pipeline-health-check.py` parses comments but does not know required reviewers.
|
|
|
|
## Completion Percent And Remaining Delta
|
|
|
|
Current completion: 10 percent because `review_records` already exists.
|
|
|
|
Remaining delta:
|
|
|
|
1. Ensure eval integration writes one `review_records` row per required reviewer.
|
|
2. Update contributor attribution to prefer approved `review_records`.
|
|
3. Keep legacy fields as projection only.
|
|
4. Add optional route marker parsing to `pipeline-health-check.py`.
|
|
5. Add tests proving no partial-review false readiness.
|
|
|
|
## Closure, Endpoint, And Deployment Truth
|
|
|
|
Local closure:
|
|
|
|
- Tests prove contributor credit and stage classification respect required reviewers.
|
|
|
|
Staging closure:
|
|
|
|
- Staging proof artifact and health readback agree on required-agent completion.
|
|
|
|
Production closure:
|
|
|
|
- Production health does not show PRs as ready before all required agents approve.
|
|
|
|
## Critical Assumptions And Invalidators
|
|
|
|
Assumptions:
|
|
|
|
- `review_records` is available in production DB schema.
|
|
- Eval integration can write `review_records` for each required reviewer.
|
|
- Dashboards can tolerate legacy projections during Phase 1b.
|
|
|
|
Invalidators:
|
|
|
|
- Production DB lacks `review_records`.
|
|
- Contributor code path cannot query `review_records` without performance issues.
|
|
- Branch protection or merge logic uses legacy fields directly for readiness.
|
|
|
|
## State And Truth Contract
|
|
|
|
`review_records` becomes the compatibility source for per-agent reviewer history.
|
|
|
|
Required eval write:
|
|
|
|
```text
|
|
one review_records row per required reviewer per PR attempt
|
|
```
|
|
|
|
Legacy projection:
|
|
|
|
- `domain_agent = primary_agent`
|
|
- `domain_verdict = aggregate_verdict`
|
|
- `leo_verdict = actual Leo verdict when Leo is required, else skipped`
|
|
|
|
Route/audit JSON remains the source for `required_agents`.
|
|
|
|
## Measurement Contract
|
|
|
|
Minimum compatibility metrics:
|
|
|
|
- `review_records_written_count`
|
|
- `required_reviews_missing_count`
|
|
- `partial_review_not_ready_count`
|
|
- `contributor_evaluator_credit_count_by_agent`
|
|
|
|
Minimum proof:
|
|
|
|
- A two-agent PR with one approval and one missing verdict is not classified as ready.
|
|
- A two-agent PR with two approvals is classified as ready.
|
|
- Contributor credit includes both approved reviewers.
|
|
|
|
## Backend Work Required
|
|
|
|
Owned files:
|
|
|
|
- `lib/contributor.py`
|
|
- `lib/health.py`
|
|
- `pipeline-health-check.py`
|
|
- `tests/test_contributor.py` or new focused test.
|
|
- `tests/test_pipeline_health_phase1b.py` if added.
|
|
|
|
Implementation steps:
|
|
|
|
1. Confirm `review_records` exists in local schema and migrations.
|
|
2. Update eval integration spec to write review records per required reviewer.
|
|
3. Update contributor credit to prefer approved `review_records.reviewer` rows.
|
|
4. Fall back to legacy `leo_verdict` and `domain_verdict` for old data.
|
|
5. Update health output to include review records or route audit fields where available.
|
|
6. Update pipeline health check to read required-agent markers if present.
|
|
7. Add tests.
|
|
|
|
Forbidden work:
|
|
|
|
- Dashboard redesign.
|
|
- New leaderboard model.
|
|
- Broad schema migration before proof requires it.
|
|
|
|
## Frontend Work Required
|
|
|
|
None.
|
|
|
|
## Expected Runtime And User-Visible Behavior
|
|
|
|
Operators should see:
|
|
|
|
- Per-agent reviewer outcomes when available.
|
|
- Cross-domain PRs not marked ready until all required reviewers approve.
|
|
- Contributor credit reflecting actual approved reviewer agents.
|
|
|
|
Existing dashboard layout can remain unchanged if data is honest.
|
|
|
|
## Validation And Test Matrix
|
|
|
|
Commands:
|
|
|
|
```bash
|
|
python3 -m pytest tests/test_contributor.py tests/test_pipeline_health_phase1b.py
|
|
python3 -m ruff check lib/contributor.py lib/health.py pipeline-health-check.py tests
|
|
git diff --check
|
|
```
|
|
|
|
Test cases:
|
|
|
|
- old data fallback credits Leo/domain reviewer.
|
|
- new `review_records` data credits all approved required reviewers.
|
|
- request-changes reviewer receives no evaluator credit.
|
|
- one missing required reviewer blocks ready classification.
|
|
- all required reviewers approve enables ready classification.
|
|
|
|
## CI/CD, Release, And Pre-Push Gate Contract
|
|
|
|
Before PR:
|
|
|
|
- Compatibility tests pass or are documented as not runnable due missing dev deps.
|
|
|
|
Before staging:
|
|
|
|
- Staging proof includes health and contributor-readback commands.
|
|
|
|
Before production:
|
|
|
|
- Operator verifies no partial-review false readiness in logs/health readback.
|
|
|
|
## Independent CLI Audit Contract
|
|
|
|
Reviewer commands:
|
|
|
|
```bash
|
|
rg -n "Leo reviews every PR|leo_verdict|domain_verdict|review_records|required_agents|VERDICT" lib pipeline-health-check.py tests
|
|
sqlite3 /path/to/pipeline.db ".schema review_records"
|
|
```
|
|
|
|
Reviewer checks:
|
|
|
|
- `review_records` is preferred for new evaluator credit.
|
|
- Legacy fallback remains for old rows.
|
|
- Health does not rely on any-approve for multi-review readiness.
|
|
|
|
## Outside-The-Box Fix Paths
|
|
|
|
If `review_records` is insufficient:
|
|
|
|
- Add additive `route_json` and `agent_verdicts_json` columns to `prs`.
|
|
|
|
If `pipeline-health-check.py` cannot read route markers:
|
|
|
|
- Treat cross-domain PRs as awaiting review unless all verdict tags expected by route artifact are present.
|
|
|
|
If contributor credit is too risky for Phase 1b:
|
|
|
|
- Defer credit mutation and emit review-record-only proof until after eval stability.
|
|
|
|
## Maintenance Capture
|
|
|
|
Beneficial now:
|
|
|
|
- Replace comments claiming "Leo reviews every PR."
|
|
- Add focused tests for the compatibility projection.
|
|
|
|
Avoid now:
|
|
|
|
- Dashboard UI rewrite.
|
|
- Historical backfill.
|
|
- Leaderboard redesign.
|
|
|
|
## Parallelization And Fanout
|
|
|
|
Classification: ready_now after eval integration establishes review record writes.
|
|
|
|
Worker-ready prompt:
|
|
|
|
```text
|
|
make reporting and contributor attribution phase 1b-compatible. prefer review_records for new evaluator credit, preserve legacy fallback, and prevent health/pipeline checks from marking cross-domain prs ready before all required agents approve. do not redesign dashboards or add broad schema migrations unless tests prove necessary.
|
|
```
|
|
|
|
## Acceptance Criteria
|
|
|
|
- No code path claims Leo reviews every new Phase 1b PR.
|
|
- Approved `review_records` can credit all required reviewer agents.
|
|
- Health/check logic avoids partial-review false readiness.
|
|
- Legacy data still renders.
|
|
|
|
## Readiness And Claim Boundaries
|
|
|
|
Allowed claim:
|
|
|
|
- "Reporting compatibility is updated to avoid false readiness and credit loss."
|
|
|
|
Forbidden claim:
|
|
|
|
- "Dashboards are redesigned for Phase 1b."
|
|
|
|
## Spec Quality Self-Audit
|
|
|
|
All required execution-grade headings are present. This spec is intentionally compatibility-scoped and does not attempt a full reporting product redesign.
|
|
|
|
## Assistant-Added Caveats
|
|
|
|
The safest first move is to write accurate `review_records` and route audit JSON. Rich dashboards should wait until production behavior proves stable.
|