Three changes:
1. Drop underscore prefixes in eval_parse.py — functions are now the public
API of the module (filter_diff, parse_verdict, classify_issues, etc.).
All 12 functions renamed, imports updated in evaluate.py and tests.
2. Extract eval_actions.py from evaluate.py — 3 async PR disposition functions:
- post_formal_approvals: submit Forgejo reviews from 2 agents
- terminate_pr: close PR, post rejection comment, requeue source
- dispose_rejected_pr: disposition logic for rejected PRs on attempt 2+
evaluate.py drops from ~1140 to 911 lines.
3. 14 new tests in test_eval_actions.py covering all three functions.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Critical bug fix: close_pr now checks forgejo_api return value and
skips DB update on Forgejo failure, preventing ghost PRs (DB closed,
Forgejo open). Returns bool so callers can handle failures.
_terminate_pr checks return value — skips source requeue on failure.
stale_pr.py migrated from raw Forgejo+DB to close_pr (last raw close
transition eliminated).
eval_parse.py: 15 pure parsing functions extracted from evaluate.py
(~370 lines removed). Zero I/O, zero async, independently testable.
evaluate.py drops from ~1510 to ~1140 lines.
Tests: 295 passed (42 new eval_parse + 2 new close_pr), zero regressions.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace 38 hand-crafted UPDATE prs SET status calls across evaluate.py
and merge.py with 7 centralized functions that enforce invariants:
- close_pr: always syncs Forgejo (opt-out for reconciliation)
- approve_pr: raises ValueError on empty domain (prevents NULL bugs)
- mark_merged: always sets merged_at, clears last_error
- mark_conflict: always increments merge_failures, sets merge_cycled
- mark_conflict_permanent: terminal conflict state
- reopen_pr: handles all reopen scenarios (transient, rejection, reeval)
- start_review: atomic claim with bool return
This eliminates the class of bugs that produced 3 incidents:
1. Domain NULL on musings bypass (7 PRs stuck, 20h zero throughput)
2. Forgejo ghost PRs (70 PRs open on Forgejo but closed in DB)
3. Merge_cycled missing on various close paths
Also fixes: 3 close paths in merge.py had DB update before Forgejo call
(reversed order). close_pr does Forgejo first, then DB.
Only remaining raw status transition: _claim_next_pr (approved→merging)
which is an atomic subquery and doesn't have invariant requirements.
20 new tests, 264 total passing, 0 regressions. Net -101 lines in
evaluate.py + merge.py.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two code paths set status='closed' in the pipeline DB without calling
the Forgejo API to close the PR. This caused 50 ghost PRs to accumulate
on Forgejo (dashboard shows review backlog) while the pipeline considered
them done.
- evaluate.py: no-diff stale branch close now calls Forgejo PATCH
- merge.py: permanent conflict close now calls Forgejo PATCH
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Musings bypass and batch both_approve set status='approved' without
domain or auto_merge. Merge gate requires domain IS NOT NULL and
prefix match OR auto_merge=1. Result: agent PRs deadlocked for 20+ hours.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Atomic extract-and-connect (lib/connect.py):
- After extraction writes claim files, each new claim is embedded via
OpenRouter, searched against Qdrant, and top-5 neighbors (cosine > 0.55)
are added as `related` edges in the claim's frontmatter
- Edges written on NEW claim only — avoids merge conflicts
- Cross-domain connections enabled, non-fatal on Qdrant failure
- Wired into openrouter-extract-v2.py post-extraction step
Stale PR monitor (lib/stale_pr.py):
- Every watchdog cycle checks open extract/* PRs
- If open >30 min AND 0 claim files → auto-close with comment
- After 2 stale closures → marks source as extraction_failed
- Wired into watchdog.py as check #6
Response audit system:
- response_audit table (migration v8), persistent audit conn in bot.py
- 90-day retention cleanup, tool_calls JSON column
- Confidence tag stripping, systemd ReadWritePaths for pipeline.db
Supporting infrastructure:
- reweave.py: nightly edge reconnection for orphan claims
- reconcile-sources.py: source status reconciliation
- backfill-domains.py: domain classification backfill
- ops/reconcile-source-status.sh: operational reconciliation script
- Attribution improvements, post-extract enrichments, merge improvements
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Gate 3 in batch-extract-50.sh: query pipeline.db for closed PRs before
re-extracting. Sources with >=3 closed PRs are skipped (zombie protection).
Cost tracking: openrouter_call() now returns (text, usage) tuple with
prompt_tokens and completion_tokens from the OpenRouter API response.
All callers updated to unpack and pass tokens to costs.record_usage().
Added missing triage cost recording. Fixed batch domain review recording
cost once per batch instead of once per PR.
Pentagon-Agent: Epimetheus <0144398e-4ed3-4fe2-95a3-3d72e1abf887>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Claim-shape detector: if YAML has type: claim, force STANDARD minimum (Theseus)
- Random pre-merge promotion: 15% of LIGHT → STANDARD before eval (Rio)
- LIGHT_SKIP_LLM config flag: skip domain+Leo review for LIGHT (Rhea: env var rollback)
- Updated both_approve: domain_verdict=skipped is valid for LIGHT auto-approve
- Cost recording: only charge for reviews that actually ran
- SAMPLE_AUDIT_RATE bumped 0.10 → 0.15, audit model = Opus (Leo: different family from Haiku)
Multi-agent design review: Rio (gaming vectors, model diversity), Theseus (correlated
blindspots, claim-shape guard), Rhea (shadow mode, config flag, deployment), Leo (approval).
Pentagon-Agent: Ganymede <F99EBFA6-547B-4096-BEEA-1D59C3E4028A>
Unevaluated PRs (eval_attempts=0) now sort before re-evals in the
eval cycle query. Fresh PRs have a higher chance of passing (~12%)
vs re-evals of already-rejected PRs. Prevents migration-reset PRs
from consuming eval slots that fresh PRs could use.
Pentagon-Agent: Leo <294C3CA1-0205-4668-82FA-B984D54F48AD>
If source_path is NULL, the source requeue silently matches nothing.
Log a warning so we catch orphaned terminations in monitoring.
Pentagon-Agent: Leo <294C3CA1-0205-4668-82FA-B984D54F48AD>
Schema migration v3: adds eval_attempts (INTEGER) and eval_issues (TEXT/JSON)
columns to prs table.
Retry budget logic (Ganymede-approved design):
- Increment eval_attempts on each evaluate_pr() call
- Hard cap: eval_attempts >= 3 → terminal (close PR, tag source needs_human)
- Attempt 1: normal — back to open, wait for fix
- Attempt 2: classify issues as mechanical/substantive
- Mechanical only (schema, wiki links, dedup): keep open for one more try
- Substantive (factual, confidence, scope, title): close PR, requeue source
- Issue tags parsed from reviewer comments, stored in eval_issues column
- SHA-based reset: new commits on PR branch → eval_attempts=0, verdicts reset
- Post-migration stagger: LIMIT 5 for first batch to avoid OpenRouter spike
- Cost recording updated: domain review → OpenRouter, Leo → tier-dependent
Stops the 32-PR infinite loop burning ~$0.03/cycle with no terminal state.
Pentagon-Agent: Leo <294C3CA1-0205-4668-82FA-B984D54F48AD>
- Domain review → GPT-4o (OpenRouter), Leo STANDARD → Sonnet (OpenRouter),
Leo DEEP → Opus (Claude Max). Two model families = no correlated blind spots.
- Opus reserved for DEEP eval only — protects rate limit for overnight research.
- Review prompts calibrated: require per-criterion evidence, blocking-vs-observation
verdict rules. Moved from 100% rubber-stamp approval to 12% pass rate.
- OpenRouter failures classified as openrouter_failed (not rate_limited) to avoid
spurious 15-min Opus backoff.
- merge.py: pre-check PR state before merge API call (prevents 405 on re-merge).
Pentagon-Agent: Leo <294C3CA1-0205-4668-82FA-B984D54F48AD>
- Fix#12: domain_review undefined on resume path — initialize to None,
guard _parse_issues() call. Prevents NameError on PRs resuming after
partial eval (76 PRs in this state right now).
- Fix#11: concurrent eval workers can duplicate reviews — add atomic
UPDATE SET status='reviewing' WHERE status='open' at top of
evaluate_pr(). Check rowcount, skip if already claimed.
- Fix#8: subprocess tracking for graceful shutdown — _active_subprocesses
set in evaluate module, tracked in _claude_cli_call, exposed via
kill_active_subprocesses(). Replaces dead code in teleo-pipeline.py.
- Fix health.py divide-by-zero — guard all metabolic metric reads against
None from NULLIF/empty result set. Prevents TypeError on /health when
no PRs have been evaluated in 24h.
Also includes Leo's existing hot-fixes:
- Rate limit detection checks stdout regardless of exit code
- 15-minute cycle-level backoff on rate limit
Pentagon-Agent: Ganymede <F99EBFA6-547B-4096-BEEA-1D59C3E4028A>