fix(reaper): branch allowlist for disposable pipeline-managed branches

Apply Ganymede review nit #3 from f97dd15 review (the deferred close_on_forgejo fix already landed in e14b5f2 — Ganymede was reviewing the older commit). SQL gate previously had no branch filter — empirically all 92 candidates were extract/* but structurally any agent branch in the deadlock shape was a candidate. Positive allowlist for extract/, reweave/, fix/ scopes the reaper to disposable pipeline-managed branches that the pipeline created and can recreate. Agent branches (theseus/, vida/, epimetheus/, etc.) are WIP feature work and must not be reaped — owners review their own PRs on their own cadence. Cheap target-class lock complementing the LIMIT 50 blast-radius cap. Same scoping principle as PIPELINE_OWNED_PREFIXES, but tighter — epimetheus/ review branches are pipeline-owned for merge purposes but NOT disposable. Items 2-4 from this review: - WARNING #2 (audit_log idx_audit_event_ts): defer to followup branch alongside sync-mirror migration cleanup, as Ganymede suggested. - NIT #3 (this commit): branch allowlist applied. - NIT #4 (token asymmetry comment=admin/close=leo): confirmed established codebase pattern. merge.py:946-948 does the same — comment system-toned, close attributed to Leo for verdict-source UI clarity. Not accidental. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
fix(reaper): apply Ganymede review — dual-PATCH drift, breaker isolation, env config
2026-05-07 23:43:53 -04:00 · 2026-05-07 23:43:53 -04:00 · 2026-05-07 23:43:53 -04:00 · 2026-05-07 18:33:08 -04:00 · 2026-05-07 11:58:22 -04:00
1 changed files with 28 additions and 1 deletions
--- a/lib/substantive_fixer.py
+++ b/lib/substantive_fixer.py
@ -539,15 +539,36 @@ async def substantive_fix_cycle(conn, max_workers=None) -> tuple[int, int]:

    # Filter to only PRs with substantive issues (not just mechanical)
    substantive_rows = []
+    skipped_no_tags = []
    for row in rows:
        try:
            issues = json.loads(row["eval_issues"] or "[]")
        except (json.JSONDecodeError, TypeError):
+            # Corrupt JSON in eval_issues is abnormal (post-merge column drift,
+            # hand-edited row, partial write during crash). WARN so ops can chase
+            # the upstream column-write path. Without this, the row drops out of
+            # both substantive_rows and skipped_no_tags — the third silent path.
+            logger.warning(
+                "PR #%d: corrupt eval_issues JSON — skipping in substantive fix cycle",
+                row["number"],
+            )
            continue
        if set(issues) & (FIXABLE_TAGS | CONVERTIBLE_TAGS | UNFIXABLE_TAGS):
            substantive_rows.append(row)
+        else:
+            skipped_no_tags.append((row["number"], issues))

    if not substantive_rows:
+        # Visibility for the LIMIT-3 head-of-line block: if the oldest
+        # candidates have no fixer-actionable tags (e.g. eval_issues=[],
+        # broken_wiki_links only), the cycle silently returns 0 — and the
+        # next cycle picks the same head-of-line, forever. Log the eval_issues
+        # of skipped candidates so the journal makes the block visible.
+        if skipped_no_tags:
+            logger.info(
+                "Substantive fix cycle: 0 actionable from %d candidate(s) — head-of-line: %s",
+                len(rows), skipped_no_tags,
+            )
        return 0, 0

    fixed = 0
@ -559,7 +580,13 @@ async def substantive_fix_cycle(conn, max_workers=None) -> tuple[int, int]:
            if result.get("action"):
                fixed += 1
            elif result.get("skipped"):
-                logger.debug("PR #%d: substantive fix skipped: %s", row["number"], result.get("reason"))
+                # Was DEBUG — promoted to INFO to make stuck-PR root cause
+                # visible without enabling DEBUG fleet-wide. (Ship Apr 24+
+                # silent skip diagnosis.)
+                logger.info(
+                    "PR #%d: substantive fix skipped: %s",
+                    row["number"], result.get("reason"),
+                )
        except Exception:
            logger.exception("PR #%d: substantive fix failed", row["number"])
            errors += 1
Author	SHA1	Message	Date
m3taversal	4b2b59b184	fix(reaper): branch allowlist for disposable pipeline-managed branches Some checks are pending CI / lint-and-test (push) Waiting to run Details Apply Ganymede review nit #3 from `f97dd15` review (the deferred close_on_forgejo fix already landed in `e14b5f2` — Ganymede was reviewing the older commit). SQL gate previously had no branch filter — empirically all 92 candidates were extract/* but structurally any agent branch in the deadlock shape was a candidate. Positive allowlist for extract/, reweave/, fix/ scopes the reaper to disposable pipeline-managed branches that the pipeline created and can recreate. Agent branches (theseus/, vida/, epimetheus/, etc.) are WIP feature work and must not be reaped — owners review their own PRs on their own cadence. Cheap target-class lock complementing the LIMIT 50 blast-radius cap. Same scoping principle as PIPELINE_OWNED_PREFIXES, but tighter — epimetheus/ review branches are pipeline-owned for merge purposes but NOT disposable. Items 2-4 from this review: - WARNING #2 (audit_log idx_audit_event_ts): defer to followup branch alongside sync-mirror migration cleanup, as Ganymede suggested. - NIT #3 (this commit): branch allowlist applied. - NIT #4 (token asymmetry comment=admin/close=leo): confirmed established codebase pattern. merge.py:946-948 does the same — comment system-toned, close attributed to Leo for verdict-source UI clarity. Not accidental. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 23:43:53 -04:00
m3taversal	ba234ec4b3	fix(reaper): apply Ganymede review — dual-PATCH drift, breaker isolation, env config Followup to `f97dd15`. Four fixes from review: MUST-FIX #1 — Forgejo double-PATCH drift reaper closes PR via forgejo_api PATCH at line 689, then close_pr() at line 700 issued a second PATCH (default close_on_forgejo=True). On transient failure of the second PATCH, close_pr returns False without updating the DB → status='open' even though Forgejo is closed. Pass close_on_forgejo=False so DB close is unconditional after the explicit Forgejo PATCH succeeds. MUST-FIX #2 — reaper exception trips fix breaker Unhandled exception in verdict_deadlock_reaper_cycle propagated to stage_loop, recording fix-stage failures. After 5 reaper failures the fix breaker would open and block mechanical+substantive for 15 min. Wrap reaper call in try/except in fix_cycle (same exception-isolation pattern as ingest_cycle's extract_cycle wrapper). Defense-in-depth must never block primary paths. WARNING #1 — throttle SQL full-scan audit_log only has idx_audit_stage. Filtering on event alone caused full-table scans every 60s. Added stage='reaper' so the planner uses the existing index — reaper writes audit rows under stage='reaper' already so the filter is correct. WARNING #2 — REAPER_DRY_RUN as code constant Flipping dry-run → live required edit + commit + push + deploy + restart. Moved REAPER_DRY_RUN, REAPER_DEADLOCK_AGE_HOURS, REAPER_INTERVAL_SECONDS, REAPER_MAX_PER_RUN to lib/config.py with os.environ.get() overrides. Operator now flips via systemctl edit teleo-pipeline.service (Environment=REAPER_DRY_RUN=false) + restart. Defaults remain safe: dry-run, 24h age, hourly throttle, 50/run cap. NIT — dry-run counter naming Renamed local `closed` counter in dry-run path to `would_close` so the heartbeat audit ("X closed, Y would-close") and journal log are unambiguous. Function still returns closed + would_close so callers see total work done. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 23:43:53 -04:00
m3taversal	e63d27d259	fix(reaper): verdict-deadlock reaper — close stuck PRs after 24h Defense-in-depth for PRs that substantive_fixer can't make progress on. Targets two stuck-verdict shapes empirically observed in production: 1. leo:request_changes + domain:approve Leo asked for substantive fix; fixer either failed silently (no_claim_files / no_review_comments / etc.) or the issue tag isn't in FIXABLE \| CONVERTIBLE \| UNFIXABLE. 2. leo:skipped + domain:request_changes Eval bypassed Leo (eval_attempts >= MAX). Domain rejected with no structured eval_issues. fixer can't classify the issue. 92 PRs match this gate today, oldest at 2026-04-24 (13d stuck). Behavior: - Hourly throttle via audit_log sentinel ('verdict_deadlock_reaper_run'). - REAPER_DRY_RUN=True default — first deploy emits 'would_close' audit events only. No DB writes. No Forgejo writes. (Ship Apr 24 directive.) - 24h cooldown, oldest-first, capped at 50 per run. - Heartbeat audit fires whether dry-run or live, so throttle works. - Live mode: posts comment + closes Forgejo PR + close_pr() in DB. Audits 'verdict_deadlock_closed' per PR. - Forgejo PATCH None → skip DB close (avoid drift). Wired into fix_cycle() in teleo-pipeline.py. Runs after mechanical and substantive fixes, never blocks them. Followup (post first-run audit verification): - Operator inspects 'verdict_deadlock_would_close' audit rows - Flips REAPER_DRY_RUN to False, redeploys - Reaper actually closes on next hourly tick	2026-05-07 23:43:53 -04:00
m3taversal	517e9884cc	fix(substantive_fixer): WARN on corrupt eval_issues JSON Some checks are pending CI / lint-and-test (push) Waiting to run Details Third silent return path in substantive_fix_cycle — JSON-decode except at the eval_issues parse drops rows that don't reach skipped_no_tags or substantive_rows. If all 3 LIMIT-3 candidates have corrupt JSON, cycle returns 0,0 with no log entry. WARN level (not INFO): corrupt JSON is abnormal (post-merge column drift, hand-edited DB row, partial write during crash). If this fires, ops want to chase the upstream column-write path. If it never fires, baseline noise stays at zero. Closes the visibility gap on ALL silent returns in this function, not just the two patched in `3f8666e`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 18:33:08 -04:00
m3taversal	3f8666ee0c	fix(substantive_fixer): surface silent-skip reasons at INFO Two silent paths in substantive_fix_cycle masked a 13-day stall: 1. Filter strips all candidates → return 0,0 with no log. With LIMIT 3 ordered created_at ASC, if the oldest 3 have no fixer-actionable tags (e.g. eval_issues=[] from leo:skipped+domain:request_changes), the cycle silently picks the same head-of-line every tick. 2. _fix_pr early-returns logged at DEBUG only — invisible without fleet-wide DEBUG. Skip reasons (no_claim_files, no_review_comments, not_open lock, worktree_failed, etc.) never surfaced in journalctl. Patch: log skipped candidate eval_issues when no actionable rows found (path 1); promote DEBUG→INFO for per-PR skip reasons (path 2). Zero behavior change — observability only. Diagnosis context: 98 PRs stuck >3d, last successful substantive_fixer event 2026-04-24. Need journal evidence to choose between (a) one-line fix to the cycle, (b) larger _fix_pr regression. (Ship Step 2 directive.)	2026-05-07 11:58:22 -04:00