Commit graph

11 commits

Author SHA1 Message Date
fc002354d4 fix(substantive_fixer): json_valid guard in front of json_each
Some checks are pending
CI / lint-and-test (push) Waiting to run
Ganymede review of 5db6a02 (msg 2 of 3): json_each(invalid_json) throws
'malformed JSON' and propagates up through EXISTS, failing the SELECT.
The fix-cycle call site at teleo-pipeline.py:104 isn't try/except wrapped
(the reaper at line 109-116 is, the substantive cycle isn't), so a single
corrupt eval_issues row would trip the fix-stage breaker after 5 occurrences.

Fix is one line — AND json_valid(eval_issues) before the EXISTS clause.
json_valid(NULL) returns NULL (false in WHERE), json_valid(invalid) returns 0,
json_valid(valid) returns 1. SQLite 3.9+, predates VPS 3.45.1.

WARN-on-corrupt-JSON path kept per Ganymede's Q3 — json_valid and json.loads
use technically distinct parsers, cost is ~3 rows × parse-empty-string per
cycle, journal entry names the failure mode if SQLite ever surfaces a row
that passes both SQL guards but fails json.loads.

Comment updated to reflect new guard ordering.
2026-05-08 13:12:25 -04:00
5db6a0248c fix(substantive_fixer): SQL-side actionable-tag filter, eliminate head-of-line
Step 4 of the stuck-PR triage. Push the FIXABLE/CONVERTIBLE/UNFIXABLE_TAGS
intersection from a post-fetch Python loop into the SELECT WHERE clause via
json_each + EXISTS. LIMIT 3 now always returns 3 actionable rows (or fewer if
that's all there are), eliminating the head-of-line block where 3 oldest
empty-eval_issues PRs occupied the slots forever.

Background: 11 hours of post-deploy logs showed substantive_fix_cycle stuck
emitting "0 actionable from 3 candidate(s) — head-of-line: [(3922, []), (3926,
[]), (3940, [])]" every cycle. Reaper closed those three on schedule, then a
new triple of empty-eval_issues PRs took their place. Reaper-as-primary-clearance
worked but is defense-in-depth, not the right architecture. Source of the block
is upstream in this SELECT.

Implementation choice: json_each + EXISTS over LIKE. Robust against tag-name
substring overlap, future-proof against tag renames, and SQLite 3.45.1 on VPS
fully supports it. Verified live: returns 13 of 28 currently-stuck PRs as
actionable, 15 fall through to reaper as before.

Tag list builds from the routing constants at runtime so adding a new tag
auto-updates the SELECT filter — no two-place edit footgun.

WARN-on-corrupt-JSON path retained as defense-in-depth (json_each and
json.loads use different parsers; technically possible for a row to pass one
but not the other).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 12:52:12 -04:00
4b2b59b184 fix(reaper): branch allowlist for disposable pipeline-managed branches
Some checks are pending
CI / lint-and-test (push) Waiting to run
Apply Ganymede review nit #3 from f97dd15 review (the deferred close_on_forgejo
fix already landed in e14b5f2 — Ganymede was reviewing the older commit).

SQL gate previously had no branch filter — empirically all 92 candidates were
extract/* but structurally any agent branch in the deadlock shape was a
candidate. Positive allowlist for extract/, reweave/, fix/ scopes the reaper
to disposable pipeline-managed branches that the pipeline created and can
recreate. Agent branches (theseus/, vida/, epimetheus/, etc.) are WIP feature
work and must not be reaped — owners review their own PRs on their own cadence.

Cheap target-class lock complementing the LIMIT 50 blast-radius cap.
Same scoping principle as PIPELINE_OWNED_PREFIXES, but tighter — epimetheus/
review branches are pipeline-owned for merge purposes but NOT disposable.

Items 2-4 from this review:
- WARNING #2 (audit_log idx_audit_event_ts): defer to followup branch alongside
  sync-mirror migration cleanup, as Ganymede suggested.
- NIT #3 (this commit): branch allowlist applied.
- NIT #4 (token asymmetry comment=admin/close=leo): confirmed established
  codebase pattern. merge.py:946-948 does the same — comment system-toned,
  close attributed to Leo for verdict-source UI clarity. Not accidental.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 23:43:53 -04:00
ba234ec4b3 fix(reaper): apply Ganymede review — dual-PATCH drift, breaker isolation, env config
Followup to f97dd15. Four fixes from review:

MUST-FIX #1 — Forgejo double-PATCH drift
  reaper closes PR via forgejo_api PATCH at line 689, then close_pr() at
  line 700 issued a second PATCH (default close_on_forgejo=True). On
  transient failure of the second PATCH, close_pr returns False without
  updating the DB → status='open' even though Forgejo is closed. Pass
  close_on_forgejo=False so DB close is unconditional after the explicit
  Forgejo PATCH succeeds.

MUST-FIX #2 — reaper exception trips fix breaker
  Unhandled exception in verdict_deadlock_reaper_cycle propagated to
  stage_loop, recording fix-stage failures. After 5 reaper failures the
  fix breaker would open and block mechanical+substantive for 15 min.
  Wrap reaper call in try/except in fix_cycle (same exception-isolation
  pattern as ingest_cycle's extract_cycle wrapper). Defense-in-depth
  must never block primary paths.

WARNING #1 — throttle SQL full-scan
  audit_log only has idx_audit_stage. Filtering on event alone caused
  full-table scans every 60s. Added stage='reaper' so the planner uses
  the existing index — reaper writes audit rows under stage='reaper'
  already so the filter is correct.

WARNING #2 — REAPER_DRY_RUN as code constant
  Flipping dry-run → live required edit + commit + push + deploy +
  restart. Moved REAPER_DRY_RUN, REAPER_DEADLOCK_AGE_HOURS,
  REAPER_INTERVAL_SECONDS, REAPER_MAX_PER_RUN to lib/config.py with
  os.environ.get() overrides. Operator now flips via systemctl edit
  teleo-pipeline.service (Environment=REAPER_DRY_RUN=false) + restart.
  Defaults remain safe: dry-run, 24h age, hourly throttle, 50/run cap.

NIT — dry-run counter naming
  Renamed local `closed` counter in dry-run path to `would_close` so the
  heartbeat audit ("X closed, Y would-close") and journal log are
  unambiguous. Function still returns closed + would_close so callers
  see total work done.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 23:43:53 -04:00
e63d27d259 fix(reaper): verdict-deadlock reaper — close stuck PRs after 24h
Defense-in-depth for PRs that substantive_fixer can't make progress on.
Targets two stuck-verdict shapes empirically observed in production:

  1. leo:request_changes + domain:approve
     Leo asked for substantive fix; fixer either failed silently
     (no_claim_files / no_review_comments / etc.) or the issue tag isn't
     in FIXABLE | CONVERTIBLE | UNFIXABLE.

  2. leo:skipped + domain:request_changes
     Eval bypassed Leo (eval_attempts >= MAX). Domain rejected with no
     structured eval_issues. fixer can't classify the issue.

92 PRs match this gate today, oldest at 2026-04-24 (13d stuck).

Behavior:
  - Hourly throttle via audit_log sentinel ('verdict_deadlock_reaper_run').
  - REAPER_DRY_RUN=True default — first deploy emits 'would_close' audit
    events only. No DB writes. No Forgejo writes. (Ship Apr 24 directive.)
  - 24h cooldown, oldest-first, capped at 50 per run.
  - Heartbeat audit fires whether dry-run or live, so throttle works.
  - Live mode: posts comment + closes Forgejo PR + close_pr() in DB.
    Audits 'verdict_deadlock_closed' per PR.
  - Forgejo PATCH None → skip DB close (avoid drift).

Wired into fix_cycle() in teleo-pipeline.py. Runs after mechanical
and substantive fixes, never blocks them.

Followup (post first-run audit verification):
  - Operator inspects 'verdict_deadlock_would_close' audit rows
  - Flips REAPER_DRY_RUN to False, redeploys
  - Reaper actually closes on next hourly tick
2026-05-07 23:43:53 -04:00
517e9884cc fix(substantive_fixer): WARN on corrupt eval_issues JSON
Some checks are pending
CI / lint-and-test (push) Waiting to run
Third silent return path in substantive_fix_cycle — JSON-decode except
at the eval_issues parse drops rows that don't reach skipped_no_tags
or substantive_rows. If all 3 LIMIT-3 candidates have corrupt JSON,
cycle returns 0,0 with no log entry.

WARN level (not INFO): corrupt JSON is abnormal (post-merge column
drift, hand-edited DB row, partial write during crash). If this fires,
ops want to chase the upstream column-write path. If it never fires,
baseline noise stays at zero.

Closes the visibility gap on ALL silent returns in this function, not
just the two patched in 3f8666e.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 18:33:08 -04:00
3f8666ee0c fix(substantive_fixer): surface silent-skip reasons at INFO
Two silent paths in substantive_fix_cycle masked a 13-day stall:

1. Filter strips all candidates → return 0,0 with no log. With LIMIT 3
   ordered created_at ASC, if the oldest 3 have no fixer-actionable tags
   (e.g. eval_issues=[] from leo:skipped+domain:request_changes), the
   cycle silently picks the same head-of-line every tick.

2. _fix_pr early-returns logged at DEBUG only — invisible without
   fleet-wide DEBUG. Skip reasons (no_claim_files, no_review_comments,
   not_open lock, worktree_failed, etc.) never surfaced in journalctl.

Patch: log skipped candidate eval_issues when no actionable rows
found (path 1); promote DEBUG→INFO for per-PR skip reasons (path 2).
Zero behavior change — observability only.

Diagnosis context: 98 PRs stuck >3d, last successful substantive_fixer
event 2026-04-24. Need journal evidence to choose between (a) one-line
fix to the cycle, (b) larger _fix_pr regression. (Ship Step 2 directive.)
2026-05-07 11:58:22 -04:00
c8a08023f9 refactor: Phase 2 — wire pr_state into fixer.py and substantive_fixer.py
Some checks are pending
CI / lint-and-test (push) Waiting to run
Fix 4 Forgejo ghost PR bugs flagged by Ganymede:
- fixer.py GC close: DB update ran outside try/except, closing DB even on Forgejo failure
- substantive_fixer.py droppable: NO Forgejo close at all
- substantive_fixer.py auto-enrichment: DB update before Forgejo (reversed order)
- substantive_fixer.py close_and_reextract: replace manual Forgejo+DB with close_pr()

Add start_fixing() and reset_for_reeval() to pr_state.py:
- start_fixing: atomic claim + fix_attempts increment in one statement
- reset_for_reeval: clears all eval state for re-evaluation after fix

Also fixes stale line number comment in merge.py (Ganymede nit).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 12:21:40 +01:00
681afad506 Consolidate pipeline code from teleo-codex + VPS into single repo
Some checks failed
CI / lint-and-test (push) Has been cancelled
Sources merged:
- teleo-codex/ops/pipeline-v2/ (11 newer lib files, 5 new lib modules)
- teleo-codex/ops/ (agent-state, diagnostics expansion, systemd units, ops scripts)
- VPS /opt/teleo-eval/telegram/ (10 new bot files, agent configs)
- VPS /opt/teleo-eval/pipeline/ops/ (vector-gc, backfill-descriptions)
- VPS /opt/teleo-eval/sync-mirror.sh (Bug 2 + Step 2.5 fixes)

Non-trivial merges:
- connect.py: kept codex threshold (0.65) + added infra domain parameter
- watchdog.py: kept infra version (stale_pr integration, superset of codex)
- deploy.sh: codex rsync version (interim, until VPS git clone migration)
- diagnostics/app.py: codex decomposed dashboard (14 new route modules)

81 files changed, +17105/-200 lines

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 16:52:26 +01:00
0457c49094 fix: zombie retry loop + cost tracking
Gate 3 in batch-extract-50.sh: query pipeline.db for closed PRs before
re-extracting. Sources with >=3 closed PRs are skipped (zombie protection).

Cost tracking: openrouter_call() now returns (text, usage) tuple with
prompt_tokens and completion_tokens from the OpenRouter API response.
All callers updated to unpack and pass tokens to costs.record_usage().
Added missing triage cost recording. Fixed batch domain review recording
cost once per batch instead of once per PR.

Pentagon-Agent: Epimetheus <0144398e-4ed3-4fe2-95a3-3d72e1abf887>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 11:29:58 +00:00
d79ff60689 epimetheus: sync VPS-deployed code to repo — Mar 18-20 reliability + features
Pipeline reliability (8 fixes, reviewed by Ganymede+Rhea+Leo+Rio):
1. Merge API recovery — pre-flight approval check, transient/permanent distinction, jitter
2. Ghost PR detection — ls-remote branch check in reconciliation, network guard
3. Source status contract — directory IS status, no code change needed
4. Batch-state markers eliminated — two-gate skip (archive-check + batched branch-check)
5. Branch SHA tracking — batched ls-remote, auto-reset verdicts, dismiss stale reviews
6. Mirror pre-flight permissions — chown check in sync-mirror.sh
7. Telegram archive commit-after-write — git add/commit/push with rebase --abort fallback
8. Post-merge source archiving — queue/ → archive/{domain}/ after merge

Pipeline fixes:
- merge_cycled flag — eval attempts preserved during merge-failure cycling (Ganymede+Rhea)
- merge_failures diagnostic counter
- Startup recovery preserves eval_attempts (was incorrectly resetting to 0)
- No-diff PRs auto-closed by eval (root cause of 17 zombie PRs)
- GC threshold aligned with substantive fixer budget (was 2, now 4)
- Conflict retry with 3-attempt budget + permanent conflict handler
- Local ff-merge fallback for Forgejo 405 errors

Telegram bot:
- KB retrieval: 3-layer (entity resolution → claim search → agent context)
- Reply-to-bot handler (context.bot.id check)
- Tag regex: @teleo|@futairdbot
- Prompt rewrite for natural analyst voice
- Market data API integration (Ben's token price endpoint)
- Conversation windows (5-message unanswered counter, per-user-per-chat)
- Conversation history in prompt (last 5 exchanges)
- Worktree file lock for archive writes

Infrastructure:
- worktree_lock.py — file-based lock (flock) for main worktree coordination
- backfill-sources.py — source DB registration for Argus funnel
- batch-extract-50.sh v3 — two-gate skip, batched ls-remote, network guard
- sync-mirror.sh — auto-PR creation for mirrored GitHub branches, permission pre-flight
- Argus dashboard — conflicts + reviewing in backlog, queue count in funnel
- Enrichment-inside-frontmatter bug fix (regex anchor, not --- split)

Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70>
2026-03-20 20:17:27 +00:00