teleo/teleo-infrastructure

Author	SHA1	Message	Date
m3taversal	e14b5f2f05	fix(reaper): apply Ganymede review — dual-PATCH drift, breaker isolation, env config Followup to `f97dd15`. Four fixes from review: MUST-FIX #1 — Forgejo double-PATCH drift reaper closes PR via forgejo_api PATCH at line 689, then close_pr() at line 700 issued a second PATCH (default close_on_forgejo=True). On transient failure of the second PATCH, close_pr returns False without updating the DB → status='open' even though Forgejo is closed. Pass close_on_forgejo=False so DB close is unconditional after the explicit Forgejo PATCH succeeds. MUST-FIX #2 — reaper exception trips fix breaker Unhandled exception in verdict_deadlock_reaper_cycle propagated to stage_loop, recording fix-stage failures. After 5 reaper failures the fix breaker would open and block mechanical+substantive for 15 min. Wrap reaper call in try/except in fix_cycle (same exception-isolation pattern as ingest_cycle's extract_cycle wrapper). Defense-in-depth must never block primary paths. WARNING #1 — throttle SQL full-scan audit_log only has idx_audit_stage. Filtering on event alone caused full-table scans every 60s. Added stage='reaper' so the planner uses the existing index — reaper writes audit rows under stage='reaper' already so the filter is correct. WARNING #2 — REAPER_DRY_RUN as code constant Flipping dry-run → live required edit + commit + push + deploy + restart. Moved REAPER_DRY_RUN, REAPER_DEADLOCK_AGE_HOURS, REAPER_INTERVAL_SECONDS, REAPER_MAX_PER_RUN to lib/config.py with os.environ.get() overrides. Operator now flips via systemctl edit teleo-pipeline.service (Environment=REAPER_DRY_RUN=false) + restart. Defaults remain safe: dry-run, 24h age, hourly throttle, 50/run cap. NIT — dry-run counter naming Renamed local `closed` counter in dry-run path to `would_close` so the heartbeat audit ("X closed, Y would-close") and journal log are unambiguous. Function still returns closed + would_close so callers see total work done. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 18:18:32 -04:00
m3taversal	f97dd15349	fix(reaper): verdict-deadlock reaper — close stuck PRs after 24h Defense-in-depth for PRs that substantive_fixer can't make progress on. Targets two stuck-verdict shapes empirically observed in production: 1. leo:request_changes + domain:approve Leo asked for substantive fix; fixer either failed silently (no_claim_files / no_review_comments / etc.) or the issue tag isn't in FIXABLE \| CONVERTIBLE \| UNFIXABLE. 2. leo:skipped + domain:request_changes Eval bypassed Leo (eval_attempts >= MAX). Domain rejected with no structured eval_issues. fixer can't classify the issue. 92 PRs match this gate today, oldest at 2026-04-24 (13d stuck). Behavior: - Hourly throttle via audit_log sentinel ('verdict_deadlock_reaper_run'). - REAPER_DRY_RUN=True default — first deploy emits 'would_close' audit events only. No DB writes. No Forgejo writes. (Ship Apr 24 directive.) - 24h cooldown, oldest-first, capped at 50 per run. - Heartbeat audit fires whether dry-run or live, so throttle works. - Live mode: posts comment + closes Forgejo PR + close_pr() in DB. Audits 'verdict_deadlock_closed' per PR. - Forgejo PATCH None → skip DB close (avoid drift). Wired into fix_cycle() in teleo-pipeline.py. Runs after mechanical and substantive fixes, never blocks them. Followup (post first-run audit verification): - Operator inspects 'verdict_deadlock_would_close' audit rows - Flips REAPER_DRY_RUN to False, redeploys - Reaper actually closes on next hourly tick	2026-05-07 12:03:29 -04:00
m3taversal	cde92d3db1	fix: wrap breaker calls in stage_loop to prevent permanent task death Some checks are pending CI / lint-and-test (push) Waiting to run Details A transient DB lock in breaker.record_failure() inside an except handler killed the asyncio coroutine permanently — snapshot_cycle died Apr 18 and never recovered. All three breaker call sites now have their own try/except. Also includes HTML injection fix for github_feedback review_text. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-20 12:37:28 +01:00
m3taversal	681afad506	Consolidate pipeline code from teleo-codex + VPS into single repo Some checks failed CI / lint-and-test (push) Has been cancelled Details Sources merged: - teleo-codex/ops/pipeline-v2/ (11 newer lib files, 5 new lib modules) - teleo-codex/ops/ (agent-state, diagnostics expansion, systemd units, ops scripts) - VPS /opt/teleo-eval/telegram/ (10 new bot files, agent configs) - VPS /opt/teleo-eval/pipeline/ops/ (vector-gc, backfill-descriptions) - VPS /opt/teleo-eval/sync-mirror.sh (Bug 2 + Step 2.5 fixes) Non-trivial merges: - connect.py: kept codex threshold (0.65) + added infra domain parameter - watchdog.py: kept infra version (stale_pr integration, superset of codex) - deploy.sh: codex rsync version (interim, until VPS git clone migration) - diagnostics/app.py: codex decomposed dashboard (14 new route modules) 81 files changed, +17105/-200 lines Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-07 16:52:26 +01:00
m3taversal	d79ff60689	epimetheus: sync VPS-deployed code to repo — Mar 18-20 reliability + features Pipeline reliability (8 fixes, reviewed by Ganymede+Rhea+Leo+Rio): 1. Merge API recovery — pre-flight approval check, transient/permanent distinction, jitter 2. Ghost PR detection — ls-remote branch check in reconciliation, network guard 3. Source status contract — directory IS status, no code change needed 4. Batch-state markers eliminated — two-gate skip (archive-check + batched branch-check) 5. Branch SHA tracking — batched ls-remote, auto-reset verdicts, dismiss stale reviews 6. Mirror pre-flight permissions — chown check in sync-mirror.sh 7. Telegram archive commit-after-write — git add/commit/push with rebase --abort fallback 8. Post-merge source archiving — queue/ → archive/{domain}/ after merge Pipeline fixes: - merge_cycled flag — eval attempts preserved during merge-failure cycling (Ganymede+Rhea) - merge_failures diagnostic counter - Startup recovery preserves eval_attempts (was incorrectly resetting to 0) - No-diff PRs auto-closed by eval (root cause of 17 zombie PRs) - GC threshold aligned with substantive fixer budget (was 2, now 4) - Conflict retry with 3-attempt budget + permanent conflict handler - Local ff-merge fallback for Forgejo 405 errors Telegram bot: - KB retrieval: 3-layer (entity resolution → claim search → agent context) - Reply-to-bot handler (context.bot.id check) - Tag regex: @teleo\|@futairdbot - Prompt rewrite for natural analyst voice - Market data API integration (Ben's token price endpoint) - Conversation windows (5-message unanswered counter, per-user-per-chat) - Conversation history in prompt (last 5 exchanges) - Worktree file lock for archive writes Infrastructure: - worktree_lock.py — file-based lock (flock) for main worktree coordination - backfill-sources.py — source DB registration for Argus funnel - batch-extract-50.sh v3 — two-gate skip, batched ls-remote, network guard - sync-mirror.sh — auto-PR creation for mirrored GitHub branches, permission pre-flight - Argus dashboard — conflicts + reviewing in backlog, queue count in funnel - Enrichment-inside-frontmatter bug fix (regex anchor, not --- split) Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70>	2026-03-20 20:17:27 +00:00
m3taversal	85b86a918a	ganymede: extract lib/llm.py from evaluate.py (Phase 3c) Some checks failed CI / lint-and-test (pull_request) Has been cancelled Details - What: LLM transport (OpenRouter, Claude CLI), prompt templates (triage/domain/Leo), and review runner functions moved to lib/llm.py. evaluate.py retains PR lifecycle orchestration, SQLite state, Forgejo posting, rate limit backoff, and evaluate_cycle. - Why: evaluate.py was 734 lines mixing orchestration with LLM concerns. Now 455 lines orchestration + 250 lines LLM transport. Each module has a single responsibility. - Connections: completes Phase 3 structural refactor (forgejo.py + domains.py + llm.py). teleo-pipeline.py updated to import kill_active_subprocesses from lib.llm. Pentagon-Agent: Ganymede <F99EBFA6-547B-4096-BEEA-1D59C3E4028A>	2026-03-13 15:40:18 +00:00
m3taversal	a7251d7529	ganymede: add dev infrastructure — pyproject.toml, CI, deploy script Some checks failed CI / lint-and-test (pull_request) Has been cancelled Details Phase 2 of pipeline refactoring: - pyproject.toml: Python >=3.11, aiohttp dep, dev extras (pytest, pytest-asyncio, ruff). Ruff configured with sane defaults + ignore rules for existing code patterns (implicit Optional, timezone.utc). - .forgejo/workflows/ci.yml: Forgejo Actions CI — syntax check, ruff lint, ruff format, pytest on every PR and push to main. - deploy.sh: Pull + venv update + syntax check + optional restart. Replaces ad-hoc scp workflow. - tests/conftest.py: Shared fixture for in-memory SQLite with full schema. Ready for Phase 4 test suite. - .gitignore: Added venv, pytest cache, coverage, build artifacts. - Ruff auto-fixes: import sorting, unused imports removed across all modules. All files pass ruff check + ruff format. Pentagon-Agent: Ganymede <F99EBFA6-547B-4096-BEEA-1D59C3E4028A>	2026-03-13 14:24:27 +00:00
m3taversal	f166db4f62	ganymede: fix 4 critical bugs before pipeline restart - Fix #12: domain_review undefined on resume path — initialize to None, guard _parse_issues() call. Prevents NameError on PRs resuming after partial eval (76 PRs in this state right now). - Fix #11: concurrent eval workers can duplicate reviews — add atomic UPDATE SET status='reviewing' WHERE status='open' at top of evaluate_pr(). Check rowcount, skip if already claimed. - Fix #8: subprocess tracking for graceful shutdown — _active_subprocesses set in evaluate module, tracked in _claude_cli_call, exposed via kill_active_subprocesses(). Replaces dead code in teleo-pipeline.py. - Fix health.py divide-by-zero — guard all metabolic metric reads against None from NULLIF/empty result set. Prevents TypeError on /health when no PRs have been evaluated in 24h. Also includes Leo's existing hot-fixes: - Rate limit detection checks stdout regardless of exit code - 15-minute cycle-level backoff on rate limit Pentagon-Agent: Ganymede <F99EBFA6-547B-4096-BEEA-1D59C3E4028A>	2026-03-13 14:13:25 +00:00
m3taversal	799249d470	Initial commit: Pipeline v2 daemon + infrastructure docs - teleo-pipeline.py: async daemon with 4 stage loops (ingest/validate/evaluate/merge) - lib/: config, db, evaluate, validate, merge, breaker, costs, health, log modules - INFRASTRUCTURE.md: comprehensive deep-dive for onboarding - teleo-pipeline.service: systemd unit file Pentagon-Agent: Leo <294C3CA1-0205-4668-82FA-B984D54F48AD>	2026-03-12 14:11:18 +00:00

9 commits