teleo-infrastructure/handoff/phase1-step3-script-migration.md
Fawaz 377924dabe
feat(phase1-step3): rewire critical scripts Forgejo -> GitHub (decision-engine)
Phase 1 Step 3 — migrate research-session.sh and pipeline-health-check.py off Forgejo onto GitHub living-ip/decision-engine. eval-dispatcher.sh / eval-worker.sh documented as dead code (replaced by daemon).
2026-05-22 21:43:08 -04:00

102 lines
5.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Phase 1 Step 3: Script Migration to GitHub
## Summary
Migrated critical-path scripts from Forgejo (`git.livingip.xyz` / `teleo/teleo-codex`) to GitHub (`living-ip/decision-engine`). Audit found two of the four planned scripts are dead code; scope reduced from 4 scripts to 2.
| Script | Status | Action |
|---|---|---|
| `research/research-session.sh` | live (cron paused 2026-05-12 pending Hermes) | migrated this PR |
| `pipeline-health-check.py` (VPS root, unversioned) | live, cron every 2h | migrated, deploy notes below |
| `eval/eval-dispatcher.sh` | dead since 2026-03-12 | deprecated, see `handoff/deprecated/eval-scripts.md` |
| `eval/eval-worker.sh` | dead since 2026-03-12 | deprecated, see `handoff/deprecated/eval-scripts.md` |
## What changed in `research/research-session.sh`
Forgejo → GitHub rewire. Same control flow, same Claude invocation, same agent-state hooks. Only external integrations swapped.
| Change | Before | After |
|---|---|---|
| API base | `http://localhost:3000` (Forgejo) | `https://api.github.com` |
| Repo | `teleo/teleo-codex` | `living-ip/decision-engine` |
| Token file | `/opt/teleo-eval/secrets/forgejo-${AGENT}-token` (per-agent), fallback to admin | `/opt/teleo-eval/secrets/github-admin-token` (single livingIPbot, per Option A) |
| REST API auth | `?token=<pat>` query or `Authorization: token <pat>` header | `Authorization: Bearer <pat>` + GitHub API version header |
| Git auth | `http.extraHeader: Authorization: token <pat>` | `url.<base>.insteadOf` rewrite injecting `x-access-token:<pat>@github.com` |
| PR list query | `pulls?state=open` then jq filter | `pulls?state=open&head=living-ip:<branch>` (server-side filter) |
| PR create | `POST /api/v1/repos/.../pulls` | `POST /repos/.../pulls` + GitHub API version header |
## Per-agent identity (deferred)
Phase 1 uses Option A: single `livingIPbot` PAT for all agents. The `AGENT_TOKEN` variable remains as a placeholder so per-agent elevation in Phase 2 is a one-line change.
When Billy elevates: generate `github-${AGENT}-token` files at `/opt/teleo-eval/secrets/`, switch the PR-creation curl to use `AGENT_TOKEN`. Git operations stay on the bot token (it's the one with push access to all agent branches). Per-agent VERDICT comments / PR opens become visible in commit history as separate authors.
## Security note: token in URL rewrite
The `insteadOf` rewrite injects the PAT into the URL only at command-execution time. It does NOT persist in `.git/config` or `git remote -v`. Verified: post-push `remote -v` shows the clean `https://github.com/living-ip/decision-engine.git` URL.
Risk surfaces that remain:
- `ps auxf` during the git command shows the rewrite arg with the token
- If the script's log file gets verbose enough, token could appear in error output
Mitigation for Billy: switch to a git credential helper (`git-credential-store` or a custom helper that reads from the secrets file) to remove the in-flight exposure entirely. Out of scope for Phase 1.
## Smoke test results
Performed against `living-ip/decision-engine` end-to-end, without invoking Claude:
```
✅ git clone (depth=1) via insteadOf rewrite
✅ branch create + commit
✅ git push (authenticated)
✅ PR list API (server-side head= filter)
✅ remote -v shows clean URL (token not persisted)
✅ branch cleanup
```
Static checks: `bash -n` passes, no residual Forgejo references in the file.
## `pipeline-health-check.py` — deploy notes (NOT auto-deployed)
This script lives at `/opt/teleo-eval/pipeline-health-check.py` on the VPS — **NOT in this repo**. It was never added to teleo-infrastructure; lives only as a VPS-local script.
The migrated version is at `/tmp/pipeline-health-check.py.new` on the VPS. To go live:
```bash
# Backup current
cp /opt/teleo-eval/pipeline-health-check.py /opt/teleo-eval/pipeline-health-check.py.bak-pre-github
# Promote new version
cp /tmp/pipeline-health-check.py.new /opt/teleo-eval/pipeline-health-check.py
chmod +x /opt/teleo-eval/pipeline-health-check.py
# Cron continues to run it every 2h; no cron change needed.
```
Before promoting: confirm with Fwaz/m3ta whether the script should also be added to this repo for versioning. Recommended yes; out of scope for this PR.
Until promoted, the live VPS script keeps reading from Forgejo. Fine during cutover window. Will produce empty/stale metrics once Forgejo is decommissioned (Step 7) if not promoted by then.
## Auto-deploy of research-session.sh
`research/research-session.sh` is in the repo's `research/` directory. The auto-deploy script (`teleo-auto-deploy.timer`) rsyncs the repo into `/opt/teleo-eval/pipeline/`. Check whether `research/` is in the rsync manifest — if not, the migrated script won't reach the runtime path that cron used to invoke (`/opt/teleo-eval/research-session.sh`).
If `research/` is NOT in the rsync manifest (or the runtime path differs from `pipeline/research/research-session.sh`), Billy should add it during productionization. Until then, the migrated script needs a manual `cp` to `/opt/teleo-eval/research-session.sh`.
This was a pre-existing topology issue; not introduced by this PR.
## When the cron gets re-enabled
The research-session crons were paused 2026-05-12 with comment `PAUSED 2026-05-12 (architecture change)`. They should stay paused until Phase 1 Step 4 (Leo on Hermes) is verified — Hermes-Leo's research loop replaces this script for Leo.
For the other 5 agents (Theseus, Rio, Vida, Clay, Astra): this script remains the fallback path during the Hermes rollout. Billy uses Leo as the pattern and can either re-enable cron or invoke from Hermes per agent.
## Hermes runtime note (Step 4 preview)
While auditing the repo, found `hermes-agent/` directory in teleo-infrastructure root. Not investigated as part of Step 3. Will audit during Step 4.
## Files changed in this PR
- `research/research-session.sh` — migrated (+29 / 14 lines)
- `handoff/phase1-step3-script-migration.md` — this file (new)
- `handoff/deprecated/eval-scripts.md` — deprecation notes (new)