teleo-infrastructure/docs/external-contributor-merge-flow.md
m3taversal 537cfd5ed7 docs(external-contrib): v3 — locked Phase 2 decisions, verbose merge msg (Ship Msg 3)
Cleanup pass on the design doc per Ship's Msg 3 review. Two changes, no architecture shifts:

1. Merge commit message updated to locked verbose form:
     "Merge external GitHub PR #{N}: {branch_slug}"
   N derived from gh-pr-{N}/ regex; branch_slug = post-prefix portion. Embeds
   the GitHub PR number explicitly so `git log --merges --grep "#90"` surfaces
   the merge from PR number alone. Drops the v2 "{branch}"-only form.

2. "Open questions for Ship" section collapsed to "Locked decisions" —
   the three Phase 2 items (backout flag, rebase-after-fix scope, merge commit
   message format) were locked in Ship's reply on Msg 2 but the v2 doc still
   framed them as open. Now restated as decisions so Ganymede doesn't re-debate
   resolved items at line-level review.

Diffs reviewable against v2 (commit fa6cceb).
2026-04-28 13:04:09 +01:00

375 lines
18 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# External Contributor Merge Flow — Design Doc
**Author:** Epimetheus
**Architecture review:** Ship (owns merge.py, sync-mirror.sh)
**Code review:** Ganymede (line-level, post-design-approval)
**Status:** Phase 1 sweep-only scope locked. Phase 2 architecture decisions locked (Ship Msg 3). Awaiting Phase 1 line-level code review.
## Revision log
- **v3 (this revision):** Cleanup pass per Ship Msg 3 — merge commit message updated to locked verbose form (`"Merge external GitHub PR #{N}: {branch_slug}"`), open-questions section collapsed to "Locked Phase 2 decisions" restating the three resolved outcomes (no longer questions).
- **v2:** Phase 1 simplified per Ship's Msg 2 — backfill script dropped (sweep IS the backfill), sweep placement specified explicitly as the very first action after initial fetch (ahead of branch-mirror loop AND auto-create-PR block at line ~250). Phase 2 architecture unchanged.
- **v1:** Initial draft.
## Problem statement
External GitHub contributors submit PRs via the `living-ip/teleo-codex` mirror.
Pipeline accepts the claim and merges the content into Forgejo `main`, but the
GitHub PR shows "open with no diff" — it looks abandoned to the contributor.
Two compounding bugs intersect on this path:
1. **Cherry-pick breaks GitHub merge detection.** `lib/merge.py::_cherry_pick_onto_main`
creates a new SHA on Forgejo main. GitHub's "is PR head SHA an ancestor of
main?" check returns false. `merged: false, merge_commit_sha: null` forever.
2. **`prs.github_pr` not populated for fork PRs.** `sync-mirror.sh` Step 4.5 looks
up GitHub PR number via `?head=living-ip:$branch`, but fork PR heads are
`FwazB:contributor/...` (or `<fork-owner>:<branch>`), not `living-ip:`. The
filter misses, `github_pr` stays NULL, and `lib/github_feedback.py::on_merged`
returns early (no comment, no close) because `_get_github_pr` requires non-NULL.
Empirical:
| PR | head | merge mech | github_pr | merged badge | comment posted |
|---|---|---|---|---|---|
| #87 (own-repo) | `living-ip:fix/...` | `git merge --no-ff` | populated | ✓ true | ✓ |
| #90 (FwazB fork) | `FwazB:contributor/...` | cherry-pick | NULL | ✗ false | ✗ |
Both bugs need fixes. Bug #1 is the structural one (load-bearing for the badge).
Bug #2 is a sync-mirror filter issue (load-bearing for the comment/close).
## Goal
External GitHub contributor opens a PR → pipeline ingests, evaluates, merges →
GitHub PR shows `merged: true` with badge → bot comment posted → PR closed
cleanly. No human in the loop on the success path. Failure modes (eval reject,
auto-fix, contributor force-push) handled gracefully.
## Out of scope
- Agent-extraction PRs (`extract/*`, `reweave/*`, `epimetheus/*`, etc.) — keep
cherry-pick. They merge 70+/day, have no contributor UX surface, and the
cherry-pick → linear-history rationale (auto-fixer rebase pattern, bisect
friendliness) holds.
- `/api/contributors` legacy endpoint — separate work, deferred.
- PAT-in-URL credential pattern — separate security follow-up.
## Design — branch-prefix conditional, scoped to `gh-pr-*`
### Bug #1 fix: `_merge_no_ff_external` for `gh-pr-*` branches
**Dispatch site (`lib/merge.py::_merge_domain_queue`, currently lines 736-738):**
```python
# Reweave: per-file frontmatter union (existing)
if branch.startswith("reweave/"):
merge_fn = _merge_reweave_pr(branch)
# External GitHub fork PRs: true merge with --no-ff so contributor SHA lands
# in main's history → GitHub recognizes "merged" badge.
elif branch.startswith("gh-pr-"):
merge_fn = _merge_no_ff_external(branch)
# Default: cherry-pick (extraction commits ADD new files, applies cleanly,
# linear history preserved for the bulk-extraction flow).
else:
merge_fn = _cherry_pick_onto_main(branch)
```
**New function (`lib/merge.py`):**
```python
async def _merge_no_ff_external(branch: str) -> tuple[bool, str]:
"""Merge an external GitHub PR with --no-ff so contributor SHA lands in main.
Why this differs from _cherry_pick_onto_main:
- Cherry-pick rewrites SHA → GitHub never recognizes the PR as merged.
- --no-ff preserves the contributor's commit SHA in main's history.
- sync-mirror's Forgejo→GitHub propagation already handles merge commits
(verified empirically: PR #87 round-tripped cleanly with merge_commit_sha
preserved).
Mechanics:
1. Fetch latest origin/main and origin/{branch}
2. Create scratch worktree at HEAD of origin/main
3. Derive: gh_pr_num = re.match(r"gh-pr-(\d+)/", branch).group(1)
branch_slug = branch[len(f"gh-pr-{gh_pr_num}/"):]
4. git merge --no-ff origin/{branch} \
-m f"Merge external GitHub PR #{gh_pr_num}: {branch_slug}"
5. git push origin HEAD:main
6. Cleanup worktree
Conflict handling:
- Entity conflicts: same auto-resolve pattern as cherry-pick
(--ours = main HEAD, --theirs = branch). External claims rarely touch
entities so this is a low-frequency path.
- Other conflicts: abort, return False with conflict detail. Caller marks
conflict_permanent. Manual resolution or contributor rebase required.
Idempotency: caller already gates on PR status, so re-running on a merged
PR fails at the merge step (already merged), which is the right behavior.
Returns (success, message).
"""
```
The merge commit message format `"Merge external GitHub PR #{N}: {branch_slug}"`
embeds the GitHub PR number explicitly so `git log --merges --grep "#90"`
surfaces the merge from PR number alone (no branch-name guessing). Branch slug
is the post-`gh-pr-{N}/` portion of the branch (e.g.,
`contributor/arcium-confidential-computing-challenge`) — already in scope at
merge time, no claim-file read needed.
### Bug #2 fix: self-healing sweep at top of sync-mirror cycle
**Root cause:** The one-shot link UPDATE in Step 4.5 (lines ~250-294) runs once
per branch creation — if it fails (race with PR row insertion, transient API
hiccup, transient lock), the row is permanently stuck at `github_pr=NULL`. No
retry path. FwazB's PR 4066 is the visible artifact of this class of failure.
The structural fix is a self-healing sweep: each cron tick, scan for any
`gh-pr-*` PR rows missing `github_pr` and link them. Idempotent, zero-cost when
clean, retries forever until the row is healed. The sweep IS the backfill — no
separate one-shot script needed (Ship's simplification, Msg 2). First cron tick
after deploy picks up 4066 automatically, same SELECT path that handles all
future races.
**Placement (load-bearing):** the sweep runs **as the very first action after
the initial Forgejo+GitHub fetch**, ahead of both:
- The branch-mirror loop
- The auto-create-PR block at line ~250
This sequencing matters because a fresh-cycle race — PR created in the *current*
cycle, link UPDATE in `Step 4.5` fails — self-heals on the very next iteration's
sweep, not 2 minutes later. Same-cycle convergence vs cross-cycle convergence.
```bash
# Step 0: self-heal any gh-pr-* PR rows missing github_pr.
# Runs FIRST — before branch-mirror loop, before auto-create-PR block.
# Idempotent: SELECT returns empty when clean.
# The branch name encodes the GitHub PR number (gh-pr-{N}/...) so no API
# round-trip needed to recover the number. Source-of-truth derivation.
sqlite3 -separator '|' "$PIPELINE_DB" \
"SELECT number, branch FROM prs WHERE branch LIKE 'gh-pr-%' AND github_pr IS NULL" \
| while IFS='|' read -r pr_num branch; do
gh_pr_num=$(echo "$branch" | sed -n 's|^gh-pr-\([0-9]*\)/.*|\1|p')
[ -z "$gh_pr_num" ] && continue
sqlite3 "$PIPELINE_DB" \
"UPDATE prs SET github_pr = $gh_pr_num, source_channel = 'github' WHERE number = $pr_num;"
log "self-heal: linked Forgejo PR #$pr_num → GitHub PR #$gh_pr_num"
done
```
~10 lines. No API call required (branch name carries the number deterministically).
No `.fork-pr-map` file, no `--script` flag, no manual deploy step.
**Why this approach beats the earlier `.fork-pr-map` proposal:**
- The map approach repaired the write path but left already-stuck rows orphaned
- A separate one-shot backfill script for the orphans is one more code path to maintain
- The sweep collapses both paths: it's the failure recovery AND the historical backfill
- Future races automatically self-heal without code changes
**Why this approach beats the earlier `?head=living-ip:` filter:**
- That filter never matched fork PRs at all — by design, fork PRs come from a
different owner. Pre-existing nit Ganymede flagged in multi-repo-mirror review
- The branch name encoding bypasses the GitHub API entirely for fork PRs
**Why no API verification before UPDATE:**
- Branch name is deterministic source — the only writer of `gh-pr-N/...` is
Step 2.1, which only writes after fetching the GitHub PR ref. If the branch
exists, the GitHub PR exists.
- Avoids API rate-limit pressure on every cron tick (every 2 min × 24h × 7d ≈ 5040
calls/week even when no work needed).
- Sanity-check via API is achievable cheaply if Ship wants it; current scope
matches "zero-cost when clean."
### Auto-fixer mode='append' for `gh-pr-*` branches
**Current behavior (`lib/fixer.py`):** Worktree-based fix → commit → push (regular
push, not force). When the PR was created by the pipeline (extract/* branches),
this works because the LLM-extractor's commit is at HEAD and we're appending on
top — push succeeds.
**External PR behavior (today):** Same code path runs. Fork PR has FwazB's commit
at HEAD; auto-fixer creates a fix commit on top; pushes via Forgejo's branch
ref (the fork PR was mirrored as `refs/heads/gh-pr-90/contributor/...` on
Forgejo). Push works. Eval reset fires. Eval re-runs.
Wait — re-reading the existing fixer, it's actually already append-only. Good.
The cherry-pick at merge time was the only thing rewriting SHAs. So **no fixer
change required** for Option 2. The fixer commit is already at HEAD~1 from
FwazB's commit on Forgejo. When we `git merge --no-ff` instead of cherry-pick,
the merge commit's parent chain includes BOTH the fix commit AND FwazB's
original commit. GitHub sees FwazB's SHA in ancestry → "merged" badge.
**Cross-check:** Verified PR 4066's existing branch state.
```
$ git log refs/heads/gh-pr-90/contributor/arcium-confidential-computing-challenge --oneline
d7916d65 auto-fix: strip 2 broken wiki links ← fixer's commit
f6a59d7d claim: confidential computing reshapes... ← FwazB's commit
```
Both commits already on Forgejo. When merge.py cherry-picked, it picked both
commits onto main as new SHAs. With merge --no-ff, both stay intact, the merge
commit references them, GitHub sees `f6a59d7d` (FwazB's HEAD on his fork) in
main's ancestry, marks merged.
**This means scope shrinks:** the design is purely a merge.py change + a
sync-mirror Step 2.1/4.5 plumbing tweak. No fixer module changes.
### Edge case: contributor rebases their fork after fixer appended
**Scenario:** FwazB's PR is in eval. Pipeline auto-fixer pushes a fix commit to
Forgejo `gh-pr-90/...`. FwazB notices the original wiki-link issue, fixes it
locally, force-pushes to his fork. sync-mirror's next cycle fetches his new
SHA → tries to update the Forgejo branch → push from sync-mirror is regular
(not --force) → fails because Forgejo branch has diverged from FwazB's fork.
**Today's behavior:** sync-mirror logs a warning. Forgejo branch keeps the
appended-fix state. Eval continues against that state. FwazB's most recent
fork state is silently ignored.
**Acceptable risk for hackathon.** The eval-reset-on-tip-change gate will
re-trigger eval if anyone force-pushes the Forgejo branch later (e.g., manual
re-sync). Documented; not fixing in this PR.
**Followup:** sync-mirror could detect the divergence, log a structured alert,
and post a comment on the GitHub PR ("we detected your force-push but our
appended fix has diverged; please rebase against `<sha>` or we'll close").
Out of scope for this branch.
## Test plan
| # | Scenario | Expected | Validation |
|---|---|---|---|
| 1 | External PR, clean (no auto-fix needed) | Merged with `--no-ff`, contributor SHA in main, GitHub badge `merged: true`, on_merged comment + close | curl GitHub API for PR state after merge |
| 2 | External PR with broken wiki links | auto-fixer appends commit, eval re-runs and approves, merge --no-ff brings BOTH commits in via merge commit, GitHub badge `merged: true` | log line trace + GitHub API |
| 3 | External PR rejected by eval (substantive issue) | terminate_pr fires existing path, on_closed posts rejection comment + closes GitHub PR | GitHub API after eval cycle |
| 4 | sync-mirror github_pr backfill on fork PR | `prs.github_pr` populated within one cron cycle (≤2 min) of mirror PR creation | sqlite3 SELECT after sync |
| 5 | Contributor force-pushes fork mid-eval | sync-mirror logs warning, eval continues against pre-rebase state (documented behavior, not regression) | journalctl |
| 6 | Re-running merge on already-merged PR | --no-ff fails cleanly (already up to date), caller handles as no-op | manual replay |
Test 1 and 2 are the critical-path tests. 3 verifies the rejection path didn't
regress. 4 isolates the github_pr backfill fix. 5 is acceptance criteria for
the documented edge case. 6 is idempotency.
**Production smoke test:** after deploy, manually create a tiny test PR from a
secondary GitHub account. Walk it through the full lifecycle. Tear down before
hackathon. Cost: 5 minutes, catches integration-level issues that unit tests
miss.
## Backout procedure
If Option 2 misbehaves in production, revert path is one config-line toggle:
```python
# lib/merge.py — gate on a feature flag for fast disable
if branch.startswith("gh-pr-") and config.EXTERNAL_PR_NO_FF_MERGE:
merge_fn = _merge_no_ff_external(branch)
elif ...
```
Default flag value: `True` after deploy. Set to `False` via env var on VPS to
fall back to cherry-pick path immediately if anything breaks. No code revert
required for a fast cutout.
Forgejo→GitHub merge commits already in main when the flag flips can't be
un-merged (they're real commits), but the failure mode is the same as today:
GitHub PR shows merged because the SHA is in history. No worse than current.
## Migration / cleanup
**FwazB's PR #90:** existing artifact, can't retroactively un-cherry-pick.
Manual close with explanatory comment (Ship's option b). Cory-approved.
```
We've merged your claim into the knowledge base via cherry-pick (commit f6a59d7d
on main). Future external PRs will use `git merge --no-ff` so the GitHub merge
badge fires correctly. This PR is being closed manually as the one historical
case before the fix lands. Thanks for the contribution!
— LivingIP pipeline
```
Posted via `_post_comment` + `_close_github_pr` from a one-off script
(`scripts/close-fwazb-pr-90.py`).
## Implementation order
### Phase 1 — self-healing sweep in sync-mirror.sh (Bug #2, ~10 lines)
**Single block in `deploy/sync-mirror.sh`:**
- Inserted as **Step 0** — runs after the initial Forgejo+GitHub fetch but
**before** the branch-mirror loop and the auto-create-PR block (line ~250)
- SELECT `prs` rows where `branch LIKE 'gh-pr-%' AND github_pr IS NULL`
- Parse PR number from branch name (regex on `gh-pr-{N}/...`)
- UPDATE `github_pr` and `source_channel='github'`
- Audit-friendly log line per healed row
**Properties:**
- Idempotent (SELECT empty when clean)
- Zero-cost path when no rows match
- No API calls (branch name is the source of truth)
- Self-healing: same SELECT/UPDATE pattern recovers from race AND backfills
historical orphans (PR 4066 picked up on first cron tick post-deploy)
- Same-cycle convergence: a freshly-mirrored PR whose Step 4.5 link UPDATE
fails gets healed on the next cron tick's sweep, not delayed multiple cycles
**Branch:** `epimetheus/external-merge-flow-bug2` (revision in progress on
existing `epimetheus/external-merge-flow-design`).
**Deploys independently** of Phase 2 — no dependency on Bug #1 fix. Once Phase 1
is live, comments + close fire correctly even on cherry-pick-merged PRs (the
half-fix state). Phase 2 then layers in the "merged" badge.
**Smoke test (cost: 1 cron tick = 2 min):** after deploy, verify FwazB's
`prs.github_pr` populates from NULL → 90, then `_get_github_pr` resolves on
next merge action.
### Phase 2 — merge.py --no-ff for gh-pr-* (Bug #1, ~120 lines)
- Add `_merge_no_ff_external` function (architecture unchanged from v1)
- Add branch-prefix dispatch case in `_merge_domain_queue`
- Add config flag `EXTERNAL_PR_NO_FF_MERGE` for backout
- Branch: `epimetheus/external-merge-flow-bug1`
- Depends on Phase 1 being live (otherwise `on_merged` still no-ops on fork PRs)
- Smoke: end-to-end test PR
### Phase 3 — FwazB cleanup (~10 lines)
- Manual one-off script for PR #90 (option b: explanatory comment + close)
- Independent of Phase 1/2 deploy
- Can run any time after Phase 1 lands (Phase 1 populates github_pr=90 on PR 4066,
enabling `_get_github_pr` to resolve for the comment script)
**Two phases, separately reviewable, separately deployable.** Phase 1 alone gives
us contributor comments on closed-via-cherry-pick PRs (already partially solves
the UX). Phase 2 adds the "merged" badge. Phase 3 is post-Phase-1 cleanup.
## Locked decisions
**Phase 1 (resolved in Msg 2):**
- `.fork-pr-map` file location → moot, sweep replaces map approach
- Phase 1 vs Phase 2 sequencing → separate deploys, sweep-only Phase 1
- One-shot backfill script → dropped, sweep IS the backfill
- Sweep placement → first action after initial fetch, before branch-mirror loop
AND auto-create-PR block
**Phase 2 (resolved in Msg 2 / Ship reply):**
1. **Backout flag:** `EXTERNAL_PR_NO_FF_MERGE` config flag included, default
`True` after deploy. One config branch in dispatch is cheap insurance with
hackathon timing risk one week out.
2. **Rebase-after-fix scope:** silent-ignore as documented in §"Edge case".
Eval-reset semantics already handle the recovery path — no structured
alert in this scope. Re-evaluate after first 5-10 external contributors.
3. **Merge commit message format:** verbose with PR number —
`"Merge external GitHub PR #{N}: {branch_slug}"`. Branch slug derived from
the `gh-pr-{N}/` prefix strip; already in scope at merge time, no claim-file
read needed. Searchability via `git log --merges --grep "#90"` is the value.
Ganymede gets line-level review of the Phase 1 sweep code on
`epimetheus/sync-mirror-self-heal` once Ship signs off on this revision.