teleo-infrastructure/docs/external-contributor-merge-flow.md
m3taversal 537cfd5ed7 docs(external-contrib): v3 — locked Phase 2 decisions, verbose merge msg (Ship Msg 3)
Cleanup pass on the design doc per Ship's Msg 3 review. Two changes, no architecture shifts:

1. Merge commit message updated to locked verbose form:
     "Merge external GitHub PR #{N}: {branch_slug}"
   N derived from gh-pr-{N}/ regex; branch_slug = post-prefix portion. Embeds
   the GitHub PR number explicitly so `git log --merges --grep "#90"` surfaces
   the merge from PR number alone. Drops the v2 "{branch}"-only form.

2. "Open questions for Ship" section collapsed to "Locked decisions" —
   the three Phase 2 items (backout flag, rebase-after-fix scope, merge commit
   message format) were locked in Ship's reply on Msg 2 but the v2 doc still
   framed them as open. Now restated as decisions so Ganymede doesn't re-debate
   resolved items at line-level review.

Diffs reviewable against v2 (commit fa6cceb).
2026-04-28 13:04:09 +01:00

18 KiB
Raw Permalink Blame History

External Contributor Merge Flow — Design Doc

Author: Epimetheus Architecture review: Ship (owns merge.py, sync-mirror.sh) Code review: Ganymede (line-level, post-design-approval) Status: Phase 1 sweep-only scope locked. Phase 2 architecture decisions locked (Ship Msg 3). Awaiting Phase 1 line-level code review.

Revision log

  • v3 (this revision): Cleanup pass per Ship Msg 3 — merge commit message updated to locked verbose form ("Merge external GitHub PR #{N}: {branch_slug}"), open-questions section collapsed to "Locked Phase 2 decisions" restating the three resolved outcomes (no longer questions).
  • v2: Phase 1 simplified per Ship's Msg 2 — backfill script dropped (sweep IS the backfill), sweep placement specified explicitly as the very first action after initial fetch (ahead of branch-mirror loop AND auto-create-PR block at line ~250). Phase 2 architecture unchanged.
  • v1: Initial draft.

Problem statement

External GitHub contributors submit PRs via the living-ip/teleo-codex mirror. Pipeline accepts the claim and merges the content into Forgejo main, but the GitHub PR shows "open with no diff" — it looks abandoned to the contributor.

Two compounding bugs intersect on this path:

  1. Cherry-pick breaks GitHub merge detection. lib/merge.py::_cherry_pick_onto_main creates a new SHA on Forgejo main. GitHub's "is PR head SHA an ancestor of main?" check returns false. merged: false, merge_commit_sha: null forever.

  2. prs.github_pr not populated for fork PRs. sync-mirror.sh Step 4.5 looks up GitHub PR number via ?head=living-ip:$branch, but fork PR heads are FwazB:contributor/... (or <fork-owner>:<branch>), not living-ip:. The filter misses, github_pr stays NULL, and lib/github_feedback.py::on_merged returns early (no comment, no close) because _get_github_pr requires non-NULL.

Empirical:

PR head merge mech github_pr merged badge comment posted
#87 (own-repo) living-ip:fix/... git merge --no-ff populated ✓ true
#90 (FwazB fork) FwazB:contributor/... cherry-pick NULL ✗ false

Both bugs need fixes. Bug #1 is the structural one (load-bearing for the badge). Bug #2 is a sync-mirror filter issue (load-bearing for the comment/close).

Goal

External GitHub contributor opens a PR → pipeline ingests, evaluates, merges → GitHub PR shows merged: true with badge → bot comment posted → PR closed cleanly. No human in the loop on the success path. Failure modes (eval reject, auto-fix, contributor force-push) handled gracefully.

Out of scope

  • Agent-extraction PRs (extract/*, reweave/*, epimetheus/*, etc.) — keep cherry-pick. They merge 70+/day, have no contributor UX surface, and the cherry-pick → linear-history rationale (auto-fixer rebase pattern, bisect friendliness) holds.
  • /api/contributors legacy endpoint — separate work, deferred.
  • PAT-in-URL credential pattern — separate security follow-up.

Design — branch-prefix conditional, scoped to gh-pr-*

Bug #1 fix: _merge_no_ff_external for gh-pr-* branches

Dispatch site (lib/merge.py::_merge_domain_queue, currently lines 736-738):

# Reweave: per-file frontmatter union (existing)
if branch.startswith("reweave/"):
    merge_fn = _merge_reweave_pr(branch)
# External GitHub fork PRs: true merge with --no-ff so contributor SHA lands
# in main's history → GitHub recognizes "merged" badge.
elif branch.startswith("gh-pr-"):
    merge_fn = _merge_no_ff_external(branch)
# Default: cherry-pick (extraction commits ADD new files, applies cleanly,
# linear history preserved for the bulk-extraction flow).
else:
    merge_fn = _cherry_pick_onto_main(branch)

New function (lib/merge.py):

async def _merge_no_ff_external(branch: str) -> tuple[bool, str]:
    """Merge an external GitHub PR with --no-ff so contributor SHA lands in main.

    Why this differs from _cherry_pick_onto_main:
    - Cherry-pick rewrites SHA → GitHub never recognizes the PR as merged.
    - --no-ff preserves the contributor's commit SHA in main's history.
    - sync-mirror's Forgejo→GitHub propagation already handles merge commits
      (verified empirically: PR #87 round-tripped cleanly with merge_commit_sha
      preserved).

    Mechanics:
    1. Fetch latest origin/main and origin/{branch}
    2. Create scratch worktree at HEAD of origin/main
    3. Derive: gh_pr_num = re.match(r"gh-pr-(\d+)/", branch).group(1)
              branch_slug = branch[len(f"gh-pr-{gh_pr_num}/"):]
    4. git merge --no-ff origin/{branch} \
         -m f"Merge external GitHub PR #{gh_pr_num}: {branch_slug}"
    5. git push origin HEAD:main
    6. Cleanup worktree

    Conflict handling:
    - Entity conflicts: same auto-resolve pattern as cherry-pick
      (--ours = main HEAD, --theirs = branch). External claims rarely touch
      entities so this is a low-frequency path.
    - Other conflicts: abort, return False with conflict detail. Caller marks
      conflict_permanent. Manual resolution or contributor rebase required.

    Idempotency: caller already gates on PR status, so re-running on a merged
    PR fails at the merge step (already merged), which is the right behavior.

    Returns (success, message).
    """

The merge commit message format "Merge external GitHub PR #{N}: {branch_slug}" embeds the GitHub PR number explicitly so git log --merges --grep "#90" surfaces the merge from PR number alone (no branch-name guessing). Branch slug is the post-gh-pr-{N}/ portion of the branch (e.g., contributor/arcium-confidential-computing-challenge) — already in scope at merge time, no claim-file read needed.

Bug #2 fix: self-healing sweep at top of sync-mirror cycle

Root cause: The one-shot link UPDATE in Step 4.5 (lines ~250-294) runs once per branch creation — if it fails (race with PR row insertion, transient API hiccup, transient lock), the row is permanently stuck at github_pr=NULL. No retry path. FwazB's PR 4066 is the visible artifact of this class of failure.

The structural fix is a self-healing sweep: each cron tick, scan for any gh-pr-* PR rows missing github_pr and link them. Idempotent, zero-cost when clean, retries forever until the row is healed. The sweep IS the backfill — no separate one-shot script needed (Ship's simplification, Msg 2). First cron tick after deploy picks up 4066 automatically, same SELECT path that handles all future races.

Placement (load-bearing): the sweep runs as the very first action after the initial Forgejo+GitHub fetch, ahead of both:

  • The branch-mirror loop
  • The auto-create-PR block at line ~250

This sequencing matters because a fresh-cycle race — PR created in the current cycle, link UPDATE in Step 4.5 fails — self-heals on the very next iteration's sweep, not 2 minutes later. Same-cycle convergence vs cross-cycle convergence.

# Step 0: self-heal any gh-pr-* PR rows missing github_pr.
# Runs FIRST — before branch-mirror loop, before auto-create-PR block.
# Idempotent: SELECT returns empty when clean.
# The branch name encodes the GitHub PR number (gh-pr-{N}/...) so no API
# round-trip needed to recover the number. Source-of-truth derivation.
sqlite3 -separator '|' "$PIPELINE_DB" \
  "SELECT number, branch FROM prs WHERE branch LIKE 'gh-pr-%' AND github_pr IS NULL" \
| while IFS='|' read -r pr_num branch; do
    gh_pr_num=$(echo "$branch" | sed -n 's|^gh-pr-\([0-9]*\)/.*|\1|p')
    [ -z "$gh_pr_num" ] && continue
    sqlite3 "$PIPELINE_DB" \
      "UPDATE prs SET github_pr = $gh_pr_num, source_channel = 'github' WHERE number = $pr_num;"
    log "self-heal: linked Forgejo PR #$pr_num → GitHub PR #$gh_pr_num"
done

~10 lines. No API call required (branch name carries the number deterministically). No .fork-pr-map file, no --script flag, no manual deploy step.

Why this approach beats the earlier .fork-pr-map proposal:

  • The map approach repaired the write path but left already-stuck rows orphaned
  • A separate one-shot backfill script for the orphans is one more code path to maintain
  • The sweep collapses both paths: it's the failure recovery AND the historical backfill
  • Future races automatically self-heal without code changes

Why this approach beats the earlier ?head=living-ip: filter:

  • That filter never matched fork PRs at all — by design, fork PRs come from a different owner. Pre-existing nit Ganymede flagged in multi-repo-mirror review
  • The branch name encoding bypasses the GitHub API entirely for fork PRs

Why no API verification before UPDATE:

  • Branch name is deterministic source — the only writer of gh-pr-N/... is Step 2.1, which only writes after fetching the GitHub PR ref. If the branch exists, the GitHub PR exists.
  • Avoids API rate-limit pressure on every cron tick (every 2 min × 24h × 7d ≈ 5040 calls/week even when no work needed).
  • Sanity-check via API is achievable cheaply if Ship wants it; current scope matches "zero-cost when clean."

Auto-fixer mode='append' for gh-pr-* branches

Current behavior (lib/fixer.py): Worktree-based fix → commit → push (regular push, not force). When the PR was created by the pipeline (extract/* branches), this works because the LLM-extractor's commit is at HEAD and we're appending on top — push succeeds.

External PR behavior (today): Same code path runs. Fork PR has FwazB's commit at HEAD; auto-fixer creates a fix commit on top; pushes via Forgejo's branch ref (the fork PR was mirrored as refs/heads/gh-pr-90/contributor/... on Forgejo). Push works. Eval reset fires. Eval re-runs.

Wait — re-reading the existing fixer, it's actually already append-only. Good. The cherry-pick at merge time was the only thing rewriting SHAs. So no fixer change required for Option 2. The fixer commit is already at HEAD~1 from FwazB's commit on Forgejo. When we git merge --no-ff instead of cherry-pick, the merge commit's parent chain includes BOTH the fix commit AND FwazB's original commit. GitHub sees FwazB's SHA in ancestry → "merged" badge.

Cross-check: Verified PR 4066's existing branch state.

$ git log refs/heads/gh-pr-90/contributor/arcium-confidential-computing-challenge --oneline
d7916d65 auto-fix: strip 2 broken wiki links     ← fixer's commit
f6a59d7d claim: confidential computing reshapes... ← FwazB's commit

Both commits already on Forgejo. When merge.py cherry-picked, it picked both commits onto main as new SHAs. With merge --no-ff, both stay intact, the merge commit references them, GitHub sees f6a59d7d (FwazB's HEAD on his fork) in main's ancestry, marks merged.

This means scope shrinks: the design is purely a merge.py change + a sync-mirror Step 2.1/4.5 plumbing tweak. No fixer module changes.

Edge case: contributor rebases their fork after fixer appended

Scenario: FwazB's PR is in eval. Pipeline auto-fixer pushes a fix commit to Forgejo gh-pr-90/.... FwazB notices the original wiki-link issue, fixes it locally, force-pushes to his fork. sync-mirror's next cycle fetches his new SHA → tries to update the Forgejo branch → push from sync-mirror is regular (not --force) → fails because Forgejo branch has diverged from FwazB's fork.

Today's behavior: sync-mirror logs a warning. Forgejo branch keeps the appended-fix state. Eval continues against that state. FwazB's most recent fork state is silently ignored.

Acceptable risk for hackathon. The eval-reset-on-tip-change gate will re-trigger eval if anyone force-pushes the Forgejo branch later (e.g., manual re-sync). Documented; not fixing in this PR.

Followup: sync-mirror could detect the divergence, log a structured alert, and post a comment on the GitHub PR ("we detected your force-push but our appended fix has diverged; please rebase against <sha> or we'll close"). Out of scope for this branch.

Test plan

# Scenario Expected Validation
1 External PR, clean (no auto-fix needed) Merged with --no-ff, contributor SHA in main, GitHub badge merged: true, on_merged comment + close curl GitHub API for PR state after merge
2 External PR with broken wiki links auto-fixer appends commit, eval re-runs and approves, merge --no-ff brings BOTH commits in via merge commit, GitHub badge merged: true log line trace + GitHub API
3 External PR rejected by eval (substantive issue) terminate_pr fires existing path, on_closed posts rejection comment + closes GitHub PR GitHub API after eval cycle
4 sync-mirror github_pr backfill on fork PR prs.github_pr populated within one cron cycle (≤2 min) of mirror PR creation sqlite3 SELECT after sync
5 Contributor force-pushes fork mid-eval sync-mirror logs warning, eval continues against pre-rebase state (documented behavior, not regression) journalctl
6 Re-running merge on already-merged PR --no-ff fails cleanly (already up to date), caller handles as no-op manual replay

Test 1 and 2 are the critical-path tests. 3 verifies the rejection path didn't regress. 4 isolates the github_pr backfill fix. 5 is acceptance criteria for the documented edge case. 6 is idempotency.

Production smoke test: after deploy, manually create a tiny test PR from a secondary GitHub account. Walk it through the full lifecycle. Tear down before hackathon. Cost: 5 minutes, catches integration-level issues that unit tests miss.

Backout procedure

If Option 2 misbehaves in production, revert path is one config-line toggle:

# lib/merge.py — gate on a feature flag for fast disable
if branch.startswith("gh-pr-") and config.EXTERNAL_PR_NO_FF_MERGE:
    merge_fn = _merge_no_ff_external(branch)
elif ...

Default flag value: True after deploy. Set to False via env var on VPS to fall back to cherry-pick path immediately if anything breaks. No code revert required for a fast cutout.

Forgejo→GitHub merge commits already in main when the flag flips can't be un-merged (they're real commits), but the failure mode is the same as today: GitHub PR shows merged because the SHA is in history. No worse than current.

Migration / cleanup

FwazB's PR #90: existing artifact, can't retroactively un-cherry-pick. Manual close with explanatory comment (Ship's option b). Cory-approved.

We've merged your claim into the knowledge base via cherry-pick (commit f6a59d7d
on main). Future external PRs will use `git merge --no-ff` so the GitHub merge
badge fires correctly. This PR is being closed manually as the one historical
case before the fix lands. Thanks for the contribution!

— LivingIP pipeline

Posted via _post_comment + _close_github_pr from a one-off script (scripts/close-fwazb-pr-90.py).

Implementation order

Phase 1 — self-healing sweep in sync-mirror.sh (Bug #2, ~10 lines)

Single block in deploy/sync-mirror.sh:

  • Inserted as Step 0 — runs after the initial Forgejo+GitHub fetch but before the branch-mirror loop and the auto-create-PR block (line ~250)
  • SELECT prs rows where branch LIKE 'gh-pr-%' AND github_pr IS NULL
  • Parse PR number from branch name (regex on gh-pr-{N}/...)
  • UPDATE github_pr and source_channel='github'
  • Audit-friendly log line per healed row

Properties:

  • Idempotent (SELECT empty when clean)
  • Zero-cost path when no rows match
  • No API calls (branch name is the source of truth)
  • Self-healing: same SELECT/UPDATE pattern recovers from race AND backfills historical orphans (PR 4066 picked up on first cron tick post-deploy)
  • Same-cycle convergence: a freshly-mirrored PR whose Step 4.5 link UPDATE fails gets healed on the next cron tick's sweep, not delayed multiple cycles

Branch: epimetheus/external-merge-flow-bug2 (revision in progress on existing epimetheus/external-merge-flow-design).

Deploys independently of Phase 2 — no dependency on Bug #1 fix. Once Phase 1 is live, comments + close fire correctly even on cherry-pick-merged PRs (the half-fix state). Phase 2 then layers in the "merged" badge.

Smoke test (cost: 1 cron tick = 2 min): after deploy, verify FwazB's prs.github_pr populates from NULL → 90, then _get_github_pr resolves on next merge action.

Phase 2 — merge.py --no-ff for gh-pr-* (Bug #1, ~120 lines)

  • Add _merge_no_ff_external function (architecture unchanged from v1)
  • Add branch-prefix dispatch case in _merge_domain_queue
  • Add config flag EXTERNAL_PR_NO_FF_MERGE for backout
  • Branch: epimetheus/external-merge-flow-bug1
  • Depends on Phase 1 being live (otherwise on_merged still no-ops on fork PRs)
  • Smoke: end-to-end test PR

Phase 3 — FwazB cleanup (~10 lines)

  • Manual one-off script for PR #90 (option b: explanatory comment + close)
  • Independent of Phase 1/2 deploy
  • Can run any time after Phase 1 lands (Phase 1 populates github_pr=90 on PR 4066, enabling _get_github_pr to resolve for the comment script)

Two phases, separately reviewable, separately deployable. Phase 1 alone gives us contributor comments on closed-via-cherry-pick PRs (already partially solves the UX). Phase 2 adds the "merged" badge. Phase 3 is post-Phase-1 cleanup.

Locked decisions

Phase 1 (resolved in Msg 2):

  • .fork-pr-map file location → moot, sweep replaces map approach
  • Phase 1 vs Phase 2 sequencing → separate deploys, sweep-only Phase 1
  • One-shot backfill script → dropped, sweep IS the backfill
  • Sweep placement → first action after initial fetch, before branch-mirror loop AND auto-create-PR block

Phase 2 (resolved in Msg 2 / Ship reply):

  1. Backout flag: EXTERNAL_PR_NO_FF_MERGE config flag included, default True after deploy. One config branch in dispatch is cheap insurance with hackathon timing risk one week out.

  2. Rebase-after-fix scope: silent-ignore as documented in §"Edge case". Eval-reset semantics already handle the recovery path — no structured alert in this scope. Re-evaluate after first 5-10 external contributors.

  3. Merge commit message format: verbose with PR number — "Merge external GitHub PR #{N}: {branch_slug}". Branch slug derived from the gh-pr-{N}/ prefix strip; already in scope at merge time, no claim-file read needed. Searchability via git log --merges --grep "#90" is the value.

Ganymede gets line-level review of the Phase 1 sweep code on epimetheus/sync-mirror-self-heal once Ship signs off on this revision.