teleo-infrastructure/docs/external-contributor-merge-flow.md
m3taversal fa6cceb9cd docs(external-contrib): v2 — sweep-only Phase 1 (Ship Msg 2 simplification)
Phase 1 simplified per Ship's architectural review:
- Drop the one-shot backfill script — the self-healing sweep IS the backfill.
  First cron tick post-deploy picks up FwazB PR 4066 automatically via the
  same SELECT/UPDATE path that handles all future races.
- Specify sweep placement as Step 0: runs as the very first action after
  initial Forgejo+GitHub fetch, ahead of branch-mirror loop AND auto-create-PR
  block (line ~250). Same-cycle convergence on fresh-cycle races.
- Replace `.fork-pr-map` design with branch-name-encoded sweep — branch name
  carries the GitHub PR number deterministically (gh-pr-{N}/...), no API
  call required, no map file, no flag plumbing.

Phase 1 final scope: ~10-line sweep block in sync-mirror.sh, idempotent,
zero-cost when clean.

Phase 2 architecture unchanged. Ship's open questions for Phase 2 (backout
flag, rebase-after-fix scope, merge commit message format) remain open.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 12:57:35 +01:00

17 KiB
Raw Blame History

External Contributor Merge Flow — Design Doc

Author: Epimetheus Architecture review: Ship (owns merge.py, sync-mirror.sh) Code review: Ganymede (line-level, post-design-approval) Status: Phase 1 sweep-only scope locked (Ship Msg 2/2). Awaiting code review.

Revision log

  • v2 (this revision): Phase 1 simplified per Ship's Msg 2 — backfill script dropped (sweep IS the backfill), sweep placement specified explicitly as the very first action after initial fetch (ahead of branch-mirror loop AND auto-create-PR block at line ~250). Phase 2 architecture unchanged.
  • v1: Initial draft.

Problem statement

External GitHub contributors submit PRs via the living-ip/teleo-codex mirror. Pipeline accepts the claim and merges the content into Forgejo main, but the GitHub PR shows "open with no diff" — it looks abandoned to the contributor.

Two compounding bugs intersect on this path:

  1. Cherry-pick breaks GitHub merge detection. lib/merge.py::_cherry_pick_onto_main creates a new SHA on Forgejo main. GitHub's "is PR head SHA an ancestor of main?" check returns false. merged: false, merge_commit_sha: null forever.

  2. prs.github_pr not populated for fork PRs. sync-mirror.sh Step 4.5 looks up GitHub PR number via ?head=living-ip:$branch, but fork PR heads are FwazB:contributor/... (or <fork-owner>:<branch>), not living-ip:. The filter misses, github_pr stays NULL, and lib/github_feedback.py::on_merged returns early (no comment, no close) because _get_github_pr requires non-NULL.

Empirical:

PR head merge mech github_pr merged badge comment posted
#87 (own-repo) living-ip:fix/... git merge --no-ff populated ✓ true
#90 (FwazB fork) FwazB:contributor/... cherry-pick NULL ✗ false

Both bugs need fixes. Bug #1 is the structural one (load-bearing for the badge). Bug #2 is a sync-mirror filter issue (load-bearing for the comment/close).

Goal

External GitHub contributor opens a PR → pipeline ingests, evaluates, merges → GitHub PR shows merged: true with badge → bot comment posted → PR closed cleanly. No human in the loop on the success path. Failure modes (eval reject, auto-fix, contributor force-push) handled gracefully.

Out of scope

  • Agent-extraction PRs (extract/*, reweave/*, epimetheus/*, etc.) — keep cherry-pick. They merge 70+/day, have no contributor UX surface, and the cherry-pick → linear-history rationale (auto-fixer rebase pattern, bisect friendliness) holds.
  • /api/contributors legacy endpoint — separate work, deferred.
  • PAT-in-URL credential pattern — separate security follow-up.

Design — branch-prefix conditional, scoped to gh-pr-*

Bug #1 fix: _merge_no_ff_external for gh-pr-* branches

Dispatch site (lib/merge.py::_merge_domain_queue, currently lines 736-738):

# Reweave: per-file frontmatter union (existing)
if branch.startswith("reweave/"):
    merge_fn = _merge_reweave_pr(branch)
# External GitHub fork PRs: true merge with --no-ff so contributor SHA lands
# in main's history → GitHub recognizes "merged" badge.
elif branch.startswith("gh-pr-"):
    merge_fn = _merge_no_ff_external(branch)
# Default: cherry-pick (extraction commits ADD new files, applies cleanly,
# linear history preserved for the bulk-extraction flow).
else:
    merge_fn = _cherry_pick_onto_main(branch)

New function (lib/merge.py):

async def _merge_no_ff_external(branch: str) -> tuple[bool, str]:
    """Merge an external GitHub PR with --no-ff so contributor SHA lands in main.

    Why this differs from _cherry_pick_onto_main:
    - Cherry-pick rewrites SHA → GitHub never recognizes the PR as merged.
    - --no-ff preserves the contributor's commit SHA in main's history.
    - sync-mirror's Forgejo→GitHub propagation already handles merge commits
      (verified empirically: PR #87 round-tripped cleanly with merge_commit_sha
      preserved).

    Mechanics:
    1. Fetch latest origin/main and origin/{branch}
    2. Create scratch worktree at HEAD of origin/main
    3. git merge --no-ff origin/{branch} -m "Merge PR: {branch}"
    4. git push origin HEAD:main
    5. Cleanup worktree

    Conflict handling:
    - Entity conflicts: same auto-resolve pattern as cherry-pick
      (--ours = main HEAD, --theirs = branch). External claims rarely touch
      entities so this is a low-frequency path.
    - Other conflicts: abort, return False with conflict detail. Caller marks
      conflict_permanent. Manual resolution or contributor rebase required.

    Idempotency: caller already gates on PR status, so re-running on a merged
    PR fails at the merge step (already merged), which is the right behavior.

    Returns (success, message).
    """

The merge commit message uses the branch name (which contains the GitHub PR number via the gh-pr-{N}/... convention) so post-merge processing can correlate back to the GitHub PR if needed.

Bug #2 fix: self-healing sweep at top of sync-mirror cycle

Root cause: The one-shot link UPDATE in Step 4.5 (lines ~250-294) runs once per branch creation — if it fails (race with PR row insertion, transient API hiccup, transient lock), the row is permanently stuck at github_pr=NULL. No retry path. FwazB's PR 4066 is the visible artifact of this class of failure.

The structural fix is a self-healing sweep: each cron tick, scan for any gh-pr-* PR rows missing github_pr and link them. Idempotent, zero-cost when clean, retries forever until the row is healed. The sweep IS the backfill — no separate one-shot script needed (Ship's simplification, Msg 2). First cron tick after deploy picks up 4066 automatically, same SELECT path that handles all future races.

Placement (load-bearing): the sweep runs as the very first action after the initial Forgejo+GitHub fetch, ahead of both:

  • The branch-mirror loop
  • The auto-create-PR block at line ~250

This sequencing matters because a fresh-cycle race — PR created in the current cycle, link UPDATE in Step 4.5 fails — self-heals on the very next iteration's sweep, not 2 minutes later. Same-cycle convergence vs cross-cycle convergence.

# Step 0: self-heal any gh-pr-* PR rows missing github_pr.
# Runs FIRST — before branch-mirror loop, before auto-create-PR block.
# Idempotent: SELECT returns empty when clean.
# The branch name encodes the GitHub PR number (gh-pr-{N}/...) so no API
# round-trip needed to recover the number. Source-of-truth derivation.
sqlite3 -separator '|' "$PIPELINE_DB" \
  "SELECT number, branch FROM prs WHERE branch LIKE 'gh-pr-%' AND github_pr IS NULL" \
| while IFS='|' read -r pr_num branch; do
    gh_pr_num=$(echo "$branch" | sed -n 's|^gh-pr-\([0-9]*\)/.*|\1|p')
    [ -z "$gh_pr_num" ] && continue
    sqlite3 "$PIPELINE_DB" \
      "UPDATE prs SET github_pr = $gh_pr_num, source_channel = 'github' WHERE number = $pr_num;"
    log "self-heal: linked Forgejo PR #$pr_num → GitHub PR #$gh_pr_num"
done

~10 lines. No API call required (branch name carries the number deterministically). No .fork-pr-map file, no --script flag, no manual deploy step.

Why this approach beats the earlier .fork-pr-map proposal:

  • The map approach repaired the write path but left already-stuck rows orphaned
  • A separate one-shot backfill script for the orphans is one more code path to maintain
  • The sweep collapses both paths: it's the failure recovery AND the historical backfill
  • Future races automatically self-heal without code changes

Why this approach beats the earlier ?head=living-ip: filter:

  • That filter never matched fork PRs at all — by design, fork PRs come from a different owner. Pre-existing nit Ganymede flagged in multi-repo-mirror review
  • The branch name encoding bypasses the GitHub API entirely for fork PRs

Why no API verification before UPDATE:

  • Branch name is deterministic source — the only writer of gh-pr-N/... is Step 2.1, which only writes after fetching the GitHub PR ref. If the branch exists, the GitHub PR exists.
  • Avoids API rate-limit pressure on every cron tick (every 2 min × 24h × 7d ≈ 5040 calls/week even when no work needed).
  • Sanity-check via API is achievable cheaply if Ship wants it; current scope matches "zero-cost when clean."

Auto-fixer mode='append' for gh-pr-* branches

Current behavior (lib/fixer.py): Worktree-based fix → commit → push (regular push, not force). When the PR was created by the pipeline (extract/* branches), this works because the LLM-extractor's commit is at HEAD and we're appending on top — push succeeds.

External PR behavior (today): Same code path runs. Fork PR has FwazB's commit at HEAD; auto-fixer creates a fix commit on top; pushes via Forgejo's branch ref (the fork PR was mirrored as refs/heads/gh-pr-90/contributor/... on Forgejo). Push works. Eval reset fires. Eval re-runs.

Wait — re-reading the existing fixer, it's actually already append-only. Good. The cherry-pick at merge time was the only thing rewriting SHAs. So no fixer change required for Option 2. The fixer commit is already at HEAD~1 from FwazB's commit on Forgejo. When we git merge --no-ff instead of cherry-pick, the merge commit's parent chain includes BOTH the fix commit AND FwazB's original commit. GitHub sees FwazB's SHA in ancestry → "merged" badge.

Cross-check: Verified PR 4066's existing branch state.

$ git log refs/heads/gh-pr-90/contributor/arcium-confidential-computing-challenge --oneline
d7916d65 auto-fix: strip 2 broken wiki links     ← fixer's commit
f6a59d7d claim: confidential computing reshapes... ← FwazB's commit

Both commits already on Forgejo. When merge.py cherry-picked, it picked both commits onto main as new SHAs. With merge --no-ff, both stay intact, the merge commit references them, GitHub sees f6a59d7d (FwazB's HEAD on his fork) in main's ancestry, marks merged.

This means scope shrinks: the design is purely a merge.py change + a sync-mirror Step 2.1/4.5 plumbing tweak. No fixer module changes.

Edge case: contributor rebases their fork after fixer appended

Scenario: FwazB's PR is in eval. Pipeline auto-fixer pushes a fix commit to Forgejo gh-pr-90/.... FwazB notices the original wiki-link issue, fixes it locally, force-pushes to his fork. sync-mirror's next cycle fetches his new SHA → tries to update the Forgejo branch → push from sync-mirror is regular (not --force) → fails because Forgejo branch has diverged from FwazB's fork.

Today's behavior: sync-mirror logs a warning. Forgejo branch keeps the appended-fix state. Eval continues against that state. FwazB's most recent fork state is silently ignored.

Acceptable risk for hackathon. The eval-reset-on-tip-change gate will re-trigger eval if anyone force-pushes the Forgejo branch later (e.g., manual re-sync). Documented; not fixing in this PR.

Followup: sync-mirror could detect the divergence, log a structured alert, and post a comment on the GitHub PR ("we detected your force-push but our appended fix has diverged; please rebase against <sha> or we'll close"). Out of scope for this branch.

Test plan

# Scenario Expected Validation
1 External PR, clean (no auto-fix needed) Merged with --no-ff, contributor SHA in main, GitHub badge merged: true, on_merged comment + close curl GitHub API for PR state after merge
2 External PR with broken wiki links auto-fixer appends commit, eval re-runs and approves, merge --no-ff brings BOTH commits in via merge commit, GitHub badge merged: true log line trace + GitHub API
3 External PR rejected by eval (substantive issue) terminate_pr fires existing path, on_closed posts rejection comment + closes GitHub PR GitHub API after eval cycle
4 sync-mirror github_pr backfill on fork PR prs.github_pr populated within one cron cycle (≤2 min) of mirror PR creation sqlite3 SELECT after sync
5 Contributor force-pushes fork mid-eval sync-mirror logs warning, eval continues against pre-rebase state (documented behavior, not regression) journalctl
6 Re-running merge on already-merged PR --no-ff fails cleanly (already up to date), caller handles as no-op manual replay

Test 1 and 2 are the critical-path tests. 3 verifies the rejection path didn't regress. 4 isolates the github_pr backfill fix. 5 is acceptance criteria for the documented edge case. 6 is idempotency.

Production smoke test: after deploy, manually create a tiny test PR from a secondary GitHub account. Walk it through the full lifecycle. Tear down before hackathon. Cost: 5 minutes, catches integration-level issues that unit tests miss.

Backout procedure

If Option 2 misbehaves in production, revert path is one config-line toggle:

# lib/merge.py — gate on a feature flag for fast disable
if branch.startswith("gh-pr-") and config.EXTERNAL_PR_NO_FF_MERGE:
    merge_fn = _merge_no_ff_external(branch)
elif ...

Default flag value: True after deploy. Set to False via env var on VPS to fall back to cherry-pick path immediately if anything breaks. No code revert required for a fast cutout.

Forgejo→GitHub merge commits already in main when the flag flips can't be un-merged (they're real commits), but the failure mode is the same as today: GitHub PR shows merged because the SHA is in history. No worse than current.

Migration / cleanup

FwazB's PR #90: existing artifact, can't retroactively un-cherry-pick. Manual close with explanatory comment (Ship's option b). Cory-approved.

We've merged your claim into the knowledge base via cherry-pick (commit f6a59d7d
on main). Future external PRs will use `git merge --no-ff` so the GitHub merge
badge fires correctly. This PR is being closed manually as the one historical
case before the fix lands. Thanks for the contribution!

— LivingIP pipeline

Posted via _post_comment + _close_github_pr from a one-off script (scripts/close-fwazb-pr-90.py).

Implementation order

Phase 1 — self-healing sweep in sync-mirror.sh (Bug #2, ~10 lines)

Single block in deploy/sync-mirror.sh:

  • Inserted as Step 0 — runs after the initial Forgejo+GitHub fetch but before the branch-mirror loop and the auto-create-PR block (line ~250)
  • SELECT prs rows where branch LIKE 'gh-pr-%' AND github_pr IS NULL
  • Parse PR number from branch name (regex on gh-pr-{N}/...)
  • UPDATE github_pr and source_channel='github'
  • Audit-friendly log line per healed row

Properties:

  • Idempotent (SELECT empty when clean)
  • Zero-cost path when no rows match
  • No API calls (branch name is the source of truth)
  • Self-healing: same SELECT/UPDATE pattern recovers from race AND backfills historical orphans (PR 4066 picked up on first cron tick post-deploy)
  • Same-cycle convergence: a freshly-mirrored PR whose Step 4.5 link UPDATE fails gets healed on the next cron tick's sweep, not delayed multiple cycles

Branch: epimetheus/external-merge-flow-bug2 (revision in progress on existing epimetheus/external-merge-flow-design).

Deploys independently of Phase 2 — no dependency on Bug #1 fix. Once Phase 1 is live, comments + close fire correctly even on cherry-pick-merged PRs (the half-fix state). Phase 2 then layers in the "merged" badge.

Smoke test (cost: 1 cron tick = 2 min): after deploy, verify FwazB's prs.github_pr populates from NULL → 90, then _get_github_pr resolves on next merge action.

Phase 2 — merge.py --no-ff for gh-pr-* (Bug #1, ~120 lines)

  • Add _merge_no_ff_external function (architecture unchanged from v1)
  • Add branch-prefix dispatch case in _merge_domain_queue
  • Add config flag EXTERNAL_PR_NO_FF_MERGE for backout
  • Branch: epimetheus/external-merge-flow-bug1
  • Depends on Phase 1 being live (otherwise on_merged still no-ops on fork PRs)
  • Smoke: end-to-end test PR

Phase 3 — FwazB cleanup (~10 lines)

  • Manual one-off script for PR #90 (option b: explanatory comment + close)
  • Independent of Phase 1/2 deploy
  • Can run any time after Phase 1 lands (Phase 1 populates github_pr=90 on PR 4066, enabling _get_github_pr to resolve for the comment script)

Two phases, separately reviewable, separately deployable. Phase 1 alone gives us contributor comments on closed-via-cherry-pick PRs (already partially solves the UX). Phase 2 adds the "merged" badge. Phase 3 is post-Phase-1 cleanup.

Open questions for Ship

Resolved in Msg 2:

  • .fork-pr-map file location → moot, sweep replaces map approach
  • Phase 1 vs Phase 2 sequencing → separate deploys, sweep-only Phase 1 confirmed
  • One-shot backfill script → dropped, sweep IS the backfill
  • Sweep placement → confirmed: first action after initial fetch, before branch-mirror loop AND auto-create-PR block

Still open (Phase 2 architecture):

  1. Backout flag necessary? I'm including EXTERNAL_PR_NO_FF_MERGE config flag for fast cutout. Overkill given low-traffic external-PR path? Or prudent given Accelerate Solana pressure?

  2. Test plan #5 (rebase-after-fix) scope. Documenting the silent-ignore as acceptable for now. Sufficient for hackathon, or do you want me to add the structured-alert path before May 5?

  3. merge --no-ff commit message format. I have "Merge PR: {branch}". Better signal to derive the GitHub PR number from the branch and write "Merge external GitHub PR #{N}: {claim_title_or_branch_slug}"? Trades verbosity for searchability in git log --merges.

Sending revised doc directly via Pentagon. Ganymede gets line-level review of the Phase 1 code once Ship signs off on the revised design.