Phase 1 simplified per Ship's architectural review:
- Drop the one-shot backfill script — the self-healing sweep IS the backfill.
First cron tick post-deploy picks up FwazB PR 4066 automatically via the
same SELECT/UPDATE path that handles all future races.
- Specify sweep placement as Step 0: runs as the very first action after
initial Forgejo+GitHub fetch, ahead of branch-mirror loop AND auto-create-PR
block (line ~250). Same-cycle convergence on fresh-cycle races.
- Replace `.fork-pr-map` design with branch-name-encoded sweep — branch name
carries the GitHub PR number deterministically (gh-pr-{N}/...), no API
call required, no map file, no flag plumbing.
Phase 1 final scope: ~10-line sweep block in sync-mirror.sh, idempotent,
zero-cost when clean.
Phase 2 architecture unchanged. Ship's open questions for Phase 2 (backout
flag, rebase-after-fix scope, merge commit message format) remain open.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
368 lines
17 KiB
Markdown
368 lines
17 KiB
Markdown
# External Contributor Merge Flow — Design Doc
|
||
|
||
**Author:** Epimetheus
|
||
**Architecture review:** Ship (owns merge.py, sync-mirror.sh)
|
||
**Code review:** Ganymede (line-level, post-design-approval)
|
||
**Status:** Phase 1 sweep-only scope locked (Ship Msg 2/2). Awaiting code review.
|
||
|
||
## Revision log
|
||
|
||
- **v2 (this revision):** Phase 1 simplified per Ship's Msg 2 — backfill script dropped (sweep IS the backfill), sweep placement specified explicitly as the very first action after initial fetch (ahead of branch-mirror loop AND auto-create-PR block at line ~250). Phase 2 architecture unchanged.
|
||
- **v1:** Initial draft.
|
||
|
||
## Problem statement
|
||
|
||
External GitHub contributors submit PRs via the `living-ip/teleo-codex` mirror.
|
||
Pipeline accepts the claim and merges the content into Forgejo `main`, but the
|
||
GitHub PR shows "open with no diff" — it looks abandoned to the contributor.
|
||
|
||
Two compounding bugs intersect on this path:
|
||
|
||
1. **Cherry-pick breaks GitHub merge detection.** `lib/merge.py::_cherry_pick_onto_main`
|
||
creates a new SHA on Forgejo main. GitHub's "is PR head SHA an ancestor of
|
||
main?" check returns false. `merged: false, merge_commit_sha: null` forever.
|
||
|
||
2. **`prs.github_pr` not populated for fork PRs.** `sync-mirror.sh` Step 4.5 looks
|
||
up GitHub PR number via `?head=living-ip:$branch`, but fork PR heads are
|
||
`FwazB:contributor/...` (or `<fork-owner>:<branch>`), not `living-ip:`. The
|
||
filter misses, `github_pr` stays NULL, and `lib/github_feedback.py::on_merged`
|
||
returns early (no comment, no close) because `_get_github_pr` requires non-NULL.
|
||
|
||
Empirical:
|
||
|
||
| PR | head | merge mech | github_pr | merged badge | comment posted |
|
||
|---|---|---|---|---|---|
|
||
| #87 (own-repo) | `living-ip:fix/...` | `git merge --no-ff` | populated | ✓ true | ✓ |
|
||
| #90 (FwazB fork) | `FwazB:contributor/...` | cherry-pick | NULL | ✗ false | ✗ |
|
||
|
||
Both bugs need fixes. Bug #1 is the structural one (load-bearing for the badge).
|
||
Bug #2 is a sync-mirror filter issue (load-bearing for the comment/close).
|
||
|
||
## Goal
|
||
|
||
External GitHub contributor opens a PR → pipeline ingests, evaluates, merges →
|
||
GitHub PR shows `merged: true` with badge → bot comment posted → PR closed
|
||
cleanly. No human in the loop on the success path. Failure modes (eval reject,
|
||
auto-fix, contributor force-push) handled gracefully.
|
||
|
||
## Out of scope
|
||
|
||
- Agent-extraction PRs (`extract/*`, `reweave/*`, `epimetheus/*`, etc.) — keep
|
||
cherry-pick. They merge 70+/day, have no contributor UX surface, and the
|
||
cherry-pick → linear-history rationale (auto-fixer rebase pattern, bisect
|
||
friendliness) holds.
|
||
- `/api/contributors` legacy endpoint — separate work, deferred.
|
||
- PAT-in-URL credential pattern — separate security follow-up.
|
||
|
||
## Design — branch-prefix conditional, scoped to `gh-pr-*`
|
||
|
||
### Bug #1 fix: `_merge_no_ff_external` for `gh-pr-*` branches
|
||
|
||
**Dispatch site (`lib/merge.py::_merge_domain_queue`, currently lines 736-738):**
|
||
|
||
```python
|
||
# Reweave: per-file frontmatter union (existing)
|
||
if branch.startswith("reweave/"):
|
||
merge_fn = _merge_reweave_pr(branch)
|
||
# External GitHub fork PRs: true merge with --no-ff so contributor SHA lands
|
||
# in main's history → GitHub recognizes "merged" badge.
|
||
elif branch.startswith("gh-pr-"):
|
||
merge_fn = _merge_no_ff_external(branch)
|
||
# Default: cherry-pick (extraction commits ADD new files, applies cleanly,
|
||
# linear history preserved for the bulk-extraction flow).
|
||
else:
|
||
merge_fn = _cherry_pick_onto_main(branch)
|
||
```
|
||
|
||
**New function (`lib/merge.py`):**
|
||
|
||
```python
|
||
async def _merge_no_ff_external(branch: str) -> tuple[bool, str]:
|
||
"""Merge an external GitHub PR with --no-ff so contributor SHA lands in main.
|
||
|
||
Why this differs from _cherry_pick_onto_main:
|
||
- Cherry-pick rewrites SHA → GitHub never recognizes the PR as merged.
|
||
- --no-ff preserves the contributor's commit SHA in main's history.
|
||
- sync-mirror's Forgejo→GitHub propagation already handles merge commits
|
||
(verified empirically: PR #87 round-tripped cleanly with merge_commit_sha
|
||
preserved).
|
||
|
||
Mechanics:
|
||
1. Fetch latest origin/main and origin/{branch}
|
||
2. Create scratch worktree at HEAD of origin/main
|
||
3. git merge --no-ff origin/{branch} -m "Merge PR: {branch}"
|
||
4. git push origin HEAD:main
|
||
5. Cleanup worktree
|
||
|
||
Conflict handling:
|
||
- Entity conflicts: same auto-resolve pattern as cherry-pick
|
||
(--ours = main HEAD, --theirs = branch). External claims rarely touch
|
||
entities so this is a low-frequency path.
|
||
- Other conflicts: abort, return False with conflict detail. Caller marks
|
||
conflict_permanent. Manual resolution or contributor rebase required.
|
||
|
||
Idempotency: caller already gates on PR status, so re-running on a merged
|
||
PR fails at the merge step (already merged), which is the right behavior.
|
||
|
||
Returns (success, message).
|
||
"""
|
||
```
|
||
|
||
The merge commit message uses the branch name (which contains the GitHub PR
|
||
number via the `gh-pr-{N}/...` convention) so post-merge processing can
|
||
correlate back to the GitHub PR if needed.
|
||
|
||
### Bug #2 fix: self-healing sweep at top of sync-mirror cycle
|
||
|
||
**Root cause:** The one-shot link UPDATE in Step 4.5 (lines ~250-294) runs once
|
||
per branch creation — if it fails (race with PR row insertion, transient API
|
||
hiccup, transient lock), the row is permanently stuck at `github_pr=NULL`. No
|
||
retry path. FwazB's PR 4066 is the visible artifact of this class of failure.
|
||
|
||
The structural fix is a self-healing sweep: each cron tick, scan for any
|
||
`gh-pr-*` PR rows missing `github_pr` and link them. Idempotent, zero-cost when
|
||
clean, retries forever until the row is healed. The sweep IS the backfill — no
|
||
separate one-shot script needed (Ship's simplification, Msg 2). First cron tick
|
||
after deploy picks up 4066 automatically, same SELECT path that handles all
|
||
future races.
|
||
|
||
**Placement (load-bearing):** the sweep runs **as the very first action after
|
||
the initial Forgejo+GitHub fetch**, ahead of both:
|
||
- The branch-mirror loop
|
||
- The auto-create-PR block at line ~250
|
||
|
||
This sequencing matters because a fresh-cycle race — PR created in the *current*
|
||
cycle, link UPDATE in `Step 4.5` fails — self-heals on the very next iteration's
|
||
sweep, not 2 minutes later. Same-cycle convergence vs cross-cycle convergence.
|
||
|
||
```bash
|
||
# Step 0: self-heal any gh-pr-* PR rows missing github_pr.
|
||
# Runs FIRST — before branch-mirror loop, before auto-create-PR block.
|
||
# Idempotent: SELECT returns empty when clean.
|
||
# The branch name encodes the GitHub PR number (gh-pr-{N}/...) so no API
|
||
# round-trip needed to recover the number. Source-of-truth derivation.
|
||
sqlite3 -separator '|' "$PIPELINE_DB" \
|
||
"SELECT number, branch FROM prs WHERE branch LIKE 'gh-pr-%' AND github_pr IS NULL" \
|
||
| while IFS='|' read -r pr_num branch; do
|
||
gh_pr_num=$(echo "$branch" | sed -n 's|^gh-pr-\([0-9]*\)/.*|\1|p')
|
||
[ -z "$gh_pr_num" ] && continue
|
||
sqlite3 "$PIPELINE_DB" \
|
||
"UPDATE prs SET github_pr = $gh_pr_num, source_channel = 'github' WHERE number = $pr_num;"
|
||
log "self-heal: linked Forgejo PR #$pr_num → GitHub PR #$gh_pr_num"
|
||
done
|
||
```
|
||
|
||
~10 lines. No API call required (branch name carries the number deterministically).
|
||
No `.fork-pr-map` file, no `--script` flag, no manual deploy step.
|
||
|
||
**Why this approach beats the earlier `.fork-pr-map` proposal:**
|
||
- The map approach repaired the write path but left already-stuck rows orphaned
|
||
- A separate one-shot backfill script for the orphans is one more code path to maintain
|
||
- The sweep collapses both paths: it's the failure recovery AND the historical backfill
|
||
- Future races automatically self-heal without code changes
|
||
|
||
**Why this approach beats the earlier `?head=living-ip:` filter:**
|
||
- That filter never matched fork PRs at all — by design, fork PRs come from a
|
||
different owner. Pre-existing nit Ganymede flagged in multi-repo-mirror review
|
||
- The branch name encoding bypasses the GitHub API entirely for fork PRs
|
||
|
||
**Why no API verification before UPDATE:**
|
||
- Branch name is deterministic source — the only writer of `gh-pr-N/...` is
|
||
Step 2.1, which only writes after fetching the GitHub PR ref. If the branch
|
||
exists, the GitHub PR exists.
|
||
- Avoids API rate-limit pressure on every cron tick (every 2 min × 24h × 7d ≈ 5040
|
||
calls/week even when no work needed).
|
||
- Sanity-check via API is achievable cheaply if Ship wants it; current scope
|
||
matches "zero-cost when clean."
|
||
|
||
### Auto-fixer mode='append' for `gh-pr-*` branches
|
||
|
||
**Current behavior (`lib/fixer.py`):** Worktree-based fix → commit → push (regular
|
||
push, not force). When the PR was created by the pipeline (extract/* branches),
|
||
this works because the LLM-extractor's commit is at HEAD and we're appending on
|
||
top — push succeeds.
|
||
|
||
**External PR behavior (today):** Same code path runs. Fork PR has FwazB's commit
|
||
at HEAD; auto-fixer creates a fix commit on top; pushes via Forgejo's branch
|
||
ref (the fork PR was mirrored as `refs/heads/gh-pr-90/contributor/...` on
|
||
Forgejo). Push works. Eval reset fires. Eval re-runs.
|
||
|
||
Wait — re-reading the existing fixer, it's actually already append-only. Good.
|
||
The cherry-pick at merge time was the only thing rewriting SHAs. So **no fixer
|
||
change required** for Option 2. The fixer commit is already at HEAD~1 from
|
||
FwazB's commit on Forgejo. When we `git merge --no-ff` instead of cherry-pick,
|
||
the merge commit's parent chain includes BOTH the fix commit AND FwazB's
|
||
original commit. GitHub sees FwazB's SHA in ancestry → "merged" badge.
|
||
|
||
**Cross-check:** Verified PR 4066's existing branch state.
|
||
|
||
```
|
||
$ git log refs/heads/gh-pr-90/contributor/arcium-confidential-computing-challenge --oneline
|
||
d7916d65 auto-fix: strip 2 broken wiki links ← fixer's commit
|
||
f6a59d7d claim: confidential computing reshapes... ← FwazB's commit
|
||
```
|
||
|
||
Both commits already on Forgejo. When merge.py cherry-picked, it picked both
|
||
commits onto main as new SHAs. With merge --no-ff, both stay intact, the merge
|
||
commit references them, GitHub sees `f6a59d7d` (FwazB's HEAD on his fork) in
|
||
main's ancestry, marks merged.
|
||
|
||
**This means scope shrinks:** the design is purely a merge.py change + a
|
||
sync-mirror Step 2.1/4.5 plumbing tweak. No fixer module changes.
|
||
|
||
### Edge case: contributor rebases their fork after fixer appended
|
||
|
||
**Scenario:** FwazB's PR is in eval. Pipeline auto-fixer pushes a fix commit to
|
||
Forgejo `gh-pr-90/...`. FwazB notices the original wiki-link issue, fixes it
|
||
locally, force-pushes to his fork. sync-mirror's next cycle fetches his new
|
||
SHA → tries to update the Forgejo branch → push from sync-mirror is regular
|
||
(not --force) → fails because Forgejo branch has diverged from FwazB's fork.
|
||
|
||
**Today's behavior:** sync-mirror logs a warning. Forgejo branch keeps the
|
||
appended-fix state. Eval continues against that state. FwazB's most recent
|
||
fork state is silently ignored.
|
||
|
||
**Acceptable risk for hackathon.** The eval-reset-on-tip-change gate will
|
||
re-trigger eval if anyone force-pushes the Forgejo branch later (e.g., manual
|
||
re-sync). Documented; not fixing in this PR.
|
||
|
||
**Followup:** sync-mirror could detect the divergence, log a structured alert,
|
||
and post a comment on the GitHub PR ("we detected your force-push but our
|
||
appended fix has diverged; please rebase against `<sha>` or we'll close").
|
||
Out of scope for this branch.
|
||
|
||
## Test plan
|
||
|
||
| # | Scenario | Expected | Validation |
|
||
|---|---|---|---|
|
||
| 1 | External PR, clean (no auto-fix needed) | Merged with `--no-ff`, contributor SHA in main, GitHub badge `merged: true`, on_merged comment + close | curl GitHub API for PR state after merge |
|
||
| 2 | External PR with broken wiki links | auto-fixer appends commit, eval re-runs and approves, merge --no-ff brings BOTH commits in via merge commit, GitHub badge `merged: true` | log line trace + GitHub API |
|
||
| 3 | External PR rejected by eval (substantive issue) | terminate_pr fires existing path, on_closed posts rejection comment + closes GitHub PR | GitHub API after eval cycle |
|
||
| 4 | sync-mirror github_pr backfill on fork PR | `prs.github_pr` populated within one cron cycle (≤2 min) of mirror PR creation | sqlite3 SELECT after sync |
|
||
| 5 | Contributor force-pushes fork mid-eval | sync-mirror logs warning, eval continues against pre-rebase state (documented behavior, not regression) | journalctl |
|
||
| 6 | Re-running merge on already-merged PR | --no-ff fails cleanly (already up to date), caller handles as no-op | manual replay |
|
||
|
||
Test 1 and 2 are the critical-path tests. 3 verifies the rejection path didn't
|
||
regress. 4 isolates the github_pr backfill fix. 5 is acceptance criteria for
|
||
the documented edge case. 6 is idempotency.
|
||
|
||
**Production smoke test:** after deploy, manually create a tiny test PR from a
|
||
secondary GitHub account. Walk it through the full lifecycle. Tear down before
|
||
hackathon. Cost: 5 minutes, catches integration-level issues that unit tests
|
||
miss.
|
||
|
||
## Backout procedure
|
||
|
||
If Option 2 misbehaves in production, revert path is one config-line toggle:
|
||
|
||
```python
|
||
# lib/merge.py — gate on a feature flag for fast disable
|
||
if branch.startswith("gh-pr-") and config.EXTERNAL_PR_NO_FF_MERGE:
|
||
merge_fn = _merge_no_ff_external(branch)
|
||
elif ...
|
||
```
|
||
|
||
Default flag value: `True` after deploy. Set to `False` via env var on VPS to
|
||
fall back to cherry-pick path immediately if anything breaks. No code revert
|
||
required for a fast cutout.
|
||
|
||
Forgejo→GitHub merge commits already in main when the flag flips can't be
|
||
un-merged (they're real commits), but the failure mode is the same as today:
|
||
GitHub PR shows merged because the SHA is in history. No worse than current.
|
||
|
||
## Migration / cleanup
|
||
|
||
**FwazB's PR #90:** existing artifact, can't retroactively un-cherry-pick.
|
||
Manual close with explanatory comment (Ship's option b). Cory-approved.
|
||
|
||
```
|
||
We've merged your claim into the knowledge base via cherry-pick (commit f6a59d7d
|
||
on main). Future external PRs will use `git merge --no-ff` so the GitHub merge
|
||
badge fires correctly. This PR is being closed manually as the one historical
|
||
case before the fix lands. Thanks for the contribution!
|
||
|
||
— LivingIP pipeline
|
||
```
|
||
|
||
Posted via `_post_comment` + `_close_github_pr` from a one-off script
|
||
(`scripts/close-fwazb-pr-90.py`).
|
||
|
||
## Implementation order
|
||
|
||
### Phase 1 — self-healing sweep in sync-mirror.sh (Bug #2, ~10 lines)
|
||
|
||
**Single block in `deploy/sync-mirror.sh`:**
|
||
|
||
- Inserted as **Step 0** — runs after the initial Forgejo+GitHub fetch but
|
||
**before** the branch-mirror loop and the auto-create-PR block (line ~250)
|
||
- SELECT `prs` rows where `branch LIKE 'gh-pr-%' AND github_pr IS NULL`
|
||
- Parse PR number from branch name (regex on `gh-pr-{N}/...`)
|
||
- UPDATE `github_pr` and `source_channel='github'`
|
||
- Audit-friendly log line per healed row
|
||
|
||
**Properties:**
|
||
- Idempotent (SELECT empty when clean)
|
||
- Zero-cost path when no rows match
|
||
- No API calls (branch name is the source of truth)
|
||
- Self-healing: same SELECT/UPDATE pattern recovers from race AND backfills
|
||
historical orphans (PR 4066 picked up on first cron tick post-deploy)
|
||
- Same-cycle convergence: a freshly-mirrored PR whose Step 4.5 link UPDATE
|
||
fails gets healed on the next cron tick's sweep, not delayed multiple cycles
|
||
|
||
**Branch:** `epimetheus/external-merge-flow-bug2` (revision in progress on
|
||
existing `epimetheus/external-merge-flow-design`).
|
||
|
||
**Deploys independently** of Phase 2 — no dependency on Bug #1 fix. Once Phase 1
|
||
is live, comments + close fire correctly even on cherry-pick-merged PRs (the
|
||
half-fix state). Phase 2 then layers in the "merged" badge.
|
||
|
||
**Smoke test (cost: 1 cron tick = 2 min):** after deploy, verify FwazB's
|
||
`prs.github_pr` populates from NULL → 90, then `_get_github_pr` resolves on
|
||
next merge action.
|
||
|
||
### Phase 2 — merge.py --no-ff for gh-pr-* (Bug #1, ~120 lines)
|
||
|
||
- Add `_merge_no_ff_external` function (architecture unchanged from v1)
|
||
- Add branch-prefix dispatch case in `_merge_domain_queue`
|
||
- Add config flag `EXTERNAL_PR_NO_FF_MERGE` for backout
|
||
- Branch: `epimetheus/external-merge-flow-bug1`
|
||
- Depends on Phase 1 being live (otherwise `on_merged` still no-ops on fork PRs)
|
||
- Smoke: end-to-end test PR
|
||
|
||
### Phase 3 — FwazB cleanup (~10 lines)
|
||
|
||
- Manual one-off script for PR #90 (option b: explanatory comment + close)
|
||
- Independent of Phase 1/2 deploy
|
||
- Can run any time after Phase 1 lands (Phase 1 populates github_pr=90 on PR 4066,
|
||
enabling `_get_github_pr` to resolve for the comment script)
|
||
|
||
**Two phases, separately reviewable, separately deployable.** Phase 1 alone gives
|
||
us contributor comments on closed-via-cherry-pick PRs (already partially solves
|
||
the UX). Phase 2 adds the "merged" badge. Phase 3 is post-Phase-1 cleanup.
|
||
|
||
## Open questions for Ship
|
||
|
||
**Resolved in Msg 2:**
|
||
- ~~`.fork-pr-map` file location~~ → moot, sweep replaces map approach
|
||
- ~~Phase 1 vs Phase 2 sequencing~~ → separate deploys, sweep-only Phase 1 confirmed
|
||
- ~~One-shot backfill script~~ → dropped, sweep IS the backfill
|
||
- ~~Sweep placement~~ → confirmed: first action after initial fetch, before
|
||
branch-mirror loop AND auto-create-PR block
|
||
|
||
**Still open (Phase 2 architecture):**
|
||
|
||
1. **Backout flag necessary?** I'm including `EXTERNAL_PR_NO_FF_MERGE` config
|
||
flag for fast cutout. Overkill given low-traffic external-PR path? Or
|
||
prudent given Accelerate Solana pressure?
|
||
|
||
2. **Test plan #5 (rebase-after-fix) scope.** Documenting the silent-ignore as
|
||
acceptable for now. Sufficient for hackathon, or do you want me to add the
|
||
structured-alert path before May 5?
|
||
|
||
3. **`merge --no-ff` commit message format.** I have `"Merge PR: {branch}"`.
|
||
Better signal to derive the GitHub PR number from the branch and write
|
||
`"Merge external GitHub PR #{N}: {claim_title_or_branch_slug}"`? Trades
|
||
verbosity for searchability in `git log --merges`.
|
||
|
||
Sending revised doc directly via Pentagon. Ganymede gets line-level review of
|
||
the Phase 1 code once Ship signs off on the revised design.
|