Compare commits

..

2 commits

Author SHA1 Message Date
b57cb41e31 fix(mirror): Step 0b self-heals stuck submitted_by via cron retry
Step 4.5's submitted_by write inherits the same one-shot failure mode
that Phase 1's Step 0 sweep was designed to retire. On transient
GitHub API failure, link-only fallback writes github_pr and permanently
closes the retry window — submitted_by stays stuck on 'm3taversal'
(bot identity), and contribution_events never gets the author event.

Step 0b adds a second sweep alongside Step 0:
  SELECT ... WHERE github_pr IS NOT NULL
              AND (submitted_by IS NULL OR submitted_by = 'm3taversal')

Same idempotent cron-retry shape: SELECT empty when clean, per-row
GitHub API call + UPDATE only when stuck. Targets bidirectional
repos only (gh-pr-* branches don't exist for main_only mirrors).

Derives bidirectional GitHub repo from MIRROR_REPOS at sweep time
since Step 0 runs before sync_repo() sets GITHUB_REPO scope.

Doubles as future-proof safety net: any external PR landing during
the deploy window with submitted_by stuck on bot identity gets
self-healed on the next cron tick. No backfill script needed.

Per Ganymede line-level review of 1decf09.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 18:21:04 +01:00
1decf09598 fix(mirror): Step 4.5 writes submitted_by from GitHub user.login
Closes the systematic external-PR attribution gap diagnosed on FwazB PR #4066.
The Forgejo PR was being created via admin-token (m3taversal), so
prs.submitted_by ended up as the bot identity. record_contributor_attribution
treats m3taversal as a bot and finds no other signal for fork PRs (commit
authors are bot-rewritten by sync-mirror), so zero author events emit.

Mechanism — for gh-pr-* branches in the auto-create path:
1. After deriving GH_PR_NUM from branch name, GET /pulls/{N} from GitHub API
2. Extract user.login (e.g. "FwazB") from the PR's head author
3. Validate against GitHub username regex: ^[a-zA-Z0-9]([a-zA-Z0-9-]{0,37}[a-zA-Z0-9])?$
4. Lowercase to match contributor.py canonical handle storage
5. Include submitted_by in the same UPDATE that sets github_pr + source_channel

The regex doubles as the SQL-injection safety boundary (no parametric binding
in bash sqlite3). 14 cases tested locally including SQL injection probes —
all rejected, all real handles pass.

Failure modes:
- API call fails or returns no user.login → fall back to link-only UPDATE
  (existing behavior). Better than failing the whole step.
- Regex rejects malformed login → same fall-back. Preserves audit trail.

Smoke-tested against real FwazB PR #90 on VPS: extracts "FwazB", lowercases
to "fwazb", regex passes, would write submitted_by='fwazb'. Once deployed,
record_contributor_attribution will trust submitted_by as fallback when no
agent trailer found, emit author event with weight 0.30 → FwazB-class
contributors auto-attributed end-to-end with zero manual backfill.

Architectural decisions per Ship's Apr 28 sign-off:
- (a) Sweep stays zero-API: no per-row API calls in Step 0 self-heal.
  This Step 4.5 fix is create-time only. Existing rows (FwazB) already
  manually backfilled — no further population to backfill.
- (b) Skip _BOT_AUTHORS exception (#2 in original ticket): once submitted_by
  is correct, the bot filter doesn't fire on external PRs anymore.
- (c) Defer frontmatter rewriting: convenience, not load-bearing.
2026-04-28 18:15:13 +01:00
2 changed files with 80 additions and 80 deletions

View file

@ -204,41 +204,7 @@ sync_github_to_forgejo_with_prs() {
local FORGEJO_TOKEN
FORGEJO_TOKEN=$(cat /opt/teleo-eval/secrets/forgejo-admin-token 2>/dev/null)
# Lazy schema for sync-mirror's auto-create tracker. Records (branch, sha)
# pairs we've already auto-created PRs for, so the loop below can skip
# redundant creates after pipeline merge → _delete_remote_branch →
# GitHub-only re-discovery → re-push. Cheap CREATE IF NOT EXISTS on each
# cycle; no migration needed because this table is private to sync-mirror.
sqlite3 "$PIPELINE_DB" "CREATE TABLE IF NOT EXISTS sync_autocreate_tracker (branch TEXT NOT NULL, sha TEXT NOT NULL, pr_number INTEGER, created_at TEXT DEFAULT (datetime('now')), PRIMARY KEY (branch, sha));" 2>/dev/null || true
for branch in $GITHUB_ONLY; do
# Already-tracked gate: if we've previously auto-created a PR for
# this exact (branch, sha), skip the entire push+create sequence.
# Closes the empty-PR loop (research and reweave both observed):
# pipeline merges PR → _delete_remote_branch on Forgejo → next sync
# sees branch GitHub-only (origin still has it) → re-pushes to
# Forgejo → HAS_PR misses (Forgejo ?head= broken; closed PRs scroll
# past 50-item paginated window) → auto-creates fresh PR → pipeline
# merges (empty no-op via cherry-pick / reweave union) → repeat.
# Tracker keys on SHA, so legitimate new commits on the same branch
# produce a new SHA → tracker miss → auto-create proceeds normally.
local BRANCH_SHA TRACKED_PR
if [[ "$branch" == gh-pr-* ]]; then
BRANCH_SHA=$(git rev-parse "refs/heads/$branch" 2>/dev/null || true)
else
BRANCH_SHA=$(git rev-parse "refs/remotes/origin/$branch" 2>/dev/null || true)
fi
if [ -n "$BRANCH_SHA" ]; then
# stderr → $LOG so sustained sqlite3 contention surfaces in ops logs
# rather than silently falling through to a redundant auto-create.
TRACKED_PR=$(sqlite3 "$PIPELINE_DB" "SELECT pr_number FROM sync_autocreate_tracker WHERE branch=$(printf "'%s'" "${branch//\'/\'\'}") AND sha=$(printf "'%s'" "$BRANCH_SHA") LIMIT 1;" 2>>"$LOG" || echo "")
if [ -n "$TRACKED_PR" ]; then
log "Skip auto-create: $branch SHA $BRANCH_SHA already tracked (PR #$TRACKED_PR)"
continue
fi
fi
log "New from GitHub: $branch -> Forgejo"
# Fork PR branches live as local refs (from Step 2.1), not on origin remote
if [[ "$branch" == gh-pr-* ]]; then
@ -309,21 +275,26 @@ print('no')
fi
log "Auto-created PR #$PR_NUM on Forgejo for $branch"
# Record (branch, sha, pr_number) so the tracker gate above can short-
# circuit the next time we see this exact (branch, sha) combination.
# INSERT OR IGNORE: idempotent if a concurrent run already inserted.
# WARN log on failure: silent INSERT failure under sustained sqlite3
# contention would mask the loop reappearing on the next cycle (HAS_PR
# only saves us while the closed PR is in the 50-item pagination window).
if [ -n "$BRANCH_SHA" ] && [[ "$PR_NUM" =~ ^[0-9]+$ ]]; then
if ! sqlite3 "$PIPELINE_DB" "INSERT OR IGNORE INTO sync_autocreate_tracker (branch, sha, pr_number) VALUES ($(printf "'%s'" "${branch//\'/\'\'}"), $(printf "'%s'" "$BRANCH_SHA"), $PR_NUM);" 2>>"$LOG"; then
log "WARN: tracker insert failed for $branch SHA $BRANCH_SHA (PR #$PR_NUM) — duplicate auto-create possible next cycle"
fi
fi
# Step 4.5: Link GitHub PR to Forgejo PR in pipeline DB
# Step 4.5: Link GitHub PR to Forgejo PR + populate submitted_by from GitHub user.login.
#
# Why submitted_by here: sync-mirror creates the Forgejo PR using the admin
# token (user=m3taversal), so the Forgejo PR's submitter is bot-identity, not
# the GitHub author (FwazB, etc). Without this fix, contribution_events emits
# zero author events for external PRs because record_contributor_attribution
# treats m3taversal as a bot and finds no other signal. Architectural fix per
# Ship + Cory after the FwazB PR #4066 attribution gap (Apr 28).
local GH_USER
GH_USER=""
if [[ "$branch" == gh-pr-* ]]; then
GH_PR_NUM=$(echo "$branch" | sed 's|gh-pr-\([0-9]*\)/.*|\1|')
# Look up GitHub PR author for submitted_by attribution.
local PAT_GHU
PAT_GHU=$(cat "$GITHUB_PAT_FILE" 2>/dev/null | tr -d '[:space:]')
if [ -n "$PAT_GHU" ] && [[ "$GH_PR_NUM" =~ ^[0-9]+$ ]]; then
GH_USER=$(curl -sf "https://api.github.com/repos/$GITHUB_REPO/pulls/$GH_PR_NUM" \
-H "Authorization: token $PAT_GHU" 2>/dev/null | \
python3 -c "import sys,json; print((json.load(sys.stdin).get('user') or {}).get('login',''))" 2>/dev/null || true)
fi
else
local PAT
PAT=$(cat "$GITHUB_PAT_FILE" 2>/dev/null | tr -d '[:space:]')
@ -335,9 +306,24 @@ print('no')
fi
fi
if [[ "$GH_PR_NUM" =~ ^[0-9]+$ ]] && [[ "$PR_NUM" =~ ^[0-9]+$ ]]; then
sqlite3 "$PIPELINE_DB" "UPDATE prs SET github_pr = $GH_PR_NUM, source_channel = 'github' WHERE number = $PR_NUM;" 2>/dev/null && \
log "Linked GitHub PR #$GH_PR_NUM -> Forgejo PR #$PR_NUM" || \
# GitHub usernames: 1-39 chars, alphanumeric + hyphen, can't start/end with hyphen.
# Strict regex is the SQL-injection safety boundary (no bash sqlite3 parametric binding).
if [[ "$GH_USER" =~ ^[a-zA-Z0-9]([a-zA-Z0-9-]{0,37}[a-zA-Z0-9])?$ ]]; then
# Lowercase for canonical handle storage (matches contributor.py normalize)
local GH_USER_LC
GH_USER_LC=$(echo "$GH_USER" | tr '[:upper:]' '[:lower:]')
sqlite3 "$PIPELINE_DB" "UPDATE prs SET github_pr = $GH_PR_NUM, source_channel = 'github', submitted_by = '$GH_USER_LC' WHERE number = $PR_NUM;" 2>/dev/null && \
log "Linked GitHub PR #$GH_PR_NUM -> Forgejo PR #$PR_NUM (author: $GH_USER_LC)" || \
log "WARN: Failed to link GitHub PR #$GH_PR_NUM to Forgejo PR #$PR_NUM in DB"
else
# Fall back to link-only when user.login lookup failed or was malformed.
# Still better than no link; record_contributor_attribution can use
# other signals (commit author trailer, etc) — author event may still
# not emit, but submitted_by stays NULL/whatever discovery set.
sqlite3 "$PIPELINE_DB" "UPDATE prs SET github_pr = $GH_PR_NUM, source_channel = 'github' WHERE number = $PR_NUM;" 2>/dev/null && \
log "Linked GitHub PR #$GH_PR_NUM -> Forgejo PR #$PR_NUM (author lookup failed)" || \
log "WARN: Failed to link GitHub PR #$GH_PR_NUM to Forgejo PR #$PR_NUM in DB"
fi
fi
done
}
@ -439,6 +425,50 @@ if [ -f "$PIPELINE_DB" ]; then
log "self-heal: linked Forgejo PR #$pr_num -> GitHub PR #$gh_pr_num"
fi
done
# Step 0b: self-heal stuck submitted_by on linked gh-pr-* PRs.
# Step 4.5 fall-through writes github_pr but leaves submitted_by as 'm3taversal'
# (bot identity) on transient GitHub API failures. Without retry, contribution_events
# never gets the author event — same one-shot failure mode Step 0 was designed to
# retire. Same idempotent cron-retry shape: SELECT empty when clean, per-row API
# call only when stuck. Targets bidirectional repos only (gh-pr-* branches don't
# exist for main_only mirrors).
GH_REPO_BIDIR=""
for entry in "${MIRROR_REPOS[@]}"; do
read -r _f gh_repo _bare mode <<< "$entry"
if [ "$mode" = "bidirectional" ]; then
GH_REPO_BIDIR="$gh_repo"
break
fi
done
if [ -n "$GH_REPO_BIDIR" ] && [ -f "$GITHUB_PAT_FILE" ]; then
sqlite3 -separator '|' "$PIPELINE_DB" \
"SELECT number, branch FROM prs
WHERE branch LIKE 'gh-pr-%'
AND github_pr IS NOT NULL
AND (submitted_by IS NULL OR submitted_by = 'm3taversal');" \
2>/dev/null | while IFS='|' read -r pr_num branch; do
gh_pr_num=$(echo "$branch" | sed -n 's|^gh-pr-\([0-9][0-9]*\)/.*|\1|p')
[ -z "$gh_pr_num" ] && continue
PAT_S0=$(cat "$GITHUB_PAT_FILE" 2>/dev/null | tr -d '[:space:]')
[ -z "$PAT_S0" ] && continue
gh_user=$(curl -sf "https://api.github.com/repos/$GH_REPO_BIDIR/pulls/$gh_pr_num" \
-H "Authorization: token $PAT_S0" 2>/dev/null | \
python3 -c "import sys,json; print((json.load(sys.stdin).get('user') or {}).get('login',''))" 2>/dev/null || true)
# Regex matches Step 4.5: GitHub username spec (anchored, alnum + hyphen,
# no consecutive-hyphen check). SQL-injection boundary: char class excludes
# quotes/semicolons/backslashes, so single-quoted literal is safe.
if [[ "$gh_user" =~ ^[a-zA-Z0-9]([a-zA-Z0-9-]{0,37}[a-zA-Z0-9])?$ ]]; then
gh_user_lc=$(echo "$gh_user" | tr '[:upper:]' '[:lower:]')
# SQL-integer-safe: pr_num from INTEGER column, gh_user_lc regex-validated.
if sqlite3 "$PIPELINE_DB" \
"UPDATE prs SET submitted_by = '$gh_user_lc' WHERE number = $pr_num;" \
2>/dev/null; then
log "self-heal: set submitted_by=$gh_user_lc on Forgejo PR #$pr_num (GitHub PR #$gh_pr_num)"
fi
fi
done
fi
fi
for entry in "${MIRROR_REPOS[@]}"; do

View file

@ -923,36 +923,6 @@ async def extract_cycle(conn, max_workers=None) -> tuple[int, int]:
except Exception:
logger.debug("Failed to read source %s", f, exc_info=True)
# Archive-basename filter: skip queue files whose basename already exists in
# inbox/archive/. Research-session commits on agent branches occasionally
# re-introduce already-archived queue files when the branch is re-merged,
# producing same-source re-extractions every cooldown cycle. The archive
# copy is the source of truth — if a file with this basename is in archive,
# the source is processed regardless of queue state. Single archive scan
# per cycle, cheap (~1k files).
#
# Assumes basename uniqueness across queue+archive — current naming
# convention (date-prefix + topic-slug) makes collisions vanishingly
# rare. If short generic names like "notes.md" enter the queue, this
# filter silently false-positives.
if unprocessed:
archive_dir = main / "inbox" / "archive"
archived_basenames: set[str] = set()
if archive_dir.exists():
for af in archive_dir.rglob("*.md"):
if af.name.startswith("_"):
continue
archived_basenames.add(af.name)
if archived_basenames:
before = len(unprocessed)
unprocessed = [
(sp, c, f) for sp, c, f in unprocessed
if Path(sp).name not in archived_basenames
]
skipped = before - len(unprocessed)
if skipped:
logger.info("Skipped %d queue source(s) — basename already in inbox/archive/", skipped)
# Don't early-return here — re-extraction sources may exist even when queue is empty
# (the re-extraction check runs after open-PR filtering below)