fix(backfill): don't regress terminal source statuses to unprocessed
Some checks are pending
CI / lint-and-test (push) Waiting to run

backfill-sources.py runs every 15 minutes and derives sources.status
purely from directory location. If a source file is in inbox/queue/,
it blindly overwrites the DB status to 'unprocessed' — even when the
DB already had 'extracted' or 'null_result'.

This is why the 43 zombies kept coming back after manual backfill:
cron re-reset them every 15 minutes, then each 4h cooldown expiry
re-triggered runaway extraction on the same source.

Fix: never regress from a terminal status (extracted, null_result,
error, ghost_no_file) to 'unprocessed'. File location is ambiguous
(legitimately new vs. zombie from failed archive); DB is authoritative.
Legitimate re-extraction still works — it goes through the needs_reextraction
path which is unaffected by this gate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
m3taversal 2026-04-22 21:29:33 +01:00
parent 97b590acd6
commit a053a8ebf9

View file

@ -104,14 +104,22 @@ def main():
claims_count = 0
if rel_path in existing:
# Update status if different
# Update status if different — but never regress from terminal states.
# If DB says 'extracted' or 'null_result' and file happens to be in queue/
# (e.g., failed archive push, zombie file), the DB is authoritative.
# Downgrading to 'unprocessed' triggers the runaway re-extraction loop.
current = conn.execute("SELECT status FROM sources WHERE path = ?", (rel_path,)).fetchone()
TERMINAL_STATUSES = {"extracted", "null_result", "error", "ghost_no_file"}
if current and current["status"] != status:
conn.execute(
"UPDATE sources SET status = ?, updated_at = datetime('now') WHERE path = ?",
(status, rel_path),
)
updated += 1
if current["status"] in TERMINAL_STATUSES and status == "unprocessed":
# Don't regress terminal → unprocessed. DB wins.
pass
else:
conn.execute(
"UPDATE sources SET status = ?, updated_at = datetime('now') WHERE path = ?",
(status, rel_path),
)
updated += 1
else:
conn.execute(
"""INSERT INTO sources (path, status, priority, claims_count, created_at, updated_at)