extract: document basename-uniqueness invariant + skip _-prefixed archive files

Two nits from Ganymede review of ed4af4d: 1. Archive-basename filter depends on basename-uniqueness across queue+archive. Current naming (date-prefix + topic-slug) makes collisions rare, but if short generic names like "notes.md" enter the queue, the filter silently false-positives. Comment block names the assumption. 2. Archive walk now skips _-prefixed files, matching the standing convention everywhere else (search.py STRUCTURAL_FILES, reweave wiki-link skip, Layer 0 entity exclusion). Defensive — no _*.md exists under inbox/archive/ today, but consistent with codebase convention if a future operator drops _README.md to document the directory. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 11:09:19 +01:00 · 2026-04-30 11:09:19 +01:00 · 923454c9ea
commit 923454c9ea
parent ed4af4d72e
1 changed files with 7 additions and 0 deletions
--- a/lib/extract.py
+++ b/lib/extract.py
@ -930,11 +930,18 @@ async def extract_cycle(conn, max_workers=None) -> tuple[int, int]:
    # copy is the source of truth — if a file with this basename is in archive,
    # the source is processed regardless of queue state. Single archive scan
    # per cycle, cheap (~1k files).
+    #
+    # Assumes basename uniqueness across queue+archive — current naming
+    # convention (date-prefix + topic-slug) makes collisions vanishingly
+    # rare. If short generic names like "notes.md" enter the queue, this
+    # filter silently false-positives.
    if unprocessed:
        archive_dir = main / "inbox" / "archive"
        archived_basenames: set[str] = set()
        if archive_dir.exists():
            for af in archive_dir.rglob("*.md"):
+                if af.name.startswith("_"):
+                    continue
                archived_basenames.add(af.name)
        if archived_basenames:
            before = len(unprocessed)