extract: document basename-uniqueness invariant + skip _-prefixed archive files
Some checks are pending
CI / lint-and-test (push) Waiting to run

Two nits from Ganymede review of ed4af4d:

1. Archive-basename filter depends on basename-uniqueness across queue+archive.
   Current naming (date-prefix + topic-slug) makes collisions rare, but if
   short generic names like "notes.md" enter the queue, the filter silently
   false-positives. Comment block names the assumption.

2. Archive walk now skips _-prefixed files, matching the standing convention
   everywhere else (search.py STRUCTURAL_FILES, reweave wiki-link skip, Layer
   0 entity exclusion). Defensive — no _*.md exists under inbox/archive/
   today, but consistent with codebase convention if a future operator drops
   _README.md to document the directory.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
m3taversal 2026-04-30 11:09:19 +01:00
parent ed4af4d72e
commit 923454c9ea

View file

@ -930,11 +930,18 @@ async def extract_cycle(conn, max_workers=None) -> tuple[int, int]:
# copy is the source of truth — if a file with this basename is in archive,
# the source is processed regardless of queue state. Single archive scan
# per cycle, cheap (~1k files).
#
# Assumes basename uniqueness across queue+archive — current naming
# convention (date-prefix + topic-slug) makes collisions vanishingly
# rare. If short generic names like "notes.md" enter the queue, this
# filter silently false-positives.
if unprocessed:
archive_dir = main / "inbox" / "archive"
archived_basenames: set[str] = set()
if archive_dir.exists():
for af in archive_dir.rglob("*.md"):
if af.name.startswith("_"):
continue
archived_basenames.add(af.name)
if archived_basenames:
before = len(unprocessed)