extract: document basename-uniqueness invariant + skip _-prefixed archive files
Some checks are pending
CI / lint-and-test (push) Waiting to run
Some checks are pending
CI / lint-and-test (push) Waiting to run
Two nits from Ganymede review of ed4af4d:
1. Archive-basename filter depends on basename-uniqueness across queue+archive.
Current naming (date-prefix + topic-slug) makes collisions rare, but if
short generic names like "notes.md" enter the queue, the filter silently
false-positives. Comment block names the assumption.
2. Archive walk now skips _-prefixed files, matching the standing convention
everywhere else (search.py STRUCTURAL_FILES, reweave wiki-link skip, Layer
0 entity exclusion). Defensive — no _*.md exists under inbox/archive/
today, but consistent with codebase convention if a future operator drops
_README.md to document the directory.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
ed4af4d72e
commit
923454c9ea
1 changed files with 7 additions and 0 deletions
|
|
@ -930,11 +930,18 @@ async def extract_cycle(conn, max_workers=None) -> tuple[int, int]:
|
|||
# copy is the source of truth — if a file with this basename is in archive,
|
||||
# the source is processed regardless of queue state. Single archive scan
|
||||
# per cycle, cheap (~1k files).
|
||||
#
|
||||
# Assumes basename uniqueness across queue+archive — current naming
|
||||
# convention (date-prefix + topic-slug) makes collisions vanishingly
|
||||
# rare. If short generic names like "notes.md" enter the queue, this
|
||||
# filter silently false-positives.
|
||||
if unprocessed:
|
||||
archive_dir = main / "inbox" / "archive"
|
||||
archived_basenames: set[str] = set()
|
||||
if archive_dir.exists():
|
||||
for af in archive_dir.rglob("*.md"):
|
||||
if af.name.startswith("_"):
|
||||
continue
|
||||
archived_basenames.add(af.name)
|
||||
if archived_basenames:
|
||||
before = len(unprocessed)
|
||||
|
|
|
|||
Loading…
Reference in a new issue