Three targeted fixes from Ganymede's review of commit 469cb7f:
BUG #1 — Success path now updates sources.status='extracting' before PR
creation, so queue scan's DB-authoritative filter catches sources between
PR creation and merge. Previously the cooldown gate was load-bearing for
this window, not belt-and-suspenders as claimed.
BUG #2 — Second null-result path (line 573, triggered when enrichments
existed but all targets were missing in worktree) now updates DB. Without
this, that path created no PR, no DB mark, and would have re-entered the
runaway loop 4h later when the cooldown window expired.
NIT #6 — 4h cooldown moved to config.EXTRACTION_COOLDOWN_HOURS. Tunable
without code change. Log format now shows the configured hours.
Also backfilled 59 pre-existing zombie queue-path rows where the file
was already archived but DB status said 'unprocessed' — these would have
leaked past the DB filter once the 4h cooldown expired.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three changes reduce extraction cost and duplicate PR flood:
1. 4-hour cooldown gate — skip sources with ANY PR (merged/closed/open)
created in the last 4h. Prevents same source re-extracting every 60s
while archive step lags behind merge.
2. DB-authoritative status — sources.status is now updated in the pipeline DB
at each extraction terminal point (null_result, success). Queue scan checks
DB first so sources with failed archives (e.g., root-owned worktree files
blocking git pull --rebase) don't get re-extracted forever. Also moves
archival into the extraction branch so it goes through PR merge instead
of a fragile separate main-worktree push.
3. source_channel wiring — extract.py PR INSERT now sets source_channel from
classify_source_channel(branch). Previously daemon-created PRs had NULL
source_channel, breaking Argus dashboard filters. Combined with Ship's
in-branch archive refactor.
Root incident: blockworks-metadao-strategic-reset.md extracted 31 times in
12 hours. Nine other sources hit 10-22 extractions each. Near-duplicate
rejection rate jumped to 94%.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Forward link: claims get `sourced_from: {domain}/{filename}` at extraction time.
Reverse link: after merge, backlink_source_claims() updates source files with
`claims_extracted:` list. All disk writes happen under async_main_worktree_lock.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Near-duplicate (159+ rejections):
- Add extract-time dedup gate: SequenceMatcher check before file write ($0)
- Strengthen extraction prompt: high-similarity matches (>=0.75) get explicit
"DO NOT extract, use enrichment instead" warning
- Strip [[wiki link]] brackets from related_claims field
Frontmatter schema (129+ rejections):
- Normalize LLM confidence aliases (high→likely, medium→experimental, etc.)
in both _build_claim_content and validate_schema
- Strip code fences (```markdown/```yaml) from entity content in extract.py
and from diff content in validate.py tier0.5 check
- Code fences were root cause of "no_frontmatter" failures: parser sees
```markdown as first line, not ---
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two changes to address the #1 rejection reason:
1. extraction_prompt.py: Explicitly tell LLM NOT to use [[wiki links]]
in body text — use connections/related_claims JSON fields instead.
Remove misleading "post-processor handles wiki links" language.
2. extract.py _get_kb_index(): Expand KB index to include entity stems
from entities/{domain}/ so the LLM knows what entities exist when
building connections. Previously only showed domain claims.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There were TWO `if not unprocessed: return 0, 0` gates. The previous
fix (c763c99) only addressed the second one. The first at line 746
fires before the re-extraction query even runs. Replace with a comment
explaining why we don't early-return there.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The re-extraction check was below an early return that fires when
unprocessed queue is empty. Sources in needs_reextraction state were
never picked up unless new sources happened to arrive simultaneously.
Move re-extraction query above the gate so both paths run independently.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Path(target).name strips directory components from LLM-generated
target filenames, preventing path traversal via ../. Same pattern
already applied to claim filenames (line 404) and entity filenames
(line 416). Ganymede-approved.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three fixes for conversation-sourced claim quality:
1. Trust hierarchy in extraction prompt: bot-generated numbers are
flagged as unverified context, not evidence. Directional claims
are extractable but specific figures require external verification.
Prevents laundering bot guesses into the KB as evidence.
2. Conversation-sourced claims tagged with verified: false and
source_type: conversation in frontmatter. Downstream consumers
(Leo, dashboard) can filter/flag these for verification.
3. GET /api/telegram-extractions endpoint for daily spot-checking.
Shows recent Telegram-sourced PRs with claim titles, status,
merge rate, and eval issues. Quick review surface.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two changes:
1. extract.py: Enrichments now modify existing claim files by appending
evidence sections. Previously enrichment-only extractions were
discarded as null-result even when they contained valuable challenges.
2. extraction_prompt.py: Corrections should produce BOTH a claim (the
corrected knowledge) AND an enrichment (linking to what it corrects).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When source format is "conversation", inject specialized extraction
rules that prioritize human corrections/pushback as highest-value
content. Fixes null-result on short but high-signal correction
messages. Maps corrections to existing KB claims as challenges.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>