fix: skip format: conversation in extraction — archive directly instead

Conversation archives produce low-quality claims (26x schema failures,
22x near-duplicates in 24h). Valuable content from conversations now
enters through three other paths:
1. Standalone sources (URLs shared → x-article/x-tweet files)
2. Inline tags (SOURCE:/CLAIM: → curated source files)
3. Transcript review (1-hour JSONL dumps → periodic safety net)

Conversations moved to inbox/archive/telegram/ for provenance without
burning extraction cycles.

Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70>
This commit is contained in:
m3taversal 2026-03-26 12:02:57 +00:00
parent 1019602eec
commit 0854375fd0

View file

@ -97,6 +97,17 @@ for SOURCE in $SOURCES; do
BASENAME=$(basename "$SOURCE" .md)
BRANCH="extract/$BASENAME"
# Skip conversation archives — valuable content enters through standalone sources,
# inline tags (SOURCE:/CLAIM:), and transcript review. Raw conversations produce
# low-quality claims with schema failures. (Epimetheus session 4)
if grep -q "^format: conversation" "$SOURCE" 2>/dev/null; then
# Move to archive instead of leaving in queue (prevents re-processing)
mv "$SOURCE" "$MAIN_REPO/inbox/archive/telegram/" 2>/dev/null
echo "[$(date)] [$COUNT/$MAX] ARCHIVE $BASENAME (conversation — skipped extraction)" >> $LOG
SKIPPED=$((SKIPPED + 1))
continue
fi
# Gate 1: Already in archive? Source was already processed — dedup (Ganymede)
if find "$MAIN_REPO/inbox/archive" -name "$BASENAME.md" 2>/dev/null | grep -q .; then
echo "[$(date)] [$COUNT/$MAX] SKIP $BASENAME (already in archive)" >> $LOG