fix: skip format: conversation in extraction — archive directly instead

Conversation archives produce low-quality claims (26x schema failures, 22x near-duplicates in 24h). Valuable content from conversations now enters through three other paths: 1. Standalone sources (URLs shared → x-article/x-tweet files) 2. Inline tags (SOURCE:/CLAIM: → curated source files) 3. Transcript review (1-hour JSONL dumps → periodic safety net) Conversations moved to inbox/archive/telegram/ for provenance without burning extraction cycles. Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70>
2026-03-26 12:02:57 +00:00 · 2026-03-26 12:02:57 +00:00 · 0854375fd0
commit 0854375fd0
parent 1019602eec
1 changed files with 11 additions and 0 deletions
--- a/batch-extract-50.sh
+++ b/batch-extract-50.sh
@ -97,6 +97,17 @@ for SOURCE in $SOURCES; do
    BASENAME=$(basename "$SOURCE" .md)
    BRANCH="extract/$BASENAME"

+    # Skip conversation archives — valuable content enters through standalone sources,
+    # inline tags (SOURCE:/CLAIM:), and transcript review. Raw conversations produce
+    # low-quality claims with schema failures. (Epimetheus session 4)
+    if grep -q "^format: conversation" "$SOURCE" 2>/dev/null; then
+        # Move to archive instead of leaving in queue (prevents re-processing)
+        mv "$SOURCE" "$MAIN_REPO/inbox/archive/telegram/" 2>/dev/null
+        echo "[$(date)] [$COUNT/$MAX] ARCHIVE $BASENAME (conversation — skipped extraction)" >> $LOG
+        SKIPPED=$((SKIPPED + 1))
+        continue
+    fi
+
    # Gate 1: Already in archive? Source was already processed — dedup (Ganymede)
    if find "$MAIN_REPO/inbox/archive" -name "$BASENAME.md" 2>/dev/null | grep -q .; then
        echo "[$(date)] [$COUNT/$MAX] SKIP $BASENAME (already in archive)" >> $LOG