fix: skip format: conversation in extraction — archive directly instead
Conversation archives produce low-quality claims (26x schema failures, 22x near-duplicates in 24h). Valuable content from conversations now enters through three other paths: 1. Standalone sources (URLs shared → x-article/x-tweet files) 2. Inline tags (SOURCE:/CLAIM: → curated source files) 3. Transcript review (1-hour JSONL dumps → periodic safety net) Conversations moved to inbox/archive/telegram/ for provenance without burning extraction cycles. Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70>
This commit is contained in:
parent
1019602eec
commit
0854375fd0
1 changed files with 11 additions and 0 deletions
|
|
@ -97,6 +97,17 @@ for SOURCE in $SOURCES; do
|
|||
BASENAME=$(basename "$SOURCE" .md)
|
||||
BRANCH="extract/$BASENAME"
|
||||
|
||||
# Skip conversation archives — valuable content enters through standalone sources,
|
||||
# inline tags (SOURCE:/CLAIM:), and transcript review. Raw conversations produce
|
||||
# low-quality claims with schema failures. (Epimetheus session 4)
|
||||
if grep -q "^format: conversation" "$SOURCE" 2>/dev/null; then
|
||||
# Move to archive instead of leaving in queue (prevents re-processing)
|
||||
mv "$SOURCE" "$MAIN_REPO/inbox/archive/telegram/" 2>/dev/null
|
||||
echo "[$(date)] [$COUNT/$MAX] ARCHIVE $BASENAME (conversation — skipped extraction)" >> $LOG
|
||||
SKIPPED=$((SKIPPED + 1))
|
||||
continue
|
||||
fi
|
||||
|
||||
# Gate 1: Already in archive? Source was already processed — dedup (Ganymede)
|
||||
if find "$MAIN_REPO/inbox/archive" -name "$BASENAME.md" 2>/dev/null | grep -q .; then
|
||||
echo "[$(date)] [$COUNT/$MAX] SKIP $BASENAME (already in archive)" >> $LOG
|
||||
|
|
|
|||
Loading…
Reference in a new issue