leo: split ingestion — agents research + archive, VPS extracts headlessly

- What: Rewrote skills/ingest.md to be research-only (find sources, archive with notes) - Added ops/extract-cron.sh — VPS cron job that picks up unprocessed sources, runs Claude headless to extract claims, opens PRs - Why: Separates high-judgment work (research) from mechanical work (extraction). Agents spend session time finding sources, not grinding through extraction. Archive everything regardless of whether claims come out. - Architecture: Agents archive → VPS extracts → VPS eval reviews → auto-merge Pentagon-Agent: Leo <14FF9C29-CABF-40C8-8808-B0B495D03FF8>
Auto: ops/extract-cron.sh | 1 file changed, 167 insertions(+)
2026-03-10 10:31:49 +00:00 · 2026-03-10 10:31:39 +00:00 · 2026-03-10 10:31:02 +00:00
2 changed files with 250 additions and 109 deletions
--- a/ops/extract-cron.sh
+++ b/ops/extract-cron.sh
@ -0,0 +1,167 @@
+#!/bin/bash
+# Extract claims from unprocessed sources in inbox/archive/
+# Runs via cron on VPS every 15 minutes.
+#
+# Flow:
+#   1. Pull latest main
+#   2. Find sources with status: unprocessed
+#   3. For each: run Claude headless to extract claims
+#   4. Commit extractions, push, open PR
+#   5. Update source status to processed
+#
+# The eval pipeline (webhook.py) handles review and merge separately.
+
+set -euo pipefail
+
+REPO_DIR="/opt/teleo-eval/workspaces/extract"
+REPO_URL="http://m3taversal:$(cat /opt/teleo-eval/secrets/forgejo-admin-token)@localhost:3000/teleo/teleo-codex.git"
+CLAUDE_BIN="/home/teleo/.local/bin/claude"
+LOG_DIR="/opt/teleo-eval/logs"
+LOG="$LOG_DIR/extract-cron.log"
+LOCKFILE="/tmp/extract-cron.lock"
+MAX_SOURCES=5  # Process at most 5 sources per run to limit cost
+
+log() { echo "[$(date -Iseconds)] $*" >> "$LOG"; }
+
+# --- Lock ---
+if [ -f "$LOCKFILE" ]; then
+    pid=$(cat "$LOCKFILE" 2>/dev/null)
+    if kill -0 "$pid" 2>/dev/null; then
+        log "SKIP: already running (pid $pid)"
+        exit 0
+    fi
+    log "WARN: stale lockfile, removing"
+    rm -f "$LOCKFILE"
+fi
+echo $$ > "$LOCKFILE"
+trap 'rm -f "$LOCKFILE"' EXIT
+
+# --- Ensure repo clone ---
+if [ ! -d "$REPO_DIR/.git" ]; then
+    log "Cloning repo..."
+    git clone "$REPO_URL" "$REPO_DIR" >> "$LOG" 2>&1
+fi
+
+cd "$REPO_DIR"
+
+# --- Pull latest main ---
+git checkout main >> "$LOG" 2>&1
+git pull --rebase >> "$LOG" 2>&1
+
+# --- Find unprocessed sources ---
+UNPROCESSED=$(grep -rl '^status: unprocessed' inbox/archive/ 2>/dev/null | head -n "$MAX_SOURCES" || true)
+
+if [ -z "$UNPROCESSED" ]; then
+    log "No unprocessed sources found"
+    exit 0
+fi
+
+COUNT=$(echo "$UNPROCESSED" | wc -l | tr -d ' ')
+log "Found $COUNT unprocessed source(s)"
+
+# --- Process each source ---
+for SOURCE_FILE in $UNPROCESSED; do
+    SLUG=$(basename "$SOURCE_FILE" .md)
+    BRANCH="extract/$SLUG"
+
+    log "Processing: $SOURCE_FILE → branch $BRANCH"
+
+    # Create branch from main
+    git checkout main >> "$LOG" 2>&1
+    git branch -D "$BRANCH" 2>/dev/null || true
+    git checkout -b "$BRANCH" >> "$LOG" 2>&1
+
+    # Read domain from frontmatter
+    DOMAIN=$(grep '^domain:' "$SOURCE_FILE" | head -1 | sed 's/domain: *//' | tr -d '"' | tr -d "'" | xargs)
+
+    # Map domain to agent
+    case "$DOMAIN" in
+        internet-finance) AGENT="rio" ;;
+        entertainment) AGENT="clay" ;;
+        ai-alignment) AGENT="theseus" ;;
+        health) AGENT="vida" ;;
+        space-development) AGENT="astra" ;;
+        *) AGENT="leo" ;;
+    esac
+
+    AGENT_TOKEN=$(cat "/opt/teleo-eval/secrets/forgejo-${AGENT}-token" 2>/dev/null || cat /opt/teleo-eval/secrets/forgejo-leo-token)
+
+    log "Domain: $DOMAIN, Agent: $AGENT"
+
+    # Run Claude headless to extract claims
+    EXTRACT_PROMPT="You are $AGENT, a Teleo knowledge base agent. Extract claims from this source.
+
+READ these files first:
+- skills/extract.md (extraction process)
+- schemas/claim.md (claim format)
+- $SOURCE_FILE (the source to extract from)
+
+Then scan domains/$DOMAIN/ to check for duplicate claims.
+
+EXTRACT claims following the process in skills/extract.md:
+1. Read the source completely
+2. Separate evidence from interpretation
+3. Extract candidate claims (specific, disagreeable, evidence-backed)
+4. Check for duplicates against existing claims in domains/$DOMAIN/
+5. Write claim files to domains/$DOMAIN/ with proper YAML frontmatter
+6. Update $SOURCE_FILE: set status to 'processed', add processed_by: $AGENT, processed_date: $(date +%Y-%m-%d), and claims_extracted list
+
+If no claims can be extracted, update $SOURCE_FILE: set status to 'null-result' and add notes explaining why.
+
+IMPORTANT: Use the Edit tool to update the source file status. Use the Write tool to create new claim files. Do not create claims that duplicate existing ones."
+
+    # Run extraction with timeout (10 minutes)
+    timeout 600 "$CLAUDE_BIN" -p "$EXTRACT_PROMPT" \
+        --allowedTools 'Read,Write,Edit,Glob,Grep' \
+        --model sonnet \
+        >> "$LOG" 2>&1 || {
+        log "WARN: Claude extraction failed or timed out for $SOURCE_FILE"
+        git checkout main >> "$LOG" 2>&1
+        continue
+    }
+
+    # Check if any files were created/modified
+    CHANGES=$(git status --porcelain | wc -l | tr -d ' ')
+    if [ "$CHANGES" -eq 0 ]; then
+        log "No changes produced for $SOURCE_FILE"
+        git checkout main >> "$LOG" 2>&1
+        continue
+    fi
+
+    # Stage and commit
+    git add inbox/archive/ "domains/$DOMAIN/" >> "$LOG" 2>&1
+    git commit -m "$AGENT: extract claims from $(basename "$SOURCE_FILE")
+
+- Source: $SOURCE_FILE
+- Domain: $DOMAIN
+- Extracted by: headless extraction cron
+
+Pentagon-Agent: $(echo "$AGENT" | sed 's/./\U&/') <HEADLESS>" >> "$LOG" 2>&1
+
+    # Push branch
+    git push -u "$REPO_URL" "$BRANCH" --force >> "$LOG" 2>&1
+
+    # Open PR
+    PR_TITLE="$AGENT: extract claims from $(basename "$SOURCE_FILE" .md)"
+    PR_BODY="## Automated Extraction\n\nSource: \`$SOURCE_FILE\`\nDomain: $DOMAIN\nExtracted by: headless cron on VPS\n\nThis PR was created automatically by the extraction cron job. Claims were extracted using \`skills/extract.md\` process via Claude headless."
+
+    curl -s -X POST "http://localhost:3000/api/v1/repos/teleo/teleo-codex/pulls" \
+        -H "Authorization: token $AGENT_TOKEN" \
+        -H "Content-Type: application/json" \
+        -d "{
+            \"title\": \"$PR_TITLE\",
+            \"body\": \"$PR_BODY\",
+            \"base\": \"main\",
+            \"head\": \"$BRANCH\"
+        }" >> "$LOG" 2>&1
+
+    log "PR opened for $SOURCE_FILE"
+
+    # Back to main for next source
+    git checkout main >> "$LOG" 2>&1
+
+    # Brief pause between extractions
+    sleep 5
+done
+
+log "Extraction run complete: processed $COUNT source(s)"
--- a/skills/ingest.md
+++ b/skills/ingest.md
@ -1,14 +1,16 @@
 # Skill: Ingest

-Pull tweets from your domain network, triage for signal, archive sources, extract claims, and open a PR. This is the full ingestion loop — from raw X data to knowledge base contribution.
+Research your domain, find source material, and archive it in inbox/ with context notes. Extraction happens separately on the VPS — your job is to find and archive good sources, not to extract claims.
+
+**Archive everything.** The inbox is a library, not a filter. If it's relevant to any Teleo domain, archive it. Null-result sources (no extractable claims) are still valuable — they prevent duplicate work and build domain context.

 ## Usage

 ```
-/ingest                    # Run full loop: pull → triage → archive → extract → PR
-/ingest pull-only          # Just pull fresh tweets, don't extract yet
-/ingest from-cache         # Skip pulling, extract from already-cached tweets
-/ingest @username          # Ingest a specific account (pull + extract)
+/ingest                    # Research loop: pull tweets, find sources, archive with notes
+/ingest @username          # Pull and archive a specific X account's content
+/ingest url <url>          # Archive a paper, article, or thread from URL
+/ingest scan               # Scan your network for new content since last pull
 ```

 ## Prerequisites
@ -19,108 +21,84 @@ Pull tweets from your domain network, triage for signal, archive sources, extrac

 ## The Loop

-### Step 1: Pull fresh tweets
+### Step 1: Research

-For each account in your network file (or the specified account):
+Find source material relevant to your domain. Sources include:
+- **X/Twitter** — tweets, threads, debates from your network accounts
+- **Papers** — academic papers, preprints, whitepapers
+- **Articles** — blog posts, newsletters, news coverage
+- **Reports** — industry reports, data releases, government filings
+- **Conversations** — podcast transcripts, interview notes, voicenote transcripts

-1. **Check cache** — read `~/.pentagon/workspace/collective/x-ingestion/raw/{username}.json`. If `pulled_at` is <24h old, skip.
-2. **Pull** — use `/x-research pull @{username}` or the API directly:
-   ```bash
-   API_KEY=$(cat ~/.pentagon/secrets/twitterapi-io-key)
-   curl -s -H "X-API-Key: $API_KEY" \
-     "https://api.twitterapi.io/twitter/user/last_tweets?userName={username}&count=100"
-   ```
-3. **Save** to `~/.pentagon/workspace/collective/x-ingestion/raw/{username}.json`
-4. **Log** the pull to `~/.pentagon/workspace/collective/x-ingestion/pull-log.jsonl`
+For X accounts, use `/x-research pull @{username}` to pull tweets, then scan for anything worth archiving. Don't just archive the "best" tweets — archive anything substantive. A thread arguing a wrong position is as valuable as one arguing a right one.

-Rate limit: 2-second delay between accounts. Start with core tier accounts, then extended.
+### Step 2: Archive with notes

-### Step 2: Triage for signal
+For each source, create an archive file on your branch:

-Not every tweet is worth extracting. For each account's tweets, scan for:
-
-**High signal (extract):**
- Original analysis or arguments (not just links or reactions)
- Threads with evidence chains
- Data, statistics, study citations
- Novel claims that challenge or extend KB knowledge
- Cross-domain connections
-
-**Low signal (skip):**
- Pure engagement farming ("gm", memes, one-liners)
- Retweets without commentary
- Personal updates unrelated to domain
- Duplicate arguments already in the KB
-
-For each high-signal tweet or thread, note:
- Username, tweet URL, date
- Why it's high signal (1 sentence)
- Which domain it maps to
- Whether it's a new claim, counter-evidence, or enrichment to existing claims
-
-### Step 3: Archive sources
-
-For each high-signal item, create a source archive file on your branch:
-
-**Filename:** `inbox/archive/YYYY-MM-DD-{username}-{brief-slug}.md`
+**Filename:** `inbox/archive/YYYY-MM-DD-{author-handle}-{brief-slug}.md`

 ```yaml
 ---
 type: source
-title: "Brief description of the tweet/thread"
-author: "Display Name (@username)"
-twitter_id: "numeric_id_from_author_object"
-url: https://x.com/{username}/status/{tweet_id}
+title: "Descriptive title of the content"
+author: "Display Name (@handle)"
+twitter_id: "numeric_id_from_author_object"  # X sources only
+url: https://original-url
 date: YYYY-MM-DD
-domain: {primary-domain}
-format: tweet | thread
-status: processing
-tags: [relevant, topics]
+domain: internet-finance | entertainment | ai-alignment | health | space-development | grand-strategy
+secondary_domains: [other-domain]  # if cross-domain
+format: tweet | thread | essay | paper | whitepaper | report | newsletter | news | transcript
+status: unprocessed
+priority: high | medium | low
+tags: [topic1, topic2]
+flagged_for_rio: ["reason"]  # if relevant to another agent's domain
 ---
 ```

-**Body:** Include the full tweet text (or thread text concatenated). For threads, preserve the order and note which tweets are replies to which.
+**Body:** Include the full source text, then your research notes.

-### Step 4: Extract claims
+```markdown
+## Content

-Follow `skills/extract.md` for each archived source:
+[Full text of tweet/thread/article. For long papers, include abstract + key sections.]

-1. Read the source completely
-2. Separate evidence from interpretation
-3. Extract candidate claims (specific, disagreeable, evidence-backed)
-4. Check for duplicates against existing KB
-5. Classify by domain
-6. Identify enrichments to existing claims
+## Agent Notes

-Write claim files to `domains/{your-domain}/` with proper frontmatter.
+**Why this matters:** [1-2 sentences — what makes this worth archiving]

-After extraction, update the source archive:
-```yaml
-status: processed
-processed_by: {your-name}
-processed_date: YYYY-MM-DD
-claims_extracted:
-  - "claim title 1"
-  - "claim title 2"
-enrichments:
-  - "existing claim that was enriched"
+**KB connections:** [Which existing claims does this relate to, support, or challenge?]
+
+**Extraction hints:** [What claims might the extractor pull from this? Flag specific passages.]
+
+**Context:** [Anything the extractor needs to know — who the author is, what debate this is part of, etc.]
 ```

-### Step 5: Branch, commit, PR
+The "Agent Notes" section is where you add value. The VPS extractor is good at mechanical extraction but lacks your domain context. Your notes guide it.
+
+### Step 3: Cross-domain flagging
+
+When you find sources outside your domain:
+- Archive them anyway (you're already reading them)
+- Set the `domain` field to the correct domain, not yours
+- Add `flagged_for_{agent}: ["brief reason"]` to frontmatter
+- Set `priority: high` if it's urgent or challenges existing claims
+
+### Step 4: Branch, commit, push

 ```bash
 # Branch
-git checkout -b {your-name}/ingest-{date}-{brief-slug}
+git checkout -b {your-name}/sources-{date}-{brief-slug}

-# Stage
-git add inbox/archive/*.md domains/{your-domain}/*.md
+# Stage all archive files
+git add inbox/archive/*.md

 # Commit
-git commit -m "{your-name}: ingest {N} claims from {source description}
+git commit -m "{your-name}: archive {N} sources — {brief description}

- What: {N} claims from {M} tweets/threads by {accounts}
- Why: {brief rationale — what KB gap this fills}
- Connections: {key links to existing claims}
+- What: {N} sources from {list of authors/accounts}
+- Domains: {which domains these cover}
+- Priority: {any high-priority items flagged}

 Pentagon-Agent: {Name} <{UUID}>"

@ -129,49 +107,37 @@ FORGEJO_TOKEN=$(cat ~/.pentagon/secrets/forgejo-{your-name}-token)
 git push -u https://{your-name}:${FORGEJO_TOKEN}@git.livingip.xyz/teleo/teleo-codex.git {branch-name}
 ```

-Then open a PR on Forgejo:
+Open a PR:
 ```bash
 curl -s -X POST "https://git.livingip.xyz/api/v1/repos/teleo/teleo-codex/pulls" \
  -H "Authorization: token ${FORGEJO_TOKEN}" \
  -H "Content-Type: application/json" \
  -d '{
-    "title": "{your-name}: ingest {N} claims — {brief description}",
-    "body": "## Source\n{tweet URLs and account names}\n\n## Claims\n{numbered list of claim titles}\n\n## Why\n{what KB gap this fills, connections to existing claims}\n\n## Enrichments\n{any existing claims updated with new evidence}",
+    "title": "{your-name}: archive {N} sources — {brief description}",
+    "body": "## Sources archived\n{numbered list with titles and domains}\n\n## High priority\n{any flagged items}\n\n## Cross-domain flags\n{any items flagged for other agents}",
    "base": "main",
    "head": "{branch-name}"
  }'
 ```

-The eval pipeline handles review and auto-merge from here.
+Source-only PRs should merge fast — they don't change claims, just add to the library.

-## Batch Ingestion
+## What Happens After You Archive

-When running the full loop across your network:
+A cron job on the VPS checks inbox/ for `status: unprocessed` sources every 15 minutes. For each one it:

-1. Pull all accounts (Step 1)
-2. Triage across all pulled tweets (Step 2) — batch the triage so you can see patterns
-3. Group high-signal items by topic, not by account
-4. Create one PR per topic cluster (3-8 claims per PR is ideal)
-5. Don't create mega-PRs with 20+ claims — they're harder to review
+1. Reads the source + your agent notes
+2. Runs extraction (skills/extract.md) via Claude headless
+3. Creates claim files in the correct domain
+4. Opens a PR with the extracted claims
+5. Updates the source to `status: processed`
+6. The eval pipeline reviews the extraction PR

-## Cross-Domain Routing
-
-If you find high-signal content outside your domain during triage:
- Archive the source in `inbox/archive/` with `status: unprocessed`
- Add `flagged_for_{agent}: ["brief reason"]` to the frontmatter
- Message the relevant agent: "New source archived for your domain: {filename}"
- Don't extract claims outside your territory — let the domain agent do it
-
-## Quality Controls
-
- **Source diversity:** If you're extracting 5+ claims from one account in one batch, flag it. Monoculture risk.
- **Freshness:** Don't re-extract tweets that are already archived. Check `inbox/archive/` first.
- **Signal ratio:** Aim for ≥50% of triaged tweets yielding at least one claim. If your ratio is lower, raise your triage bar.
- **Cost tracking:** Log every API call. The pull log tracks spend across agents.
+**You don't need to wait for this.** Archive and move on. The VPS handles the rest.

 ## Network Management

-Your network file (`{your-name}-network.json`) lists accounts to monitor. Update it as you discover new high-signal accounts in your domain:
+Your network file (`{your-name}-network.json`) lists X accounts to monitor:

 ```json
 {
@ -185,8 +151,16 @@ Your network file (`{your-name}-network.json`) lists accounts to monitor. Update
 ```

 **Tiers:**
- `core` — Pull every ingestion cycle. High signal-to-noise ratio.
+- `core` — Pull every session. High signal-to-noise.
 - `extended` — Pull weekly or when specifically relevant.
- `watch` — Discovered but not yet confirmed as useful. Pull once to evaluate.
+- `watch` — Pull once to evaluate, then promote or drop.

-Agents without a network file yet should create one as their first ingestion task. Start with 5-10 seed accounts, pull them, evaluate signal quality, then expand.
+Agents without a network file should create one as their first task. Start with 5-10 seed accounts.
+
+## Quality Controls
+
+- **Archive everything substantive.** Don't self-censor. The extractor decides what yields claims.
+- **Write good notes.** Your domain context is the difference between a useful source and a pile of text.
+- **Check for duplicates.** Don't re-archive sources already in `inbox/archive/`.
+- **Flag cross-domain.** If you see something relevant to another agent, flag it — don't assume they'll find it.
+- **Log API costs.** Every X pull gets logged to `~/.pentagon/workspace/collective/x-ingestion/pull-log.jsonl`.