diff --git a/skills/ingest.md b/skills/ingest.md index 623dd82..f55dd59 100644 --- a/skills/ingest.md +++ b/skills/ingest.md @@ -1,14 +1,16 @@ # Skill: Ingest -Pull tweets from your domain network, triage for signal, archive sources, extract claims, and open a PR. This is the full ingestion loop — from raw X data to knowledge base contribution. +Research your domain, find source material, and archive it in inbox/ with context notes. Extraction happens separately on the VPS — your job is to find and archive good sources, not to extract claims. + +**Archive everything.** The inbox is a library, not a filter. If it's relevant to any Teleo domain, archive it. Null-result sources (no extractable claims) are still valuable — they prevent duplicate work and build domain context. ## Usage ``` -/ingest # Run full loop: pull → triage → archive → extract → PR -/ingest pull-only # Just pull fresh tweets, don't extract yet -/ingest from-cache # Skip pulling, extract from already-cached tweets -/ingest @username # Ingest a specific account (pull + extract) +/ingest # Research loop: pull tweets, find sources, archive with notes +/ingest @username # Pull and archive a specific X account's content +/ingest url # Archive a paper, article, or thread from URL +/ingest scan # Scan your network for new content since last pull ``` ## Prerequisites @@ -19,108 +21,84 @@ Pull tweets from your domain network, triage for signal, archive sources, extrac ## The Loop -### Step 1: Pull fresh tweets +### Step 1: Research -For each account in your network file (or the specified account): +Find source material relevant to your domain. Sources include: +- **X/Twitter** — tweets, threads, debates from your network accounts +- **Papers** — academic papers, preprints, whitepapers +- **Articles** — blog posts, newsletters, news coverage +- **Reports** — industry reports, data releases, government filings +- **Conversations** — podcast transcripts, interview notes, voicenote transcripts -1. **Check cache** — read `~/.pentagon/workspace/collective/x-ingestion/raw/{username}.json`. If `pulled_at` is <24h old, skip. -2. **Pull** — use `/x-research pull @{username}` or the API directly: - ```bash - API_KEY=$(cat ~/.pentagon/secrets/twitterapi-io-key) - curl -s -H "X-API-Key: $API_KEY" \ - "https://api.twitterapi.io/twitter/user/last_tweets?userName={username}&count=100" - ``` -3. **Save** to `~/.pentagon/workspace/collective/x-ingestion/raw/{username}.json` -4. **Log** the pull to `~/.pentagon/workspace/collective/x-ingestion/pull-log.jsonl` +For X accounts, use `/x-research pull @{username}` to pull tweets, then scan for anything worth archiving. Don't just archive the "best" tweets — archive anything substantive. A thread arguing a wrong position is as valuable as one arguing a right one. -Rate limit: 2-second delay between accounts. Start with core tier accounts, then extended. +### Step 2: Archive with notes -### Step 2: Triage for signal +For each source, create an archive file on your branch: -Not every tweet is worth extracting. For each account's tweets, scan for: - -**High signal (extract):** -- Original analysis or arguments (not just links or reactions) -- Threads with evidence chains -- Data, statistics, study citations -- Novel claims that challenge or extend KB knowledge -- Cross-domain connections - -**Low signal (skip):** -- Pure engagement farming ("gm", memes, one-liners) -- Retweets without commentary -- Personal updates unrelated to domain -- Duplicate arguments already in the KB - -For each high-signal tweet or thread, note: -- Username, tweet URL, date -- Why it's high signal (1 sentence) -- Which domain it maps to -- Whether it's a new claim, counter-evidence, or enrichment to existing claims - -### Step 3: Archive sources - -For each high-signal item, create a source archive file on your branch: - -**Filename:** `inbox/archive/YYYY-MM-DD-{username}-{brief-slug}.md` +**Filename:** `inbox/archive/YYYY-MM-DD-{author-handle}-{brief-slug}.md` ```yaml --- type: source -title: "Brief description of the tweet/thread" -author: "Display Name (@username)" -twitter_id: "numeric_id_from_author_object" -url: https://x.com/{username}/status/{tweet_id} +title: "Descriptive title of the content" +author: "Display Name (@handle)" +twitter_id: "numeric_id_from_author_object" # X sources only +url: https://original-url date: YYYY-MM-DD -domain: {primary-domain} -format: tweet | thread -status: processing -tags: [relevant, topics] +domain: internet-finance | entertainment | ai-alignment | health | space-development | grand-strategy +secondary_domains: [other-domain] # if cross-domain +format: tweet | thread | essay | paper | whitepaper | report | newsletter | news | transcript +status: unprocessed +priority: high | medium | low +tags: [topic1, topic2] +flagged_for_rio: ["reason"] # if relevant to another agent's domain --- ``` -**Body:** Include the full tweet text (or thread text concatenated). For threads, preserve the order and note which tweets are replies to which. +**Body:** Include the full source text, then your research notes. -### Step 4: Extract claims +```markdown +## Content -Follow `skills/extract.md` for each archived source: +[Full text of tweet/thread/article. For long papers, include abstract + key sections.] -1. Read the source completely -2. Separate evidence from interpretation -3. Extract candidate claims (specific, disagreeable, evidence-backed) -4. Check for duplicates against existing KB -5. Classify by domain -6. Identify enrichments to existing claims +## Agent Notes -Write claim files to `domains/{your-domain}/` with proper frontmatter. +**Why this matters:** [1-2 sentences — what makes this worth archiving] -After extraction, update the source archive: -```yaml -status: processed -processed_by: {your-name} -processed_date: YYYY-MM-DD -claims_extracted: - - "claim title 1" - - "claim title 2" -enrichments: - - "existing claim that was enriched" +**KB connections:** [Which existing claims does this relate to, support, or challenge?] + +**Extraction hints:** [What claims might the extractor pull from this? Flag specific passages.] + +**Context:** [Anything the extractor needs to know — who the author is, what debate this is part of, etc.] ``` -### Step 5: Branch, commit, PR +The "Agent Notes" section is where you add value. The VPS extractor is good at mechanical extraction but lacks your domain context. Your notes guide it. + +### Step 3: Cross-domain flagging + +When you find sources outside your domain: +- Archive them anyway (you're already reading them) +- Set the `domain` field to the correct domain, not yours +- Add `flagged_for_{agent}: ["brief reason"]` to frontmatter +- Set `priority: high` if it's urgent or challenges existing claims + +### Step 4: Branch, commit, push ```bash # Branch -git checkout -b {your-name}/ingest-{date}-{brief-slug} +git checkout -b {your-name}/sources-{date}-{brief-slug} -# Stage -git add inbox/archive/*.md domains/{your-domain}/*.md +# Stage all archive files +git add inbox/archive/*.md # Commit -git commit -m "{your-name}: ingest {N} claims from {source description} +git commit -m "{your-name}: archive {N} sources — {brief description} -- What: {N} claims from {M} tweets/threads by {accounts} -- Why: {brief rationale — what KB gap this fills} -- Connections: {key links to existing claims} +- What: {N} sources from {list of authors/accounts} +- Domains: {which domains these cover} +- Priority: {any high-priority items flagged} Pentagon-Agent: {Name} <{UUID}>" @@ -129,49 +107,37 @@ FORGEJO_TOKEN=$(cat ~/.pentagon/secrets/forgejo-{your-name}-token) git push -u https://{your-name}:${FORGEJO_TOKEN}@git.livingip.xyz/teleo/teleo-codex.git {branch-name} ``` -Then open a PR on Forgejo: +Open a PR: ```bash curl -s -X POST "https://git.livingip.xyz/api/v1/repos/teleo/teleo-codex/pulls" \ -H "Authorization: token ${FORGEJO_TOKEN}" \ -H "Content-Type: application/json" \ -d '{ - "title": "{your-name}: ingest {N} claims — {brief description}", - "body": "## Source\n{tweet URLs and account names}\n\n## Claims\n{numbered list of claim titles}\n\n## Why\n{what KB gap this fills, connections to existing claims}\n\n## Enrichments\n{any existing claims updated with new evidence}", + "title": "{your-name}: archive {N} sources — {brief description}", + "body": "## Sources archived\n{numbered list with titles and domains}\n\n## High priority\n{any flagged items}\n\n## Cross-domain flags\n{any items flagged for other agents}", "base": "main", "head": "{branch-name}" }' ``` -The eval pipeline handles review and auto-merge from here. +Source-only PRs should merge fast — they don't change claims, just add to the library. -## Batch Ingestion +## What Happens After You Archive -When running the full loop across your network: +A cron job on the VPS checks inbox/ for `status: unprocessed` sources every 15 minutes. For each one it: -1. Pull all accounts (Step 1) -2. Triage across all pulled tweets (Step 2) — batch the triage so you can see patterns -3. Group high-signal items by topic, not by account -4. Create one PR per topic cluster (3-8 claims per PR is ideal) -5. Don't create mega-PRs with 20+ claims — they're harder to review +1. Reads the source + your agent notes +2. Runs extraction (skills/extract.md) via Claude headless +3. Creates claim files in the correct domain +4. Opens a PR with the extracted claims +5. Updates the source to `status: processed` +6. The eval pipeline reviews the extraction PR -## Cross-Domain Routing - -If you find high-signal content outside your domain during triage: -- Archive the source in `inbox/archive/` with `status: unprocessed` -- Add `flagged_for_{agent}: ["brief reason"]` to the frontmatter -- Message the relevant agent: "New source archived for your domain: {filename}" -- Don't extract claims outside your territory — let the domain agent do it - -## Quality Controls - -- **Source diversity:** If you're extracting 5+ claims from one account in one batch, flag it. Monoculture risk. -- **Freshness:** Don't re-extract tweets that are already archived. Check `inbox/archive/` first. -- **Signal ratio:** Aim for ≥50% of triaged tweets yielding at least one claim. If your ratio is lower, raise your triage bar. -- **Cost tracking:** Log every API call. The pull log tracks spend across agents. +**You don't need to wait for this.** Archive and move on. The VPS handles the rest. ## Network Management -Your network file (`{your-name}-network.json`) lists accounts to monitor. Update it as you discover new high-signal accounts in your domain: +Your network file (`{your-name}-network.json`) lists X accounts to monitor: ```json { @@ -185,8 +151,16 @@ Your network file (`{your-name}-network.json`) lists accounts to monitor. Update ``` **Tiers:** -- `core` — Pull every ingestion cycle. High signal-to-noise ratio. +- `core` — Pull every session. High signal-to-noise. - `extended` — Pull weekly or when specifically relevant. -- `watch` — Discovered but not yet confirmed as useful. Pull once to evaluate. +- `watch` — Pull once to evaluate, then promote or drop. -Agents without a network file yet should create one as their first ingestion task. Start with 5-10 seed accounts, pull them, evaluate signal quality, then expand. +Agents without a network file should create one as their first task. Start with 5-10 seed accounts. + +## Quality Controls + +- **Archive everything substantive.** Don't self-censor. The extractor decides what yields claims. +- **Write good notes.** Your domain context is the difference between a useful source and a pile of text. +- **Check for duplicates.** Don't re-archive sources already in `inbox/archive/`. +- **Flag cross-domain.** If you see something relevant to another agent, flag it — don't assume they'll find it. +- **Log API costs.** Every X pull gets logged to `~/.pentagon/workspace/collective/x-ingestion/pull-log.jsonl`.