leo: add ingest skill — full X-to-claims pipeline #103
1 changed files with 83 additions and 109 deletions
192
skills/ingest.md
192
skills/ingest.md
|
|
@ -1,14 +1,16 @@
|
||||||
# Skill: Ingest
|
# Skill: Ingest
|
||||||
|
|
||||||
Pull tweets from your domain network, triage for signal, archive sources, extract claims, and open a PR. This is the full ingestion loop — from raw X data to knowledge base contribution.
|
Research your domain, find source material, and archive it in inbox/ with context notes. Extraction happens separately on the VPS — your job is to find and archive good sources, not to extract claims.
|
||||||
|
|
||||||
|
**Archive everything.** The inbox is a library, not a filter. If it's relevant to any Teleo domain, archive it. Null-result sources (no extractable claims) are still valuable — they prevent duplicate work and build domain context.
|
||||||
|
|
||||||
## Usage
|
## Usage
|
||||||
|
|
||||||
```
|
```
|
||||||
/ingest # Run full loop: pull → triage → archive → extract → PR
|
/ingest # Research loop: pull tweets, find sources, archive with notes
|
||||||
/ingest pull-only # Just pull fresh tweets, don't extract yet
|
/ingest @username # Pull and archive a specific X account's content
|
||||||
/ingest from-cache # Skip pulling, extract from already-cached tweets
|
/ingest url <url> # Archive a paper, article, or thread from URL
|
||||||
/ingest @username # Ingest a specific account (pull + extract)
|
/ingest scan # Scan your network for new content since last pull
|
||||||
```
|
```
|
||||||
|
|
||||||
## Prerequisites
|
## Prerequisites
|
||||||
|
|
@ -19,108 +21,84 @@ Pull tweets from your domain network, triage for signal, archive sources, extrac
|
||||||
|
|
||||||
## The Loop
|
## The Loop
|
||||||
|
|
||||||
### Step 1: Pull fresh tweets
|
### Step 1: Research
|
||||||
|
|
||||||
For each account in your network file (or the specified account):
|
Find source material relevant to your domain. Sources include:
|
||||||
|
- **X/Twitter** — tweets, threads, debates from your network accounts
|
||||||
|
- **Papers** — academic papers, preprints, whitepapers
|
||||||
|
- **Articles** — blog posts, newsletters, news coverage
|
||||||
|
- **Reports** — industry reports, data releases, government filings
|
||||||
|
- **Conversations** — podcast transcripts, interview notes, voicenote transcripts
|
||||||
|
|
||||||
1. **Check cache** — read `~/.pentagon/workspace/collective/x-ingestion/raw/{username}.json`. If `pulled_at` is <24h old, skip.
|
For X accounts, use `/x-research pull @{username}` to pull tweets, then scan for anything worth archiving. Don't just archive the "best" tweets — archive anything substantive. A thread arguing a wrong position is as valuable as one arguing a right one.
|
||||||
2. **Pull** — use `/x-research pull @{username}` or the API directly:
|
|
||||||
```bash
|
|
||||||
API_KEY=$(cat ~/.pentagon/secrets/twitterapi-io-key)
|
|
||||||
curl -s -H "X-API-Key: $API_KEY" \
|
|
||||||
"https://api.twitterapi.io/twitter/user/last_tweets?userName={username}&count=100"
|
|
||||||
```
|
|
||||||
3. **Save** to `~/.pentagon/workspace/collective/x-ingestion/raw/{username}.json`
|
|
||||||
4. **Log** the pull to `~/.pentagon/workspace/collective/x-ingestion/pull-log.jsonl`
|
|
||||||
|
|
||||||
Rate limit: 2-second delay between accounts. Start with core tier accounts, then extended.
|
### Step 2: Archive with notes
|
||||||
|
|
||||||
### Step 2: Triage for signal
|
For each source, create an archive file on your branch:
|
||||||
|
|
||||||
Not every tweet is worth extracting. For each account's tweets, scan for:
|
**Filename:** `inbox/archive/YYYY-MM-DD-{author-handle}-{brief-slug}.md`
|
||||||
|
|
||||||
**High signal (extract):**
|
|
||||||
- Original analysis or arguments (not just links or reactions)
|
|
||||||
- Threads with evidence chains
|
|
||||||
- Data, statistics, study citations
|
|
||||||
- Novel claims that challenge or extend KB knowledge
|
|
||||||
- Cross-domain connections
|
|
||||||
|
|
||||||
**Low signal (skip):**
|
|
||||||
- Pure engagement farming ("gm", memes, one-liners)
|
|
||||||
- Retweets without commentary
|
|
||||||
- Personal updates unrelated to domain
|
|
||||||
- Duplicate arguments already in the KB
|
|
||||||
|
|
||||||
For each high-signal tweet or thread, note:
|
|
||||||
- Username, tweet URL, date
|
|
||||||
- Why it's high signal (1 sentence)
|
|
||||||
- Which domain it maps to
|
|
||||||
- Whether it's a new claim, counter-evidence, or enrichment to existing claims
|
|
||||||
|
|
||||||
### Step 3: Archive sources
|
|
||||||
|
|
||||||
For each high-signal item, create a source archive file on your branch:
|
|
||||||
|
|
||||||
**Filename:** `inbox/archive/YYYY-MM-DD-{username}-{brief-slug}.md`
|
|
||||||
|
|
||||||
```yaml
|
```yaml
|
||||||
---
|
---
|
||||||
type: source
|
type: source
|
||||||
title: "Brief description of the tweet/thread"
|
title: "Descriptive title of the content"
|
||||||
author: "Display Name (@username)"
|
author: "Display Name (@handle)"
|
||||||
twitter_id: "numeric_id_from_author_object"
|
twitter_id: "numeric_id_from_author_object" # X sources only
|
||||||
url: https://x.com/{username}/status/{tweet_id}
|
url: https://original-url
|
||||||
date: YYYY-MM-DD
|
date: YYYY-MM-DD
|
||||||
domain: {primary-domain}
|
domain: internet-finance | entertainment | ai-alignment | health | space-development | grand-strategy
|
||||||
format: tweet | thread
|
secondary_domains: [other-domain] # if cross-domain
|
||||||
status: processing
|
format: tweet | thread | essay | paper | whitepaper | report | newsletter | news | transcript
|
||||||
tags: [relevant, topics]
|
status: unprocessed
|
||||||
|
priority: high | medium | low
|
||||||
|
tags: [topic1, topic2]
|
||||||
|
flagged_for_rio: ["reason"] # if relevant to another agent's domain
|
||||||
---
|
---
|
||||||
```
|
```
|
||||||
|
|
||||||
**Body:** Include the full tweet text (or thread text concatenated). For threads, preserve the order and note which tweets are replies to which.
|
**Body:** Include the full source text, then your research notes.
|
||||||
|
|
||||||
### Step 4: Extract claims
|
```markdown
|
||||||
|
## Content
|
||||||
|
|
||||||
Follow `skills/extract.md` for each archived source:
|
[Full text of tweet/thread/article. For long papers, include abstract + key sections.]
|
||||||
|
|
||||||
1. Read the source completely
|
## Agent Notes
|
||||||
2. Separate evidence from interpretation
|
|
||||||
3. Extract candidate claims (specific, disagreeable, evidence-backed)
|
|
||||||
4. Check for duplicates against existing KB
|
|
||||||
5. Classify by domain
|
|
||||||
6. Identify enrichments to existing claims
|
|
||||||
|
|
||||||
Write claim files to `domains/{your-domain}/` with proper frontmatter.
|
**Why this matters:** [1-2 sentences — what makes this worth archiving]
|
||||||
|
|
||||||
After extraction, update the source archive:
|
**KB connections:** [Which existing claims does this relate to, support, or challenge?]
|
||||||
```yaml
|
|
||||||
status: processed
|
**Extraction hints:** [What claims might the extractor pull from this? Flag specific passages.]
|
||||||
processed_by: {your-name}
|
|
||||||
processed_date: YYYY-MM-DD
|
**Context:** [Anything the extractor needs to know — who the author is, what debate this is part of, etc.]
|
||||||
claims_extracted:
|
|
||||||
- "claim title 1"
|
|
||||||
- "claim title 2"
|
|
||||||
enrichments:
|
|
||||||
- "existing claim that was enriched"
|
|
||||||
```
|
```
|
||||||
|
|
||||||
### Step 5: Branch, commit, PR
|
The "Agent Notes" section is where you add value. The VPS extractor is good at mechanical extraction but lacks your domain context. Your notes guide it.
|
||||||
|
|
||||||
|
### Step 3: Cross-domain flagging
|
||||||
|
|
||||||
|
When you find sources outside your domain:
|
||||||
|
- Archive them anyway (you're already reading them)
|
||||||
|
- Set the `domain` field to the correct domain, not yours
|
||||||
|
- Add `flagged_for_{agent}: ["brief reason"]` to frontmatter
|
||||||
|
- Set `priority: high` if it's urgent or challenges existing claims
|
||||||
|
|
||||||
|
### Step 4: Branch, commit, push
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Branch
|
# Branch
|
||||||
git checkout -b {your-name}/ingest-{date}-{brief-slug}
|
git checkout -b {your-name}/sources-{date}-{brief-slug}
|
||||||
|
|
||||||
# Stage
|
# Stage all archive files
|
||||||
git add inbox/archive/*.md domains/{your-domain}/*.md
|
git add inbox/archive/*.md
|
||||||
|
|
||||||
# Commit
|
# Commit
|
||||||
git commit -m "{your-name}: ingest {N} claims from {source description}
|
git commit -m "{your-name}: archive {N} sources — {brief description}
|
||||||
|
|
||||||
- What: {N} claims from {M} tweets/threads by {accounts}
|
- What: {N} sources from {list of authors/accounts}
|
||||||
- Why: {brief rationale — what KB gap this fills}
|
- Domains: {which domains these cover}
|
||||||
- Connections: {key links to existing claims}
|
- Priority: {any high-priority items flagged}
|
||||||
|
|
||||||
Pentagon-Agent: {Name} <{UUID}>"
|
Pentagon-Agent: {Name} <{UUID}>"
|
||||||
|
|
||||||
|
|
@ -129,49 +107,37 @@ FORGEJO_TOKEN=$(cat ~/.pentagon/secrets/forgejo-{your-name}-token)
|
||||||
git push -u https://{your-name}:${FORGEJO_TOKEN}@git.livingip.xyz/teleo/teleo-codex.git {branch-name}
|
git push -u https://{your-name}:${FORGEJO_TOKEN}@git.livingip.xyz/teleo/teleo-codex.git {branch-name}
|
||||||
```
|
```
|
||||||
|
|
||||||
Then open a PR on Forgejo:
|
Open a PR:
|
||||||
```bash
|
```bash
|
||||||
curl -s -X POST "https://git.livingip.xyz/api/v1/repos/teleo/teleo-codex/pulls" \
|
curl -s -X POST "https://git.livingip.xyz/api/v1/repos/teleo/teleo-codex/pulls" \
|
||||||
-H "Authorization: token ${FORGEJO_TOKEN}" \
|
-H "Authorization: token ${FORGEJO_TOKEN}" \
|
||||||
-H "Content-Type: application/json" \
|
-H "Content-Type: application/json" \
|
||||||
-d '{
|
-d '{
|
||||||
"title": "{your-name}: ingest {N} claims — {brief description}",
|
"title": "{your-name}: archive {N} sources — {brief description}",
|
||||||
"body": "## Source\n{tweet URLs and account names}\n\n## Claims\n{numbered list of claim titles}\n\n## Why\n{what KB gap this fills, connections to existing claims}\n\n## Enrichments\n{any existing claims updated with new evidence}",
|
"body": "## Sources archived\n{numbered list with titles and domains}\n\n## High priority\n{any flagged items}\n\n## Cross-domain flags\n{any items flagged for other agents}",
|
||||||
"base": "main",
|
"base": "main",
|
||||||
"head": "{branch-name}"
|
"head": "{branch-name}"
|
||||||
}'
|
}'
|
||||||
```
|
```
|
||||||
|
|
||||||
The eval pipeline handles review and auto-merge from here.
|
Source-only PRs should merge fast — they don't change claims, just add to the library.
|
||||||
|
|
||||||
## Batch Ingestion
|
## What Happens After You Archive
|
||||||
|
|
||||||
When running the full loop across your network:
|
A cron job on the VPS checks inbox/ for `status: unprocessed` sources every 15 minutes. For each one it:
|
||||||
|
|
||||||
1. Pull all accounts (Step 1)
|
1. Reads the source + your agent notes
|
||||||
2. Triage across all pulled tweets (Step 2) — batch the triage so you can see patterns
|
2. Runs extraction (skills/extract.md) via Claude headless
|
||||||
3. Group high-signal items by topic, not by account
|
3. Creates claim files in the correct domain
|
||||||
4. Create one PR per topic cluster (3-8 claims per PR is ideal)
|
4. Opens a PR with the extracted claims
|
||||||
5. Don't create mega-PRs with 20+ claims — they're harder to review
|
5. Updates the source to `status: processed`
|
||||||
|
6. The eval pipeline reviews the extraction PR
|
||||||
|
|
||||||
## Cross-Domain Routing
|
**You don't need to wait for this.** Archive and move on. The VPS handles the rest.
|
||||||
|
|
||||||
If you find high-signal content outside your domain during triage:
|
|
||||||
- Archive the source in `inbox/archive/` with `status: unprocessed`
|
|
||||||
- Add `flagged_for_{agent}: ["brief reason"]` to the frontmatter
|
|
||||||
- Message the relevant agent: "New source archived for your domain: {filename}"
|
|
||||||
- Don't extract claims outside your territory — let the domain agent do it
|
|
||||||
|
|
||||||
## Quality Controls
|
|
||||||
|
|
||||||
- **Source diversity:** If you're extracting 5+ claims from one account in one batch, flag it. Monoculture risk.
|
|
||||||
- **Freshness:** Don't re-extract tweets that are already archived. Check `inbox/archive/` first.
|
|
||||||
- **Signal ratio:** Aim for ≥50% of triaged tweets yielding at least one claim. If your ratio is lower, raise your triage bar.
|
|
||||||
- **Cost tracking:** Log every API call. The pull log tracks spend across agents.
|
|
||||||
|
|
||||||
## Network Management
|
## Network Management
|
||||||
|
|
||||||
Your network file (`{your-name}-network.json`) lists accounts to monitor. Update it as you discover new high-signal accounts in your domain:
|
Your network file (`{your-name}-network.json`) lists X accounts to monitor:
|
||||||
|
|
||||||
```json
|
```json
|
||||||
{
|
{
|
||||||
|
|
@ -185,8 +151,16 @@ Your network file (`{your-name}-network.json`) lists accounts to monitor. Update
|
||||||
```
|
```
|
||||||
|
|
||||||
**Tiers:**
|
**Tiers:**
|
||||||
- `core` — Pull every ingestion cycle. High signal-to-noise ratio.
|
- `core` — Pull every session. High signal-to-noise.
|
||||||
- `extended` — Pull weekly or when specifically relevant.
|
- `extended` — Pull weekly or when specifically relevant.
|
||||||
- `watch` — Discovered but not yet confirmed as useful. Pull once to evaluate.
|
- `watch` — Pull once to evaluate, then promote or drop.
|
||||||
|
|
||||||
Agents without a network file yet should create one as their first ingestion task. Start with 5-10 seed accounts, pull them, evaluate signal quality, then expand.
|
Agents without a network file should create one as their first task. Start with 5-10 seed accounts.
|
||||||
|
|
||||||
|
## Quality Controls
|
||||||
|
|
||||||
|
- **Archive everything substantive.** Don't self-censor. The extractor decides what yields claims.
|
||||||
|
- **Write good notes.** Your domain context is the difference between a useful source and a pile of text.
|
||||||
|
- **Check for duplicates.** Don't re-archive sources already in `inbox/archive/`.
|
||||||
|
- **Flag cross-domain.** If you see something relevant to another agent, flag it — don't assume they'll find it.
|
||||||
|
- **Log API costs.** Every X pull gets logged to `~/.pentagon/workspace/collective/x-ingestion/pull-log.jsonl`.
|
||||||
|
|
|
||||||
Loading…
Reference in a new issue