From 5db0c660b27ec1077c46e8bb74e751cddd4579d1 Mon Sep 17 00:00:00 2001 From: m3taversal Date: Mon, 9 Mar 2026 19:12:22 +0000 Subject: [PATCH] Auto: docs/ingestion-daemon-onboarding.md | 1 file changed, 203 insertions(+), 77 deletions(-) --- docs/ingestion-daemon-onboarding.md | 282 ++++++++++++++++++++-------- 1 file changed, 204 insertions(+), 78 deletions(-) diff --git a/docs/ingestion-daemon-onboarding.md b/docs/ingestion-daemon-onboarding.md index 713d039..fea52e2 100644 --- a/docs/ingestion-daemon-onboarding.md +++ b/docs/ingestion-daemon-onboarding.md @@ -1,24 +1,103 @@ # Ingestion Daemon Onboarding -How to build an ingestion daemon for the Teleo collective knowledge base. This doc covers the **futardio daemon** as the first example, but the pattern generalizes to any data source (X feeds, RSS, on-chain data, arxiv, etc.). +How to build the Teleo ingestion daemon — a single service with pluggable source adapters that feeds the collective knowledge base. ## Architecture ``` -Data source (futard.io, X, RSS, on-chain...) - ↓ -Ingestion daemon (your script, runs on VPS cron) - ↓ -inbox/archive/*.md (source archive files with YAML frontmatter) - ↓ -Git branch → push → PR on Forgejo - ↓ -Webhook triggers headless domain agent (extraction) - ↓ -Agent opens claims PR → eval pipeline reviews → merge +┌─────────────────────────────────────────────┐ +│ Ingestion Daemon (1 service) │ +│ │ +│ ┌──────────┐ ┌────────┐ ┌──────┐ ┌──────┐ │ +│ │ futardio │ │ x-feed │ │ rss │ │onchain│ │ +│ │ adapter │ │ adapter│ │adapter│ │adapter│ │ +│ └────┬─────┘ └───┬────┘ └──┬───┘ └──┬───┘ │ +│ └────────┬───┴────┬────┘ │ │ +│ ▼ ▼ ▼ │ +│ ┌─────────────────────────┐ │ +│ │ Shared pipeline: │ │ +│ │ dedup → format → git │ │ +│ └───────────┬─────────────┘ │ +└─────────────────────┼───────────────────────┘ + ▼ + inbox/archive/*.md on Forgejo branch + ▼ + PR opened on Forgejo + ▼ + Webhook → headless domain agent (extraction) + ▼ + Agent claims PR → eval pipeline → merge ``` -**Your daemon is responsible for steps 1-4 only.** You pull data, format it, and push it. Agents handle everything downstream. +**The daemon handles ingestion only.** It pulls data, deduplicates, formats as source archive markdown, and opens PRs. Agents handle everything downstream (extraction, claim writing, evaluation, merge). + +## Single daemon, pluggable adapters + +One codebase, one container, one scheduler. Each data source is an adapter — a function that knows how to pull and normalize content from one source. The shared pipeline handles dedup, formatting, git workflow, and PR creation identically for every adapter. + +### Configuration + +```yaml +# ingestion-config.yaml + +daemon: + dedup_db: /data/ingestion.db # Shared SQLite for dedup + repo_dir: /workspace/teleo-codex # Local clone + forgejo_url: https://git.livingip.xyz + forgejo_token: ${FORGEJO_TOKEN} # From env/secrets + batch_branch_prefix: ingestion + +sources: + futardio: + adapter: futardio + interval: 15m + domain: internet-finance + significance_filter: true # Only new launches, threshold events, refunds + tags: [futardio, metadao, solana, permissionless-launches] + + x-ai: + adapter: twitter + interval: 30m + domain: ai-alignment + network: theseus-network.json # Account list + tiers + api: twitterapi.io + engagement_threshold: 50 # Min likes/RTs to archive + + x-finance: + adapter: twitter + interval: 30m + domain: internet-finance + network: rio-network.json + api: twitterapi.io + engagement_threshold: 50 + + rss: + adapter: rss + interval: 15m + feeds: + - url: https://noahpinion.substack.com/feed + domain: grand-strategy + - url: https://citriniresearch.substack.com/feed + domain: internet-finance + # Add feeds here — no code changes needed + + onchain: + adapter: solana + interval: 5m + domain: internet-finance + programs: + - metadao_autocrat # Futarchy governance events + - metadao_conditional_vault # Conditional token markets + significance_filter: true # Only governance events, not routine txs +``` + +### Adding a new source + +1. Write an adapter function: `pull_{source}(config) → list[SourceItem]` +2. Add an entry to `ingestion-config.yaml` +3. Restart daemon (or it hot-reloads config) + +No changes to the pipeline, git workflow, or PR creation. The adapter is the only custom part. ## What the daemon produces @@ -58,7 +137,7 @@ linked_set: "futardio-launches-march-2026" # Group related items cross_domain_flags: [ai-alignment, mechanisms] # Flag other relevant domains extraction_hints: "Focus on governance mechanism data" priority: low | medium | high # Signal urgency to agents -contributor: "Ben Harper" # Who ran the daemon +contributor: "ingestion-daemon" # Attribution ``` ### Body @@ -93,76 +172,95 @@ Route each source to the primary domain that should process it: If a source touches multiple domains, pick the primary and list others in `cross_domain_flags`. -## Git workflow +## Shared pipeline -### Branch convention +### Deduplication (SQLite) -``` -ingestion/{daemon-name}-{timestamp} +Every source item passes through dedup before archiving: + +```sql +CREATE TABLE staged ( + source_type TEXT, -- 'futardio', 'twitter', 'rss', 'solana' + source_id TEXT UNIQUE, -- Launch ID, tweet ID, article URL, tx sig + url TEXT, + title TEXT, + author TEXT, + content TEXT, + domain TEXT, + published_date TEXT, + staged_at TEXT DEFAULT CURRENT_TIMESTAMP +); ``` -Example: `ingestion/futardio-20260309-1700` +Dedup key varies by adapter: +| Adapter | Dedup key | +|---------|-----------| +| futardio | launch ID | +| twitter | tweet ID | +| rss | article URL | +| solana | tx signature | -### Commit format +### Git workflow -``` -ingestion: {N} sources from {daemon-name} batch {timestamp} - -- Sources: [brief list] -- Domains: [which domains routed to] - -Pentagon-Agent: {daemon-name} <{daemon-uuid-if-applicable}> -``` - -### PR creation +All adapters share the same git workflow: ```bash -git checkout -b ingestion/futardio-$(date +%Y%m%d-%H%M) +# 1. Branch +git checkout -b ingestion/{source}-$(date +%Y%m%d-%H%M) + +# 2. Stage files git add inbox/archive/*.md -git commit -m "ingestion: N sources from futardio batch $(date +%Y%m%d-%H%M)" + +# 3. Commit +git commit -m "ingestion: N sources from {source} batch $(date +%Y%m%d-%H%M) + +- Sources: [brief list] +- Domains: [which domains routed to]" + +# 4. Push git push -u origin HEAD -# Open PR on Forgejo + +# 5. Open PR on Forgejo curl -X POST "https://git.livingip.xyz/api/v1/repos/teleo/teleo-codex/pulls" \ - -H "Authorization: token YOUR_TOKEN" \ + -H "Authorization: token $FORGEJO_TOKEN" \ -H "Content-Type: application/json" \ -d '{ - "title": "ingestion: N sources from futardio batch TIMESTAMP", - "body": "## Batch summary\n- N source files\n- Domain: internet-finance\n- Source: futard.io\n\nAutomated ingestion daemon.", - "head": "ingestion/futardio-TIMESTAMP", + "title": "ingestion: N sources from {source} batch TIMESTAMP", + "body": "## Batch summary\n- N source files\n- Domain: {domain}\n- Source: {source}\n\nAutomated ingestion daemon.", + "head": "ingestion/{source}-TIMESTAMP", "base": "main" }' ``` -After PR is created, the Forgejo webhook triggers the eval pipeline which routes to the appropriate domain agent for extraction. +After PR creation, the Forgejo webhook triggers the eval pipeline which routes to the appropriate domain agent for extraction. -## Futardio Daemon — Specific Implementation +### Batching -### What to pull +Sources are batched per adapter per run. If the futardio adapter finds 3 new launches in one poll cycle, all 3 go in one branch/PR. If it finds 0, no branch is created. This keeps PR volume manageable for the review pipeline. -futard.io is a permissionless launchpad on Solana (MetaDAO ecosystem). Key data: +## Adapter specifications -1. **New project launches** — name, description, funding target, FDV, status (LIVE/REFUNDING/COMPLETE) -2. **Funding progress** — committed amounts, funder counts, threshold status -3. **Transaction feed** — individual contributions with amounts and timestamps -4. **Platform metrics** — total committed ($17.8M+), total funders (1k+), active launches (44+) +### futardio adapter -### Poll interval +**Source:** futard.io — permissionless launchpad on Solana (MetaDAO ecosystem) -Every 15 minutes. futard.io data changes frequently (live fundraising), but most changes are incremental transaction data. New project launches are the high-signal events. +**What to pull:** +1. New project launches — name, description, funding target, FDV, status +2. Funding threshold events — project reaches funding threshold, triggers refund +3. Platform metrics snapshots — total committed, funder count, active launches -### Deduplication +**Significance filter:** Skip routine transaction updates. Archive only: +- New launch listed +- Funding threshold reached (project funded) +- Refund triggered +- Platform milestone (e.g., total committed crosses round number) -Before creating a source file, check: -1. **Filename dedup** — does `inbox/archive/` already have a file for this source? -2. **Content dedup** — SQLite staging table with `source_id` unique constraint -3. **Significance filter** — skip trivial transaction updates; archive meaningful state changes (new launch, funding threshold reached, refund triggered) - -### Example output +**Example output:** ```markdown --- type: source -title: "Futardio launch: SolForge reaches 80% funding threshold" +title: "Futardio launch: SolForge reaches funding threshold" author: "futard.io" url: "https://futard.io/launches/solforge" date: 2026-03-09 @@ -172,48 +270,64 @@ status: unprocessed tags: [futardio, metadao, solana, permissionless-launches, capital-formation] linked_set: futardio-launches-march-2026 priority: medium -contributor: "Ben Harper (ingestion daemon)" +contributor: "ingestion-daemon" --- ## Summary -SolForge project on futard.io reached 80% of its funding threshold, with $X committed from N funders. +SolForge reached its funding threshold on futard.io with $X committed from N funders. ## Content - Project: SolForge -- Description: [from futard.io listing] +- Description: [from listing] - FDV: [value] -- Funding committed: [amount] / [target] ([percentage]%) -- Funder count: [N] -- Status: LIVE +- Funding: [amount] / [target] ([percentage]%) +- Funders: [N] +- Status: COMPLETE - Launch date: 2026-03-09 -- Key milestones: [any threshold events] +- Use of funds: [from listing] ## Context -Part of the futard.io permissionless launch platform (MetaDAO ecosystem). Relevant to existing claims on permissionless capital formation and futarchy-governed launches. +Part of the futard.io permissionless launch platform (MetaDAO ecosystem). ``` -## Generalizing to other daemons +### twitter adapter -The pattern is identical for any data source. Only these things change: +**Source:** X/Twitter via twitterapi.io -| Parameter | Futardio | X feeds | RSS | On-chain | -|-----------|----------|---------|-----|----------| -| Data source | futard.io web/API | twitterapi.io | feedparser | Solana RPC | -| Poll interval | 15 min | 15-30 min | 15 min | 5 min | -| Domain routing | internet-finance | per-account | per-feed | internet-finance | -| Dedup key | launch ID | tweet ID | article URL | tx signature | -| Format field | data | tweet/thread | essay/news | data | -| Significance filter | new launch, threshold event | engagement threshold | always archive | governance events | +**Config:** Takes a network JSON file (e.g., `theseus-network.json`, `rio-network.json`) that defines accounts and tiers. -The output format (source archive markdown) and git workflow (branch → PR → webhook) are always the same. +**What to pull:** Recent tweets from network accounts, filtered by engagement threshold. + +**Dedup:** Tweet ID. Skip retweets without commentary. Quote tweets are separate items. + +### rss adapter + +**Source:** RSS/Atom feeds via feedparser + +**Config:** List of feed URLs with domain routing. + +**What to pull:** New articles since last poll. Full text via Crawl4AI (JS-rendered) or trafilatura (fallback). + +**Dedup:** Article URL. + +### solana adapter + +**Source:** Solana RPC / program event logs + +**Config:** List of program addresses to monitor. + +**What to pull:** Governance events (new proposals, vote results, treasury operations). Not routine transfers. + +**Significance filter:** Only events that change governance state. ## Setup checklist - [ ] Forgejo account with API token (write access to teleo-codex) -- [ ] SSH key or HTTPS token for git push -- [ ] SQLite database for dedup staging -- [ ] Cron job on VPS (every 15 min) -- [ ] Test: create one source file manually, push, verify PR triggers eval pipeline +- [ ] SSH key or HTTPS token for git push to Forgejo +- [ ] SQLite database file for dedup staging +- [ ] `ingestion-config.yaml` with source definitions +- [ ] Cron or systemd timer on VPS +- [ ] Test: single adapter → one source file → push → PR → verify webhook triggers eval ## Files to read @@ -225,3 +339,15 @@ The output format (source archive markdown) and git workflow (branch → PR → | `CONTRIBUTING.md` | Human contributor workflow (similar pattern) | | `CLAUDE.md` | Full collective operating manual | | `inbox/archive/*.md` | Real examples of archived sources | + +## Cost model + +| Component | Cost | +|-----------|------| +| VPS (Hetzner CAX31) | ~$15/mo | +| X API (twitterapi.io) | ~$100/mo | +| Daemon compute | Negligible (polling + formatting) | +| Agent extraction (downstream) | Covered by Claude Max subscription on VPS | +| Total ingestion | ~$115/mo fixed | + +The expensive part (LLM calls for extraction and evaluation) happens downstream in the agent pipeline, not in the daemon. The daemon itself is cheap — it's just HTTP requests, text formatting, and git operations.