Auto: docs/ingestion-daemon-onboarding.md | 1 file changed, 203 insertions(+), 77 deletions(-)

This commit is contained in:
m3taversal 2026-03-09 19:12:22 +00:00
parent ec1da89f1f
commit 5db0c660b2

View file

@ -1,24 +1,103 @@
# Ingestion Daemon Onboarding
How to build an ingestion daemon for the Teleo collective knowledge base. This doc covers the **futardio daemon** as the first example, but the pattern generalizes to any data source (X feeds, RSS, on-chain data, arxiv, etc.).
How to build the Teleo ingestion daemon — a single service with pluggable source adapters that feeds the collective knowledge base.
## Architecture
```
Data source (futard.io, X, RSS, on-chain...)
Ingestion daemon (your script, runs on VPS cron)
inbox/archive/*.md (source archive files with YAML frontmatter)
Git branch → push → PR on Forgejo
Webhook triggers headless domain agent (extraction)
Agent opens claims PR → eval pipeline reviews → merge
┌─────────────────────────────────────────────┐
│ Ingestion Daemon (1 service) │
│ │
│ ┌──────────┐ ┌────────┐ ┌──────┐ ┌──────┐ │
│ │ futardio │ │ x-feed │ │ rss │ │onchain│ │
│ │ adapter │ │ adapter│ │adapter│ │adapter│ │
│ └────┬─────┘ └───┬────┘ └──┬───┘ └──┬───┘ │
│ └────────┬───┴────┬────┘ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────────────────┐ │
│ │ Shared pipeline: │ │
│ │ dedup → format → git │ │
│ └───────────┬─────────────┘ │
└─────────────────────┼───────────────────────┘
inbox/archive/*.md on Forgejo branch
PR opened on Forgejo
Webhook → headless domain agent (extraction)
Agent claims PR → eval pipeline → merge
```
**Your daemon is responsible for steps 1-4 only.** You pull data, format it, and push it. Agents handle everything downstream.
**The daemon handles ingestion only.** It pulls data, deduplicates, formats as source archive markdown, and opens PRs. Agents handle everything downstream (extraction, claim writing, evaluation, merge).
## Single daemon, pluggable adapters
One codebase, one container, one scheduler. Each data source is an adapter — a function that knows how to pull and normalize content from one source. The shared pipeline handles dedup, formatting, git workflow, and PR creation identically for every adapter.
### Configuration
```yaml
# ingestion-config.yaml
daemon:
dedup_db: /data/ingestion.db # Shared SQLite for dedup
repo_dir: /workspace/teleo-codex # Local clone
forgejo_url: https://git.livingip.xyz
forgejo_token: ${FORGEJO_TOKEN} # From env/secrets
batch_branch_prefix: ingestion
sources:
futardio:
adapter: futardio
interval: 15m
domain: internet-finance
significance_filter: true # Only new launches, threshold events, refunds
tags: [futardio, metadao, solana, permissionless-launches]
x-ai:
adapter: twitter
interval: 30m
domain: ai-alignment
network: theseus-network.json # Account list + tiers
api: twitterapi.io
engagement_threshold: 50 # Min likes/RTs to archive
x-finance:
adapter: twitter
interval: 30m
domain: internet-finance
network: rio-network.json
api: twitterapi.io
engagement_threshold: 50
rss:
adapter: rss
interval: 15m
feeds:
- url: https://noahpinion.substack.com/feed
domain: grand-strategy
- url: https://citriniresearch.substack.com/feed
domain: internet-finance
# Add feeds here — no code changes needed
onchain:
adapter: solana
interval: 5m
domain: internet-finance
programs:
- metadao_autocrat # Futarchy governance events
- metadao_conditional_vault # Conditional token markets
significance_filter: true # Only governance events, not routine txs
```
### Adding a new source
1. Write an adapter function: `pull_{source}(config) → list[SourceItem]`
2. Add an entry to `ingestion-config.yaml`
3. Restart daemon (or it hot-reloads config)
No changes to the pipeline, git workflow, or PR creation. The adapter is the only custom part.
## What the daemon produces
@ -58,7 +137,7 @@ linked_set: "futardio-launches-march-2026" # Group related items
cross_domain_flags: [ai-alignment, mechanisms] # Flag other relevant domains
extraction_hints: "Focus on governance mechanism data"
priority: low | medium | high # Signal urgency to agents
contributor: "Ben Harper" # Who ran the daemon
contributor: "ingestion-daemon" # Attribution
```
### Body
@ -93,76 +172,95 @@ Route each source to the primary domain that should process it:
If a source touches multiple domains, pick the primary and list others in `cross_domain_flags`.
## Git workflow
## Shared pipeline
### Branch convention
### Deduplication (SQLite)
```
ingestion/{daemon-name}-{timestamp}
Every source item passes through dedup before archiving:
```sql
CREATE TABLE staged (
source_type TEXT, -- 'futardio', 'twitter', 'rss', 'solana'
source_id TEXT UNIQUE, -- Launch ID, tweet ID, article URL, tx sig
url TEXT,
title TEXT,
author TEXT,
content TEXT,
domain TEXT,
published_date TEXT,
staged_at TEXT DEFAULT CURRENT_TIMESTAMP
);
```
Example: `ingestion/futardio-20260309-1700`
Dedup key varies by adapter:
| Adapter | Dedup key |
|---------|-----------|
| futardio | launch ID |
| twitter | tweet ID |
| rss | article URL |
| solana | tx signature |
### Commit format
### Git workflow
```
ingestion: {N} sources from {daemon-name} batch {timestamp}
- Sources: [brief list]
- Domains: [which domains routed to]
Pentagon-Agent: {daemon-name} <{daemon-uuid-if-applicable}>
```
### PR creation
All adapters share the same git workflow:
```bash
git checkout -b ingestion/futardio-$(date +%Y%m%d-%H%M)
# 1. Branch
git checkout -b ingestion/{source}-$(date +%Y%m%d-%H%M)
# 2. Stage files
git add inbox/archive/*.md
git commit -m "ingestion: N sources from futardio batch $(date +%Y%m%d-%H%M)"
# 3. Commit
git commit -m "ingestion: N sources from {source} batch $(date +%Y%m%d-%H%M)
- Sources: [brief list]
- Domains: [which domains routed to]"
# 4. Push
git push -u origin HEAD
# Open PR on Forgejo
# 5. Open PR on Forgejo
curl -X POST "https://git.livingip.xyz/api/v1/repos/teleo/teleo-codex/pulls" \
-H "Authorization: token YOUR_TOKEN" \
-H "Authorization: token $FORGEJO_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"title": "ingestion: N sources from futardio batch TIMESTAMP",
"body": "## Batch summary\n- N source files\n- Domain: internet-finance\n- Source: futard.io\n\nAutomated ingestion daemon.",
"head": "ingestion/futardio-TIMESTAMP",
"title": "ingestion: N sources from {source} batch TIMESTAMP",
"body": "## Batch summary\n- N source files\n- Domain: {domain}\n- Source: {source}\n\nAutomated ingestion daemon.",
"head": "ingestion/{source}-TIMESTAMP",
"base": "main"
}'
```
After PR is created, the Forgejo webhook triggers the eval pipeline which routes to the appropriate domain agent for extraction.
After PR creation, the Forgejo webhook triggers the eval pipeline which routes to the appropriate domain agent for extraction.
## Futardio Daemon — Specific Implementation
### Batching
### What to pull
Sources are batched per adapter per run. If the futardio adapter finds 3 new launches in one poll cycle, all 3 go in one branch/PR. If it finds 0, no branch is created. This keeps PR volume manageable for the review pipeline.
futard.io is a permissionless launchpad on Solana (MetaDAO ecosystem). Key data:
## Adapter specifications
1. **New project launches** — name, description, funding target, FDV, status (LIVE/REFUNDING/COMPLETE)
2. **Funding progress** — committed amounts, funder counts, threshold status
3. **Transaction feed** — individual contributions with amounts and timestamps
4. **Platform metrics** — total committed ($17.8M+), total funders (1k+), active launches (44+)
### futardio adapter
### Poll interval
**Source:** futard.io — permissionless launchpad on Solana (MetaDAO ecosystem)
Every 15 minutes. futard.io data changes frequently (live fundraising), but most changes are incremental transaction data. New project launches are the high-signal events.
**What to pull:**
1. New project launches — name, description, funding target, FDV, status
2. Funding threshold events — project reaches funding threshold, triggers refund
3. Platform metrics snapshots — total committed, funder count, active launches
### Deduplication
**Significance filter:** Skip routine transaction updates. Archive only:
- New launch listed
- Funding threshold reached (project funded)
- Refund triggered
- Platform milestone (e.g., total committed crosses round number)
Before creating a source file, check:
1. **Filename dedup** — does `inbox/archive/` already have a file for this source?
2. **Content dedup** — SQLite staging table with `source_id` unique constraint
3. **Significance filter** — skip trivial transaction updates; archive meaningful state changes (new launch, funding threshold reached, refund triggered)
### Example output
**Example output:**
```markdown
---
type: source
title: "Futardio launch: SolForge reaches 80% funding threshold"
title: "Futardio launch: SolForge reaches funding threshold"
author: "futard.io"
url: "https://futard.io/launches/solforge"
date: 2026-03-09
@ -172,48 +270,64 @@ status: unprocessed
tags: [futardio, metadao, solana, permissionless-launches, capital-formation]
linked_set: futardio-launches-march-2026
priority: medium
contributor: "Ben Harper (ingestion daemon)"
contributor: "ingestion-daemon"
---
## Summary
SolForge project on futard.io reached 80% of its funding threshold, with $X committed from N funders.
SolForge reached its funding threshold on futard.io with $X committed from N funders.
## Content
- Project: SolForge
- Description: [from futard.io listing]
- Description: [from listing]
- FDV: [value]
- Funding committed: [amount] / [target] ([percentage]%)
- Funder count: [N]
- Status: LIVE
- Funding: [amount] / [target] ([percentage]%)
- Funders: [N]
- Status: COMPLETE
- Launch date: 2026-03-09
- Key milestones: [any threshold events]
- Use of funds: [from listing]
## Context
Part of the futard.io permissionless launch platform (MetaDAO ecosystem). Relevant to existing claims on permissionless capital formation and futarchy-governed launches.
Part of the futard.io permissionless launch platform (MetaDAO ecosystem).
```
## Generalizing to other daemons
### twitter adapter
The pattern is identical for any data source. Only these things change:
**Source:** X/Twitter via twitterapi.io
| Parameter | Futardio | X feeds | RSS | On-chain |
|-----------|----------|---------|-----|----------|
| Data source | futard.io web/API | twitterapi.io | feedparser | Solana RPC |
| Poll interval | 15 min | 15-30 min | 15 min | 5 min |
| Domain routing | internet-finance | per-account | per-feed | internet-finance |
| Dedup key | launch ID | tweet ID | article URL | tx signature |
| Format field | data | tweet/thread | essay/news | data |
| Significance filter | new launch, threshold event | engagement threshold | always archive | governance events |
**Config:** Takes a network JSON file (e.g., `theseus-network.json`, `rio-network.json`) that defines accounts and tiers.
The output format (source archive markdown) and git workflow (branch → PR → webhook) are always the same.
**What to pull:** Recent tweets from network accounts, filtered by engagement threshold.
**Dedup:** Tweet ID. Skip retweets without commentary. Quote tweets are separate items.
### rss adapter
**Source:** RSS/Atom feeds via feedparser
**Config:** List of feed URLs with domain routing.
**What to pull:** New articles since last poll. Full text via Crawl4AI (JS-rendered) or trafilatura (fallback).
**Dedup:** Article URL.
### solana adapter
**Source:** Solana RPC / program event logs
**Config:** List of program addresses to monitor.
**What to pull:** Governance events (new proposals, vote results, treasury operations). Not routine transfers.
**Significance filter:** Only events that change governance state.
## Setup checklist
- [ ] Forgejo account with API token (write access to teleo-codex)
- [ ] SSH key or HTTPS token for git push
- [ ] SQLite database for dedup staging
- [ ] Cron job on VPS (every 15 min)
- [ ] Test: create one source file manually, push, verify PR triggers eval pipeline
- [ ] SSH key or HTTPS token for git push to Forgejo
- [ ] SQLite database file for dedup staging
- [ ] `ingestion-config.yaml` with source definitions
- [ ] Cron or systemd timer on VPS
- [ ] Test: single adapter → one source file → push → PR → verify webhook triggers eval
## Files to read
@ -225,3 +339,15 @@ The output format (source archive markdown) and git workflow (branch → PR →
| `CONTRIBUTING.md` | Human contributor workflow (similar pattern) |
| `CLAUDE.md` | Full collective operating manual |
| `inbox/archive/*.md` | Real examples of archived sources |
## Cost model
| Component | Cost |
|-----------|------|
| VPS (Hetzner CAX31) | ~$15/mo |
| X API (twitterapi.io) | ~$100/mo |
| Daemon compute | Negligible (polling + formatting) |
| Agent extraction (downstream) | Covered by Claude Max subscription on VPS |
| Total ingestion | ~$115/mo fixed |
The expensive part (LLM calls for extraction and evaluation) happens downstream in the agent pipeline, not in the daemon. The daemon itself is cheap — it's just HTTP requests, text formatting, and git operations.