teleo-codex/docs/ingestion-daemon-onboarding.md

353 lines
12 KiB
Markdown

# Ingestion Daemon Onboarding
How to build the Teleo ingestion daemon — a single service with pluggable source adapters that feeds the collective knowledge base.
## Architecture
```
┌─────────────────────────────────────────────┐
│ Ingestion Daemon (1 service) │
│ │
│ ┌──────────┐ ┌────────┐ ┌──────┐ ┌──────┐ │
│ │ futardio │ │ x-feed │ │ rss │ │onchain│ │
│ │ adapter │ │ adapter│ │adapter│ │adapter│ │
│ └────┬─────┘ └───┬────┘ └──┬───┘ └──┬───┘ │
│ └────────┬───┴────┬────┘ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────────────────┐ │
│ │ Shared pipeline: │ │
│ │ dedup → format → git │ │
│ └───────────┬─────────────┘ │
└─────────────────────┼───────────────────────┘
inbox/archive/*.md on Forgejo branch
PR opened on Forgejo
Webhook → headless domain agent (extraction)
Agent claims PR → eval pipeline → merge
```
**The daemon handles ingestion only.** It pulls data, deduplicates, formats as source archive markdown, and opens PRs. Agents handle everything downstream (extraction, claim writing, evaluation, merge).
## Single daemon, pluggable adapters
One codebase, one container, one scheduler. Each data source is an adapter — a function that knows how to pull and normalize content from one source. The shared pipeline handles dedup, formatting, git workflow, and PR creation identically for every adapter.
### Configuration
```yaml
# ingestion-config.yaml
daemon:
dedup_db: /data/ingestion.db # Shared SQLite for dedup
repo_dir: /workspace/teleo-codex # Local clone
forgejo_url: https://git.livingip.xyz
forgejo_token: ${FORGEJO_TOKEN} # From env/secrets
batch_branch_prefix: ingestion
sources:
futardio:
adapter: futardio
interval: 15m
domain: internet-finance
significance_filter: true # Only new launches, threshold events, refunds
tags: [futardio, metadao, solana, permissionless-launches]
x-ai:
adapter: twitter
interval: 30m
domain: ai-alignment
network: theseus-network.json # Account list + tiers
api: twitterapi.io
engagement_threshold: 50 # Min likes/RTs to archive
x-finance:
adapter: twitter
interval: 30m
domain: internet-finance
network: rio-network.json
api: twitterapi.io
engagement_threshold: 50
rss:
adapter: rss
interval: 15m
feeds:
- url: https://noahpinion.substack.com/feed
domain: grand-strategy
- url: https://citriniresearch.substack.com/feed
domain: internet-finance
# Add feeds here — no code changes needed
onchain:
adapter: solana
interval: 5m
domain: internet-finance
programs:
- metadao_autocrat # Futarchy governance events
- metadao_conditional_vault # Conditional token markets
significance_filter: true # Only governance events, not routine txs
```
### Adding a new source
1. Write an adapter function: `pull_{source}(config) → list[SourceItem]`
2. Add an entry to `ingestion-config.yaml`
3. Restart daemon (or it hot-reloads config)
No changes to the pipeline, git workflow, or PR creation. The adapter is the only custom part.
## What the daemon produces
One markdown file per source item in `inbox/archive/`. Each file has YAML frontmatter + body content.
### Filename convention
```
YYYY-MM-DD-{author-or-source-handle}-{brief-slug}.md
```
Examples:
- `2026-03-09-futardio-project-launch-solforge.md`
- `2026-03-09-metaproph3t-futarchy-governance-update.md`
- `2026-03-09-pineanalytics-futardio-launch-metrics.md`
### Frontmatter (required fields)
```yaml
---
type: source
title: "Human-readable title of the source"
author: "Author name (@handle if applicable)"
url: "https://original-url.com"
date: 2026-03-09
domain: internet-finance
format: report | essay | tweet | thread | whitepaper | paper | news | data
status: unprocessed
tags: [futarchy, metadao, futardio, solana, permissionless-launches]
---
```
### Frontmatter (optional fields)
```yaml
linked_set: "futardio-launches-march-2026" # Group related items
cross_domain_flags: [ai-alignment, mechanisms] # Flag other relevant domains
extraction_hints: "Focus on governance mechanism data"
priority: low | medium | high # Signal urgency to agents
contributor: "ingestion-daemon" # Attribution
```
### Body
Full content text after the frontmatter. This is what agents read to extract claims. Include everything — agents need the raw material.
```markdown
## Summary
[Brief description of what this source contains]
## Content
[Full text, data, or structured content from the source]
## Context
[Optional: why this matters, what it connects to]
```
**Important:** The body is reference material, not argumentative. Don't write claims — just stage the raw content faithfully. Agents handle interpretation.
### Valid domains
Route each source to the primary domain that should process it:
| Domain | Agent | What goes here |
|--------|-------|----------------|
| `internet-finance` | Rio | Futarchy, MetaDAO, tokens, DeFi, capital formation |
| `entertainment` | Clay | Creator economy, IP, media, gaming, cultural dynamics |
| `ai-alignment` | Theseus | AI safety, capability, alignment, multi-agent, governance |
| `health` | Vida | Healthcare, biotech, longevity, wellness, diagnostics |
| `space-development` | Astra | Launch, orbital, cislunar, governance, manufacturing |
| `grand-strategy` | Leo | Cross-domain, macro, geopolitics, coordination |
If a source touches multiple domains, pick the primary and list others in `cross_domain_flags`.
## Shared pipeline
### Deduplication (SQLite)
Every source item passes through dedup before archiving:
```sql
CREATE TABLE staged (
source_type TEXT, -- 'futardio', 'twitter', 'rss', 'solana'
source_id TEXT UNIQUE, -- Launch ID, tweet ID, article URL, tx sig
url TEXT,
title TEXT,
author TEXT,
content TEXT,
domain TEXT,
published_date TEXT,
staged_at TEXT DEFAULT CURRENT_TIMESTAMP
);
```
Dedup key varies by adapter:
| Adapter | Dedup key |
|---------|-----------|
| futardio | launch ID |
| twitter | tweet ID |
| rss | article URL |
| solana | tx signature |
### Git workflow
All adapters share the same git workflow:
```bash
# 1. Branch
git checkout -b ingestion/{source}-$(date +%Y%m%d-%H%M)
# 2. Stage files
git add inbox/archive/*.md
# 3. Commit
git commit -m "ingestion: N sources from {source} batch $(date +%Y%m%d-%H%M)
- Sources: [brief list]
- Domains: [which domains routed to]"
# 4. Push
git push -u origin HEAD
# 5. Open PR on Forgejo
curl -X POST "https://git.livingip.xyz/api/v1/repos/teleo/teleo-codex/pulls" \
-H "Authorization: token $FORGEJO_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"title": "ingestion: N sources from {source} batch TIMESTAMP",
"body": "## Batch summary\n- N source files\n- Domain: {domain}\n- Source: {source}\n\nAutomated ingestion daemon.",
"head": "ingestion/{source}-TIMESTAMP",
"base": "main"
}'
```
After PR creation, the Forgejo webhook triggers the eval pipeline which routes to the appropriate domain agent for extraction.
### Batching
Sources are batched per adapter per run. If the futardio adapter finds 3 new launches in one poll cycle, all 3 go in one branch/PR. If it finds 0, no branch is created. This keeps PR volume manageable for the review pipeline.
## Adapter specifications
### futardio adapter
**Source:** futard.io — permissionless launchpad on Solana (MetaDAO ecosystem)
**What to pull:**
1. New project launches — name, description, funding target, FDV, status
2. Funding threshold events — project reaches funding threshold, triggers refund
3. Platform metrics snapshots — total committed, funder count, active launches
**Significance filter:** Skip routine transaction updates. Archive only:
- New launch listed
- Funding threshold reached (project funded)
- Refund triggered
- Platform milestone (e.g., total committed crosses round number)
**Example output:**
```markdown
---
type: source
title: "Futardio launch: SolForge reaches funding threshold"
author: "futard.io"
url: "https://futard.io/launches/solforge"
date: 2026-03-09
domain: internet-finance
format: data
status: unprocessed
tags: [futardio, metadao, solana, permissionless-launches, capital-formation]
linked_set: futardio-launches-march-2026
priority: medium
contributor: "ingestion-daemon"
---
## Summary
SolForge reached its funding threshold on futard.io with $X committed from N funders.
## Content
- Project: SolForge
- Description: [from listing]
- FDV: [value]
- Funding: [amount] / [target] ([percentage]%)
- Funders: [N]
- Status: COMPLETE
- Launch date: 2026-03-09
- Use of funds: [from listing]
## Context
Part of the futard.io permissionless launch platform (MetaDAO ecosystem).
```
### twitter adapter
**Source:** X/Twitter via twitterapi.io
**Config:** Takes a network JSON file (e.g., `theseus-network.json`, `rio-network.json`) that defines accounts and tiers.
**What to pull:** Recent tweets from network accounts, filtered by engagement threshold.
**Dedup:** Tweet ID. Skip retweets without commentary. Quote tweets are separate items.
### rss adapter
**Source:** RSS/Atom feeds via feedparser
**Config:** List of feed URLs with domain routing.
**What to pull:** New articles since last poll. Full text via Crawl4AI (JS-rendered) or trafilatura (fallback).
**Dedup:** Article URL.
### solana adapter
**Source:** Solana RPC / program event logs
**Config:** List of program addresses to monitor.
**What to pull:** Governance events (new proposals, vote results, treasury operations). Not routine transfers.
**Significance filter:** Only events that change governance state.
## Setup checklist
- [ ] Forgejo account with API token (write access to teleo-codex)
- [ ] SSH key or HTTPS token for git push to Forgejo
- [ ] SQLite database file for dedup staging
- [ ] `ingestion-config.yaml` with source definitions
- [ ] Cron or systemd timer on VPS
- [ ] Test: single adapter → one source file → push → PR → verify webhook triggers eval
## Files to read
| File | What it tells you |
|------|-------------------|
| `schemas/source.md` | Canonical source archive schema |
| `schemas/claim.md` | What agents produce from your sources (downstream) |
| `skills/extract.md` | The extraction process agents run on your files |
| `CONTRIBUTING.md` | Human contributor workflow (similar pattern) |
| `CLAUDE.md` | Full collective operating manual |
| `inbox/archive/*.md` | Real examples of archived sources |
## Cost model
| Component | Cost |
|-----------|------|
| VPS (Hetzner CAX31) | ~$15/mo |
| X API (twitterapi.io) | ~$100/mo |
| Daemon compute | Negligible (polling + formatting) |
| Agent extraction (downstream) | Covered by Claude Max subscription on VPS |
| Total ingestion | ~$115/mo fixed |
The expensive part (LLM calls for extraction and evaluation) happens downstream in the agent pipeline, not in the daemon. The daemon itself is cheap — it's just HTTP requests, text formatting, and git operations.