Auto: docs/ingestion-daemon-onboarding.md | 1 file changed, 203 insertions(+), 77 deletions(-)
This commit is contained in:
parent
ec1da89f1f
commit
5db0c660b2
1 changed files with 204 additions and 78 deletions
|
|
@ -1,24 +1,103 @@
|
||||||
# Ingestion Daemon Onboarding
|
# Ingestion Daemon Onboarding
|
||||||
|
|
||||||
How to build an ingestion daemon for the Teleo collective knowledge base. This doc covers the **futardio daemon** as the first example, but the pattern generalizes to any data source (X feeds, RSS, on-chain data, arxiv, etc.).
|
How to build the Teleo ingestion daemon — a single service with pluggable source adapters that feeds the collective knowledge base.
|
||||||
|
|
||||||
## Architecture
|
## Architecture
|
||||||
|
|
||||||
```
|
```
|
||||||
Data source (futard.io, X, RSS, on-chain...)
|
┌─────────────────────────────────────────────┐
|
||||||
↓
|
│ Ingestion Daemon (1 service) │
|
||||||
Ingestion daemon (your script, runs on VPS cron)
|
│ │
|
||||||
↓
|
│ ┌──────────┐ ┌────────┐ ┌──────┐ ┌──────┐ │
|
||||||
inbox/archive/*.md (source archive files with YAML frontmatter)
|
│ │ futardio │ │ x-feed │ │ rss │ │onchain│ │
|
||||||
↓
|
│ │ adapter │ │ adapter│ │adapter│ │adapter│ │
|
||||||
Git branch → push → PR on Forgejo
|
│ └────┬─────┘ └───┬────┘ └──┬───┘ └──┬───┘ │
|
||||||
↓
|
│ └────────┬───┴────┬────┘ │ │
|
||||||
Webhook triggers headless domain agent (extraction)
|
│ ▼ ▼ ▼ │
|
||||||
↓
|
│ ┌─────────────────────────┐ │
|
||||||
Agent opens claims PR → eval pipeline reviews → merge
|
│ │ Shared pipeline: │ │
|
||||||
|
│ │ dedup → format → git │ │
|
||||||
|
│ └───────────┬─────────────┘ │
|
||||||
|
└─────────────────────┼───────────────────────┘
|
||||||
|
▼
|
||||||
|
inbox/archive/*.md on Forgejo branch
|
||||||
|
▼
|
||||||
|
PR opened on Forgejo
|
||||||
|
▼
|
||||||
|
Webhook → headless domain agent (extraction)
|
||||||
|
▼
|
||||||
|
Agent claims PR → eval pipeline → merge
|
||||||
```
|
```
|
||||||
|
|
||||||
**Your daemon is responsible for steps 1-4 only.** You pull data, format it, and push it. Agents handle everything downstream.
|
**The daemon handles ingestion only.** It pulls data, deduplicates, formats as source archive markdown, and opens PRs. Agents handle everything downstream (extraction, claim writing, evaluation, merge).
|
||||||
|
|
||||||
|
## Single daemon, pluggable adapters
|
||||||
|
|
||||||
|
One codebase, one container, one scheduler. Each data source is an adapter — a function that knows how to pull and normalize content from one source. The shared pipeline handles dedup, formatting, git workflow, and PR creation identically for every adapter.
|
||||||
|
|
||||||
|
### Configuration
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
# ingestion-config.yaml
|
||||||
|
|
||||||
|
daemon:
|
||||||
|
dedup_db: /data/ingestion.db # Shared SQLite for dedup
|
||||||
|
repo_dir: /workspace/teleo-codex # Local clone
|
||||||
|
forgejo_url: https://git.livingip.xyz
|
||||||
|
forgejo_token: ${FORGEJO_TOKEN} # From env/secrets
|
||||||
|
batch_branch_prefix: ingestion
|
||||||
|
|
||||||
|
sources:
|
||||||
|
futardio:
|
||||||
|
adapter: futardio
|
||||||
|
interval: 15m
|
||||||
|
domain: internet-finance
|
||||||
|
significance_filter: true # Only new launches, threshold events, refunds
|
||||||
|
tags: [futardio, metadao, solana, permissionless-launches]
|
||||||
|
|
||||||
|
x-ai:
|
||||||
|
adapter: twitter
|
||||||
|
interval: 30m
|
||||||
|
domain: ai-alignment
|
||||||
|
network: theseus-network.json # Account list + tiers
|
||||||
|
api: twitterapi.io
|
||||||
|
engagement_threshold: 50 # Min likes/RTs to archive
|
||||||
|
|
||||||
|
x-finance:
|
||||||
|
adapter: twitter
|
||||||
|
interval: 30m
|
||||||
|
domain: internet-finance
|
||||||
|
network: rio-network.json
|
||||||
|
api: twitterapi.io
|
||||||
|
engagement_threshold: 50
|
||||||
|
|
||||||
|
rss:
|
||||||
|
adapter: rss
|
||||||
|
interval: 15m
|
||||||
|
feeds:
|
||||||
|
- url: https://noahpinion.substack.com/feed
|
||||||
|
domain: grand-strategy
|
||||||
|
- url: https://citriniresearch.substack.com/feed
|
||||||
|
domain: internet-finance
|
||||||
|
# Add feeds here — no code changes needed
|
||||||
|
|
||||||
|
onchain:
|
||||||
|
adapter: solana
|
||||||
|
interval: 5m
|
||||||
|
domain: internet-finance
|
||||||
|
programs:
|
||||||
|
- metadao_autocrat # Futarchy governance events
|
||||||
|
- metadao_conditional_vault # Conditional token markets
|
||||||
|
significance_filter: true # Only governance events, not routine txs
|
||||||
|
```
|
||||||
|
|
||||||
|
### Adding a new source
|
||||||
|
|
||||||
|
1. Write an adapter function: `pull_{source}(config) → list[SourceItem]`
|
||||||
|
2. Add an entry to `ingestion-config.yaml`
|
||||||
|
3. Restart daemon (or it hot-reloads config)
|
||||||
|
|
||||||
|
No changes to the pipeline, git workflow, or PR creation. The adapter is the only custom part.
|
||||||
|
|
||||||
## What the daemon produces
|
## What the daemon produces
|
||||||
|
|
||||||
|
|
@ -58,7 +137,7 @@ linked_set: "futardio-launches-march-2026" # Group related items
|
||||||
cross_domain_flags: [ai-alignment, mechanisms] # Flag other relevant domains
|
cross_domain_flags: [ai-alignment, mechanisms] # Flag other relevant domains
|
||||||
extraction_hints: "Focus on governance mechanism data"
|
extraction_hints: "Focus on governance mechanism data"
|
||||||
priority: low | medium | high # Signal urgency to agents
|
priority: low | medium | high # Signal urgency to agents
|
||||||
contributor: "Ben Harper" # Who ran the daemon
|
contributor: "ingestion-daemon" # Attribution
|
||||||
```
|
```
|
||||||
|
|
||||||
### Body
|
### Body
|
||||||
|
|
@ -93,76 +172,95 @@ Route each source to the primary domain that should process it:
|
||||||
|
|
||||||
If a source touches multiple domains, pick the primary and list others in `cross_domain_flags`.
|
If a source touches multiple domains, pick the primary and list others in `cross_domain_flags`.
|
||||||
|
|
||||||
## Git workflow
|
## Shared pipeline
|
||||||
|
|
||||||
### Branch convention
|
### Deduplication (SQLite)
|
||||||
|
|
||||||
```
|
Every source item passes through dedup before archiving:
|
||||||
ingestion/{daemon-name}-{timestamp}
|
|
||||||
|
```sql
|
||||||
|
CREATE TABLE staged (
|
||||||
|
source_type TEXT, -- 'futardio', 'twitter', 'rss', 'solana'
|
||||||
|
source_id TEXT UNIQUE, -- Launch ID, tweet ID, article URL, tx sig
|
||||||
|
url TEXT,
|
||||||
|
title TEXT,
|
||||||
|
author TEXT,
|
||||||
|
content TEXT,
|
||||||
|
domain TEXT,
|
||||||
|
published_date TEXT,
|
||||||
|
staged_at TEXT DEFAULT CURRENT_TIMESTAMP
|
||||||
|
);
|
||||||
```
|
```
|
||||||
|
|
||||||
Example: `ingestion/futardio-20260309-1700`
|
Dedup key varies by adapter:
|
||||||
|
| Adapter | Dedup key |
|
||||||
|
|---------|-----------|
|
||||||
|
| futardio | launch ID |
|
||||||
|
| twitter | tweet ID |
|
||||||
|
| rss | article URL |
|
||||||
|
| solana | tx signature |
|
||||||
|
|
||||||
### Commit format
|
### Git workflow
|
||||||
|
|
||||||
```
|
All adapters share the same git workflow:
|
||||||
ingestion: {N} sources from {daemon-name} batch {timestamp}
|
|
||||||
|
|
||||||
- Sources: [brief list]
|
|
||||||
- Domains: [which domains routed to]
|
|
||||||
|
|
||||||
Pentagon-Agent: {daemon-name} <{daemon-uuid-if-applicable}>
|
|
||||||
```
|
|
||||||
|
|
||||||
### PR creation
|
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
git checkout -b ingestion/futardio-$(date +%Y%m%d-%H%M)
|
# 1. Branch
|
||||||
|
git checkout -b ingestion/{source}-$(date +%Y%m%d-%H%M)
|
||||||
|
|
||||||
|
# 2. Stage files
|
||||||
git add inbox/archive/*.md
|
git add inbox/archive/*.md
|
||||||
git commit -m "ingestion: N sources from futardio batch $(date +%Y%m%d-%H%M)"
|
|
||||||
|
# 3. Commit
|
||||||
|
git commit -m "ingestion: N sources from {source} batch $(date +%Y%m%d-%H%M)
|
||||||
|
|
||||||
|
- Sources: [brief list]
|
||||||
|
- Domains: [which domains routed to]"
|
||||||
|
|
||||||
|
# 4. Push
|
||||||
git push -u origin HEAD
|
git push -u origin HEAD
|
||||||
# Open PR on Forgejo
|
|
||||||
|
# 5. Open PR on Forgejo
|
||||||
curl -X POST "https://git.livingip.xyz/api/v1/repos/teleo/teleo-codex/pulls" \
|
curl -X POST "https://git.livingip.xyz/api/v1/repos/teleo/teleo-codex/pulls" \
|
||||||
-H "Authorization: token YOUR_TOKEN" \
|
-H "Authorization: token $FORGEJO_TOKEN" \
|
||||||
-H "Content-Type: application/json" \
|
-H "Content-Type: application/json" \
|
||||||
-d '{
|
-d '{
|
||||||
"title": "ingestion: N sources from futardio batch TIMESTAMP",
|
"title": "ingestion: N sources from {source} batch TIMESTAMP",
|
||||||
"body": "## Batch summary\n- N source files\n- Domain: internet-finance\n- Source: futard.io\n\nAutomated ingestion daemon.",
|
"body": "## Batch summary\n- N source files\n- Domain: {domain}\n- Source: {source}\n\nAutomated ingestion daemon.",
|
||||||
"head": "ingestion/futardio-TIMESTAMP",
|
"head": "ingestion/{source}-TIMESTAMP",
|
||||||
"base": "main"
|
"base": "main"
|
||||||
}'
|
}'
|
||||||
```
|
```
|
||||||
|
|
||||||
After PR is created, the Forgejo webhook triggers the eval pipeline which routes to the appropriate domain agent for extraction.
|
After PR creation, the Forgejo webhook triggers the eval pipeline which routes to the appropriate domain agent for extraction.
|
||||||
|
|
||||||
## Futardio Daemon — Specific Implementation
|
### Batching
|
||||||
|
|
||||||
### What to pull
|
Sources are batched per adapter per run. If the futardio adapter finds 3 new launches in one poll cycle, all 3 go in one branch/PR. If it finds 0, no branch is created. This keeps PR volume manageable for the review pipeline.
|
||||||
|
|
||||||
futard.io is a permissionless launchpad on Solana (MetaDAO ecosystem). Key data:
|
## Adapter specifications
|
||||||
|
|
||||||
1. **New project launches** — name, description, funding target, FDV, status (LIVE/REFUNDING/COMPLETE)
|
### futardio adapter
|
||||||
2. **Funding progress** — committed amounts, funder counts, threshold status
|
|
||||||
3. **Transaction feed** — individual contributions with amounts and timestamps
|
|
||||||
4. **Platform metrics** — total committed ($17.8M+), total funders (1k+), active launches (44+)
|
|
||||||
|
|
||||||
### Poll interval
|
**Source:** futard.io — permissionless launchpad on Solana (MetaDAO ecosystem)
|
||||||
|
|
||||||
Every 15 minutes. futard.io data changes frequently (live fundraising), but most changes are incremental transaction data. New project launches are the high-signal events.
|
**What to pull:**
|
||||||
|
1. New project launches — name, description, funding target, FDV, status
|
||||||
|
2. Funding threshold events — project reaches funding threshold, triggers refund
|
||||||
|
3. Platform metrics snapshots — total committed, funder count, active launches
|
||||||
|
|
||||||
### Deduplication
|
**Significance filter:** Skip routine transaction updates. Archive only:
|
||||||
|
- New launch listed
|
||||||
|
- Funding threshold reached (project funded)
|
||||||
|
- Refund triggered
|
||||||
|
- Platform milestone (e.g., total committed crosses round number)
|
||||||
|
|
||||||
Before creating a source file, check:
|
**Example output:**
|
||||||
1. **Filename dedup** — does `inbox/archive/` already have a file for this source?
|
|
||||||
2. **Content dedup** — SQLite staging table with `source_id` unique constraint
|
|
||||||
3. **Significance filter** — skip trivial transaction updates; archive meaningful state changes (new launch, funding threshold reached, refund triggered)
|
|
||||||
|
|
||||||
### Example output
|
|
||||||
|
|
||||||
```markdown
|
```markdown
|
||||||
---
|
---
|
||||||
type: source
|
type: source
|
||||||
title: "Futardio launch: SolForge reaches 80% funding threshold"
|
title: "Futardio launch: SolForge reaches funding threshold"
|
||||||
author: "futard.io"
|
author: "futard.io"
|
||||||
url: "https://futard.io/launches/solforge"
|
url: "https://futard.io/launches/solforge"
|
||||||
date: 2026-03-09
|
date: 2026-03-09
|
||||||
|
|
@ -172,48 +270,64 @@ status: unprocessed
|
||||||
tags: [futardio, metadao, solana, permissionless-launches, capital-formation]
|
tags: [futardio, metadao, solana, permissionless-launches, capital-formation]
|
||||||
linked_set: futardio-launches-march-2026
|
linked_set: futardio-launches-march-2026
|
||||||
priority: medium
|
priority: medium
|
||||||
contributor: "Ben Harper (ingestion daemon)"
|
contributor: "ingestion-daemon"
|
||||||
---
|
---
|
||||||
|
|
||||||
## Summary
|
## Summary
|
||||||
SolForge project on futard.io reached 80% of its funding threshold, with $X committed from N funders.
|
SolForge reached its funding threshold on futard.io with $X committed from N funders.
|
||||||
|
|
||||||
## Content
|
## Content
|
||||||
- Project: SolForge
|
- Project: SolForge
|
||||||
- Description: [from futard.io listing]
|
- Description: [from listing]
|
||||||
- FDV: [value]
|
- FDV: [value]
|
||||||
- Funding committed: [amount] / [target] ([percentage]%)
|
- Funding: [amount] / [target] ([percentage]%)
|
||||||
- Funder count: [N]
|
- Funders: [N]
|
||||||
- Status: LIVE
|
- Status: COMPLETE
|
||||||
- Launch date: 2026-03-09
|
- Launch date: 2026-03-09
|
||||||
- Key milestones: [any threshold events]
|
- Use of funds: [from listing]
|
||||||
|
|
||||||
## Context
|
## Context
|
||||||
Part of the futard.io permissionless launch platform (MetaDAO ecosystem). Relevant to existing claims on permissionless capital formation and futarchy-governed launches.
|
Part of the futard.io permissionless launch platform (MetaDAO ecosystem).
|
||||||
```
|
```
|
||||||
|
|
||||||
## Generalizing to other daemons
|
### twitter adapter
|
||||||
|
|
||||||
The pattern is identical for any data source. Only these things change:
|
**Source:** X/Twitter via twitterapi.io
|
||||||
|
|
||||||
| Parameter | Futardio | X feeds | RSS | On-chain |
|
**Config:** Takes a network JSON file (e.g., `theseus-network.json`, `rio-network.json`) that defines accounts and tiers.
|
||||||
|-----------|----------|---------|-----|----------|
|
|
||||||
| Data source | futard.io web/API | twitterapi.io | feedparser | Solana RPC |
|
|
||||||
| Poll interval | 15 min | 15-30 min | 15 min | 5 min |
|
|
||||||
| Domain routing | internet-finance | per-account | per-feed | internet-finance |
|
|
||||||
| Dedup key | launch ID | tweet ID | article URL | tx signature |
|
|
||||||
| Format field | data | tweet/thread | essay/news | data |
|
|
||||||
| Significance filter | new launch, threshold event | engagement threshold | always archive | governance events |
|
|
||||||
|
|
||||||
The output format (source archive markdown) and git workflow (branch → PR → webhook) are always the same.
|
**What to pull:** Recent tweets from network accounts, filtered by engagement threshold.
|
||||||
|
|
||||||
|
**Dedup:** Tweet ID. Skip retweets without commentary. Quote tweets are separate items.
|
||||||
|
|
||||||
|
### rss adapter
|
||||||
|
|
||||||
|
**Source:** RSS/Atom feeds via feedparser
|
||||||
|
|
||||||
|
**Config:** List of feed URLs with domain routing.
|
||||||
|
|
||||||
|
**What to pull:** New articles since last poll. Full text via Crawl4AI (JS-rendered) or trafilatura (fallback).
|
||||||
|
|
||||||
|
**Dedup:** Article URL.
|
||||||
|
|
||||||
|
### solana adapter
|
||||||
|
|
||||||
|
**Source:** Solana RPC / program event logs
|
||||||
|
|
||||||
|
**Config:** List of program addresses to monitor.
|
||||||
|
|
||||||
|
**What to pull:** Governance events (new proposals, vote results, treasury operations). Not routine transfers.
|
||||||
|
|
||||||
|
**Significance filter:** Only events that change governance state.
|
||||||
|
|
||||||
## Setup checklist
|
## Setup checklist
|
||||||
|
|
||||||
- [ ] Forgejo account with API token (write access to teleo-codex)
|
- [ ] Forgejo account with API token (write access to teleo-codex)
|
||||||
- [ ] SSH key or HTTPS token for git push
|
- [ ] SSH key or HTTPS token for git push to Forgejo
|
||||||
- [ ] SQLite database for dedup staging
|
- [ ] SQLite database file for dedup staging
|
||||||
- [ ] Cron job on VPS (every 15 min)
|
- [ ] `ingestion-config.yaml` with source definitions
|
||||||
- [ ] Test: create one source file manually, push, verify PR triggers eval pipeline
|
- [ ] Cron or systemd timer on VPS
|
||||||
|
- [ ] Test: single adapter → one source file → push → PR → verify webhook triggers eval
|
||||||
|
|
||||||
## Files to read
|
## Files to read
|
||||||
|
|
||||||
|
|
@ -225,3 +339,15 @@ The output format (source archive markdown) and git workflow (branch → PR →
|
||||||
| `CONTRIBUTING.md` | Human contributor workflow (similar pattern) |
|
| `CONTRIBUTING.md` | Human contributor workflow (similar pattern) |
|
||||||
| `CLAUDE.md` | Full collective operating manual |
|
| `CLAUDE.md` | Full collective operating manual |
|
||||||
| `inbox/archive/*.md` | Real examples of archived sources |
|
| `inbox/archive/*.md` | Real examples of archived sources |
|
||||||
|
|
||||||
|
## Cost model
|
||||||
|
|
||||||
|
| Component | Cost |
|
||||||
|
|-----------|------|
|
||||||
|
| VPS (Hetzner CAX31) | ~$15/mo |
|
||||||
|
| X API (twitterapi.io) | ~$100/mo |
|
||||||
|
| Daemon compute | Negligible (polling + formatting) |
|
||||||
|
| Agent extraction (downstream) | Covered by Claude Max subscription on VPS |
|
||||||
|
| Total ingestion | ~$115/mo fixed |
|
||||||
|
|
||||||
|
The expensive part (LLM calls for extraction and evaluation) happens downstream in the agent pipeline, not in the daemon. The daemon itself is cheap — it's just HTTP requests, text formatting, and git operations.
|
||||||
|
|
|
||||||
Loading…
Reference in a new issue