teleo-codex/docs/ingestion-daemon-onboarding.md

# Ingestion Daemon Onboarding

How to build the Teleo ingestion daemon — a single service with pluggable source adapters that feeds the collective knowledge base.

## Architecture

```
┌─────────────────────────────────────────────┐
│           Ingestion Daemon (1 service)       │
│                                              │
│  ┌──────────┐ ┌────────┐ ┌──────┐ ┌──────┐ │
│  │ futardio │ │ x-feed │ │ rss  │ │onchain│ │
│  │ adapter  │ │ adapter│ │adapter│ │adapter│ │
│  └────┬─────┘ └───┬────┘ └──┬───┘ └──┬───┘ │
│       └────────┬───┴────┬────┘        │     │
│                ▼        ▼             ▼     │
│         ┌─────────────────────────┐         │
│         │  Shared pipeline:       │         │
│         │  dedup → format → git   │         │
│         └───────────┬─────────────┘         │
└─────────────────────┼───────────────────────┘
                      ▼
        inbox/archive/*.md on Forgejo branch
                      ▼
              PR opened on Forgejo
                      ▼
        Webhook → headless domain agent (extraction)
                      ▼
        Agent claims PR → eval pipeline → merge
```

**The daemon handles ingestion only.** It pulls data, deduplicates, formats as source archive markdown, and opens PRs. Agents handle everything downstream (extraction, claim writing, evaluation, merge).

## Single daemon, pluggable adapters

One codebase, one container, one scheduler. Each data source is an adapter — a function that knows how to pull and normalize content from one source. The shared pipeline handles dedup, formatting, git workflow, and PR creation identically for every adapter.

### Configuration

```yaml
# ingestion-config.yaml

daemon:
  dedup_db: /data/ingestion.db        # Shared SQLite for dedup
  repo_dir: /workspace/teleo-codex     # Local clone
  forgejo_url: https://git.livingip.xyz
  forgejo_token: ${FORGEJO_TOKEN}      # From env/secrets
  batch_branch_prefix: ingestion

sources:
  futardio:
    adapter: futardio
    interval: 15m
    domain: internet-finance
    significance_filter: true          # Only new launches, threshold events, refunds
    tags: [futardio, metadao, solana, permissionless-launches]

  x-ai:
    adapter: twitter
    interval: 30m
    domain: ai-alignment
    network: theseus-network.json      # Account list + tiers
    api: twitterapi.io
    engagement_threshold: 50           # Min likes/RTs to archive

  x-finance:
    adapter: twitter
    interval: 30m
    domain: internet-finance
    network: rio-network.json
    api: twitterapi.io
    engagement_threshold: 50

  rss:
    adapter: rss
    interval: 15m
    feeds:
      - url: https://noahpinion.substack.com/feed
        domain: grand-strategy
      - url: https://citriniresearch.substack.com/feed
        domain: internet-finance
    # Add feeds here — no code changes needed

  onchain:
    adapter: solana
    interval: 5m
    domain: internet-finance
    programs:
      - metadao_autocrat           # Futarchy governance events
      - metadao_conditional_vault  # Conditional token markets
    significance_filter: true      # Only governance events, not routine txs
```

### Adding a new source

1. Write an adapter function: `pull_{source}(config) → list[SourceItem]`
2. Add an entry to `ingestion-config.yaml`
3. Restart daemon (or it hot-reloads config)

No changes to the pipeline, git workflow, or PR creation. The adapter is the only custom part.

## What the daemon produces

One markdown file per source item in `inbox/archive/`. Each file has YAML frontmatter + body content.

### Filename convention

```
YYYY-MM-DD-{author-or-source-handle}-{brief-slug}.md
```

Examples:
- `2026-03-09-futardio-project-launch-solforge.md`
- `2026-03-09-metaproph3t-futarchy-governance-update.md`
- `2026-03-09-pineanalytics-futardio-launch-metrics.md`

### Frontmatter (required fields)

```yaml
---
type: source
title: "Human-readable title of the source"
author: "Author name (@handle if applicable)"
url: "https://original-url.com"
date: 2026-03-09
domain: internet-finance
format: report | essay | tweet | thread | whitepaper | paper | news | data
status: unprocessed
tags: [futarchy, metadao, futardio, solana, permissionless-launches]
---
```

### Frontmatter (optional fields)

```yaml
linked_set: "futardio-launches-march-2026"    # Group related items
cross_domain_flags: [ai-alignment, mechanisms] # Flag other relevant domains
extraction_hints: "Focus on governance mechanism data"
priority: low | medium | high                  # Signal urgency to agents
contributor: "ingestion-daemon"                # Attribution
```

### Body

Full content text after the frontmatter. This is what agents read to extract claims. Include everything — agents need the raw material.

```markdown
## Summary
[Brief description of what this source contains]

## Content
[Full text, data, or structured content from the source]

## Context
[Optional: why this matters, what it connects to]
```

**Important:** The body is reference material, not argumentative. Don't write claims — just stage the raw content faithfully. Agents handle interpretation.

### Valid domains

Route each source to the primary domain that should process it:

| Domain | Agent | What goes here |
|--------|-------|----------------|
| `internet-finance` | Rio | Futarchy, MetaDAO, tokens, DeFi, capital formation |
| `entertainment` | Clay | Creator economy, IP, media, gaming, cultural dynamics |
| `ai-alignment` | Theseus | AI safety, capability, alignment, multi-agent, governance |
| `health` | Vida | Healthcare, biotech, longevity, wellness, diagnostics |
| `space-development` | Astra | Launch, orbital, cislunar, governance, manufacturing |
| `grand-strategy` | Leo | Cross-domain, macro, geopolitics, coordination |

If a source touches multiple domains, pick the primary and list others in `cross_domain_flags`.

## Shared pipeline

### Deduplication (SQLite)

Every source item passes through dedup before archiving:

```sql
CREATE TABLE staged (
    source_type TEXT,       -- 'futardio', 'twitter', 'rss', 'solana'
    source_id TEXT UNIQUE,  -- Launch ID, tweet ID, article URL, tx sig
    url TEXT,
    title TEXT,
    author TEXT,
    content TEXT,
    domain TEXT,
    published_date TEXT,
    staged_at TEXT DEFAULT CURRENT_TIMESTAMP
);
```

Dedup key varies by adapter:
| Adapter | Dedup key |
|---------|-----------|
| futardio | launch ID |
| twitter | tweet ID |
| rss | article URL |
| solana | tx signature |

### Git workflow

All adapters share the same git workflow:

```bash
# 1. Branch
git checkout -b ingestion/{source}-$(date +%Y%m%d-%H%M)

# 2. Stage files
git add inbox/archive/*.md

# 3. Commit
git commit -m "ingestion: N sources from {source} batch $(date +%Y%m%d-%H%M)

- Sources: [brief list]
- Domains: [which domains routed to]"

# 4. Push
git push -u origin HEAD

# 5. Open PR on Forgejo
curl -X POST "https://git.livingip.xyz/api/v1/repos/teleo/teleo-codex/pulls" \
  -H "Authorization: token $FORGEJO_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "title": "ingestion: N sources from {source} batch TIMESTAMP",
    "body": "## Batch summary\n- N source files\n- Domain: {domain}\n- Source: {source}\n\nAutomated ingestion daemon.",
    "head": "ingestion/{source}-TIMESTAMP",
    "base": "main"
  }'
```

After PR creation, the Forgejo webhook triggers the eval pipeline which routes to the appropriate domain agent for extraction.

### Batching

Sources are batched per adapter per run. If the futardio adapter finds 3 new launches in one poll cycle, all 3 go in one branch/PR. If it finds 0, no branch is created. This keeps PR volume manageable for the review pipeline.

## Adapter specifications

### futardio adapter

**Source:** futard.io — permissionless launchpad on Solana (MetaDAO ecosystem)

**What to pull:**
1. New project launches — name, description, funding target, FDV, status
2. Funding threshold events — project reaches funding threshold, triggers refund
3. Platform metrics snapshots — total committed, funder count, active launches

**Significance filter:** Skip routine transaction updates. Archive only:
- New launch listed
- Funding threshold reached (project funded)
- Refund triggered
- Platform milestone (e.g., total committed crosses round number)

**Example output:**

```markdown
---
type: source
title: "Futardio launch: SolForge reaches funding threshold"
author: "futard.io"
url: "https://futard.io/launches/solforge"
date: 2026-03-09
domain: internet-finance
format: data
status: unprocessed
tags: [futardio, metadao, solana, permissionless-launches, capital-formation]
linked_set: futardio-launches-march-2026
priority: medium
contributor: "ingestion-daemon"
---

## Summary
SolForge reached its funding threshold on futard.io with $X committed from N funders.

## Content
- Project: SolForge
- Description: [from listing]
- FDV: [value]
- Funding: [amount] / [target] ([percentage]%)
- Funders: [N]
- Status: COMPLETE
- Launch date: 2026-03-09
- Use of funds: [from listing]

## Context
Part of the futard.io permissionless launch platform (MetaDAO ecosystem).
```

### twitter adapter

**Source:** X/Twitter via twitterapi.io

**Config:** Takes a network JSON file (e.g., `theseus-network.json`, `rio-network.json`) that defines accounts and tiers.

**What to pull:** Recent tweets from network accounts, filtered by engagement threshold.

**Dedup:** Tweet ID. Skip retweets without commentary. Quote tweets are separate items.

### rss adapter

**Source:** RSS/Atom feeds via feedparser

**Config:** List of feed URLs with domain routing.

**What to pull:** New articles since last poll. Full text via Crawl4AI (JS-rendered) or trafilatura (fallback).

**Dedup:** Article URL.

### solana adapter

**Source:** Solana RPC / program event logs

**Config:** List of program addresses to monitor.

**What to pull:** Governance events (new proposals, vote results, treasury operations). Not routine transfers.

**Significance filter:** Only events that change governance state.

## Setup checklist

- [ ] Forgejo account with API token (write access to teleo-codex)
- [ ] SSH key or HTTPS token for git push to Forgejo
- [ ] SQLite database file for dedup staging
- [ ] `ingestion-config.yaml` with source definitions
- [ ] Cron or systemd timer on VPS
- [ ] Test: single adapter → one source file → push → PR → verify webhook triggers eval

## Files to read

| File | What it tells you |
|------|-------------------|
| `schemas/source.md` | Canonical source archive schema |
| `schemas/claim.md` | What agents produce from your sources (downstream) |
| `skills/extract.md` | The extraction process agents run on your files |
| `CONTRIBUTING.md` | Human contributor workflow (similar pattern) |
| `CLAUDE.md` | Full collective operating manual |
| `inbox/archive/*.md` | Real examples of archived sources |

## Cost model

| Component | Cost |
|-----------|------|
| VPS (Hetzner CAX31) | ~$15/mo |
| X API (twitterapi.io) | ~$100/mo |
| Daemon compute | Negligible (polling + formatting) |
| Agent extraction (downstream) | Covered by Claude Max subscription on VPS |
| Total ingestion | ~$115/mo fixed |

The expensive part (LLM calls for extraction and evaluation) happens downstream in the agent pipeline, not in the daemon. The daemon itself is cheap — it's just HTTP requests, text formatting, and git operations.