teleo-codex/docs/ingestion-daemon-onboarding.md

12 KiB

Ingestion Daemon Onboarding

How to build the Teleo ingestion daemon — a single service with pluggable source adapters that feeds the collective knowledge base.

Architecture

┌─────────────────────────────────────────────┐
│           Ingestion Daemon (1 service)       │
│                                              │
│  ┌──────────┐ ┌────────┐ ┌──────┐ ┌──────┐ │
│  │ futardio │ │ x-feed │ │ rss  │ │onchain│ │
│  │ adapter  │ │ adapter│ │adapter│ │adapter│ │
│  └────┬─────┘ └───┬────┘ └──┬───┘ └──┬───┘ │
│       └────────┬───┴────┬────┘        │     │
│                ▼        ▼             ▼     │
│         ┌─────────────────────────┐         │
│         │  Shared pipeline:       │         │
│         │  dedup → format → git   │         │
│         └───────────┬─────────────┘         │
└─────────────────────┼───────────────────────┘
                      ▼
        inbox/archive/*.md on Forgejo branch
                      ▼
              PR opened on Forgejo
                      ▼
        Webhook → headless domain agent (extraction)
                      ▼
        Agent claims PR → eval pipeline → merge

The daemon handles ingestion only. It pulls data, deduplicates, formats as source archive markdown, and opens PRs. Agents handle everything downstream (extraction, claim writing, evaluation, merge).

Single daemon, pluggable adapters

One codebase, one container, one scheduler. Each data source is an adapter — a function that knows how to pull and normalize content from one source. The shared pipeline handles dedup, formatting, git workflow, and PR creation identically for every adapter.

Configuration

# ingestion-config.yaml

daemon:
  dedup_db: /data/ingestion.db        # Shared SQLite for dedup
  repo_dir: /workspace/teleo-codex     # Local clone
  forgejo_url: https://git.livingip.xyz
  forgejo_token: ${FORGEJO_TOKEN}      # From env/secrets
  batch_branch_prefix: ingestion

sources:
  futardio:
    adapter: futardio
    interval: 15m
    domain: internet-finance
    significance_filter: true          # Only new launches, threshold events, refunds
    tags: [futardio, metadao, solana, permissionless-launches]

  x-ai:
    adapter: twitter
    interval: 30m
    domain: ai-alignment
    network: theseus-network.json      # Account list + tiers
    api: twitterapi.io
    engagement_threshold: 50           # Min likes/RTs to archive

  x-finance:
    adapter: twitter
    interval: 30m
    domain: internet-finance
    network: rio-network.json
    api: twitterapi.io
    engagement_threshold: 50

  rss:
    adapter: rss
    interval: 15m
    feeds:
      - url: https://noahpinion.substack.com/feed
        domain: grand-strategy
      - url: https://citriniresearch.substack.com/feed
        domain: internet-finance
    # Add feeds here — no code changes needed

  onchain:
    adapter: solana
    interval: 5m
    domain: internet-finance
    programs:
      - metadao_autocrat           # Futarchy governance events
      - metadao_conditional_vault  # Conditional token markets
    significance_filter: true      # Only governance events, not routine txs

Adding a new source

  1. Write an adapter function: pull_{source}(config) → list[SourceItem]
  2. Add an entry to ingestion-config.yaml
  3. Restart daemon (or it hot-reloads config)

No changes to the pipeline, git workflow, or PR creation. The adapter is the only custom part.

What the daemon produces

One markdown file per source item in inbox/archive/. Each file has YAML frontmatter + body content.

Filename convention

YYYY-MM-DD-{author-or-source-handle}-{brief-slug}.md

Examples:

  • 2026-03-09-futardio-project-launch-solforge.md
  • 2026-03-09-metaproph3t-futarchy-governance-update.md
  • 2026-03-09-pineanalytics-futardio-launch-metrics.md

Frontmatter (required fields)

---
type: source
title: "Human-readable title of the source"
author: "Author name (@handle if applicable)"
url: "https://original-url.com"
date: 2026-03-09
domain: internet-finance
format: report | essay | tweet | thread | whitepaper | paper | news | data
status: unprocessed
tags: [futarchy, metadao, futardio, solana, permissionless-launches]
---

Frontmatter (optional fields)

linked_set: "futardio-launches-march-2026"    # Group related items
cross_domain_flags: [ai-alignment, mechanisms] # Flag other relevant domains
extraction_hints: "Focus on governance mechanism data"
priority: low | medium | high                  # Signal urgency to agents
contributor: "ingestion-daemon"                # Attribution

Body

Full content text after the frontmatter. This is what agents read to extract claims. Include everything — agents need the raw material.

## Summary
[Brief description of what this source contains]

## Content
[Full text, data, or structured content from the source]

## Context
[Optional: why this matters, what it connects to]

Important: The body is reference material, not argumentative. Don't write claims — just stage the raw content faithfully. Agents handle interpretation.

Valid domains

Route each source to the primary domain that should process it:

Domain Agent What goes here
internet-finance Rio Futarchy, MetaDAO, tokens, DeFi, capital formation
entertainment Clay Creator economy, IP, media, gaming, cultural dynamics
ai-alignment Theseus AI safety, capability, alignment, multi-agent, governance
health Vida Healthcare, biotech, longevity, wellness, diagnostics
space-development Astra Launch, orbital, cislunar, governance, manufacturing
grand-strategy Leo Cross-domain, macro, geopolitics, coordination

If a source touches multiple domains, pick the primary and list others in cross_domain_flags.

Shared pipeline

Deduplication (SQLite)

Every source item passes through dedup before archiving:

CREATE TABLE staged (
    source_type TEXT,       -- 'futardio', 'twitter', 'rss', 'solana'
    source_id TEXT UNIQUE,  -- Launch ID, tweet ID, article URL, tx sig
    url TEXT,
    title TEXT,
    author TEXT,
    content TEXT,
    domain TEXT,
    published_date TEXT,
    staged_at TEXT DEFAULT CURRENT_TIMESTAMP
);

Dedup key varies by adapter:

Adapter Dedup key
futardio launch ID
twitter tweet ID
rss article URL
solana tx signature

Git workflow

All adapters share the same git workflow:

# 1. Branch
git checkout -b ingestion/{source}-$(date +%Y%m%d-%H%M)

# 2. Stage files
git add inbox/archive/*.md

# 3. Commit
git commit -m "ingestion: N sources from {source} batch $(date +%Y%m%d-%H%M)

- Sources: [brief list]
- Domains: [which domains routed to]"

# 4. Push
git push -u origin HEAD

# 5. Open PR on Forgejo
curl -X POST "https://git.livingip.xyz/api/v1/repos/teleo/teleo-codex/pulls" \
  -H "Authorization: token $FORGEJO_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "title": "ingestion: N sources from {source} batch TIMESTAMP",
    "body": "## Batch summary\n- N source files\n- Domain: {domain}\n- Source: {source}\n\nAutomated ingestion daemon.",
    "head": "ingestion/{source}-TIMESTAMP",
    "base": "main"
  }'

After PR creation, the Forgejo webhook triggers the eval pipeline which routes to the appropriate domain agent for extraction.

Batching

Sources are batched per adapter per run. If the futardio adapter finds 3 new launches in one poll cycle, all 3 go in one branch/PR. If it finds 0, no branch is created. This keeps PR volume manageable for the review pipeline.

Adapter specifications

futardio adapter

Source: futard.io — permissionless launchpad on Solana (MetaDAO ecosystem)

What to pull:

  1. New project launches — name, description, funding target, FDV, status
  2. Funding threshold events — project reaches funding threshold, triggers refund
  3. Platform metrics snapshots — total committed, funder count, active launches

Significance filter: Skip routine transaction updates. Archive only:

  • New launch listed
  • Funding threshold reached (project funded)
  • Refund triggered
  • Platform milestone (e.g., total committed crosses round number)

Example output:

---
type: source
title: "Futardio launch: SolForge reaches funding threshold"
author: "futard.io"
url: "https://futard.io/launches/solforge"
date: 2026-03-09
domain: internet-finance
format: data
status: unprocessed
tags: [futardio, metadao, solana, permissionless-launches, capital-formation]
linked_set: futardio-launches-march-2026
priority: medium
contributor: "ingestion-daemon"
---

## Summary
SolForge reached its funding threshold on futard.io with $X committed from N funders.

## Content
- Project: SolForge
- Description: [from listing]
- FDV: [value]
- Funding: [amount] / [target] ([percentage]%)
- Funders: [N]
- Status: COMPLETE
- Launch date: 2026-03-09
- Use of funds: [from listing]

## Context
Part of the futard.io permissionless launch platform (MetaDAO ecosystem).

twitter adapter

Source: X/Twitter via twitterapi.io

Config: Takes a network JSON file (e.g., theseus-network.json, rio-network.json) that defines accounts and tiers.

What to pull: Recent tweets from network accounts, filtered by engagement threshold.

Dedup: Tweet ID. Skip retweets without commentary. Quote tweets are separate items.

rss adapter

Source: RSS/Atom feeds via feedparser

Config: List of feed URLs with domain routing.

What to pull: New articles since last poll. Full text via Crawl4AI (JS-rendered) or trafilatura (fallback).

Dedup: Article URL.

solana adapter

Source: Solana RPC / program event logs

Config: List of program addresses to monitor.

What to pull: Governance events (new proposals, vote results, treasury operations). Not routine transfers.

Significance filter: Only events that change governance state.

Setup checklist

  • Forgejo account with API token (write access to teleo-codex)
  • SSH key or HTTPS token for git push to Forgejo
  • SQLite database file for dedup staging
  • ingestion-config.yaml with source definitions
  • Cron or systemd timer on VPS
  • Test: single adapter → one source file → push → PR → verify webhook triggers eval

Files to read

File What it tells you
schemas/source.md Canonical source archive schema
schemas/claim.md What agents produce from your sources (downstream)
skills/extract.md The extraction process agents run on your files
CONTRIBUTING.md Human contributor workflow (similar pattern)
CLAUDE.md Full collective operating manual
inbox/archive/*.md Real examples of archived sources

Cost model

Component Cost
VPS (Hetzner CAX31) ~$15/mo
X API (twitterapi.io) ~$100/mo
Daemon compute Negligible (polling + formatting)
Agent extraction (downstream) Covered by Claude Max subscription on VPS
Total ingestion ~$115/mo fixed

The expensive part (LLM calls for extraction and evaluation) happens downstream in the agent pipeline, not in the daemon. The daemon itself is cheap — it's just HTTP requests, text formatting, and git operations.