teleo-infrastructure/INFRASTRUCTURE.md
m3taversal 799249d470 Initial commit: Pipeline v2 daemon + infrastructure docs
- teleo-pipeline.py: async daemon with 4 stage loops (ingest/validate/evaluate/merge)
- lib/: config, db, evaluate, validate, merge, breaker, costs, health, log modules
- INFRASTRUCTURE.md: comprehensive deep-dive for onboarding
- teleo-pipeline.service: systemd unit file

Pentagon-Agent: Leo <294C3CA1-0205-4668-82FA-B984D54F48AD>
2026-03-12 14:11:18 +00:00

17 KiB

Teleo Infrastructure Deep Dive

Overview

Teleo runs a knowledge extraction and evaluation pipeline on a single VPS. Six AI domain agents (Rio, Clay, Theseus, Vida, Astra, Leo) continuously extract claims from source material, evaluate them through a multi-stage review process, and merge approved claims into a shared knowledge base.

The system is mid-migration from 7 bash cron scripts (v1) to a single Python async daemon (v2). Pipeline v2 handles validate, evaluate, and merge. Extraction still runs on v1 cron. Ingest (Phase 4) will complete the migration.

Source Material → Ingest → Validate → Evaluate → Merge → Knowledge Base
     (cron v1)     (stub)   (v2)       (v2)       (v2)     (git repo)

VPS

  • Host: 77.42.65.182 (Hetzner, Debian)
  • SSH: root@77.42.65.182 (key auth)
  • Disk: 150GB, 19GB used (13%)
  • User: teleo (pipeline runs as this user)
  • Base dir: /opt/teleo-eval/

Directory Layout

/opt/teleo-eval/
├── pipeline/                    # Pipeline v2 daemon
│   ├── teleo-pipeline.py        # Main entry point (4 async stage loops)
│   ├── pipeline.db              # SQLite WAL state store (160KB)
│   ├── .venv/                   # Python virtualenv (aiohttp)
│   └── lib/
│       ├── config.py            # All constants, model assignments, overflow policies
│       ├── db.py                # Schema, migrations, connection management
│       ├── validate.py          # Tier 0 validation (schema, links, duplicates)
│       ├── evaluate.py          # Triage + domain review + Leo review
│       ├── merge.py             # Domain-serialized rebase + Forgejo API merge
│       ├── health.py            # HTTP health API (localhost:8080)
│       ├── breaker.py           # Circuit breaker per stage
│       ├── costs.py             # API cost tracking with daily budgets
│       └── log.py               # JSON structured logging
├── workspaces/
│   ├── teleo-codex.git/         # Bare repo (49MB) — pipeline's git backend
│   └── main/                    # Main branch worktree (for validation checks)
├── mirror/
│   └── teleo-codex.git/         # Separate bare repo for GitHub↔Forgejo sync
├── secrets/
│   ├── forgejo-admin-token      # Admin Forgejo API token
│   ├── forgejo-{agent}-token    # Per-agent tokens (rio, clay, theseus, vida, astra, leo)
│   ├── github-pat               # GitHub mirror push token
│   ├── openrouter-key           # OpenRouter API key
│   ├── twitterapi-io-key        # X/Twitter API key
│   └── x-bearer-token           # X bearer token
├── logs/                        # Log files for cron scripts and pipeline
├── *.sh                         # Legacy cron scripts (being replaced)
└── eval/                        # Legacy eval scripts

Services

Forgejo (Git Forge)

  • Runs in: Docker container (codeberg.org/forgejo/forgejo:9)
  • Port: 3000 (HTTP), 2222 (SSH)
  • Public URL: https://git.livingip.xyz
  • Repo: teleo/teleo-codex
  • Purpose: Hosts the knowledge base repo, manages PRs, stores review comments
  • Users: Per-agent Forgejo accounts (rio, clay, theseus, vida, astra, leo, teleo)

Pipeline v2 Daemon

  • Service: teleo-pipeline.service (systemd)
  • Commands: systemctl {start|stop|restart|status} teleo-pipeline
  • Logs: journalctl -u teleo-pipeline -f
  • Health: curl localhost:8080/health
  • Shutdown: SIGTERM → 60s drain → force-cancel → kill subprocesses (180s total)

Active Cron Jobs (teleo user)

Schedule Script Purpose
*/3 * * * extract-cron.sh Source extraction (v1, still active)
*/2 * * * sync-mirror.sh Forgejo↔GitHub bidirectional sync
*/2 * * * fetch-bare.sh Fetch latest into bare repo
0 0 * * * pipeline-health-check.sh Daily health metrics
0 */2 * * * pipeline-health-check.py 2-hourly health report

Disabled Cron Jobs (replaced by Pipeline v2)

  • fix-extraction-prs.py — replaced by validate.py
  • eval-dispatcher.sh — replaced by evaluate.py
  • merge-retry.sh — replaced by merge.py
  • Research sessions (rio, clay, theseus, vida, astra) — disabled during pipeline migration

GitHub Mirror

  • Repo: github.com/user/teleo-codex (public mirror)
  • Sync: Bidirectional, Forgejo authoritative on conflict
  • Frequency: Every 2 minutes via sync-mirror.sh
  • Security: GitHub→Forgejo path never auto-processes branches. Only PRs trigger pipeline work.

Pipeline v2 Architecture

Stage Loop

Each stage runs as an async task with its own interval, circuit breaker, and shutdown check:

async def stage_loop(name, interval, func, conn, breaker):
    while not shutdown_event.is_set():
        if breaker.allow_request():
            succeeded, failed = await func(conn, max_workers=breaker.max_workers())
            # Record success/failure for breaker
        await asyncio.wait_for(shutdown_event.wait(), timeout=interval)
Stage Interval Function Status
Ingest 60s ingest_cycle() Stub — Phase 4
Validate 30s validate_cycle() Live
Evaluate 30s evaluate_cycle() Live
Merge 30s merge_cycle() Live

Crash Recovery

On startup, the daemon recovers interrupted state from prior crashes:

  1. Sources stuck in extracting → increment retry counter → unprocessed (or error if budget exhausted)
  2. PRs stuck in mergingapproved (re-enter merge queue)
  3. PRs stuck in reviewingopen (re-enter eval queue)
  4. Orphan git worktrees (/tmp/teleo-extract-*, /tmp/teleo-merge-*) cleaned up

Stage 1: Validate (lib/validate.py)

Runs Tier 0 structural validation on PRs with status='open' and tier0_pass IS NULL.

Checks

  1. Schema validation — YAML frontmatter has required fields (type, domain, description, confidence, source, created)
  2. Date formatcreated field is valid YYYY-MM-DD
  3. Title format — Prose proposition, not a label (heuristic: 8+ words, no bare noun phrases)
  4. Wiki link validity[[links]] resolve to real files in the repo
  5. Universal quantifier check — Flags claims using "all", "always", "never", "every" without scoping
  6. Domain-directory match — Claim's domain field matches its file path
  7. Description quality — Description adds info beyond the title (not a substring)
  8. Near-duplicate detection — Trigram similarity against existing claims
  9. Proposition heuristic — Title passes the claim test ("This note argues that [title]" works)

Output

  • Posts Tier 0 validation comment on Forgejo PR (with SHA-based idempotency marker)
  • Sets tier0_pass = 1 (pass) or tier0_pass = 0 (fail)
  • Failing PRs remain status='open' but are excluded from eval queue

Stage 2: Evaluate (lib/evaluate.py)

The core intelligence stage. Domain-first, Leo-last architecture.

PR Flow

PR (open, tier0_pass=1)
  │
  ├─ Triage (Haiku/OpenRouter) → DEEP / STANDARD / LIGHT
  │
  ├─ Domain Review (Sonnet/Claude Max → overflow GPT-4o/OpenRouter)
  │    ├─ REJECT → status='open', feedback stored, Leo skipped
  │    └─ APPROVE → continue to Leo
  │
  ├─ Leo Review (Opus/Claude Max → overflow: queue only)
  │    ├─ REJECT → status='open', feedback stored
  │    └─ APPROVE → continue
  │
  ├─ LIGHT tier: Leo skipped, domain-only gate
  │
  ├─ Both approve → formal Forgejo approvals (2 agent tokens) → status='approved'
  │
  └─ Musings bypass: PRs touching only agents/*/musings/ auto-approve

Model Routing

Stage Primary Overflow Policy
Triage Haiku (OpenRouter) Always API
Domain review Sonnet (Claude Max) GPT-4o (OpenRouter) overflow
Leo review Opus (Claude Max) queue (never overflow)
DEEP cross-family GPT-4o (OpenRouter) Always API (not yet implemented)

Claude Max is a subscription — free but rate-limited. When rate-limited, the CLI returns "You've hit your limit" on stdout (not stderr) with exit code 1. The pipeline detects this and applies the overflow policy.

Key design principle: Opus is the scarce resource. Domain review (Sonnet) filters first — high volume, catches most issues. Leo review (Opus) only sees pre-filtered PRs. This maximizes value per scarce Opus call.

Domain Routing

Domain detection reads diff file paths (domains/, entities/, core/, foundations/) and maps to the responsible agent:

Domain Agent
internet-finance, mechanisms, living-capital, teleological-economics Rio
entertainment, cultural-dynamics Clay
ai-alignment, living-agents, critical-systems, collective-intelligence Theseus
health Vida
space-development Astra
teleohumanity, grand-strategy Leo

Backoff and Resume

  • 10-minute backoff: PRs attempted within the last 10 minutes are skipped (prevents retry storms during rate limits)
  • Domain review resume: If domain review completed but Leo review was rate-limited, domain review is skipped on retry (no wasted OpenRouter calls)
  • last_attempt tracking: Set at the start of evaluate_pr, persists through status revert

Review Attribution

  • Domain review comments post from the domain agent's Forgejo account (e.g., Rio posts Rio's review)
  • Leo review comments post from Leo's Forgejo account
  • Formal approvals come from 2 agent tokens (not the PR author)

Verdict Parsing

Reviews end with HTML comment tags:

<!-- VERDICT:RIO:APPROVE -->
<!-- VERDICT:LEO:REQUEST_CHANGES -->
<!-- ISSUES: broken_wiki_links, confidence_miscalibration -->

Stage 3: Merge (lib/merge.py)

Domain-serialized priority queue with rebase-before-merge.

Design

  • Domain serialization: Same-domain merges are serial (prevents _map.md conflicts). Cross-domain merges are parallel.
  • Two-layer locking: asyncio.Lock per domain (fast path, lost on crash) + prs.status='merging' in SQLite (durable, crash recovery)
  • NOT EXISTS subquery: SQL defense-in-depth prevents two PRs in the same domain from merging simultaneously

Merge Flow

1. Discover external PRs (pagination over Forgejo API)
   - Detect origin: pipeline vs human (by author login)
   - Human PRs: priority='high', ack comment posted

2. For each domain with approved PRs:
   a. Claim next PR (atomic UPDATE...RETURNING with priority queue)
   b. Create git worktree at /tmp/teleo-merge-{branch}
   c. Capture expected SHA (pin for force-with-lease)
   d. Fetch origin/main, check if rebase needed
   e. Rebase onto main (abort on conflict → status='conflict')
   f. Force-push with --force-with-lease={branch}:{expected_sha}
   g. Merge via Forgejo API
   h. Delete remote branch
   i. Cleanup worktree

Priority Queue

COALESCE(p.priority, s.priority, 'medium')
-- PR-level priority > source-level priority > default 'medium'
-- NULL falls to ELSE 4 (intentionally below explicit medium)
Priority Value Use
critical 0 Reserved for explicit human override
high 1 Human-submitted PRs
medium 2 Standard pipeline PRs
low 3 Explicitly deprioritized
NULL 4 Unclassified (below medium)

Timeouts

  • Merge timeout: 5 minutes per PR. Exceeding → status='conflict'
  • Rebase timeout: 2 minutes
  • Push timeout: 30 seconds
  • API merge failure: Sets status='conflict' (not approved — prevents infinite retry)

Database Schema

SQLite WAL mode. Schema version 2.

Tables

sources — Source material pipeline

  • path (PK), status, priority, extraction_model, claims_count, pr_number
  • transient_retries, substantive_retries, last_error, feedback

prs — Pull request lifecycle

  • number (PK), source_path, branch, status, domain, tier
  • tier0_pass, leo_verdict, domain_verdict, domain_agent, domain_model
  • priority, origin (pipeline/human), last_attempt

costs — API spend tracking

  • (date, model, stage) (composite PK), calls, input_tokens, output_tokens, cost_usd

circuit_breakers — Per-stage health

  • name (PK), state (closed/open/halfopen), failures, successes, last_success_at

audit_log — Event log

  • id, timestamp, stage, event, detail (JSON)

PR Status Lifecycle

open → validating → open (tier0_pass set)
                  → reviewing → approved → merging → merged
                              → open (rejected, feedback stored)
                  → conflict (rebase/merge failed)
                  → zombie (stuck, manual intervention)

Health API

GET localhost:8080/health returns:

{
  "status": "healthy|degraded|stalled",
  "breakers": {
    "ingest": {"state": "closed", "failures": 0},
    "validate": {"state": "closed", "failures": 0, "last_success_age_s": 30, "stalled": false},
    "evaluate": {"state": "closed", "failures": 0, "last_success_age_s": 45, "stalled": false},
    "merge": {"state": "closed", "failures": 0}
  },
  "sources": {"unprocessed": 10, "extracting": 2},
  "prs": {"open": 117, "approved": 5, "merging": 1},
  "merge_queue_by_domain": {"internet-finance": 3, "health": 2},
  "budget": {"ok": true, "spend": 1.23, "budget": 20.0, "pct": 6.2},
  "metabolic": {
    "null_result_rate_24h": 0.05,
    "domain_approval_rate_24h": 0.96,
    "leo_approval_rate_24h": 0.85
  }
}

Stall detection: If now() - last_success_at > 2 * interval, the stage is stalled.


Circuit Breakers

Each stage has an independent circuit breaker:

  • Closed (normal): All requests pass
  • Open (tripped): Requests blocked for BREAKER_COOLDOWN (15 min)
  • Half-open: One test request allowed; success → closed, failure → open

Triggers: 5 consecutive failures trip the breaker. Worker count reduces under pressure.


Cost Management

  • Daily budget: $20 USD (OpenRouter)
  • Warning threshold: 80% of budget
  • Claude Max: Free (tracked for volume, cost = $0)
  • Budget check: Health API reports spend, pipeline can pause extraction when budget exhausted

Known Issues and Deferred Work

Active Issues

  1. PR #702 in conflict: Archive-only PR, Forgejo returned 500 on merge API. Likely needs manual merge or close.
  2. 36 PRs failed Tier 0: Will not enter eval. Need either re-extraction or closure.
  3. Domain-rejected PR limbo (Ganymede warning #4): PRs rejected by domain review have status='open' but exit the eval queue. No path to re-extraction or closure. Needs domain_rejected status or auto-close mechanism.
  4. DEEP cross-family review not implemented (Ganymede warning #5): Docstring promises GPT-4o adversarial review for DEEP PRs after both domain and Leo approve. Not in code.
  5. Sonnet leniency tracking: 96% domain approval rate. Need to measure Opus disagreement rate when it comes online (Mar 13, 5pm UTC). If Opus rejects >15% of domain-approved PRs, domain prompt needs tightening.

Deferred Nits

  • entity_diff from _filter_diff() is returned but unused
  • Formal approvals use hardcoded agent order instead of actual reviewers
  • aiohttp.ClientSession created per API call (should be one per cycle)

Phase 4: Ingest Module (lib/ingest.py)

Not yet built. Will port extract-cron.sh + extract-worker.sh. When complete, the remaining v1 cron scripts can be disabled.

Phase 5: Integration + Cutover

Full pipeline test with all 4 stages. Disable remaining cron scripts. Re-enable research sessions.


Operational Runbook

Check pipeline health

ssh root@77.42.65.182 'curl -s localhost:8080/health | python3 -m json.tool'

View logs

ssh root@77.42.65.182 'journalctl -u teleo-pipeline -f'           # live
ssh root@77.42.65.182 'journalctl -u teleo-pipeline -n 50'        # recent
ssh root@77.42.65.182 'journalctl -u teleo-pipeline --since "1 hour ago"'

Restart pipeline

ssh root@77.42.65.182 'systemctl restart teleo-pipeline'

Query database

ssh root@77.42.65.182 'sqlite3 /opt/teleo-eval/pipeline/pipeline.db "SELECT status, count(*) FROM prs GROUP BY status"'

Deploy code changes

scp lib/evaluate.py root@77.42.65.182:/opt/teleo-eval/pipeline/lib/evaluate.py
ssh root@77.42.65.182 'chown teleo:teleo /opt/teleo-eval/pipeline/lib/evaluate.py && systemctl restart teleo-pipeline'

Reset a stuck PR

ssh root@77.42.65.182 'sqlite3 /opt/teleo-eval/pipeline/pipeline.db "UPDATE prs SET status = \"open\", leo_verdict = \"pending\", domain_verdict = \"pending\" WHERE number = 702"'

Check circuit breakers

ssh root@77.42.65.182 'sqlite3 /opt/teleo-eval/pipeline/pipeline.db "SELECT * FROM circuit_breakers"'

View cost breakdown

ssh root@77.42.65.182 'sqlite3 /opt/teleo-eval/pipeline/pipeline.db "SELECT model, stage, calls, cost_usd FROM costs WHERE date = date(\"now\") ORDER BY cost_usd DESC"'