# Teleo Infrastructure Deep Dive

## Overview

Teleo runs a **knowledge extraction and evaluation pipeline** on a single VPS. Six AI domain agents (Rio, Clay, Theseus, Vida, Astra, Leo) continuously extract claims from source material, evaluate them through a multi-stage review process, and merge approved claims into a shared knowledge base.

The system is mid-migration from **7 bash cron scripts** (v1) to a **single Python async daemon** (v2). Pipeline v2 handles validate, evaluate, and merge. Extraction still runs on v1 cron. Ingest (Phase 4) will complete the migration.

```
Source Material → Ingest → Validate → Evaluate → Merge → Knowledge Base
     (cron v1)     (stub)   (v2)       (v2)       (v2)     (git repo)
```

---

## VPS

- **Host**: `77.42.65.182` (Hetzner, Debian)
- **SSH**: `root@77.42.65.182` (key auth)
- **Disk**: 150GB, 19GB used (13%)
- **User**: `teleo` (pipeline runs as this user)
- **Base dir**: `/opt/teleo-eval/`

### Directory Layout

```
/opt/teleo-eval/
├── pipeline/                    # Pipeline v2 daemon
│   ├── teleo-pipeline.py        # Main entry point (4 async stage loops)
│   ├── pipeline.db              # SQLite WAL state store (160KB)
│   ├── .venv/                   # Python virtualenv (aiohttp)
│   └── lib/
│       ├── config.py            # All constants, model assignments, overflow policies
│       ├── db.py                # Schema, migrations, connection management
│       ├── validate.py          # Tier 0 validation (schema, links, duplicates)
│       ├── evaluate.py          # Triage + domain review + Leo review
│       ├── merge.py             # Domain-serialized rebase + Forgejo API merge
│       ├── health.py            # HTTP health API (localhost:8080)
│       ├── breaker.py           # Circuit breaker per stage
│       ├── costs.py             # API cost tracking with daily budgets
│       └── log.py               # JSON structured logging
├── workspaces/
│   ├── teleo-codex.git/         # Bare repo (49MB) — pipeline's git backend
│   └── main/                    # Main branch worktree (for validation checks)
├── mirror/
│   └── teleo-codex.git/         # Separate bare repo for GitHub↔Forgejo sync
├── secrets/
│   ├── forgejo-admin-token      # Admin Forgejo API token
│   ├── forgejo-{agent}-token    # Per-agent tokens (rio, clay, theseus, vida, astra, leo)
│   ├── github-pat               # GitHub mirror push token
│   ├── openrouter-key           # OpenRouter API key
│   ├── twitterapi-io-key        # X/Twitter API key
│   └── x-bearer-token           # X bearer token
├── logs/                        # Log files for cron scripts and pipeline
├── *.sh                         # Legacy cron scripts (being replaced)
└── eval/                        # Legacy eval scripts
```

---

## Services

### Forgejo (Git Forge)

- **Runs in**: Docker container (`codeberg.org/forgejo/forgejo:9`)
- **Port**: 3000 (HTTP), 2222 (SSH)
- **Public URL**: `https://git.livingip.xyz`
- **Repo**: `teleo/teleo-codex`
- **Purpose**: Hosts the knowledge base repo, manages PRs, stores review comments
- **Users**: Per-agent Forgejo accounts (`rio`, `clay`, `theseus`, `vida`, `astra`, `leo`, `teleo`)

### Pipeline v2 Daemon

- **Service**: `teleo-pipeline.service` (systemd)
- **Commands**: `systemctl {start|stop|restart|status} teleo-pipeline`
- **Logs**: `journalctl -u teleo-pipeline -f`
- **Health**: `curl localhost:8080/health`
- **Shutdown**: SIGTERM → 60s drain → force-cancel → kill subprocesses (180s total)

### Active Cron Jobs (teleo user)

| Schedule | Script | Purpose |
|----------|--------|---------|
| `*/3 * * *` | `extract-cron.sh` | Source extraction (v1, still active) |
| `*/2 * * *` | `sync-mirror.sh` | Forgejo↔GitHub bidirectional sync |
| `*/2 * * *` | `fetch-bare.sh` | Fetch latest into bare repo |
| `0 0 * * *` | `pipeline-health-check.sh` | Daily health metrics |
| `0 */2 * * *` | `pipeline-health-check.py` | 2-hourly health report |

### Disabled Cron Jobs (replaced by Pipeline v2)

- `fix-extraction-prs.py` — replaced by `validate.py`
- `eval-dispatcher.sh` — replaced by `evaluate.py`
- `merge-retry.sh` — replaced by `merge.py`
- Research sessions (rio, clay, theseus, vida, astra) — disabled during pipeline migration

### GitHub Mirror

- **Repo**: `github.com/user/teleo-codex` (public mirror)
- **Sync**: Bidirectional, Forgejo authoritative on conflict
- **Frequency**: Every 2 minutes via `sync-mirror.sh`
- **Security**: GitHub→Forgejo path never auto-processes branches. Only PRs trigger pipeline work.

---

## Pipeline v2 Architecture

### Stage Loop

Each stage runs as an async task with its own interval, circuit breaker, and shutdown check:

```python
async def stage_loop(name, interval, func, conn, breaker):
    while not shutdown_event.is_set():
        if breaker.allow_request():
            succeeded, failed = await func(conn, max_workers=breaker.max_workers())
            # Record success/failure for breaker
        await asyncio.wait_for(shutdown_event.wait(), timeout=interval)
```

| Stage | Interval | Function | Status |
|-------|----------|----------|--------|
| Ingest | 60s | `ingest_cycle()` | **Stub** — Phase 4 |
| Validate | 30s | `validate_cycle()` | **Live** |
| Evaluate | 30s | `evaluate_cycle()` | **Live** |
| Merge | 30s | `merge_cycle()` | **Live** |

### Crash Recovery

On startup, the daemon recovers interrupted state from prior crashes:

1. Sources stuck in `extracting` → increment retry counter → `unprocessed` (or `error` if budget exhausted)
2. PRs stuck in `merging` → `approved` (re-enter merge queue)
3. PRs stuck in `reviewing` → `open` (re-enter eval queue)
4. Orphan git worktrees (`/tmp/teleo-extract-*`, `/tmp/teleo-merge-*`) cleaned up

---

## Stage 1: Validate (`lib/validate.py`)

Runs Tier 0 structural validation on PRs with `status='open'` and `tier0_pass IS NULL`.

### Checks

1. **Schema validation** — YAML frontmatter has required fields (type, domain, description, confidence, source, created)
2. **Date format** — `created` field is valid YYYY-MM-DD
3. **Title format** — Prose proposition, not a label (heuristic: 8+ words, no bare noun phrases)
4. **Wiki link validity** — `[[links]]` resolve to real files in the repo
5. **Universal quantifier check** — Flags claims using "all", "always", "never", "every" without scoping
6. **Domain-directory match** — Claim's `domain` field matches its file path
7. **Description quality** — Description adds info beyond the title (not a substring)
8. **Near-duplicate detection** — Trigram similarity against existing claims
9. **Proposition heuristic** — Title passes the claim test ("This note argues that [title]" works)

### Output

- Posts Tier 0 validation comment on Forgejo PR (with SHA-based idempotency marker)
- Sets `tier0_pass = 1` (pass) or `tier0_pass = 0` (fail)
- Failing PRs remain `status='open'` but are excluded from eval queue

---

## Stage 2: Evaluate (`lib/evaluate.py`)

The core intelligence stage. Domain-first, Leo-last architecture.

### PR Flow

```
PR (open, tier0_pass=1)
  │
  ├─ Triage (Haiku/OpenRouter) → DEEP / STANDARD / LIGHT
  │
  ├─ Domain Review (Sonnet/Claude Max → overflow GPT-4o/OpenRouter)
  │    ├─ REJECT → status='open', feedback stored, Leo skipped
  │    └─ APPROVE → continue to Leo
  │
  ├─ Leo Review (Opus/Claude Max → overflow: queue only)
  │    ├─ REJECT → status='open', feedback stored
  │    └─ APPROVE → continue
  │
  ├─ LIGHT tier: Leo skipped, domain-only gate
  │
  ├─ Both approve → formal Forgejo approvals (2 agent tokens) → status='approved'
  │
  └─ Musings bypass: PRs touching only agents/*/musings/ auto-approve
```

### Model Routing

| Stage | Primary | Overflow | Policy |
|-------|---------|----------|--------|
| Triage | Haiku (OpenRouter) | — | Always API |
| Domain review | Sonnet (Claude Max) | GPT-4o (OpenRouter) | `overflow` |
| Leo review | Opus (Claude Max) | — | `queue` (never overflow) |
| DEEP cross-family | GPT-4o (OpenRouter) | — | Always API (not yet implemented) |

**Claude Max** is a subscription — free but rate-limited. When rate-limited, the CLI returns `"You've hit your limit"` on **stdout** (not stderr) with exit code 1. The pipeline detects this and applies the overflow policy.

**Key design principle**: Opus is the scarce resource. Domain review (Sonnet) filters first — high volume, catches most issues. Leo review (Opus) only sees pre-filtered PRs. This maximizes value per scarce Opus call.

### Domain Routing

Domain detection reads diff file paths (`domains/`, `entities/`, `core/`, `foundations/`) and maps to the responsible agent:

| Domain | Agent |
|--------|-------|
| internet-finance, mechanisms, living-capital, teleological-economics | Rio |
| entertainment, cultural-dynamics | Clay |
| ai-alignment, living-agents, critical-systems, collective-intelligence | Theseus |
| health | Vida |
| space-development | Astra |
| teleohumanity, grand-strategy | Leo |

### Backoff and Resume

- **10-minute backoff**: PRs attempted within the last 10 minutes are skipped (prevents retry storms during rate limits)
- **Domain review resume**: If domain review completed but Leo review was rate-limited, domain review is skipped on retry (no wasted OpenRouter calls)
- **`last_attempt` tracking**: Set at the start of `evaluate_pr`, persists through status revert

### Review Attribution

- Domain review comments post from the domain agent's Forgejo account (e.g., Rio posts Rio's review)
- Leo review comments post from Leo's Forgejo account
- Formal approvals come from 2 agent tokens (not the PR author)

### Verdict Parsing

Reviews end with HTML comment tags:
```
<!-- VERDICT:RIO:APPROVE -->
<!-- VERDICT:LEO:REQUEST_CHANGES -->
<!-- ISSUES: broken_wiki_links, confidence_miscalibration -->
```

---

## Stage 3: Merge (`lib/merge.py`)

Domain-serialized priority queue with rebase-before-merge.

### Design

- **Domain serialization**: Same-domain merges are serial (prevents `_map.md` conflicts). Cross-domain merges are parallel.
- **Two-layer locking**: `asyncio.Lock` per domain (fast path, lost on crash) + `prs.status='merging'` in SQLite (durable, crash recovery)
- **NOT EXISTS subquery**: SQL defense-in-depth prevents two PRs in the same domain from merging simultaneously

### Merge Flow

```
1. Discover external PRs (pagination over Forgejo API)
   - Detect origin: pipeline vs human (by author login)
   - Human PRs: priority='high', ack comment posted

2. For each domain with approved PRs:
   a. Claim next PR (atomic UPDATE...RETURNING with priority queue)
   b. Create git worktree at /tmp/teleo-merge-{branch}
   c. Capture expected SHA (pin for force-with-lease)
   d. Fetch origin/main, check if rebase needed
   e. Rebase onto main (abort on conflict → status='conflict')
   f. Force-push with --force-with-lease={branch}:{expected_sha}
   g. Merge via Forgejo API
   h. Delete remote branch
   i. Cleanup worktree
```

### Priority Queue

```sql
COALESCE(p.priority, s.priority, 'medium')
-- PR-level priority > source-level priority > default 'medium'
-- NULL falls to ELSE 4 (intentionally below explicit medium)
```

| Priority | Value | Use |
|----------|-------|-----|
| critical | 0 | Reserved for explicit human override |
| high | 1 | Human-submitted PRs |
| medium | 2 | Standard pipeline PRs |
| low | 3 | Explicitly deprioritized |
| NULL | 4 | Unclassified (below medium) |

### Timeouts

- **Merge timeout**: 5 minutes per PR. Exceeding → `status='conflict'`
- **Rebase timeout**: 2 minutes
- **Push timeout**: 30 seconds
- **API merge failure**: Sets `status='conflict'` (not `approved` — prevents infinite retry)

---

## Database Schema

SQLite WAL mode. Schema version 2.

### Tables

**`sources`** — Source material pipeline
- `path` (PK), `status`, `priority`, `extraction_model`, `claims_count`, `pr_number`
- `transient_retries`, `substantive_retries`, `last_error`, `feedback`

**`prs`** — Pull request lifecycle
- `number` (PK), `source_path`, `branch`, `status`, `domain`, `tier`
- `tier0_pass`, `leo_verdict`, `domain_verdict`, `domain_agent`, `domain_model`
- `priority`, `origin` (pipeline/human), `last_attempt`

**`costs`** — API spend tracking
- `(date, model, stage)` (composite PK), `calls`, `input_tokens`, `output_tokens`, `cost_usd`

**`circuit_breakers`** — Per-stage health
- `name` (PK), `state` (closed/open/halfopen), `failures`, `successes`, `last_success_at`

**`audit_log`** — Event log
- `id`, `timestamp`, `stage`, `event`, `detail` (JSON)

### PR Status Lifecycle

```
open → validating → open (tier0_pass set)
                  → reviewing → approved → merging → merged
                              → open (rejected, feedback stored)
                  → conflict (rebase/merge failed)
                  → zombie (stuck, manual intervention)
```

---

## Health API

`GET localhost:8080/health` returns:

```json
{
  "status": "healthy|degraded|stalled",
  "breakers": {
    "ingest": {"state": "closed", "failures": 0},
    "validate": {"state": "closed", "failures": 0, "last_success_age_s": 30, "stalled": false},
    "evaluate": {"state": "closed", "failures": 0, "last_success_age_s": 45, "stalled": false},
    "merge": {"state": "closed", "failures": 0}
  },
  "sources": {"unprocessed": 10, "extracting": 2},
  "prs": {"open": 117, "approved": 5, "merging": 1},
  "merge_queue_by_domain": {"internet-finance": 3, "health": 2},
  "budget": {"ok": true, "spend": 1.23, "budget": 20.0, "pct": 6.2},
  "metabolic": {
    "null_result_rate_24h": 0.05,
    "domain_approval_rate_24h": 0.96,
    "leo_approval_rate_24h": 0.85
  }
}
```

**Stall detection**: If `now() - last_success_at > 2 * interval`, the stage is stalled.

---

## Circuit Breakers

Each stage has an independent circuit breaker:

- **Closed** (normal): All requests pass
- **Open** (tripped): Requests blocked for `BREAKER_COOLDOWN` (15 min)
- **Half-open**: One test request allowed; success → closed, failure → open

Triggers: 5 consecutive failures trip the breaker. Worker count reduces under pressure.

---

## Cost Management

- **Daily budget**: $20 USD (OpenRouter)
- **Warning threshold**: 80% of budget
- **Claude Max**: Free (tracked for volume, cost = $0)
- **Budget check**: Health API reports spend, pipeline can pause extraction when budget exhausted

---

## Known Issues and Deferred Work

### Active Issues

1. **PR #702 in `conflict`**: Archive-only PR, Forgejo returned 500 on merge API. Likely needs manual merge or close.
2. **36 PRs failed Tier 0**: Will not enter eval. Need either re-extraction or closure.
3. **Domain-rejected PR limbo** (Ganymede warning #4): PRs rejected by domain review have `status='open'` but exit the eval queue. No path to re-extraction or closure. Needs `domain_rejected` status or auto-close mechanism.
4. **DEEP cross-family review not implemented** (Ganymede warning #5): Docstring promises GPT-4o adversarial review for DEEP PRs after both domain and Leo approve. Not in code.
5. **Sonnet leniency tracking**: 96% domain approval rate. Need to measure Opus disagreement rate when it comes online (Mar 13, 5pm UTC). If Opus rejects >15% of domain-approved PRs, domain prompt needs tightening.

### Deferred Nits

- `entity_diff` from `_filter_diff()` is returned but unused
- Formal approvals use hardcoded agent order instead of actual reviewers
- `aiohttp.ClientSession` created per API call (should be one per cycle)

### Phase 4: Ingest Module (`lib/ingest.py`)

Not yet built. Will port `extract-cron.sh` + `extract-worker.sh`. When complete, the remaining v1 cron scripts can be disabled.

### Phase 5: Integration + Cutover

Full pipeline test with all 4 stages. Disable remaining cron scripts. Re-enable research sessions.

---

## Operational Runbook

### Check pipeline health
```bash
ssh root@77.42.65.182 'curl -s localhost:8080/health | python3 -m json.tool'
```

### View logs
```bash
ssh root@77.42.65.182 'journalctl -u teleo-pipeline -f'           # live
ssh root@77.42.65.182 'journalctl -u teleo-pipeline -n 50'        # recent
ssh root@77.42.65.182 'journalctl -u teleo-pipeline --since "1 hour ago"'
```

### Restart pipeline
```bash
ssh root@77.42.65.182 'systemctl restart teleo-pipeline'
```

### Query database
```bash
ssh root@77.42.65.182 'sqlite3 /opt/teleo-eval/pipeline/pipeline.db "SELECT status, count(*) FROM prs GROUP BY status"'
```

### Deploy code changes
```bash
scp lib/evaluate.py root@77.42.65.182:/opt/teleo-eval/pipeline/lib/evaluate.py
ssh root@77.42.65.182 'chown teleo:teleo /opt/teleo-eval/pipeline/lib/evaluate.py && systemctl restart teleo-pipeline'
```

### Reset a stuck PR
```bash
ssh root@77.42.65.182 'sqlite3 /opt/teleo-eval/pipeline/pipeline.db "UPDATE prs SET status = \"open\", leo_verdict = \"pending\", domain_verdict = \"pending\" WHERE number = 702"'
```

### Check circuit breakers
```bash
ssh root@77.42.65.182 'sqlite3 /opt/teleo-eval/pipeline/pipeline.db "SELECT * FROM circuit_breakers"'
```

### View cost breakdown
```bash
ssh root@77.42.65.182 'sqlite3 /opt/teleo-eval/pipeline/pipeline.db "SELECT model, stage, calls, cost_usd FROM costs WHERE date = date(\"now\") ORDER BY cost_usd DESC"'
```