teleo-infrastructure/ARCHITECTURE.md

# Pipeline v2 Architecture

Single async Python daemon replacing 7 cron scripts. Four stage loops running concurrently with SQLite WAL state store.

## System Overview

```
                    ┌─────────────────────────────────────────────┐
                    │            teleo-pipeline.py                │
                    │                                             │
                    │  ┌─────────┐ ┌──────────┐ ┌──────────┐ ┌───────┐
                    │  │ Ingest  │ │ Validate │ │ Evaluate │ │ Merge │
                    │  │ (stub)  │ │  30s     │ │  30s     │ │ 30s   │
                    │  └────┬────┘ └────┬─────┘ └────┬─────┘ └───┬───┘
                    │       │           │            │           │
                    │       └───────────┴────────────┴───────────┘
                    │                        │
                    │                   SQLite WAL
                    │              (pipeline.db)
                    └─────────────────────────────────────────────┘
                                         │
                              ┌──────────┴──────────┐
                              │    Forgejo API       │
                              │  git.livingip.xyz    │
                              └─────────────────────┘
```

**Location:** `/opt/teleo-eval/pipeline/` (VPS), `~/.pentagon/workspace/collective/pipeline-v2/` (local dev)

**Process:** Single Python process, systemd-managed. PID tracked. Graceful shutdown on SIGTERM/SIGINT — waits up to 60s for stages to finish, then kills lingering Claude CLI subprocesses.

## Infrastructure

| Component | Detail |
|-----------|--------|
| VPS | Hetzner CAX31, 77.42.65.182, Ubuntu 24.04 ARM64, 16GB RAM |
| Forgejo | git.livingip.xyz, org: `teleo`, repo: `teleo-codex` |
| Bare repo | `/opt/teleo-eval/workspaces/teleo-codex.git` — single-writer (fetch cron only) |
| Main worktree | `/opt/teleo-eval/workspaces/main` — refreshed by fetch, used for wiki link resolution |
| Database | `/opt/teleo-eval/pipeline/pipeline.db` — SQLite WAL mode |
| Secrets | `/opt/teleo-eval/secrets/` — per-agent Forgejo tokens, OpenRouter key |
| Logs | `/opt/teleo-eval/logs/pipeline.jsonl` — structured JSON, 50MB rotation, 7-day retention |

## PR Lifecycle

```
Source → Ingest → PR created on Forgejo
                       │
                  ┌─────▼──────┐
                  │   Validate  │  Tier 0: deterministic Python ($0)
                  │  (tier0)    │  Schema, title, wiki links, domain match
                  └─────┬──────┘
                        │ tier0_pass = 1
                  ┌─────▼──────┐
                  │  Tier 0.5   │  Mechanical pre-check ($0)
                  │             │  Frontmatter, wiki links (ALL .md files),
                  │             │  near-duplicate (warning only)
                  └─────┬──────┘
                        │ passes
                  ┌─────▼──────┐
                  │   Triage    │  Haiku via OpenRouter (~$0.002)
                  │             │  → DEEP / STANDARD / LIGHT
                  └─────┬──────┘
                        │
              ┌─────────┼─────────┐
              │         │         │
           DEEP    STANDARD    LIGHT
              │         │         │
         ┌────▼────┐ ┌──▼──┐  ┌──▼──────────┐
         │ Domain  │ │same │  │ skip or      │
         │ GPT-4o  │ │     │  │ auto-approve │
         │(OpenR)  │ │     │  │ (LIGHT_SKIP) │
         └────┬────┘ └──┬──┘  └──────────────┘
              │         │
         ┌────▼────┐ ┌──▼──────┐
         │  Leo    │ │  Leo    │
         │  Opus   │ │ Sonnet  │
         │(Claude  │ │(OpenR)  │
         │  Max)   │ │         │
         └────┬────┘ └──┬──────┘
              │         │
              └────┬────┘
                   │
            ┌──────▼──────┐
            │  Disposition │  Retry budget, issue classification
            └──────┬──────┘
                   │ both approve
            ┌──────▼──────┐
            │    Merge     │  Rebase + API merge, domain-serialized
            └─────────────┘
```

## Stage 1: Ingest (stub)

**Status:** Not implemented in pipeline v2. Sources were processed by old cron scripts (`extract-cron.sh`, `openrouter-extract.py`). All extraction crons are currently **disabled**.

**Interval:** 60s

**What it will do:** Scan `inbox/` for unprocessed sources, extract claims via LLM, create PRs on Forgejo, track in `sources` table.

## Stage 2: Validate (Tier 0)

**Module:** `lib/validate.py`
**Interval:** 30s
**Cost:** $0 (pure Python)

Deterministic validation gate. Finds PRs with `status='open'` and `tier0_pass IS NULL`.

### Checks performed (per claim file)

| Check | Type | Action |
|-------|------|--------|
| YAML frontmatter present | Gate | Fail if missing |
| Required fields: type, domain, description, confidence, source, created | Gate | Fail if missing |
| Valid enums (type, domain, confidence) | Gate | Fail if invalid |
| Description length ≥ 10 chars | Gate | Fail |
| Date valid (2020–today, correct format) | Gate | Fail |
| Title is prose proposition (verb/connective detection) | Gate | Fail if < 4 words and no signal |
| Wiki links resolve to existing files | Gate | Fail if broken |
| Domain-directory match | Gate | Fail if `domain:` field doesn't match file path |
| Universal quantifiers without scoping | Warning | Tag but don't fail |
| Description too similar to title (>75% SequenceMatcher) | Warning | Tag but don't fail |
| Near-duplicate title (>85% SequenceMatcher) | Warning | Tag but don't fail |

### SHA-based idempotency

Each validation posts a comment with `<!-- TIER0-VALIDATION:{sha} -->`. If a comment with the current HEAD SHA already exists, validation is skipped. Force-push (new SHA) triggers re-validation.

### On new commits: full eval reset

When Tier 0 runs on a PR, it unconditionally resets:
- `eval_attempts = 0`
- `eval_issues = '[]'`
- `domain_verdict = 'pending'`, `leo_verdict = 'pending'`

This gives the PR a fresh evaluation cycle after any code change.

## Stage 2.5: Tier 0.5 (Mechanical Pre-check)

**Location:** `_tier05_mechanical_check()` in `lib/evaluate.py`
**Cost:** $0 (pure Python)
**Runs:** Inside `evaluate_pr()`, after musings bypass, before triage.

Catches mechanical issues that domain review (GPT-4o) rubber-stamps and Leo rejects without structured issue tags.

### Checks

| Check | Scope | Action |
|-------|-------|--------|
| Frontmatter schema (parse + validate) | New files in claim dirs only | **Gate** (block) |
| Wiki link resolution | **ALL .md files** in diff | **Gate** (block) |
| Near-duplicate detection | New files in claim dirs only | **Tag only** (warning, LLM decides) |

### Key design decisions

- **Wiki links checked on all .md files**, not just claim directories. Agent files (`agents/*/beliefs.md`, etc.) frequently contain broken `[[links]]` that Tier 0.5 must catch before Opus wastes time on them.
- **Modified files only get wiki link checks** — they have partial content from diff, so frontmatter parsing is unreliable.
- **Near-duplicate is never a gate** — similarity is a judgment call for the LLM reviewer.

### On failure

Posts Forgejo comment with issue tags (`<!-- ISSUES: tag1, tag2 -->`), sets `status='open'`, runs disposition. Counts as an eval attempt.

## Stage 3: Evaluate

**Module:** `lib/evaluate.py`
**Interval:** 30s
**Finds:** PRs with `status='open'`, `tier0_pass=1`, pending verdicts, `eval_attempts < MAX_EVAL_ATTEMPTS`

### 3a. Musings Bypass

If a PR only modifies files in `agents/*/musings/`, it's auto-approved immediately. No review needed.

### 3b. Triage

**Model:** Haiku via OpenRouter (~$0.002/call)

Classifies PR into exactly one tier:

| Tier | Criteria | Review path |
|------|----------|-------------|
| **DEEP** | Likely+ confidence, cross-domain, challenges existing, axiom-level | Full: Domain (GPT-4o) + Leo (Opus) |
| **STANDARD** | New claims, enrichments, hypothesis beliefs | Full: Domain (GPT-4o) + Leo (Sonnet) |
| **LIGHT** | Entity updates, source archiving, formatting, status changes | Configurable: skip or auto-approve |

**When uncertain, classify UP.** Always err toward more review.

### Tier Overrides (post-triage)

Two overrides run after triage, in order. Both check `tier == "LIGHT"` so no double-upgrade is possible.

1. **Claim-shape detector** — If any `+` line in the diff contains `type: claim` (any YAML quoting variant), upgrade LIGHT → STANDARD. Catches factual claims disguised as light content. $0, deterministic.

2. **Random pre-merge promotion** — 15% of remaining LIGHT PRs get upgraded to STANDARD. Makes gaming unpredictable — extraction agents can't know which LIGHT PRs get full review.

### 3c. Domain Review

**Model:** GPT-4o via OpenRouter
**Skipped when:** `LIGHT_SKIP_LLM=True` (config flag), or already completed from prior attempt

Reviews 4 criteria:
1. Factual accuracy
2. Intra-PR duplicates (same evidence copy-pasted across files)
3. Confidence calibration
4. Wiki link validity

**Verdict rules:** APPROVE if factually correct even with minor improvements possible. REQUEST_CHANGES only for blocking issues (factual errors, genuinely broken links, copy-pasted duplicates, clearly wrong confidence).

**If domain rejects:** Leo review is skipped entirely (saves Opus/Sonnet).

### 3d. Leo Review

**Model:** Opus via Claude Max (DEEP) or Sonnet via OpenRouter (STANDARD)
**Skipped when:** LIGHT tier, or domain review rejected

DEEP reviews check 11 criteria (cross-domain implications, axiom integrity, epistemic hygiene, etc.). STANDARD reviews check 6 criteria (schema, duplicates, confidence, wiki links, source quality, specificity).

### Verdicts

**There are exactly two verdicts:** `APPROVE` and `REQUEST_CHANGES`. There is no `REJECT` verdict.

Verdicts are parsed from structured tags in the review:
```
<!-- VERDICT:LEO:APPROVE -->
<!-- VERDICT:LEO:REQUEST_CHANGES -->
```

If no parseable verdict is found, defaults to `request_changes`.

### Issue Tags

Reviews tag specific issues using structured comments:
```
<!-- ISSUES: broken_wiki_links, frontmatter_schema -->
```

**Valid tags:**

| Tag | Category | Description |
|-----|----------|-------------|
| `broken_wiki_links` | Mechanical | `[[links]]` that don't resolve to existing files |
| `frontmatter_schema` | Mechanical | Missing/invalid YAML fields |
| `near_duplicate` | Mechanical | Title too similar to existing claim (>85%) |
| `factual_discrepancy` | Substantive | Factual errors in the claim |
| `confidence_miscalibration` | Substantive | Confidence level doesn't match evidence |
| `scope_error` | Substantive | Claim scope too broad/narrow |
| `title_overclaims` | Substantive | Title makes stronger claim than evidence supports |
| `date_errors` | — | Invalid or incorrect dates |

**Tag inference fallback:** If a review rejects without structured `<!-- ISSUES: -->` tags, `_infer_issues_from_prose()` scans the review text with conservative regex patterns to extract issue tags. 7 categories, 2-4 keyword patterns each.

### Review Style Guide

All review prompts include the style guide requiring per-criterion findings:
- "You MUST show your work"
- "For each criterion, write one sentence with your finding"
- "'Everything passes' with no evidence of checking will be treated as review failures"

Reviews are posted as Forgejo comments from the reviewing agent's own Forgejo account (per-agent tokens in `/opt/teleo-eval/secrets/`).

## Retry Budget and Disposition

### Eval Attempts

**Hard cap:** `MAX_EVAL_ATTEMPTS = 3`

Each time `evaluate_pr()` runs, it increments `eval_attempts` before any checks. This means Tier 0.5 failures count as eval attempts.

### Issue Classification

Issues are classified as:
- **Mechanical:** `frontmatter_schema`, `broken_wiki_links`, `near_duplicate`
- **Substantive:** `factual_discrepancy`, `confidence_miscalibration`, `scope_error`, `title_overclaims`
- **Mixed:** Both types present
- **Unknown:** Tags not in either set

### Disposition Logic

| Attempt | Mechanical only | Substantive/Mixed/Unknown |
|---------|----------------|--------------------------|
| 1 | Back to open, wait for fix | Back to open, wait for fix |
| 2 | **Keep open** for one more try | **Terminate** (close PR, requeue source) |
| 3+ | **Terminate** | **Terminate** |

**Terminate** means: close PR on Forgejo with explanation comment, update DB status to `closed`, tag source for re-extraction (if source_path linked).

### SHA-based Reset

When Tier 0 validates a new commit (new HEAD SHA), it resets `eval_attempts = 0` and all verdicts to `pending`. This gives the PR a completely fresh evaluation cycle after any code change.

## Stage 4: Merge

**Module:** `lib/merge.py`
**Interval:** 30s

### Domain Serialization

Merges are serialized per-domain (one merge at a time per domain) but parallel across domains. Two layers enforce this:
1. `asyncio.Lock` per domain (fast path, lost on crash)
2. SQL `NOT EXISTS` check for `status='merging'` in same domain (defense-in-depth)

### Merge Flow

1. **Discover external PRs** — Scan Forgejo for open PRs not in SQLite. Human PRs get `priority='high'` and an acknowledgment comment.

2. **Claim next approved PR** — Atomic `UPDATE ... RETURNING` with priority ordering: `critical > high > medium > low > unclassified`. PR priority overrides source priority.

3. **Rebase onto main** — Creates temp worktree, rebases, force-pushes with `--force-with-lease` pinned to expected SHA (defeats tracking-ref race).

4. **Merge via Forgejo API** — Checks if already merged/closed first (prevents 405 on ghost PRs).

5. **Cleanup** — Delete remote branch, prune worktree metadata.

### Merge Timeout

5 minutes max per merge. If exceeded, force-reset to `status='conflict'`.

### Formal Approvals

After both verdicts approve, `_post_formal_approvals()` submits Forgejo review approvals from 2 agent accounts (not the PR author). Required by Forgejo's merge protection rules.

## Model Routing

**Design principle:** Model diversity. Domain review (GPT-4o) and Leo review (Sonnet/Opus) use different model families to prevent correlated blind spots.

| Stage | Model | Backend | Cost |
|-------|-------|---------|------|
| Triage | Haiku | OpenRouter | ~$0.002/call |
| Domain review | GPT-4o | OpenRouter | ~$0.02/call |
| Leo STANDARD | Sonnet 4.5 | OpenRouter | ~$0.02/call |
| Leo DEEP | Opus | Claude Max (subscription) | $0 (rate-limited) |
| Extraction | Sonnet | Claude Max | $0 (rate-limited) |

### Opus Rate Limit Handling

When Claude Max Opus hits rate limit:
1. Set 15-minute global backoff
2. During backoff: STANDARD PRs still flow (Sonnet via OpenRouter), DEEP PRs queue
3. Triage (Haiku) and domain review (GPT-4o) always flow (OpenRouter)
4. After cooldown: resume full eval

### Overflow Policies

Per-stage behavior when Claude Max is rate-limited:

| Stage | Policy | Behavior |
|-------|--------|----------|
| Extract | queue | Wait for capacity |
| Triage | overflow | Fall back to API |
| Domain review | overflow | Always API anyway |
| Leo review | queue | Wait for capacity (protect Opus) |
| DEEP eval | overflow | Already on API |
| Sample audit | skip | Optional, skip if constrained |

## Circuit Breakers

Per-stage circuit breakers backed by SQLite. Three states:

| State | Behavior |
|-------|----------|
| **CLOSED** | Normal operation |
| **OPEN** | Stage paused (5 consecutive failures) |
| **HALFOPEN** | Cooldown expired (15 min), probe with 1 worker |

A successful probe in HALFOPEN closes the breaker. A failed probe reopens it.

## Crash Recovery

On startup, the pipeline recovers interrupted state:
- Sources stuck in `extracting` → `unprocessed` (with retry counter increment; if exhausted → `error`)
- PRs stuck in `merging` → `approved` (re-merge attempt)
- PRs stuck in `reviewing` → `open` (re-evaluate)

Orphan worktrees from `/tmp/teleo-extract-*` and `/tmp/teleo-merge-*` are cleaned up.

## Domain → Agent Mapping

Every domain has exactly one primary reviewing agent:

| Domain | Agent | Territory |
|--------|-------|-----------|
| internet-finance | Rio | `domains/internet-finance/` |
| entertainment | Clay | `domains/entertainment/` |
| health | Vida | `domains/health/` |
| ai-alignment | Theseus | `domains/ai-alignment/` |
| space-development | Astra | `domains/space-development/` |
| mechanisms | Rio | `core/mechanisms/` |
| living-capital | Rio | `core/living-capital/` |
| living-agents | Theseus | `core/living-agents/` |
| teleohumanity | Leo | `core/teleohumanity/` |
| grand-strategy | Leo | `core/grand-strategy/` |
| critical-systems | Theseus | `foundations/critical-systems/` |
| collective-intelligence | Theseus | `foundations/collective-intelligence/` |
| teleological-economics | Rio | `foundations/teleological-economics/` |
| cultural-dynamics | Clay | `foundations/cultural-dynamics/` |

Domain detection from diff: counts file path occurrences in `domains/`, `entities/`, `core/`, `foundations/` subdirectories. Most-referenced domain wins.

## Key Configuration (`lib/config.py`)

| Setting | Value | Purpose |
|---------|-------|---------|
| `MAX_EVAL_ATTEMPTS` | 3 | Hard cap on eval cycles per PR |
| `EVAL_TIMEOUT` | 600s | Per-review timeout (Claude CLI + OpenRouter) |
| `MAX_EVAL_WORKERS` | 7 | Max concurrent eval tasks per cycle |
| `MERGE_TIMEOUT` | 300s | Force-reset to conflict if exceeded |
| `BREAKER_THRESHOLD` | 5 | Consecutive failures to trip breaker |
| `BREAKER_COOLDOWN` | 900s | 15 min before half-open probe |
| `LIGHT_SKIP_LLM` | false | When true, LIGHT PRs skip all LLM review |
| `LIGHT_PROMOTION_RATE` | 0.15 | Random LIGHT → STANDARD upgrade rate |
| `DEDUP_THRESHOLD` | 0.85 | SequenceMatcher near-duplicate threshold |
| `OPENROUTER_DAILY_BUDGET` | $20 | Daily cost cap for OpenRouter |
| `SAMPLE_AUDIT_RATE` | 0.15 | Pre-merge audit sampling rate |

## Module Map

| Module | Responsibility |
|--------|---------------|
| `teleo-pipeline.py` | Main entry, stage loops, shutdown, crash recovery |
| `lib/evaluate.py` | Tier 0.5, triage, domain+Leo review, retry budget, disposition |
| `lib/validate.py` | Tier 0 validation, frontmatter parsing, all deterministic checks |
| `lib/merge.py` | Domain-serialized merge, rebase, PR discovery, branch cleanup |
| `lib/llm.py` | Prompt templates, OpenRouter transport, Claude CLI transport |
| `lib/forgejo.py` | Forgejo API client, diff fetching, agent token management |
| `lib/domains.py` | Domain↔agent mapping, domain detection from diff/branch |
| `lib/config.py` | All constants, paths, model IDs, thresholds |
| `lib/db.py` | SQLite connection, migrations, audit logging, transactions |
| `lib/breaker.py` | Per-stage circuit breaker state machine |
| `lib/costs.py` | OpenRouter cost tracking and budget enforcement |
| `lib/health.py` | HTTP health endpoint (port 8080) |
| `lib/log.py` | Structured JSON logging setup |

## Known Issues and Gaps

1. **Ingest stage is a stub** — Sources are not being ingested into pipeline v2. Old cron scripts (disabled) handled extraction.
2. **No auto-fixer** — When Tier 0.5 or reviews reject for mechanical issues, there's no automated fix. PRs just consume eval attempts until terminal.
3. **`broken_wiki_links` is systemic** — Extraction agents create `[[links]]` to claims that don't exist in the KB. This is the #1 rejection reason. Root cause is extraction prompt quality, not eval.
4. **Sequential eval processing** — `evaluate_cycle()` processes PRs in a for-loop, not concurrent `asyncio.gather`. Only one Opus review runs at a time.
5. **Source re-extraction not wired** — `_terminate_pr()` tags sources for `needs_reextraction` but sources table is empty (never populated by pipeline v2).

## Design Decisions Log

| Decision | Rationale | Author |
|----------|-----------|--------|
| Domain review on GPT-4o, not Claude | Different model family = no correlated blind spots + keeps Claude Max rate limit for Opus | Leo |
| Opus reserved for DEEP only | Scarce resource (Claude Max subscription). STANDARD goes to Sonnet on OpenRouter. | Leo |
| Tier 0.5 before triage | Catch mechanical issues at $0 before any LLM call. Saves ~$0.02/PR on GPT-4o for obviously broken PRs. | Leo/Ganymede |
| Wiki links checked on ALL .md files | Agent files (beliefs.md etc.) frequently have broken links. Original scope (claim dirs only) let them bypass to Opus. | Leo |
| Near-duplicate is tag-only, not gate | Similarity is a judgment call. Two claims about the same topic can be genuinely distinct. LLM decides. | Ganymede |
| Domain-serialized merge | Prevents `_map.md` merge conflicts. Cross-domain parallel, same-domain serial. | Ganymede/Rhea |
| Rebase with pinned force-with-lease | Defeats tracking-ref update race between bare repo fetch and merge push. | Ganymede |
| SHA-based eval reset | New commit = new code. Cheaper to re-eval ($0.03) than parse commit messages. | Ganymede |
| Human PRs get priority high, not critical | Critical reserved for explicit override. Prevents DoS on pipeline from external PRs. | Ganymede |
| Claim-shape detector | Converts semantic problem (is this a real claim?) to mechanical check (does YAML say type: claim?). | Theseus |
| Random promotion | Makes gaming unpredictable. Extraction agents can't know which LIGHT PRs get full review. | Rio |