# Pipeline v2 Architecture Single async Python daemon replacing 7 cron scripts. Four stage loops running concurrently with SQLite WAL state store. ## System Overview ``` ┌─────────────────────────────────────────────┐ │ teleo-pipeline.py │ │ │ │ ┌─────────┐ ┌──────────┐ ┌──────────┐ ┌───────┐ │ │ Ingest │ │ Validate │ │ Evaluate │ │ Merge │ │ │ (stub) │ │ 30s │ │ 30s │ │ 30s │ │ └────┬────┘ └────┬─────┘ └────┬─────┘ └───┬───┘ │ │ │ │ │ │ └───────────┴────────────┴───────────┘ │ │ │ SQLite WAL │ (pipeline.db) └─────────────────────────────────────────────┘ │ ┌──────────┴──────────┐ │ Forgejo API │ │ git.livingip.xyz │ └─────────────────────┘ ``` **Location:** `/opt/teleo-eval/pipeline/` (VPS), `~/.pentagon/workspace/collective/pipeline-v2/` (local dev) **Process:** Single Python process, systemd-managed. PID tracked. Graceful shutdown on SIGTERM/SIGINT — waits up to 60s for stages to finish, then kills lingering Claude CLI subprocesses. ## Infrastructure | Component | Detail | |-----------|--------| | VPS | Hetzner CAX31, 77.42.65.182, Ubuntu 24.04 ARM64, 16GB RAM | | Forgejo | git.livingip.xyz, org: `teleo`, repo: `teleo-codex` | | Bare repo | `/opt/teleo-eval/workspaces/teleo-codex.git` — single-writer (fetch cron only) | | Main worktree | `/opt/teleo-eval/workspaces/main` — refreshed by fetch, used for wiki link resolution | | Database | `/opt/teleo-eval/pipeline/pipeline.db` — SQLite WAL mode | | Secrets | `/opt/teleo-eval/secrets/` — per-agent Forgejo tokens, OpenRouter key | | Logs | `/opt/teleo-eval/logs/pipeline.jsonl` — structured JSON, 50MB rotation, 7-day retention | ## PR Lifecycle ``` Source → Ingest → PR created on Forgejo │ ┌─────▼──────┐ │ Validate │ Tier 0: deterministic Python ($0) │ (tier0) │ Schema, title, wiki links, domain match └─────┬──────┘ │ tier0_pass = 1 ┌─────▼──────┐ │ Tier 0.5 │ Mechanical pre-check ($0) │ │ Frontmatter, wiki links (ALL .md files), │ │ near-duplicate (warning only) └─────┬──────┘ │ passes ┌─────▼──────┐ │ Triage │ Haiku via OpenRouter (~$0.002) │ │ → DEEP / STANDARD / LIGHT └─────┬──────┘ │ ┌─────────┼─────────┐ │ │ │ DEEP STANDARD LIGHT │ │ │ ┌────▼────┐ ┌──▼──┐ ┌──▼──────────┐ │ Domain │ │same │ │ skip or │ │ GPT-4o │ │ │ │ auto-approve │ │(OpenR) │ │ │ │ (LIGHT_SKIP) │ └────┬────┘ └──┬──┘ └──────────────┘ │ │ ┌────▼────┐ ┌──▼──────┐ │ Leo │ │ Leo │ │ Opus │ │ Sonnet │ │(Claude │ │(OpenR) │ │ Max) │ │ │ └────┬────┘ └──┬──────┘ │ │ └────┬────┘ │ ┌──────▼──────┐ │ Disposition │ Retry budget, issue classification └──────┬──────┘ │ both approve ┌──────▼──────┐ │ Merge │ Rebase + API merge, domain-serialized └─────────────┘ ``` ## Stage 1: Ingest (stub) **Status:** Not implemented in pipeline v2. Sources were processed by old cron scripts (`extract-cron.sh`, `openrouter-extract.py`). All extraction crons are currently **disabled**. **Interval:** 60s **What it will do:** Scan `inbox/` for unprocessed sources, extract claims via LLM, create PRs on Forgejo, track in `sources` table. ## Stage 2: Validate (Tier 0) **Module:** `lib/validate.py` **Interval:** 30s **Cost:** $0 (pure Python) Deterministic validation gate. Finds PRs with `status='open'` and `tier0_pass IS NULL`. ### Checks performed (per claim file) | Check | Type | Action | |-------|------|--------| | YAML frontmatter present | Gate | Fail if missing | | Required fields: type, domain, description, confidence, source, created | Gate | Fail if missing | | Valid enums (type, domain, confidence) | Gate | Fail if invalid | | Description length ≥ 10 chars | Gate | Fail | | Date valid (2020–today, correct format) | Gate | Fail | | Title is prose proposition (verb/connective detection) | Gate | Fail if < 4 words and no signal | | Wiki links resolve to existing files | Gate | Fail if broken | | Domain-directory match | Gate | Fail if `domain:` field doesn't match file path | | Universal quantifiers without scoping | Warning | Tag but don't fail | | Description too similar to title (>75% SequenceMatcher) | Warning | Tag but don't fail | | Near-duplicate title (>85% SequenceMatcher) | Warning | Tag but don't fail | ### SHA-based idempotency Each validation posts a comment with ``. If a comment with the current HEAD SHA already exists, validation is skipped. Force-push (new SHA) triggers re-validation. ### On new commits: full eval reset When Tier 0 runs on a PR, it unconditionally resets: - `eval_attempts = 0` - `eval_issues = '[]'` - `domain_verdict = 'pending'`, `leo_verdict = 'pending'` This gives the PR a fresh evaluation cycle after any code change. ## Stage 2.5: Tier 0.5 (Mechanical Pre-check) **Location:** `_tier05_mechanical_check()` in `lib/evaluate.py` **Cost:** $0 (pure Python) **Runs:** Inside `evaluate_pr()`, after musings bypass, before triage. Catches mechanical issues that domain review (GPT-4o) rubber-stamps and Leo rejects without structured issue tags. ### Checks | Check | Scope | Action | |-------|-------|--------| | Frontmatter schema (parse + validate) | New files in claim dirs only | **Gate** (block) | | Wiki link resolution | **ALL .md files** in diff | **Gate** (block) | | Near-duplicate detection | New files in claim dirs only | **Tag only** (warning, LLM decides) | ### Key design decisions - **Wiki links checked on all .md files**, not just claim directories. Agent files (`agents/*/beliefs.md`, etc.) frequently contain broken `[[links]]` that Tier 0.5 must catch before Opus wastes time on them. - **Modified files only get wiki link checks** — they have partial content from diff, so frontmatter parsing is unreliable. - **Near-duplicate is never a gate** — similarity is a judgment call for the LLM reviewer. ### On failure Posts Forgejo comment with issue tags (``), sets `status='open'`, runs disposition. Counts as an eval attempt. ## Stage 3: Evaluate **Module:** `lib/evaluate.py` **Interval:** 30s **Finds:** PRs with `status='open'`, `tier0_pass=1`, pending verdicts, `eval_attempts < MAX_EVAL_ATTEMPTS` ### 3a. Musings Bypass If a PR only modifies files in `agents/*/musings/`, it's auto-approved immediately. No review needed. ### 3b. Triage **Model:** Haiku via OpenRouter (~$0.002/call) Classifies PR into exactly one tier: | Tier | Criteria | Review path | |------|----------|-------------| | **DEEP** | Likely+ confidence, cross-domain, challenges existing, axiom-level | Full: Domain (GPT-4o) + Leo (Opus) | | **STANDARD** | New claims, enrichments, hypothesis beliefs | Full: Domain (GPT-4o) + Leo (Sonnet) | | **LIGHT** | Entity updates, source archiving, formatting, status changes | Configurable: skip or auto-approve | **When uncertain, classify UP.** Always err toward more review. ### Tier Overrides (post-triage) Two overrides run after triage, in order. Both check `tier == "LIGHT"` so no double-upgrade is possible. 1. **Claim-shape detector** — If any `+` line in the diff contains `type: claim` (any YAML quoting variant), upgrade LIGHT → STANDARD. Catches factual claims disguised as light content. $0, deterministic. 2. **Random pre-merge promotion** — 15% of remaining LIGHT PRs get upgraded to STANDARD. Makes gaming unpredictable — extraction agents can't know which LIGHT PRs get full review. ### 3c. Domain Review **Model:** GPT-4o via OpenRouter **Skipped when:** `LIGHT_SKIP_LLM=True` (config flag), or already completed from prior attempt Reviews 4 criteria: 1. Factual accuracy 2. Intra-PR duplicates (same evidence copy-pasted across files) 3. Confidence calibration 4. Wiki link validity **Verdict rules:** APPROVE if factually correct even with minor improvements possible. REQUEST_CHANGES only for blocking issues (factual errors, genuinely broken links, copy-pasted duplicates, clearly wrong confidence). **If domain rejects:** Leo review is skipped entirely (saves Opus/Sonnet). ### 3d. Leo Review **Model:** Opus via Claude Max (DEEP) or Sonnet via OpenRouter (STANDARD) **Skipped when:** LIGHT tier, or domain review rejected DEEP reviews check 11 criteria (cross-domain implications, axiom integrity, epistemic hygiene, etc.). STANDARD reviews check 6 criteria (schema, duplicates, confidence, wiki links, source quality, specificity). ### Verdicts **There are exactly two verdicts:** `APPROVE` and `REQUEST_CHANGES`. There is no `REJECT` verdict. Verdicts are parsed from structured tags in the review: ``` ``` If no parseable verdict is found, defaults to `request_changes`. ### Issue Tags Reviews tag specific issues using structured comments: ``` ``` **Valid tags:** | Tag | Category | Description | |-----|----------|-------------| | `broken_wiki_links` | Mechanical | `[[links]]` that don't resolve to existing files | | `frontmatter_schema` | Mechanical | Missing/invalid YAML fields | | `near_duplicate` | Mechanical | Title too similar to existing claim (>85%) | | `factual_discrepancy` | Substantive | Factual errors in the claim | | `confidence_miscalibration` | Substantive | Confidence level doesn't match evidence | | `scope_error` | Substantive | Claim scope too broad/narrow | | `title_overclaims` | Substantive | Title makes stronger claim than evidence supports | | `date_errors` | — | Invalid or incorrect dates | **Tag inference fallback:** If a review rejects without structured `` tags, `_infer_issues_from_prose()` scans the review text with conservative regex patterns to extract issue tags. 7 categories, 2-4 keyword patterns each. ### Review Style Guide All review prompts include the style guide requiring per-criterion findings: - "You MUST show your work" - "For each criterion, write one sentence with your finding" - "'Everything passes' with no evidence of checking will be treated as review failures" Reviews are posted as Forgejo comments from the reviewing agent's own Forgejo account (per-agent tokens in `/opt/teleo-eval/secrets/`). ## Retry Budget and Disposition ### Eval Attempts **Hard cap:** `MAX_EVAL_ATTEMPTS = 3` Each time `evaluate_pr()` runs, it increments `eval_attempts` before any checks. This means Tier 0.5 failures count as eval attempts. ### Issue Classification Issues are classified as: - **Mechanical:** `frontmatter_schema`, `broken_wiki_links`, `near_duplicate` - **Substantive:** `factual_discrepancy`, `confidence_miscalibration`, `scope_error`, `title_overclaims` - **Mixed:** Both types present - **Unknown:** Tags not in either set ### Disposition Logic | Attempt | Mechanical only | Substantive/Mixed/Unknown | |---------|----------------|--------------------------| | 1 | Back to open, wait for fix | Back to open, wait for fix | | 2 | **Keep open** for one more try | **Terminate** (close PR, requeue source) | | 3+ | **Terminate** | **Terminate** | **Terminate** means: close PR on Forgejo with explanation comment, update DB status to `closed`, tag source for re-extraction (if source_path linked). ### SHA-based Reset When Tier 0 validates a new commit (new HEAD SHA), it resets `eval_attempts = 0` and all verdicts to `pending`. This gives the PR a completely fresh evaluation cycle after any code change. ## Stage 4: Merge **Module:** `lib/merge.py` **Interval:** 30s ### Domain Serialization Merges are serialized per-domain (one merge at a time per domain) but parallel across domains. Two layers enforce this: 1. `asyncio.Lock` per domain (fast path, lost on crash) 2. SQL `NOT EXISTS` check for `status='merging'` in same domain (defense-in-depth) ### Merge Flow 1. **Discover external PRs** — Scan Forgejo for open PRs not in SQLite. Human PRs get `priority='high'` and an acknowledgment comment. 2. **Claim next approved PR** — Atomic `UPDATE ... RETURNING` with priority ordering: `critical > high > medium > low > unclassified`. PR priority overrides source priority. 3. **Rebase onto main** — Creates temp worktree, rebases, force-pushes with `--force-with-lease` pinned to expected SHA (defeats tracking-ref race). 4. **Merge via Forgejo API** — Checks if already merged/closed first (prevents 405 on ghost PRs). 5. **Cleanup** — Delete remote branch, prune worktree metadata. ### Merge Timeout 5 minutes max per merge. If exceeded, force-reset to `status='conflict'`. ### Formal Approvals After both verdicts approve, `_post_formal_approvals()` submits Forgejo review approvals from 2 agent accounts (not the PR author). Required by Forgejo's merge protection rules. ## Model Routing **Design principle:** Model diversity. Domain review (GPT-4o) and Leo review (Sonnet/Opus) use different model families to prevent correlated blind spots. | Stage | Model | Backend | Cost | |-------|-------|---------|------| | Triage | Haiku | OpenRouter | ~$0.002/call | | Domain review | GPT-4o | OpenRouter | ~$0.02/call | | Leo STANDARD | Sonnet 4.5 | OpenRouter | ~$0.02/call | | Leo DEEP | Opus | Claude Max (subscription) | $0 (rate-limited) | | Extraction | Sonnet | Claude Max | $0 (rate-limited) | ### Opus Rate Limit Handling When Claude Max Opus hits rate limit: 1. Set 15-minute global backoff 2. During backoff: STANDARD PRs still flow (Sonnet via OpenRouter), DEEP PRs queue 3. Triage (Haiku) and domain review (GPT-4o) always flow (OpenRouter) 4. After cooldown: resume full eval ### Overflow Policies Per-stage behavior when Claude Max is rate-limited: | Stage | Policy | Behavior | |-------|--------|----------| | Extract | queue | Wait for capacity | | Triage | overflow | Fall back to API | | Domain review | overflow | Always API anyway | | Leo review | queue | Wait for capacity (protect Opus) | | DEEP eval | overflow | Already on API | | Sample audit | skip | Optional, skip if constrained | ## Circuit Breakers Per-stage circuit breakers backed by SQLite. Three states: | State | Behavior | |-------|----------| | **CLOSED** | Normal operation | | **OPEN** | Stage paused (5 consecutive failures) | | **HALFOPEN** | Cooldown expired (15 min), probe with 1 worker | A successful probe in HALFOPEN closes the breaker. A failed probe reopens it. ## Crash Recovery On startup, the pipeline recovers interrupted state: - Sources stuck in `extracting` → `unprocessed` (with retry counter increment; if exhausted → `error`) - PRs stuck in `merging` → `approved` (re-merge attempt) - PRs stuck in `reviewing` → `open` (re-evaluate) Orphan worktrees from `/tmp/teleo-extract-*` and `/tmp/teleo-merge-*` are cleaned up. ## Domain → Agent Mapping Every domain has exactly one primary reviewing agent: | Domain | Agent | Territory | |--------|-------|-----------| | internet-finance | Rio | `domains/internet-finance/` | | entertainment | Clay | `domains/entertainment/` | | health | Vida | `domains/health/` | | ai-alignment | Theseus | `domains/ai-alignment/` | | space-development | Astra | `domains/space-development/` | | mechanisms | Rio | `core/mechanisms/` | | living-capital | Rio | `core/living-capital/` | | living-agents | Theseus | `core/living-agents/` | | teleohumanity | Leo | `core/teleohumanity/` | | grand-strategy | Leo | `core/grand-strategy/` | | critical-systems | Theseus | `foundations/critical-systems/` | | collective-intelligence | Theseus | `foundations/collective-intelligence/` | | teleological-economics | Rio | `foundations/teleological-economics/` | | cultural-dynamics | Clay | `foundations/cultural-dynamics/` | Domain detection from diff: counts file path occurrences in `domains/`, `entities/`, `core/`, `foundations/` subdirectories. Most-referenced domain wins. ## Key Configuration (`lib/config.py`) | Setting | Value | Purpose | |---------|-------|---------| | `MAX_EVAL_ATTEMPTS` | 3 | Hard cap on eval cycles per PR | | `EVAL_TIMEOUT` | 600s | Per-review timeout (Claude CLI + OpenRouter) | | `MAX_EVAL_WORKERS` | 7 | Max concurrent eval tasks per cycle | | `MERGE_TIMEOUT` | 300s | Force-reset to conflict if exceeded | | `BREAKER_THRESHOLD` | 5 | Consecutive failures to trip breaker | | `BREAKER_COOLDOWN` | 900s | 15 min before half-open probe | | `LIGHT_SKIP_LLM` | false | When true, LIGHT PRs skip all LLM review | | `LIGHT_PROMOTION_RATE` | 0.15 | Random LIGHT → STANDARD upgrade rate | | `DEDUP_THRESHOLD` | 0.85 | SequenceMatcher near-duplicate threshold | | `OPENROUTER_DAILY_BUDGET` | $20 | Daily cost cap for OpenRouter | | `SAMPLE_AUDIT_RATE` | 0.15 | Pre-merge audit sampling rate | ## Module Map | Module | Responsibility | |--------|---------------| | `teleo-pipeline.py` | Main entry, stage loops, shutdown, crash recovery | | `lib/evaluate.py` | Tier 0.5, triage, domain+Leo review, retry budget, disposition | | `lib/validate.py` | Tier 0 validation, frontmatter parsing, all deterministic checks | | `lib/merge.py` | Domain-serialized merge, rebase, PR discovery, branch cleanup | | `lib/llm.py` | Prompt templates, OpenRouter transport, Claude CLI transport | | `lib/forgejo.py` | Forgejo API client, diff fetching, agent token management | | `lib/domains.py` | Domain↔agent mapping, domain detection from diff/branch | | `lib/config.py` | All constants, paths, model IDs, thresholds | | `lib/db.py` | SQLite connection, migrations, audit logging, transactions | | `lib/breaker.py` | Per-stage circuit breaker state machine | | `lib/costs.py` | OpenRouter cost tracking and budget enforcement | | `lib/health.py` | HTTP health endpoint (port 8080) | | `lib/log.py` | Structured JSON logging setup | ## Known Issues and Gaps 1. **Ingest stage is a stub** — Sources are not being ingested into pipeline v2. Old cron scripts (disabled) handled extraction. 2. **No auto-fixer** — When Tier 0.5 or reviews reject for mechanical issues, there's no automated fix. PRs just consume eval attempts until terminal. 3. **`broken_wiki_links` is systemic** — Extraction agents create `[[links]]` to claims that don't exist in the KB. This is the #1 rejection reason. Root cause is extraction prompt quality, not eval. 4. **Sequential eval processing** — `evaluate_cycle()` processes PRs in a for-loop, not concurrent `asyncio.gather`. Only one Opus review runs at a time. 5. **Source re-extraction not wired** — `_terminate_pr()` tags sources for `needs_reextraction` but sources table is empty (never populated by pipeline v2). ## Design Decisions Log | Decision | Rationale | Author | |----------|-----------|--------| | Domain review on GPT-4o, not Claude | Different model family = no correlated blind spots + keeps Claude Max rate limit for Opus | Leo | | Opus reserved for DEEP only | Scarce resource (Claude Max subscription). STANDARD goes to Sonnet on OpenRouter. | Leo | | Tier 0.5 before triage | Catch mechanical issues at $0 before any LLM call. Saves ~$0.02/PR on GPT-4o for obviously broken PRs. | Leo/Ganymede | | Wiki links checked on ALL .md files | Agent files (beliefs.md etc.) frequently have broken links. Original scope (claim dirs only) let them bypass to Opus. | Leo | | Near-duplicate is tag-only, not gate | Similarity is a judgment call. Two claims about the same topic can be genuinely distinct. LLM decides. | Ganymede | | Domain-serialized merge | Prevents `_map.md` merge conflicts. Cross-domain parallel, same-domain serial. | Ganymede/Rhea | | Rebase with pinned force-with-lease | Defeats tracking-ref update race between bare repo fetch and merge push. | Ganymede | | SHA-based eval reset | New commit = new code. Cheaper to re-eval ($0.03) than parse commit messages. | Ganymede | | Human PRs get priority high, not critical | Critical reserved for explicit override. Prevents DoS on pipeline from external PRs. | Ganymede | | Claim-shape detector | Converts semantic problem (is this a real claim?) to mechanical check (does YAML say type: claim?). | Theseus | | Random promotion | Makes gaming unpredictable. Extraction agents can't know which LIGHT PRs get full review. | Rio |