teleo-infrastructure/docs/ARCHITECTURE.md
m3taversal d2aec7fee3
Some checks are pending
CI / lint-and-test (push) Waiting to run
feat: reorganize repo with clear directory boundaries and agent ownership
Move scattered root-level files into categorized directories:
- deploy/ — deployment + mirror scripts (Ship)
- scripts/ — one-off backfills + migrations (Ship)
- research/ — nightly research + prompts (Ship)
- docs/ — all operational documentation (shared)

Delete 3 dead cron scripts replaced by pipeline daemon:
- batch-extract-50.sh, evaluate-trigger.sh, extract-cron.sh

Add CODEOWNERS mapping every path to its owning agent.
Add README with directory structure, ownership table, and VPS layout.
Update deploy.sh paths to match new structure.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 18:20:13 +01:00

22 KiB
Raw Blame History

Pipeline v2 Architecture

Single async Python daemon replacing 7 cron scripts. Four stage loops running concurrently with SQLite WAL state store.

System Overview

                    ┌─────────────────────────────────────────────┐
                    │            teleo-pipeline.py                │
                    │                                             │
                    │  ┌─────────┐ ┌──────────┐ ┌──────────┐ ┌───────┐
                    │  │ Ingest  │ │ Validate │ │ Evaluate │ │ Merge │
                    │  │ (stub)  │ │  30s     │ │  30s     │ │ 30s   │
                    │  └────┬────┘ └────┬─────┘ └────┬─────┘ └───┬───┘
                    │       │           │            │           │
                    │       └───────────┴────────────┴───────────┘
                    │                        │
                    │                   SQLite WAL
                    │              (pipeline.db)
                    └─────────────────────────────────────────────┘
                                         │
                              ┌──────────┴──────────┐
                              │    Forgejo API       │
                              │  git.livingip.xyz    │
                              └─────────────────────┘

Location: /opt/teleo-eval/pipeline/ (VPS), ~/.pentagon/workspace/collective/pipeline-v2/ (local dev)

Process: Single Python process, systemd-managed. PID tracked. Graceful shutdown on SIGTERM/SIGINT — waits up to 60s for stages to finish, then kills lingering Claude CLI subprocesses.

Infrastructure

Component Detail
VPS Hetzner CAX31, 77.42.65.182, Ubuntu 24.04 ARM64, 16GB RAM
Forgejo git.livingip.xyz, org: teleo, repo: teleo-codex
Bare repo /opt/teleo-eval/workspaces/teleo-codex.git — single-writer (fetch cron only)
Main worktree /opt/teleo-eval/workspaces/main — refreshed by fetch, used for wiki link resolution
Database /opt/teleo-eval/pipeline/pipeline.db — SQLite WAL mode
Secrets /opt/teleo-eval/secrets/ — per-agent Forgejo tokens, OpenRouter key
Logs /opt/teleo-eval/logs/pipeline.jsonl — structured JSON, 50MB rotation, 7-day retention

PR Lifecycle

Source → Ingest → PR created on Forgejo
                       │
                  ┌─────▼──────┐
                  │   Validate  │  Tier 0: deterministic Python ($0)
                  │  (tier0)    │  Schema, title, wiki links, domain match
                  └─────┬──────┘
                        │ tier0_pass = 1
                  ┌─────▼──────┐
                  │  Tier 0.5   │  Mechanical pre-check ($0)
                  │             │  Frontmatter, wiki links (ALL .md files),
                  │             │  near-duplicate (warning only)
                  └─────┬──────┘
                        │ passes
                  ┌─────▼──────┐
                  │   Triage    │  Haiku via OpenRouter (~$0.002)
                  │             │  → DEEP / STANDARD / LIGHT
                  └─────┬──────┘
                        │
              ┌─────────┼─────────┐
              │         │         │
           DEEP    STANDARD    LIGHT
              │         │         │
         ┌────▼────┐ ┌──▼──┐  ┌──▼──────────┐
         │ Domain  │ │same │  │ skip or      │
         │ GPT-4o  │ │     │  │ auto-approve │
         │(OpenR)  │ │     │  │ (LIGHT_SKIP) │
         └────┬────┘ └──┬──┘  └──────────────┘
              │         │
         ┌────▼────┐ ┌──▼──────┐
         │  Leo    │ │  Leo    │
         │  Opus   │ │ Sonnet  │
         │(Claude  │ │(OpenR)  │
         │  Max)   │ │         │
         └────┬────┘ └──┬──────┘
              │         │
              └────┬────┘
                   │
            ┌──────▼──────┐
            │  Disposition │  Retry budget, issue classification
            └──────┬──────┘
                   │ both approve
            ┌──────▼──────┐
            │    Merge     │  Rebase + API merge, domain-serialized
            └─────────────┘

Stage 1: Ingest (stub)

Status: Not implemented in pipeline v2. Sources were processed by old cron scripts (extract-cron.sh, openrouter-extract.py). All extraction crons are currently disabled.

Interval: 60s

What it will do: Scan inbox/ for unprocessed sources, extract claims via LLM, create PRs on Forgejo, track in sources table.

Stage 2: Validate (Tier 0)

Module: lib/validate.py Interval: 30s Cost: $0 (pure Python)

Deterministic validation gate. Finds PRs with status='open' and tier0_pass IS NULL.

Checks performed (per claim file)

Check Type Action
YAML frontmatter present Gate Fail if missing
Required fields: type, domain, description, confidence, source, created Gate Fail if missing
Valid enums (type, domain, confidence) Gate Fail if invalid
Description length ≥ 10 chars Gate Fail
Date valid (2020today, correct format) Gate Fail
Title is prose proposition (verb/connective detection) Gate Fail if < 4 words and no signal
Wiki links resolve to existing files Gate Fail if broken
Domain-directory match Gate Fail if domain: field doesn't match file path
Universal quantifiers without scoping Warning Tag but don't fail
Description too similar to title (>75% SequenceMatcher) Warning Tag but don't fail
Near-duplicate title (>85% SequenceMatcher) Warning Tag but don't fail

SHA-based idempotency

Each validation posts a comment with <!-- TIER0-VALIDATION:{sha} -->. If a comment with the current HEAD SHA already exists, validation is skipped. Force-push (new SHA) triggers re-validation.

On new commits: full eval reset

When Tier 0 runs on a PR, it unconditionally resets:

  • eval_attempts = 0
  • eval_issues = '[]'
  • domain_verdict = 'pending', leo_verdict = 'pending'

This gives the PR a fresh evaluation cycle after any code change.

Stage 2.5: Tier 0.5 (Mechanical Pre-check)

Location: _tier05_mechanical_check() in lib/evaluate.py Cost: $0 (pure Python) Runs: Inside evaluate_pr(), after musings bypass, before triage.

Catches mechanical issues that domain review (GPT-4o) rubber-stamps and Leo rejects without structured issue tags.

Checks

Check Scope Action
Frontmatter schema (parse + validate) New files in claim dirs only Gate (block)
Wiki link resolution ALL .md files in diff Gate (block)
Near-duplicate detection New files in claim dirs only Tag only (warning, LLM decides)

Key design decisions

  • Wiki links checked on all .md files, not just claim directories. Agent files (agents/*/beliefs.md, etc.) frequently contain broken [[links]] that Tier 0.5 must catch before Opus wastes time on them.
  • Modified files only get wiki link checks — they have partial content from diff, so frontmatter parsing is unreliable.
  • Near-duplicate is never a gate — similarity is a judgment call for the LLM reviewer.

On failure

Posts Forgejo comment with issue tags (<!-- ISSUES: tag1, tag2 -->), sets status='open', runs disposition. Counts as an eval attempt.

Stage 3: Evaluate

Module: lib/evaluate.py Interval: 30s Finds: PRs with status='open', tier0_pass=1, pending verdicts, eval_attempts < MAX_EVAL_ATTEMPTS

3a. Musings Bypass

If a PR only modifies files in agents/*/musings/, it's auto-approved immediately. No review needed.

3b. Triage

Model: Haiku via OpenRouter (~$0.002/call)

Classifies PR into exactly one tier:

Tier Criteria Review path
DEEP Likely+ confidence, cross-domain, challenges existing, axiom-level Full: Domain (GPT-4o) + Leo (Opus)
STANDARD New claims, enrichments, hypothesis beliefs Full: Domain (GPT-4o) + Leo (Sonnet)
LIGHT Entity updates, source archiving, formatting, status changes Configurable: skip or auto-approve

When uncertain, classify UP. Always err toward more review.

Tier Overrides (post-triage)

Two overrides run after triage, in order. Both check tier == "LIGHT" so no double-upgrade is possible.

  1. Claim-shape detector — If any + line in the diff contains type: claim (any YAML quoting variant), upgrade LIGHT → STANDARD. Catches factual claims disguised as light content. $0, deterministic.

  2. Random pre-merge promotion — 15% of remaining LIGHT PRs get upgraded to STANDARD. Makes gaming unpredictable — extraction agents can't know which LIGHT PRs get full review.

3c. Domain Review

Model: GPT-4o via OpenRouter Skipped when: LIGHT_SKIP_LLM=True (config flag), or already completed from prior attempt

Reviews 4 criteria:

  1. Factual accuracy
  2. Intra-PR duplicates (same evidence copy-pasted across files)
  3. Confidence calibration
  4. Wiki link validity

Verdict rules: APPROVE if factually correct even with minor improvements possible. REQUEST_CHANGES only for blocking issues (factual errors, genuinely broken links, copy-pasted duplicates, clearly wrong confidence).

If domain rejects: Leo review is skipped entirely (saves Opus/Sonnet).

3d. Leo Review

Model: Opus via Claude Max (DEEP) or Sonnet via OpenRouter (STANDARD) Skipped when: LIGHT tier, or domain review rejected

DEEP reviews check 11 criteria (cross-domain implications, axiom integrity, epistemic hygiene, etc.). STANDARD reviews check 6 criteria (schema, duplicates, confidence, wiki links, source quality, specificity).

Verdicts

There are exactly two verdicts: APPROVE and REQUEST_CHANGES. There is no REJECT verdict.

Verdicts are parsed from structured tags in the review:

<!-- VERDICT:LEO:APPROVE -->
<!-- VERDICT:LEO:REQUEST_CHANGES -->

If no parseable verdict is found, defaults to request_changes.

Issue Tags

Reviews tag specific issues using structured comments:

<!-- ISSUES: broken_wiki_links, frontmatter_schema -->

Valid tags:

Tag Category Description
broken_wiki_links Mechanical [[links]] that don't resolve to existing files
frontmatter_schema Mechanical Missing/invalid YAML fields
near_duplicate Mechanical Title too similar to existing claim (>85%)
factual_discrepancy Substantive Factual errors in the claim
confidence_miscalibration Substantive Confidence level doesn't match evidence
scope_error Substantive Claim scope too broad/narrow
title_overclaims Substantive Title makes stronger claim than evidence supports
date_errors Invalid or incorrect dates

Tag inference fallback: If a review rejects without structured <!-- ISSUES: --> tags, _infer_issues_from_prose() scans the review text with conservative regex patterns to extract issue tags. 7 categories, 2-4 keyword patterns each.

Review Style Guide

All review prompts include the style guide requiring per-criterion findings:

  • "You MUST show your work"
  • "For each criterion, write one sentence with your finding"
  • "'Everything passes' with no evidence of checking will be treated as review failures"

Reviews are posted as Forgejo comments from the reviewing agent's own Forgejo account (per-agent tokens in /opt/teleo-eval/secrets/).

Retry Budget and Disposition

Eval Attempts

Hard cap: MAX_EVAL_ATTEMPTS = 3

Each time evaluate_pr() runs, it increments eval_attempts before any checks. This means Tier 0.5 failures count as eval attempts.

Issue Classification

Issues are classified as:

  • Mechanical: frontmatter_schema, broken_wiki_links, near_duplicate
  • Substantive: factual_discrepancy, confidence_miscalibration, scope_error, title_overclaims
  • Mixed: Both types present
  • Unknown: Tags not in either set

Disposition Logic

Attempt Mechanical only Substantive/Mixed/Unknown
1 Back to open, wait for fix Back to open, wait for fix
2 Keep open for one more try Terminate (close PR, requeue source)
3+ Terminate Terminate

Terminate means: close PR on Forgejo with explanation comment, update DB status to closed, tag source for re-extraction (if source_path linked).

SHA-based Reset

When Tier 0 validates a new commit (new HEAD SHA), it resets eval_attempts = 0 and all verdicts to pending. This gives the PR a completely fresh evaluation cycle after any code change.

Stage 4: Merge

Module: lib/merge.py Interval: 30s

Domain Serialization

Merges are serialized per-domain (one merge at a time per domain) but parallel across domains. Two layers enforce this:

  1. asyncio.Lock per domain (fast path, lost on crash)
  2. SQL NOT EXISTS check for status='merging' in same domain (defense-in-depth)

Merge Flow

  1. Discover external PRs — Scan Forgejo for open PRs not in SQLite. Human PRs get priority='high' and an acknowledgment comment.

  2. Claim next approved PR — Atomic UPDATE ... RETURNING with priority ordering: critical > high > medium > low > unclassified. PR priority overrides source priority.

  3. Rebase onto main — Creates temp worktree, rebases, force-pushes with --force-with-lease pinned to expected SHA (defeats tracking-ref race).

  4. Merge via Forgejo API — Checks if already merged/closed first (prevents 405 on ghost PRs).

  5. Cleanup — Delete remote branch, prune worktree metadata.

Merge Timeout

5 minutes max per merge. If exceeded, force-reset to status='conflict'.

Formal Approvals

After both verdicts approve, _post_formal_approvals() submits Forgejo review approvals from 2 agent accounts (not the PR author). Required by Forgejo's merge protection rules.

Model Routing

Design principle: Model diversity. Domain review (GPT-4o) and Leo review (Sonnet/Opus) use different model families to prevent correlated blind spots.

Stage Model Backend Cost
Triage Haiku OpenRouter ~$0.002/call
Domain review GPT-4o OpenRouter ~$0.02/call
Leo STANDARD Sonnet 4.5 OpenRouter ~$0.02/call
Leo DEEP Opus Claude Max (subscription) $0 (rate-limited)
Extraction Sonnet Claude Max $0 (rate-limited)

Opus Rate Limit Handling

When Claude Max Opus hits rate limit:

  1. Set 15-minute global backoff
  2. During backoff: STANDARD PRs still flow (Sonnet via OpenRouter), DEEP PRs queue
  3. Triage (Haiku) and domain review (GPT-4o) always flow (OpenRouter)
  4. After cooldown: resume full eval

Overflow Policies

Per-stage behavior when Claude Max is rate-limited:

Stage Policy Behavior
Extract queue Wait for capacity
Triage overflow Fall back to API
Domain review overflow Always API anyway
Leo review queue Wait for capacity (protect Opus)
DEEP eval overflow Already on API
Sample audit skip Optional, skip if constrained

Circuit Breakers

Per-stage circuit breakers backed by SQLite. Three states:

State Behavior
CLOSED Normal operation
OPEN Stage paused (5 consecutive failures)
HALFOPEN Cooldown expired (15 min), probe with 1 worker

A successful probe in HALFOPEN closes the breaker. A failed probe reopens it.

Crash Recovery

On startup, the pipeline recovers interrupted state:

  • Sources stuck in extractingunprocessed (with retry counter increment; if exhausted → error)
  • PRs stuck in mergingapproved (re-merge attempt)
  • PRs stuck in reviewingopen (re-evaluate)

Orphan worktrees from /tmp/teleo-extract-* and /tmp/teleo-merge-* are cleaned up.

Domain → Agent Mapping

Every domain has exactly one primary reviewing agent:

Domain Agent Territory
internet-finance Rio domains/internet-finance/
entertainment Clay domains/entertainment/
health Vida domains/health/
ai-alignment Theseus domains/ai-alignment/
space-development Astra domains/space-development/
mechanisms Rio core/mechanisms/
living-capital Rio core/living-capital/
living-agents Theseus core/living-agents/
teleohumanity Leo core/teleohumanity/
grand-strategy Leo core/grand-strategy/
critical-systems Theseus foundations/critical-systems/
collective-intelligence Theseus foundations/collective-intelligence/
teleological-economics Rio foundations/teleological-economics/
cultural-dynamics Clay foundations/cultural-dynamics/

Domain detection from diff: counts file path occurrences in domains/, entities/, core/, foundations/ subdirectories. Most-referenced domain wins.

Key Configuration (lib/config.py)

Setting Value Purpose
MAX_EVAL_ATTEMPTS 3 Hard cap on eval cycles per PR
EVAL_TIMEOUT 600s Per-review timeout (Claude CLI + OpenRouter)
MAX_EVAL_WORKERS 7 Max concurrent eval tasks per cycle
MERGE_TIMEOUT 300s Force-reset to conflict if exceeded
BREAKER_THRESHOLD 5 Consecutive failures to trip breaker
BREAKER_COOLDOWN 900s 15 min before half-open probe
LIGHT_SKIP_LLM false When true, LIGHT PRs skip all LLM review
LIGHT_PROMOTION_RATE 0.15 Random LIGHT → STANDARD upgrade rate
DEDUP_THRESHOLD 0.85 SequenceMatcher near-duplicate threshold
OPENROUTER_DAILY_BUDGET $20 Daily cost cap for OpenRouter
SAMPLE_AUDIT_RATE 0.15 Pre-merge audit sampling rate

Module Map

Module Responsibility
teleo-pipeline.py Main entry, stage loops, shutdown, crash recovery
lib/evaluate.py Tier 0.5, triage, domain+Leo review, retry budget, disposition
lib/validate.py Tier 0 validation, frontmatter parsing, all deterministic checks
lib/merge.py Domain-serialized merge, rebase, PR discovery, branch cleanup
lib/llm.py Prompt templates, OpenRouter transport, Claude CLI transport
lib/forgejo.py Forgejo API client, diff fetching, agent token management
lib/domains.py Domain↔agent mapping, domain detection from diff/branch
lib/config.py All constants, paths, model IDs, thresholds
lib/db.py SQLite connection, migrations, audit logging, transactions
lib/breaker.py Per-stage circuit breaker state machine
lib/costs.py OpenRouter cost tracking and budget enforcement
lib/health.py HTTP health endpoint (port 8080)
lib/log.py Structured JSON logging setup

Known Issues and Gaps

  1. Ingest stage is a stub — Sources are not being ingested into pipeline v2. Old cron scripts (disabled) handled extraction.
  2. No auto-fixer — When Tier 0.5 or reviews reject for mechanical issues, there's no automated fix. PRs just consume eval attempts until terminal.
  3. broken_wiki_links is systemic — Extraction agents create [[links]] to claims that don't exist in the KB. This is the #1 rejection reason. Root cause is extraction prompt quality, not eval.
  4. Sequential eval processingevaluate_cycle() processes PRs in a for-loop, not concurrent asyncio.gather. Only one Opus review runs at a time.
  5. Source re-extraction not wired_terminate_pr() tags sources for needs_reextraction but sources table is empty (never populated by pipeline v2).

Design Decisions Log

Decision Rationale Author
Domain review on GPT-4o, not Claude Different model family = no correlated blind spots + keeps Claude Max rate limit for Opus Leo
Opus reserved for DEEP only Scarce resource (Claude Max subscription). STANDARD goes to Sonnet on OpenRouter. Leo
Tier 0.5 before triage Catch mechanical issues at $0 before any LLM call. Saves ~$0.02/PR on GPT-4o for obviously broken PRs. Leo/Ganymede
Wiki links checked on ALL .md files Agent files (beliefs.md etc.) frequently have broken links. Original scope (claim dirs only) let them bypass to Opus. Leo
Near-duplicate is tag-only, not gate Similarity is a judgment call. Two claims about the same topic can be genuinely distinct. LLM decides. Ganymede
Domain-serialized merge Prevents _map.md merge conflicts. Cross-domain parallel, same-domain serial. Ganymede/Rhea
Rebase with pinned force-with-lease Defeats tracking-ref update race between bare repo fetch and merge push. Ganymede
SHA-based eval reset New commit = new code. Cheaper to re-eval ($0.03) than parse commit messages. Ganymede
Human PRs get priority high, not critical Critical reserved for explicit override. Prevents DoS on pipeline from external PRs. Ganymede
Claim-shape detector Converts semantic problem (is this a real claim?) to mechanical check (does YAML say type: claim?). Theseus
Random promotion Makes gaming unpredictable. Extraction agents can't know which LIGHT PRs get full review. Rio