Move scattered root-level files into categorized directories: - deploy/ — deployment + mirror scripts (Ship) - scripts/ — one-off backfills + migrations (Ship) - research/ — nightly research + prompts (Ship) - docs/ — all operational documentation (shared) Delete 3 dead cron scripts replaced by pipeline daemon: - batch-extract-50.sh, evaluate-trigger.sh, extract-cron.sh Add CODEOWNERS mapping every path to its owning agent. Add README with directory structure, ownership table, and VPS layout. Update deploy.sh paths to match new structure. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
22 KiB
Pipeline v2 Architecture
Single async Python daemon replacing 7 cron scripts. Four stage loops running concurrently with SQLite WAL state store.
System Overview
┌─────────────────────────────────────────────┐
│ teleo-pipeline.py │
│ │
│ ┌─────────┐ ┌──────────┐ ┌──────────┐ ┌───────┐
│ │ Ingest │ │ Validate │ │ Evaluate │ │ Merge │
│ │ (stub) │ │ 30s │ │ 30s │ │ 30s │
│ └────┬────┘ └────┬─────┘ └────┬─────┘ └───┬───┘
│ │ │ │ │
│ └───────────┴────────────┴───────────┘
│ │
│ SQLite WAL
│ (pipeline.db)
└─────────────────────────────────────────────┘
│
┌──────────┴──────────┐
│ Forgejo API │
│ git.livingip.xyz │
└─────────────────────┘
Location: /opt/teleo-eval/pipeline/ (VPS), ~/.pentagon/workspace/collective/pipeline-v2/ (local dev)
Process: Single Python process, systemd-managed. PID tracked. Graceful shutdown on SIGTERM/SIGINT — waits up to 60s for stages to finish, then kills lingering Claude CLI subprocesses.
Infrastructure
| Component | Detail |
|---|---|
| VPS | Hetzner CAX31, 77.42.65.182, Ubuntu 24.04 ARM64, 16GB RAM |
| Forgejo | git.livingip.xyz, org: teleo, repo: teleo-codex |
| Bare repo | /opt/teleo-eval/workspaces/teleo-codex.git — single-writer (fetch cron only) |
| Main worktree | /opt/teleo-eval/workspaces/main — refreshed by fetch, used for wiki link resolution |
| Database | /opt/teleo-eval/pipeline/pipeline.db — SQLite WAL mode |
| Secrets | /opt/teleo-eval/secrets/ — per-agent Forgejo tokens, OpenRouter key |
| Logs | /opt/teleo-eval/logs/pipeline.jsonl — structured JSON, 50MB rotation, 7-day retention |
PR Lifecycle
Source → Ingest → PR created on Forgejo
│
┌─────▼──────┐
│ Validate │ Tier 0: deterministic Python ($0)
│ (tier0) │ Schema, title, wiki links, domain match
└─────┬──────┘
│ tier0_pass = 1
┌─────▼──────┐
│ Tier 0.5 │ Mechanical pre-check ($0)
│ │ Frontmatter, wiki links (ALL .md files),
│ │ near-duplicate (warning only)
└─────┬──────┘
│ passes
┌─────▼──────┐
│ Triage │ Haiku via OpenRouter (~$0.002)
│ │ → DEEP / STANDARD / LIGHT
└─────┬──────┘
│
┌─────────┼─────────┐
│ │ │
DEEP STANDARD LIGHT
│ │ │
┌────▼────┐ ┌──▼──┐ ┌──▼──────────┐
│ Domain │ │same │ │ skip or │
│ GPT-4o │ │ │ │ auto-approve │
│(OpenR) │ │ │ │ (LIGHT_SKIP) │
└────┬────┘ └──┬──┘ └──────────────┘
│ │
┌────▼────┐ ┌──▼──────┐
│ Leo │ │ Leo │
│ Opus │ │ Sonnet │
│(Claude │ │(OpenR) │
│ Max) │ │ │
└────┬────┘ └──┬──────┘
│ │
└────┬────┘
│
┌──────▼──────┐
│ Disposition │ Retry budget, issue classification
└──────┬──────┘
│ both approve
┌──────▼──────┐
│ Merge │ Rebase + API merge, domain-serialized
└─────────────┘
Stage 1: Ingest (stub)
Status: Not implemented in pipeline v2. Sources were processed by old cron scripts (extract-cron.sh, openrouter-extract.py). All extraction crons are currently disabled.
Interval: 60s
What it will do: Scan inbox/ for unprocessed sources, extract claims via LLM, create PRs on Forgejo, track in sources table.
Stage 2: Validate (Tier 0)
Module: lib/validate.py
Interval: 30s
Cost: $0 (pure Python)
Deterministic validation gate. Finds PRs with status='open' and tier0_pass IS NULL.
Checks performed (per claim file)
| Check | Type | Action |
|---|---|---|
| YAML frontmatter present | Gate | Fail if missing |
| Required fields: type, domain, description, confidence, source, created | Gate | Fail if missing |
| Valid enums (type, domain, confidence) | Gate | Fail if invalid |
| Description length ≥ 10 chars | Gate | Fail |
| Date valid (2020–today, correct format) | Gate | Fail |
| Title is prose proposition (verb/connective detection) | Gate | Fail if < 4 words and no signal |
| Wiki links resolve to existing files | Gate | Fail if broken |
| Domain-directory match | Gate | Fail if domain: field doesn't match file path |
| Universal quantifiers without scoping | Warning | Tag but don't fail |
| Description too similar to title (>75% SequenceMatcher) | Warning | Tag but don't fail |
| Near-duplicate title (>85% SequenceMatcher) | Warning | Tag but don't fail |
SHA-based idempotency
Each validation posts a comment with <!-- TIER0-VALIDATION:{sha} -->. If a comment with the current HEAD SHA already exists, validation is skipped. Force-push (new SHA) triggers re-validation.
On new commits: full eval reset
When Tier 0 runs on a PR, it unconditionally resets:
eval_attempts = 0eval_issues = '[]'domain_verdict = 'pending',leo_verdict = 'pending'
This gives the PR a fresh evaluation cycle after any code change.
Stage 2.5: Tier 0.5 (Mechanical Pre-check)
Location: _tier05_mechanical_check() in lib/evaluate.py
Cost: $0 (pure Python)
Runs: Inside evaluate_pr(), after musings bypass, before triage.
Catches mechanical issues that domain review (GPT-4o) rubber-stamps and Leo rejects without structured issue tags.
Checks
| Check | Scope | Action |
|---|---|---|
| Frontmatter schema (parse + validate) | New files in claim dirs only | Gate (block) |
| Wiki link resolution | ALL .md files in diff | Gate (block) |
| Near-duplicate detection | New files in claim dirs only | Tag only (warning, LLM decides) |
Key design decisions
- Wiki links checked on all .md files, not just claim directories. Agent files (
agents/*/beliefs.md, etc.) frequently contain broken[[links]]that Tier 0.5 must catch before Opus wastes time on them. - Modified files only get wiki link checks — they have partial content from diff, so frontmatter parsing is unreliable.
- Near-duplicate is never a gate — similarity is a judgment call for the LLM reviewer.
On failure
Posts Forgejo comment with issue tags (<!-- ISSUES: tag1, tag2 -->), sets status='open', runs disposition. Counts as an eval attempt.
Stage 3: Evaluate
Module: lib/evaluate.py
Interval: 30s
Finds: PRs with status='open', tier0_pass=1, pending verdicts, eval_attempts < MAX_EVAL_ATTEMPTS
3a. Musings Bypass
If a PR only modifies files in agents/*/musings/, it's auto-approved immediately. No review needed.
3b. Triage
Model: Haiku via OpenRouter (~$0.002/call)
Classifies PR into exactly one tier:
| Tier | Criteria | Review path |
|---|---|---|
| DEEP | Likely+ confidence, cross-domain, challenges existing, axiom-level | Full: Domain (GPT-4o) + Leo (Opus) |
| STANDARD | New claims, enrichments, hypothesis beliefs | Full: Domain (GPT-4o) + Leo (Sonnet) |
| LIGHT | Entity updates, source archiving, formatting, status changes | Configurable: skip or auto-approve |
When uncertain, classify UP. Always err toward more review.
Tier Overrides (post-triage)
Two overrides run after triage, in order. Both check tier == "LIGHT" so no double-upgrade is possible.
-
Claim-shape detector — If any
+line in the diff containstype: claim(any YAML quoting variant), upgrade LIGHT → STANDARD. Catches factual claims disguised as light content. $0, deterministic. -
Random pre-merge promotion — 15% of remaining LIGHT PRs get upgraded to STANDARD. Makes gaming unpredictable — extraction agents can't know which LIGHT PRs get full review.
3c. Domain Review
Model: GPT-4o via OpenRouter
Skipped when: LIGHT_SKIP_LLM=True (config flag), or already completed from prior attempt
Reviews 4 criteria:
- Factual accuracy
- Intra-PR duplicates (same evidence copy-pasted across files)
- Confidence calibration
- Wiki link validity
Verdict rules: APPROVE if factually correct even with minor improvements possible. REQUEST_CHANGES only for blocking issues (factual errors, genuinely broken links, copy-pasted duplicates, clearly wrong confidence).
If domain rejects: Leo review is skipped entirely (saves Opus/Sonnet).
3d. Leo Review
Model: Opus via Claude Max (DEEP) or Sonnet via OpenRouter (STANDARD) Skipped when: LIGHT tier, or domain review rejected
DEEP reviews check 11 criteria (cross-domain implications, axiom integrity, epistemic hygiene, etc.). STANDARD reviews check 6 criteria (schema, duplicates, confidence, wiki links, source quality, specificity).
Verdicts
There are exactly two verdicts: APPROVE and REQUEST_CHANGES. There is no REJECT verdict.
Verdicts are parsed from structured tags in the review:
<!-- VERDICT:LEO:APPROVE -->
<!-- VERDICT:LEO:REQUEST_CHANGES -->
If no parseable verdict is found, defaults to request_changes.
Issue Tags
Reviews tag specific issues using structured comments:
<!-- ISSUES: broken_wiki_links, frontmatter_schema -->
Valid tags:
| Tag | Category | Description |
|---|---|---|
broken_wiki_links |
Mechanical | [[links]] that don't resolve to existing files |
frontmatter_schema |
Mechanical | Missing/invalid YAML fields |
near_duplicate |
Mechanical | Title too similar to existing claim (>85%) |
factual_discrepancy |
Substantive | Factual errors in the claim |
confidence_miscalibration |
Substantive | Confidence level doesn't match evidence |
scope_error |
Substantive | Claim scope too broad/narrow |
title_overclaims |
Substantive | Title makes stronger claim than evidence supports |
date_errors |
— | Invalid or incorrect dates |
Tag inference fallback: If a review rejects without structured <!-- ISSUES: --> tags, _infer_issues_from_prose() scans the review text with conservative regex patterns to extract issue tags. 7 categories, 2-4 keyword patterns each.
Review Style Guide
All review prompts include the style guide requiring per-criterion findings:
- "You MUST show your work"
- "For each criterion, write one sentence with your finding"
- "'Everything passes' with no evidence of checking will be treated as review failures"
Reviews are posted as Forgejo comments from the reviewing agent's own Forgejo account (per-agent tokens in /opt/teleo-eval/secrets/).
Retry Budget and Disposition
Eval Attempts
Hard cap: MAX_EVAL_ATTEMPTS = 3
Each time evaluate_pr() runs, it increments eval_attempts before any checks. This means Tier 0.5 failures count as eval attempts.
Issue Classification
Issues are classified as:
- Mechanical:
frontmatter_schema,broken_wiki_links,near_duplicate - Substantive:
factual_discrepancy,confidence_miscalibration,scope_error,title_overclaims - Mixed: Both types present
- Unknown: Tags not in either set
Disposition Logic
| Attempt | Mechanical only | Substantive/Mixed/Unknown |
|---|---|---|
| 1 | Back to open, wait for fix | Back to open, wait for fix |
| 2 | Keep open for one more try | Terminate (close PR, requeue source) |
| 3+ | Terminate | Terminate |
Terminate means: close PR on Forgejo with explanation comment, update DB status to closed, tag source for re-extraction (if source_path linked).
SHA-based Reset
When Tier 0 validates a new commit (new HEAD SHA), it resets eval_attempts = 0 and all verdicts to pending. This gives the PR a completely fresh evaluation cycle after any code change.
Stage 4: Merge
Module: lib/merge.py
Interval: 30s
Domain Serialization
Merges are serialized per-domain (one merge at a time per domain) but parallel across domains. Two layers enforce this:
asyncio.Lockper domain (fast path, lost on crash)- SQL
NOT EXISTScheck forstatus='merging'in same domain (defense-in-depth)
Merge Flow
-
Discover external PRs — Scan Forgejo for open PRs not in SQLite. Human PRs get
priority='high'and an acknowledgment comment. -
Claim next approved PR — Atomic
UPDATE ... RETURNINGwith priority ordering:critical > high > medium > low > unclassified. PR priority overrides source priority. -
Rebase onto main — Creates temp worktree, rebases, force-pushes with
--force-with-leasepinned to expected SHA (defeats tracking-ref race). -
Merge via Forgejo API — Checks if already merged/closed first (prevents 405 on ghost PRs).
-
Cleanup — Delete remote branch, prune worktree metadata.
Merge Timeout
5 minutes max per merge. If exceeded, force-reset to status='conflict'.
Formal Approvals
After both verdicts approve, _post_formal_approvals() submits Forgejo review approvals from 2 agent accounts (not the PR author). Required by Forgejo's merge protection rules.
Model Routing
Design principle: Model diversity. Domain review (GPT-4o) and Leo review (Sonnet/Opus) use different model families to prevent correlated blind spots.
| Stage | Model | Backend | Cost |
|---|---|---|---|
| Triage | Haiku | OpenRouter | ~$0.002/call |
| Domain review | GPT-4o | OpenRouter | ~$0.02/call |
| Leo STANDARD | Sonnet 4.5 | OpenRouter | ~$0.02/call |
| Leo DEEP | Opus | Claude Max (subscription) | $0 (rate-limited) |
| Extraction | Sonnet | Claude Max | $0 (rate-limited) |
Opus Rate Limit Handling
When Claude Max Opus hits rate limit:
- Set 15-minute global backoff
- During backoff: STANDARD PRs still flow (Sonnet via OpenRouter), DEEP PRs queue
- Triage (Haiku) and domain review (GPT-4o) always flow (OpenRouter)
- After cooldown: resume full eval
Overflow Policies
Per-stage behavior when Claude Max is rate-limited:
| Stage | Policy | Behavior |
|---|---|---|
| Extract | queue | Wait for capacity |
| Triage | overflow | Fall back to API |
| Domain review | overflow | Always API anyway |
| Leo review | queue | Wait for capacity (protect Opus) |
| DEEP eval | overflow | Already on API |
| Sample audit | skip | Optional, skip if constrained |
Circuit Breakers
Per-stage circuit breakers backed by SQLite. Three states:
| State | Behavior |
|---|---|
| CLOSED | Normal operation |
| OPEN | Stage paused (5 consecutive failures) |
| HALFOPEN | Cooldown expired (15 min), probe with 1 worker |
A successful probe in HALFOPEN closes the breaker. A failed probe reopens it.
Crash Recovery
On startup, the pipeline recovers interrupted state:
- Sources stuck in
extracting→unprocessed(with retry counter increment; if exhausted →error) - PRs stuck in
merging→approved(re-merge attempt) - PRs stuck in
reviewing→open(re-evaluate)
Orphan worktrees from /tmp/teleo-extract-* and /tmp/teleo-merge-* are cleaned up.
Domain → Agent Mapping
Every domain has exactly one primary reviewing agent:
| Domain | Agent | Territory |
|---|---|---|
| internet-finance | Rio | domains/internet-finance/ |
| entertainment | Clay | domains/entertainment/ |
| health | Vida | domains/health/ |
| ai-alignment | Theseus | domains/ai-alignment/ |
| space-development | Astra | domains/space-development/ |
| mechanisms | Rio | core/mechanisms/ |
| living-capital | Rio | core/living-capital/ |
| living-agents | Theseus | core/living-agents/ |
| teleohumanity | Leo | core/teleohumanity/ |
| grand-strategy | Leo | core/grand-strategy/ |
| critical-systems | Theseus | foundations/critical-systems/ |
| collective-intelligence | Theseus | foundations/collective-intelligence/ |
| teleological-economics | Rio | foundations/teleological-economics/ |
| cultural-dynamics | Clay | foundations/cultural-dynamics/ |
Domain detection from diff: counts file path occurrences in domains/, entities/, core/, foundations/ subdirectories. Most-referenced domain wins.
Key Configuration (lib/config.py)
| Setting | Value | Purpose |
|---|---|---|
MAX_EVAL_ATTEMPTS |
3 | Hard cap on eval cycles per PR |
EVAL_TIMEOUT |
600s | Per-review timeout (Claude CLI + OpenRouter) |
MAX_EVAL_WORKERS |
7 | Max concurrent eval tasks per cycle |
MERGE_TIMEOUT |
300s | Force-reset to conflict if exceeded |
BREAKER_THRESHOLD |
5 | Consecutive failures to trip breaker |
BREAKER_COOLDOWN |
900s | 15 min before half-open probe |
LIGHT_SKIP_LLM |
false | When true, LIGHT PRs skip all LLM review |
LIGHT_PROMOTION_RATE |
0.15 | Random LIGHT → STANDARD upgrade rate |
DEDUP_THRESHOLD |
0.85 | SequenceMatcher near-duplicate threshold |
OPENROUTER_DAILY_BUDGET |
$20 | Daily cost cap for OpenRouter |
SAMPLE_AUDIT_RATE |
0.15 | Pre-merge audit sampling rate |
Module Map
| Module | Responsibility |
|---|---|
teleo-pipeline.py |
Main entry, stage loops, shutdown, crash recovery |
lib/evaluate.py |
Tier 0.5, triage, domain+Leo review, retry budget, disposition |
lib/validate.py |
Tier 0 validation, frontmatter parsing, all deterministic checks |
lib/merge.py |
Domain-serialized merge, rebase, PR discovery, branch cleanup |
lib/llm.py |
Prompt templates, OpenRouter transport, Claude CLI transport |
lib/forgejo.py |
Forgejo API client, diff fetching, agent token management |
lib/domains.py |
Domain↔agent mapping, domain detection from diff/branch |
lib/config.py |
All constants, paths, model IDs, thresholds |
lib/db.py |
SQLite connection, migrations, audit logging, transactions |
lib/breaker.py |
Per-stage circuit breaker state machine |
lib/costs.py |
OpenRouter cost tracking and budget enforcement |
lib/health.py |
HTTP health endpoint (port 8080) |
lib/log.py |
Structured JSON logging setup |
Known Issues and Gaps
- Ingest stage is a stub — Sources are not being ingested into pipeline v2. Old cron scripts (disabled) handled extraction.
- No auto-fixer — When Tier 0.5 or reviews reject for mechanical issues, there's no automated fix. PRs just consume eval attempts until terminal.
broken_wiki_linksis systemic — Extraction agents create[[links]]to claims that don't exist in the KB. This is the #1 rejection reason. Root cause is extraction prompt quality, not eval.- Sequential eval processing —
evaluate_cycle()processes PRs in a for-loop, not concurrentasyncio.gather. Only one Opus review runs at a time. - Source re-extraction not wired —
_terminate_pr()tags sources forneeds_reextractionbut sources table is empty (never populated by pipeline v2).
Design Decisions Log
| Decision | Rationale | Author |
|---|---|---|
| Domain review on GPT-4o, not Claude | Different model family = no correlated blind spots + keeps Claude Max rate limit for Opus | Leo |
| Opus reserved for DEEP only | Scarce resource (Claude Max subscription). STANDARD goes to Sonnet on OpenRouter. | Leo |
| Tier 0.5 before triage | Catch mechanical issues at $0 before any LLM call. Saves ~$0.02/PR on GPT-4o for obviously broken PRs. | Leo/Ganymede |
| Wiki links checked on ALL .md files | Agent files (beliefs.md etc.) frequently have broken links. Original scope (claim dirs only) let them bypass to Opus. | Leo |
| Near-duplicate is tag-only, not gate | Similarity is a judgment call. Two claims about the same topic can be genuinely distinct. LLM decides. | Ganymede |
| Domain-serialized merge | Prevents _map.md merge conflicts. Cross-domain parallel, same-domain serial. |
Ganymede/Rhea |
| Rebase with pinned force-with-lease | Defeats tracking-ref update race between bare repo fetch and merge push. | Ganymede |
| SHA-based eval reset | New commit = new code. Cheaper to re-eval ($0.03) than parse commit messages. | Ganymede |
| Human PRs get priority high, not critical | Critical reserved for explicit override. Prevents DoS on pipeline from external PRs. | Ganymede |
| Claim-shape detector | Converts semantic problem (is this a real claim?) to mechanical check (does YAML say type: claim?). | Theseus |
| Random promotion | Makes gaming unpredictable. Extraction agents can't know which LIGHT PRs get full review. | Rio |