teleo-infrastructure/ARCHITECTURE.md
m3taversal d79ff60689 epimetheus: sync VPS-deployed code to repo — Mar 18-20 reliability + features
Pipeline reliability (8 fixes, reviewed by Ganymede+Rhea+Leo+Rio):
1. Merge API recovery — pre-flight approval check, transient/permanent distinction, jitter
2. Ghost PR detection — ls-remote branch check in reconciliation, network guard
3. Source status contract — directory IS status, no code change needed
4. Batch-state markers eliminated — two-gate skip (archive-check + batched branch-check)
5. Branch SHA tracking — batched ls-remote, auto-reset verdicts, dismiss stale reviews
6. Mirror pre-flight permissions — chown check in sync-mirror.sh
7. Telegram archive commit-after-write — git add/commit/push with rebase --abort fallback
8. Post-merge source archiving — queue/ → archive/{domain}/ after merge

Pipeline fixes:
- merge_cycled flag — eval attempts preserved during merge-failure cycling (Ganymede+Rhea)
- merge_failures diagnostic counter
- Startup recovery preserves eval_attempts (was incorrectly resetting to 0)
- No-diff PRs auto-closed by eval (root cause of 17 zombie PRs)
- GC threshold aligned with substantive fixer budget (was 2, now 4)
- Conflict retry with 3-attempt budget + permanent conflict handler
- Local ff-merge fallback for Forgejo 405 errors

Telegram bot:
- KB retrieval: 3-layer (entity resolution → claim search → agent context)
- Reply-to-bot handler (context.bot.id check)
- Tag regex: @teleo|@futairdbot
- Prompt rewrite for natural analyst voice
- Market data API integration (Ben's token price endpoint)
- Conversation windows (5-message unanswered counter, per-user-per-chat)
- Conversation history in prompt (last 5 exchanges)
- Worktree file lock for archive writes

Infrastructure:
- worktree_lock.py — file-based lock (flock) for main worktree coordination
- backfill-sources.py — source DB registration for Argus funnel
- batch-extract-50.sh v3 — two-gate skip, batched ls-remote, network guard
- sync-mirror.sh — auto-PR creation for mirrored GitHub branches, permission pre-flight
- Argus dashboard — conflicts + reviewing in backlog, queue count in funnel
- Enrichment-inside-frontmatter bug fix (regex anchor, not --- split)

Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70>
2026-03-20 20:17:27 +00:00

455 lines
22 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Pipeline v2 Architecture
Single async Python daemon replacing 7 cron scripts. Four stage loops running concurrently with SQLite WAL state store.
## System Overview
```
┌─────────────────────────────────────────────┐
│ teleo-pipeline.py │
│ │
│ ┌─────────┐ ┌──────────┐ ┌──────────┐ ┌───────┐
│ │ Ingest │ │ Validate │ │ Evaluate │ │ Merge │
│ │ (stub) │ │ 30s │ │ 30s │ │ 30s │
│ └────┬────┘ └────┬─────┘ └────┬─────┘ └───┬───┘
│ │ │ │ │
│ └───────────┴────────────┴───────────┘
│ │
│ SQLite WAL
│ (pipeline.db)
└─────────────────────────────────────────────┘
┌──────────┴──────────┐
│ Forgejo API │
│ git.livingip.xyz │
└─────────────────────┘
```
**Location:** `/opt/teleo-eval/pipeline/` (VPS), `~/.pentagon/workspace/collective/pipeline-v2/` (local dev)
**Process:** Single Python process, systemd-managed. PID tracked. Graceful shutdown on SIGTERM/SIGINT — waits up to 60s for stages to finish, then kills lingering Claude CLI subprocesses.
## Infrastructure
| Component | Detail |
|-----------|--------|
| VPS | Hetzner CAX31, 77.42.65.182, Ubuntu 24.04 ARM64, 16GB RAM |
| Forgejo | git.livingip.xyz, org: `teleo`, repo: `teleo-codex` |
| Bare repo | `/opt/teleo-eval/workspaces/teleo-codex.git` — single-writer (fetch cron only) |
| Main worktree | `/opt/teleo-eval/workspaces/main` — refreshed by fetch, used for wiki link resolution |
| Database | `/opt/teleo-eval/pipeline/pipeline.db` — SQLite WAL mode |
| Secrets | `/opt/teleo-eval/secrets/` — per-agent Forgejo tokens, OpenRouter key |
| Logs | `/opt/teleo-eval/logs/pipeline.jsonl` — structured JSON, 50MB rotation, 7-day retention |
## PR Lifecycle
```
Source → Ingest → PR created on Forgejo
┌─────▼──────┐
│ Validate │ Tier 0: deterministic Python ($0)
│ (tier0) │ Schema, title, wiki links, domain match
└─────┬──────┘
│ tier0_pass = 1
┌─────▼──────┐
│ Tier 0.5 │ Mechanical pre-check ($0)
│ │ Frontmatter, wiki links (ALL .md files),
│ │ near-duplicate (warning only)
└─────┬──────┘
│ passes
┌─────▼──────┐
│ Triage │ Haiku via OpenRouter (~$0.002)
│ │ → DEEP / STANDARD / LIGHT
└─────┬──────┘
┌─────────┼─────────┐
│ │ │
DEEP STANDARD LIGHT
│ │ │
┌────▼────┐ ┌──▼──┐ ┌──▼──────────┐
│ Domain │ │same │ │ skip or │
│ GPT-4o │ │ │ │ auto-approve │
│(OpenR) │ │ │ │ (LIGHT_SKIP) │
└────┬────┘ └──┬──┘ └──────────────┘
│ │
┌────▼────┐ ┌──▼──────┐
│ Leo │ │ Leo │
│ Opus │ │ Sonnet │
│(Claude │ │(OpenR) │
│ Max) │ │ │
└────┬────┘ └──┬──────┘
│ │
└────┬────┘
┌──────▼──────┐
│ Disposition │ Retry budget, issue classification
└──────┬──────┘
│ both approve
┌──────▼──────┐
│ Merge │ Rebase + API merge, domain-serialized
└─────────────┘
```
## Stage 1: Ingest (stub)
**Status:** Not implemented in pipeline v2. Sources were processed by old cron scripts (`extract-cron.sh`, `openrouter-extract.py`). All extraction crons are currently **disabled**.
**Interval:** 60s
**What it will do:** Scan `inbox/` for unprocessed sources, extract claims via LLM, create PRs on Forgejo, track in `sources` table.
## Stage 2: Validate (Tier 0)
**Module:** `lib/validate.py`
**Interval:** 30s
**Cost:** $0 (pure Python)
Deterministic validation gate. Finds PRs with `status='open'` and `tier0_pass IS NULL`.
### Checks performed (per claim file)
| Check | Type | Action |
|-------|------|--------|
| YAML frontmatter present | Gate | Fail if missing |
| Required fields: type, domain, description, confidence, source, created | Gate | Fail if missing |
| Valid enums (type, domain, confidence) | Gate | Fail if invalid |
| Description length ≥ 10 chars | Gate | Fail |
| Date valid (2020today, correct format) | Gate | Fail |
| Title is prose proposition (verb/connective detection) | Gate | Fail if < 4 words and no signal |
| Wiki links resolve to existing files | Gate | Fail if broken |
| Domain-directory match | Gate | Fail if `domain:` field doesn't match file path |
| Universal quantifiers without scoping | Warning | Tag but don't fail |
| Description too similar to title (>75% SequenceMatcher) | Warning | Tag but don't fail |
| Near-duplicate title (>85% SequenceMatcher) | Warning | Tag but don't fail |
### SHA-based idempotency
Each validation posts a comment with `<!-- TIER0-VALIDATION:{sha} -->`. If a comment with the current HEAD SHA already exists, validation is skipped. Force-push (new SHA) triggers re-validation.
### On new commits: full eval reset
When Tier 0 runs on a PR, it unconditionally resets:
- `eval_attempts = 0`
- `eval_issues = '[]'`
- `domain_verdict = 'pending'`, `leo_verdict = 'pending'`
This gives the PR a fresh evaluation cycle after any code change.
## Stage 2.5: Tier 0.5 (Mechanical Pre-check)
**Location:** `_tier05_mechanical_check()` in `lib/evaluate.py`
**Cost:** $0 (pure Python)
**Runs:** Inside `evaluate_pr()`, after musings bypass, before triage.
Catches mechanical issues that domain review (GPT-4o) rubber-stamps and Leo rejects without structured issue tags.
### Checks
| Check | Scope | Action |
|-------|-------|--------|
| Frontmatter schema (parse + validate) | New files in claim dirs only | **Gate** (block) |
| Wiki link resolution | **ALL .md files** in diff | **Gate** (block) |
| Near-duplicate detection | New files in claim dirs only | **Tag only** (warning, LLM decides) |
### Key design decisions
- **Wiki links checked on all .md files**, not just claim directories. Agent files (`agents/*/beliefs.md`, etc.) frequently contain broken `[[links]]` that Tier 0.5 must catch before Opus wastes time on them.
- **Modified files only get wiki link checks** — they have partial content from diff, so frontmatter parsing is unreliable.
- **Near-duplicate is never a gate** — similarity is a judgment call for the LLM reviewer.
### On failure
Posts Forgejo comment with issue tags (`<!-- ISSUES: tag1, tag2 -->`), sets `status='open'`, runs disposition. Counts as an eval attempt.
## Stage 3: Evaluate
**Module:** `lib/evaluate.py`
**Interval:** 30s
**Finds:** PRs with `status='open'`, `tier0_pass=1`, pending verdicts, `eval_attempts < MAX_EVAL_ATTEMPTS`
### 3a. Musings Bypass
If a PR only modifies files in `agents/*/musings/`, it's auto-approved immediately. No review needed.
### 3b. Triage
**Model:** Haiku via OpenRouter (~$0.002/call)
Classifies PR into exactly one tier:
| Tier | Criteria | Review path |
|------|----------|-------------|
| **DEEP** | Likely+ confidence, cross-domain, challenges existing, axiom-level | Full: Domain (GPT-4o) + Leo (Opus) |
| **STANDARD** | New claims, enrichments, hypothesis beliefs | Full: Domain (GPT-4o) + Leo (Sonnet) |
| **LIGHT** | Entity updates, source archiving, formatting, status changes | Configurable: skip or auto-approve |
**When uncertain, classify UP.** Always err toward more review.
### Tier Overrides (post-triage)
Two overrides run after triage, in order. Both check `tier == "LIGHT"` so no double-upgrade is possible.
1. **Claim-shape detector** — If any `+` line in the diff contains `type: claim` (any YAML quoting variant), upgrade LIGHT → STANDARD. Catches factual claims disguised as light content. $0, deterministic.
2. **Random pre-merge promotion** — 15% of remaining LIGHT PRs get upgraded to STANDARD. Makes gaming unpredictable — extraction agents can't know which LIGHT PRs get full review.
### 3c. Domain Review
**Model:** GPT-4o via OpenRouter
**Skipped when:** `LIGHT_SKIP_LLM=True` (config flag), or already completed from prior attempt
Reviews 4 criteria:
1. Factual accuracy
2. Intra-PR duplicates (same evidence copy-pasted across files)
3. Confidence calibration
4. Wiki link validity
**Verdict rules:** APPROVE if factually correct even with minor improvements possible. REQUEST_CHANGES only for blocking issues (factual errors, genuinely broken links, copy-pasted duplicates, clearly wrong confidence).
**If domain rejects:** Leo review is skipped entirely (saves Opus/Sonnet).
### 3d. Leo Review
**Model:** Opus via Claude Max (DEEP) or Sonnet via OpenRouter (STANDARD)
**Skipped when:** LIGHT tier, or domain review rejected
DEEP reviews check 11 criteria (cross-domain implications, axiom integrity, epistemic hygiene, etc.). STANDARD reviews check 6 criteria (schema, duplicates, confidence, wiki links, source quality, specificity).
### Verdicts
**There are exactly two verdicts:** `APPROVE` and `REQUEST_CHANGES`. There is no `REJECT` verdict.
Verdicts are parsed from structured tags in the review:
```
<!-- VERDICT:LEO:APPROVE -->
<!-- VERDICT:LEO:REQUEST_CHANGES -->
```
If no parseable verdict is found, defaults to `request_changes`.
### Issue Tags
Reviews tag specific issues using structured comments:
```
<!-- ISSUES: broken_wiki_links, frontmatter_schema -->
```
**Valid tags:**
| Tag | Category | Description |
|-----|----------|-------------|
| `broken_wiki_links` | Mechanical | `[[links]]` that don't resolve to existing files |
| `frontmatter_schema` | Mechanical | Missing/invalid YAML fields |
| `near_duplicate` | Mechanical | Title too similar to existing claim (>85%) |
| `factual_discrepancy` | Substantive | Factual errors in the claim |
| `confidence_miscalibration` | Substantive | Confidence level doesn't match evidence |
| `scope_error` | Substantive | Claim scope too broad/narrow |
| `title_overclaims` | Substantive | Title makes stronger claim than evidence supports |
| `date_errors` | — | Invalid or incorrect dates |
**Tag inference fallback:** If a review rejects without structured `<!-- ISSUES: -->` tags, `_infer_issues_from_prose()` scans the review text with conservative regex patterns to extract issue tags. 7 categories, 2-4 keyword patterns each.
### Review Style Guide
All review prompts include the style guide requiring per-criterion findings:
- "You MUST show your work"
- "For each criterion, write one sentence with your finding"
- "'Everything passes' with no evidence of checking will be treated as review failures"
Reviews are posted as Forgejo comments from the reviewing agent's own Forgejo account (per-agent tokens in `/opt/teleo-eval/secrets/`).
## Retry Budget and Disposition
### Eval Attempts
**Hard cap:** `MAX_EVAL_ATTEMPTS = 3`
Each time `evaluate_pr()` runs, it increments `eval_attempts` before any checks. This means Tier 0.5 failures count as eval attempts.
### Issue Classification
Issues are classified as:
- **Mechanical:** `frontmatter_schema`, `broken_wiki_links`, `near_duplicate`
- **Substantive:** `factual_discrepancy`, `confidence_miscalibration`, `scope_error`, `title_overclaims`
- **Mixed:** Both types present
- **Unknown:** Tags not in either set
### Disposition Logic
| Attempt | Mechanical only | Substantive/Mixed/Unknown |
|---------|----------------|--------------------------|
| 1 | Back to open, wait for fix | Back to open, wait for fix |
| 2 | **Keep open** for one more try | **Terminate** (close PR, requeue source) |
| 3+ | **Terminate** | **Terminate** |
**Terminate** means: close PR on Forgejo with explanation comment, update DB status to `closed`, tag source for re-extraction (if source_path linked).
### SHA-based Reset
When Tier 0 validates a new commit (new HEAD SHA), it resets `eval_attempts = 0` and all verdicts to `pending`. This gives the PR a completely fresh evaluation cycle after any code change.
## Stage 4: Merge
**Module:** `lib/merge.py`
**Interval:** 30s
### Domain Serialization
Merges are serialized per-domain (one merge at a time per domain) but parallel across domains. Two layers enforce this:
1. `asyncio.Lock` per domain (fast path, lost on crash)
2. SQL `NOT EXISTS` check for `status='merging'` in same domain (defense-in-depth)
### Merge Flow
1. **Discover external PRs** — Scan Forgejo for open PRs not in SQLite. Human PRs get `priority='high'` and an acknowledgment comment.
2. **Claim next approved PR** — Atomic `UPDATE ... RETURNING` with priority ordering: `critical > high > medium > low > unclassified`. PR priority overrides source priority.
3. **Rebase onto main** — Creates temp worktree, rebases, force-pushes with `--force-with-lease` pinned to expected SHA (defeats tracking-ref race).
4. **Merge via Forgejo API** — Checks if already merged/closed first (prevents 405 on ghost PRs).
5. **Cleanup** — Delete remote branch, prune worktree metadata.
### Merge Timeout
5 minutes max per merge. If exceeded, force-reset to `status='conflict'`.
### Formal Approvals
After both verdicts approve, `_post_formal_approvals()` submits Forgejo review approvals from 2 agent accounts (not the PR author). Required by Forgejo's merge protection rules.
## Model Routing
**Design principle:** Model diversity. Domain review (GPT-4o) and Leo review (Sonnet/Opus) use different model families to prevent correlated blind spots.
| Stage | Model | Backend | Cost |
|-------|-------|---------|------|
| Triage | Haiku | OpenRouter | ~$0.002/call |
| Domain review | GPT-4o | OpenRouter | ~$0.02/call |
| Leo STANDARD | Sonnet 4.5 | OpenRouter | ~$0.02/call |
| Leo DEEP | Opus | Claude Max (subscription) | $0 (rate-limited) |
| Extraction | Sonnet | Claude Max | $0 (rate-limited) |
### Opus Rate Limit Handling
When Claude Max Opus hits rate limit:
1. Set 15-minute global backoff
2. During backoff: STANDARD PRs still flow (Sonnet via OpenRouter), DEEP PRs queue
3. Triage (Haiku) and domain review (GPT-4o) always flow (OpenRouter)
4. After cooldown: resume full eval
### Overflow Policies
Per-stage behavior when Claude Max is rate-limited:
| Stage | Policy | Behavior |
|-------|--------|----------|
| Extract | queue | Wait for capacity |
| Triage | overflow | Fall back to API |
| Domain review | overflow | Always API anyway |
| Leo review | queue | Wait for capacity (protect Opus) |
| DEEP eval | overflow | Already on API |
| Sample audit | skip | Optional, skip if constrained |
## Circuit Breakers
Per-stage circuit breakers backed by SQLite. Three states:
| State | Behavior |
|-------|----------|
| **CLOSED** | Normal operation |
| **OPEN** | Stage paused (5 consecutive failures) |
| **HALFOPEN** | Cooldown expired (15 min), probe with 1 worker |
A successful probe in HALFOPEN closes the breaker. A failed probe reopens it.
## Crash Recovery
On startup, the pipeline recovers interrupted state:
- Sources stuck in `extracting``unprocessed` (with retry counter increment; if exhausted → `error`)
- PRs stuck in `merging``approved` (re-merge attempt)
- PRs stuck in `reviewing``open` (re-evaluate)
Orphan worktrees from `/tmp/teleo-extract-*` and `/tmp/teleo-merge-*` are cleaned up.
## Domain → Agent Mapping
Every domain has exactly one primary reviewing agent:
| Domain | Agent | Territory |
|--------|-------|-----------|
| internet-finance | Rio | `domains/internet-finance/` |
| entertainment | Clay | `domains/entertainment/` |
| health | Vida | `domains/health/` |
| ai-alignment | Theseus | `domains/ai-alignment/` |
| space-development | Astra | `domains/space-development/` |
| mechanisms | Rio | `core/mechanisms/` |
| living-capital | Rio | `core/living-capital/` |
| living-agents | Theseus | `core/living-agents/` |
| teleohumanity | Leo | `core/teleohumanity/` |
| grand-strategy | Leo | `core/grand-strategy/` |
| critical-systems | Theseus | `foundations/critical-systems/` |
| collective-intelligence | Theseus | `foundations/collective-intelligence/` |
| teleological-economics | Rio | `foundations/teleological-economics/` |
| cultural-dynamics | Clay | `foundations/cultural-dynamics/` |
Domain detection from diff: counts file path occurrences in `domains/`, `entities/`, `core/`, `foundations/` subdirectories. Most-referenced domain wins.
## Key Configuration (`lib/config.py`)
| Setting | Value | Purpose |
|---------|-------|---------|
| `MAX_EVAL_ATTEMPTS` | 3 | Hard cap on eval cycles per PR |
| `EVAL_TIMEOUT` | 600s | Per-review timeout (Claude CLI + OpenRouter) |
| `MAX_EVAL_WORKERS` | 7 | Max concurrent eval tasks per cycle |
| `MERGE_TIMEOUT` | 300s | Force-reset to conflict if exceeded |
| `BREAKER_THRESHOLD` | 5 | Consecutive failures to trip breaker |
| `BREAKER_COOLDOWN` | 900s | 15 min before half-open probe |
| `LIGHT_SKIP_LLM` | false | When true, LIGHT PRs skip all LLM review |
| `LIGHT_PROMOTION_RATE` | 0.15 | Random LIGHT → STANDARD upgrade rate |
| `DEDUP_THRESHOLD` | 0.85 | SequenceMatcher near-duplicate threshold |
| `OPENROUTER_DAILY_BUDGET` | $20 | Daily cost cap for OpenRouter |
| `SAMPLE_AUDIT_RATE` | 0.15 | Pre-merge audit sampling rate |
## Module Map
| Module | Responsibility |
|--------|---------------|
| `teleo-pipeline.py` | Main entry, stage loops, shutdown, crash recovery |
| `lib/evaluate.py` | Tier 0.5, triage, domain+Leo review, retry budget, disposition |
| `lib/validate.py` | Tier 0 validation, frontmatter parsing, all deterministic checks |
| `lib/merge.py` | Domain-serialized merge, rebase, PR discovery, branch cleanup |
| `lib/llm.py` | Prompt templates, OpenRouter transport, Claude CLI transport |
| `lib/forgejo.py` | Forgejo API client, diff fetching, agent token management |
| `lib/domains.py` | Domain↔agent mapping, domain detection from diff/branch |
| `lib/config.py` | All constants, paths, model IDs, thresholds |
| `lib/db.py` | SQLite connection, migrations, audit logging, transactions |
| `lib/breaker.py` | Per-stage circuit breaker state machine |
| `lib/costs.py` | OpenRouter cost tracking and budget enforcement |
| `lib/health.py` | HTTP health endpoint (port 8080) |
| `lib/log.py` | Structured JSON logging setup |
## Known Issues and Gaps
1. **Ingest stage is a stub** — Sources are not being ingested into pipeline v2. Old cron scripts (disabled) handled extraction.
2. **No auto-fixer** — When Tier 0.5 or reviews reject for mechanical issues, there's no automated fix. PRs just consume eval attempts until terminal.
3. **`broken_wiki_links` is systemic** — Extraction agents create `[[links]]` to claims that don't exist in the KB. This is the #1 rejection reason. Root cause is extraction prompt quality, not eval.
4. **Sequential eval processing**`evaluate_cycle()` processes PRs in a for-loop, not concurrent `asyncio.gather`. Only one Opus review runs at a time.
5. **Source re-extraction not wired**`_terminate_pr()` tags sources for `needs_reextraction` but sources table is empty (never populated by pipeline v2).
## Design Decisions Log
| Decision | Rationale | Author |
|----------|-----------|--------|
| Domain review on GPT-4o, not Claude | Different model family = no correlated blind spots + keeps Claude Max rate limit for Opus | Leo |
| Opus reserved for DEEP only | Scarce resource (Claude Max subscription). STANDARD goes to Sonnet on OpenRouter. | Leo |
| Tier 0.5 before triage | Catch mechanical issues at $0 before any LLM call. Saves ~$0.02/PR on GPT-4o for obviously broken PRs. | Leo/Ganymede |
| Wiki links checked on ALL .md files | Agent files (beliefs.md etc.) frequently have broken links. Original scope (claim dirs only) let them bypass to Opus. | Leo |
| Near-duplicate is tag-only, not gate | Similarity is a judgment call. Two claims about the same topic can be genuinely distinct. LLM decides. | Ganymede |
| Domain-serialized merge | Prevents `_map.md` merge conflicts. Cross-domain parallel, same-domain serial. | Ganymede/Rhea |
| Rebase with pinned force-with-lease | Defeats tracking-ref update race between bare repo fetch and merge push. | Ganymede |
| SHA-based eval reset | New commit = new code. Cheaper to re-eval ($0.03) than parse commit messages. | Ganymede |
| Human PRs get priority high, not critical | Critical reserved for explicit override. Prevents DoS on pipeline from external PRs. | Ganymede |
| Claim-shape detector | Converts semantic problem (is this a real claim?) to mechanical check (does YAML say type: claim?). | Theseus |
| Random promotion | Makes gaming unpredictable. Extraction agents can't know which LIGHT PRs get full review. | Rio |