CI / lint-and-test (push) Waiting to run

Details

feat: reorganize repo with clear directory boundaries and agent ownership

Move scattered root-level files into categorized directories:
- deploy/ — deployment + mirror scripts (Ship)
- scripts/ — one-off backfills + migrations (Ship)
- research/ — nightly research + prompts (Ship)
- docs/ — all operational documentation (shared)

Delete 3 dead cron scripts replaced by pipeline daemon:
- batch-extract-50.sh, evaluate-trigger.sh, extract-cron.sh

Add CODEOWNERS mapping every path to its owning agent.
Add README with directory structure, ownership table, and VPS layout.
Update deploy.sh paths to match new structure.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-14 18:20:13 +01:00

22 KiB

Raw Permalink Blame History

Pipeline v2 Architecture

Single async Python daemon replacing 7 cron scripts. Four stage loops running concurrently with SQLite WAL state store.

System Overview

                    ┌─────────────────────────────────────────────┐
                    │            teleo-pipeline.py                │
                    │                                             │
                    │  ┌─────────┐ ┌──────────┐ ┌──────────┐ ┌───────┐
                    │  │ Ingest  │ │ Validate │ │ Evaluate │ │ Merge │
                    │  │ (stub)  │ │  30s     │ │  30s     │ │ 30s   │
                    │  └────┬────┘ └────┬─────┘ └────┬─────┘ └───┬───┘
                    │       │           │            │           │
                    │       └───────────┴────────────┴───────────┘
                    │                        │
                    │                   SQLite WAL
                    │              (pipeline.db)
                    └─────────────────────────────────────────────┘
                                         │
                              ┌──────────┴──────────┐
                              │    Forgejo API       │
                              │  git.livingip.xyz    │
                              └─────────────────────┘

Location: /opt/teleo-eval/pipeline/ (VPS), ~/.pentagon/workspace/collective/pipeline-v2/ (local dev)

Process: Single Python process, systemd-managed. PID tracked. Graceful shutdown on SIGTERM/SIGINT — waits up to 60s for stages to finish, then kills lingering Claude CLI subprocesses.

Infrastructure

Component	Detail
VPS	Hetzner CAX31, 77.42.65.182, Ubuntu 24.04 ARM64, 16GB RAM
Forgejo	git.livingip.xyz, org: `teleo`, repo: `teleo-codex`
Bare repo	`/opt/teleo-eval/workspaces/teleo-codex.git` — single-writer (fetch cron only)
Main worktree	`/opt/teleo-eval/workspaces/main` — refreshed by fetch, used for wiki link resolution
Database	`/opt/teleo-eval/pipeline/pipeline.db` — SQLite WAL mode
Secrets	`/opt/teleo-eval/secrets/` — per-agent Forgejo tokens, OpenRouter key
Logs	`/opt/teleo-eval/logs/pipeline.jsonl` — structured JSON, 50MB rotation, 7-day retention

PR Lifecycle

Source → Ingest → PR created on Forgejo
                       │
                  ┌─────▼──────┐
                  │   Validate  │  Tier 0: deterministic Python ($0)
                  │  (tier0)    │  Schema, title, wiki links, domain match
                  └─────┬──────┘
                        │ tier0_pass = 1
                  ┌─────▼──────┐
                  │  Tier 0.5   │  Mechanical pre-check ($0)
                  │             │  Frontmatter, wiki links (ALL .md files),
                  │             │  near-duplicate (warning only)
                  └─────┬──────┘
                        │ passes
                  ┌─────▼──────┐
                  │   Triage    │  Haiku via OpenRouter (~$0.002)
                  │             │  → DEEP / STANDARD / LIGHT
                  └─────┬──────┘
                        │
              ┌─────────┼─────────┐
              │         │         │
           DEEP    STANDARD    LIGHT
              │         │         │
         ┌────▼────┐ ┌──▼──┐  ┌──▼──────────┐
         │ Domain  │ │same │  │ skip or      │
         │ GPT-4o  │ │     │  │ auto-approve │
         │(OpenR)  │ │     │  │ (LIGHT_SKIP) │
         └────┬────┘ └──┬──┘  └──────────────┘
              │         │
         ┌────▼────┐ ┌──▼──────┐
         │  Leo    │ │  Leo    │
         │  Opus   │ │ Sonnet  │
         │(Claude  │ │(OpenR)  │
         │  Max)   │ │         │
         └────┬────┘ └──┬──────┘
              │         │
              └────┬────┘
                   │
            ┌──────▼──────┐
            │  Disposition │  Retry budget, issue classification
            └──────┬──────┘
                   │ both approve
            ┌──────▼──────┐
            │    Merge     │  Rebase + API merge, domain-serialized
            └─────────────┘

Stage 1: Ingest (stub)

Status: Not implemented in pipeline v2. Sources were processed by old cron scripts (extract-cron.sh, openrouter-extract.py). All extraction crons are currently disabled.

Interval: 60s

What it will do: Scan inbox/ for unprocessed sources, extract claims via LLM, create PRs on Forgejo, track in sources table.

Stage 2: Validate (Tier 0)

Module: lib/validate.py Interval: 30s Cost: $0 (pure Python)

Deterministic validation gate. Finds PRs with status='open' and tier0_pass IS NULL.

Checks performed (per claim file)

Check	Type	Action
YAML frontmatter present	Gate	Fail if missing
Required fields: type, domain, description, confidence, source, created	Gate	Fail if missing
Valid enums (type, domain, confidence)	Gate	Fail if invalid
Description length ≥ 10 chars	Gate	Fail
Date valid (2020–today, correct format)	Gate	Fail
Title is prose proposition (verb/connective detection)	Gate	Fail if < 4 words and no signal
Wiki links resolve to existing files	Gate	Fail if broken
Domain-directory match	Gate	Fail if `domain:` field doesn't match file path
Universal quantifiers without scoping	Warning	Tag but don't fail
Description too similar to title (>75% SequenceMatcher)	Warning	Tag but don't fail
Near-duplicate title (>85% SequenceMatcher)	Warning	Tag but don't fail

SHA-based idempotency

Each validation posts a comment with . If a comment with the current HEAD SHA already exists, validation is skipped. Force-push (new SHA) triggers re-validation.

On new commits: full eval reset

When Tier 0 runs on a PR, it unconditionally resets:

eval_attempts = 0
eval_issues = '[]'
domain_verdict = 'pending', leo_verdict = 'pending'

This gives the PR a fresh evaluation cycle after any code change.

Stage 2.5: Tier 0.5 (Mechanical Pre-check)

Location: _tier05_mechanical_check() in lib/evaluate.py Cost: $0 (pure Python) Runs: Inside evaluate_pr(), after musings bypass, before triage.

Catches mechanical issues that domain review (GPT-4o) rubber-stamps and Leo rejects without structured issue tags.

Checks

Check	Scope	Action
Frontmatter schema (parse + validate)	New files in claim dirs only	Gate (block)
Wiki link resolution	ALL .md files in diff	Gate (block)
Near-duplicate detection	New files in claim dirs only	Tag only (warning, LLM decides)

Key design decisions

Wiki links checked on all .md files, not just claim directories. Agent files (agents/*/beliefs.md, etc.) frequently contain broken [[links]] that Tier 0.5 must catch before Opus wastes time on them.
Modified files only get wiki link checks — they have partial content from diff, so frontmatter parsing is unreliable.
Near-duplicate is never a gate — similarity is a judgment call for the LLM reviewer.

On failure

Posts Forgejo comment with issue tags (), sets status='open', runs disposition. Counts as an eval attempt.

Stage 3: Evaluate

Module: lib/evaluate.py Interval: 30s Finds: PRs with status='open', tier0_pass=1, pending verdicts, eval_attempts < MAX_EVAL_ATTEMPTS

3a. Musings Bypass

If a PR only modifies files in agents/*/musings/, it's auto-approved immediately. No review needed.

3b. Triage

Model: Haiku via OpenRouter (~$0.002/call)

Classifies PR into exactly one tier:

Tier	Criteria	Review path
DEEP	Likely+ confidence, cross-domain, challenges existing, axiom-level	Full: Domain (GPT-4o) + Leo (Opus)
STANDARD	New claims, enrichments, hypothesis beliefs	Full: Domain (GPT-4o) + Leo (Sonnet)
LIGHT	Entity updates, source archiving, formatting, status changes	Configurable: skip or auto-approve

When uncertain, classify UP. Always err toward more review.

Tier Overrides (post-triage)

Two overrides run after triage, in order. Both check tier == "LIGHT" so no double-upgrade is possible.

Claim-shape detector — If any + line in the diff contains type: claim (any YAML quoting variant), upgrade LIGHT → STANDARD. Catches factual claims disguised as light content. $0, deterministic.
Random pre-merge promotion — 15% of remaining LIGHT PRs get upgraded to STANDARD. Makes gaming unpredictable — extraction agents can't know which LIGHT PRs get full review.

3c. Domain Review

Model: GPT-4o via OpenRouter Skipped when: LIGHT_SKIP_LLM=True (config flag), or already completed from prior attempt

Reviews 4 criteria:

Factual accuracy
Intra-PR duplicates (same evidence copy-pasted across files)
Confidence calibration
Wiki link validity

Verdict rules: APPROVE if factually correct even with minor improvements possible. REQUEST_CHANGES only for blocking issues (factual errors, genuinely broken links, copy-pasted duplicates, clearly wrong confidence).

If domain rejects: Leo review is skipped entirely (saves Opus/Sonnet).

3d. Leo Review

Model: Opus via Claude Max (DEEP) or Sonnet via OpenRouter (STANDARD) Skipped when: LIGHT tier, or domain review rejected

DEEP reviews check 11 criteria (cross-domain implications, axiom integrity, epistemic hygiene, etc.). STANDARD reviews check 6 criteria (schema, duplicates, confidence, wiki links, source quality, specificity).

Verdicts

There are exactly two verdicts: APPROVE and REQUEST_CHANGES. There is no REJECT verdict.

Verdicts are parsed from structured tags in the review:

<!-- VERDICT:LEO:APPROVE -->
<!-- VERDICT:LEO:REQUEST_CHANGES -->

If no parseable verdict is found, defaults to request_changes.

Issue Tags

Reviews tag specific issues using structured comments:

<!-- ISSUES: broken_wiki_links, frontmatter_schema -->

Valid tags:

Tag	Category	Description
`broken_wiki_links`	Mechanical	`[[links]]` that don't resolve to existing files
`frontmatter_schema`	Mechanical	Missing/invalid YAML fields
`near_duplicate`	Mechanical	Title too similar to existing claim (>85%)
`factual_discrepancy`	Substantive	Factual errors in the claim
`confidence_miscalibration`	Substantive	Confidence level doesn't match evidence
`scope_error`	Substantive	Claim scope too broad/narrow
`title_overclaims`	Substantive	Title makes stronger claim than evidence supports
`date_errors`	—	Invalid or incorrect dates

Tag inference fallback: If a review rejects without structured  tags, _infer_issues_from_prose() scans the review text with conservative regex patterns to extract issue tags. 7 categories, 2-4 keyword patterns each.

Review Style Guide

All review prompts include the style guide requiring per-criterion findings:

"You MUST show your work"
"For each criterion, write one sentence with your finding"
"'Everything passes' with no evidence of checking will be treated as review failures"

Reviews are posted as Forgejo comments from the reviewing agent's own Forgejo account (per-agent tokens in /opt/teleo-eval/secrets/).

Retry Budget and Disposition

Eval Attempts

Hard cap: MAX_EVAL_ATTEMPTS = 3

Each time evaluate_pr() runs, it increments eval_attempts before any checks. This means Tier 0.5 failures count as eval attempts.

Issue Classification

Issues are classified as:

Mechanical: frontmatter_schema, broken_wiki_links, near_duplicate
Substantive: factual_discrepancy, confidence_miscalibration, scope_error, title_overclaims
Mixed: Both types present
Unknown: Tags not in either set

Disposition Logic

Attempt	Mechanical only	Substantive/Mixed/Unknown
1	Back to open, wait for fix	Back to open, wait for fix
2	Keep open for one more try	Terminate (close PR, requeue source)
3+	Terminate	Terminate

Terminate means: close PR on Forgejo with explanation comment, update DB status to closed, tag source for re-extraction (if source_path linked).

SHA-based Reset

When Tier 0 validates a new commit (new HEAD SHA), it resets eval_attempts = 0 and all verdicts to pending. This gives the PR a completely fresh evaluation cycle after any code change.

Stage 4: Merge

Module: lib/merge.py Interval: 30s

Domain Serialization

Merges are serialized per-domain (one merge at a time per domain) but parallel across domains. Two layers enforce this:

asyncio.Lock per domain (fast path, lost on crash)
SQL NOT EXISTS check for status='merging' in same domain (defense-in-depth)

Merge Flow

Discover external PRs — Scan Forgejo for open PRs not in SQLite. Human PRs get priority='high' and an acknowledgment comment.
Claim next approved PR — Atomic UPDATE ... RETURNING with priority ordering: critical > high > medium > low > unclassified. PR priority overrides source priority.
Rebase onto main — Creates temp worktree, rebases, force-pushes with --force-with-lease pinned to expected SHA (defeats tracking-ref race).
Merge via Forgejo API — Checks if already merged/closed first (prevents 405 on ghost PRs).
Cleanup — Delete remote branch, prune worktree metadata.

Merge Timeout

5 minutes max per merge. If exceeded, force-reset to status='conflict'.

Formal Approvals

After both verdicts approve, _post_formal_approvals() submits Forgejo review approvals from 2 agent accounts (not the PR author). Required by Forgejo's merge protection rules.

Model Routing

Design principle: Model diversity. Domain review (GPT-4o) and Leo review (Sonnet/Opus) use different model families to prevent correlated blind spots.

Stage	Model	Backend	Cost
Triage	Haiku	OpenRouter	~$0.002/call
Domain review	GPT-4o	OpenRouter	~$0.02/call
Leo STANDARD	Sonnet 4.5	OpenRouter	~$0.02/call
Leo DEEP	Opus	Claude Max (subscription)	$0 (rate-limited)
Extraction	Sonnet	Claude Max	$0 (rate-limited)

Opus Rate Limit Handling

When Claude Max Opus hits rate limit:

Set 15-minute global backoff
During backoff: STANDARD PRs still flow (Sonnet via OpenRouter), DEEP PRs queue
Triage (Haiku) and domain review (GPT-4o) always flow (OpenRouter)
After cooldown: resume full eval

Overflow Policies

Per-stage behavior when Claude Max is rate-limited:

Stage	Policy	Behavior
Extract	queue	Wait for capacity
Triage	overflow	Fall back to API
Domain review	overflow	Always API anyway
Leo review	queue	Wait for capacity (protect Opus)
DEEP eval	overflow	Already on API
Sample audit	skip	Optional, skip if constrained

Circuit Breakers

Per-stage circuit breakers backed by SQLite. Three states:

State	Behavior
CLOSED	Normal operation
OPEN	Stage paused (5 consecutive failures)
HALFOPEN	Cooldown expired (15 min), probe with 1 worker

A successful probe in HALFOPEN closes the breaker. A failed probe reopens it.

Crash Recovery

On startup, the pipeline recovers interrupted state:

Sources stuck in extracting → unprocessed (with retry counter increment; if exhausted → error)
PRs stuck in merging → approved (re-merge attempt)
PRs stuck in reviewing → open (re-evaluate)

Orphan worktrees from /tmp/teleo-extract-* and /tmp/teleo-merge-* are cleaned up.

Domain → Agent Mapping

Every domain has exactly one primary reviewing agent:

Domain	Agent	Territory
internet-finance	Rio	`domains/internet-finance/`
entertainment	Clay	`domains/entertainment/`
health	Vida	`domains/health/`
ai-alignment	Theseus	`domains/ai-alignment/`
space-development	Astra	`domains/space-development/`
mechanisms	Rio	`core/mechanisms/`
living-capital	Rio	`core/living-capital/`
living-agents	Theseus	`core/living-agents/`
teleohumanity	Leo	`core/teleohumanity/`
grand-strategy	Leo	`core/grand-strategy/`
critical-systems	Theseus	`foundations/critical-systems/`
collective-intelligence	Theseus	`foundations/collective-intelligence/`
teleological-economics	Rio	`foundations/teleological-economics/`
cultural-dynamics	Clay	`foundations/cultural-dynamics/`

Domain detection from diff: counts file path occurrences in domains/, entities/, core/, foundations/ subdirectories. Most-referenced domain wins.

Key Configuration (`lib/config.py`)

Setting	Value	Purpose
`MAX_EVAL_ATTEMPTS`	3	Hard cap on eval cycles per PR
`EVAL_TIMEOUT`	600s	Per-review timeout (Claude CLI + OpenRouter)
`MAX_EVAL_WORKERS`	7	Max concurrent eval tasks per cycle
`MERGE_TIMEOUT`	300s	Force-reset to conflict if exceeded
`BREAKER_THRESHOLD`	5	Consecutive failures to trip breaker
`BREAKER_COOLDOWN`	900s	15 min before half-open probe
`LIGHT_SKIP_LLM`	false	When true, LIGHT PRs skip all LLM review
`LIGHT_PROMOTION_RATE`	0.15	Random LIGHT → STANDARD upgrade rate
`DEDUP_THRESHOLD`	0.85	SequenceMatcher near-duplicate threshold
`OPENROUTER_DAILY_BUDGET`	$20	Daily cost cap for OpenRouter
`SAMPLE_AUDIT_RATE`	0.15	Pre-merge audit sampling rate

Module Map

Module	Responsibility
`teleo-pipeline.py`	Main entry, stage loops, shutdown, crash recovery
`lib/evaluate.py`	Tier 0.5, triage, domain+Leo review, retry budget, disposition
`lib/validate.py`	Tier 0 validation, frontmatter parsing, all deterministic checks
`lib/merge.py`	Domain-serialized merge, rebase, PR discovery, branch cleanup
`lib/llm.py`	Prompt templates, OpenRouter transport, Claude CLI transport
`lib/forgejo.py`	Forgejo API client, diff fetching, agent token management
`lib/domains.py`	Domain↔agent mapping, domain detection from diff/branch
`lib/config.py`	All constants, paths, model IDs, thresholds
`lib/db.py`	SQLite connection, migrations, audit logging, transactions
`lib/breaker.py`	Per-stage circuit breaker state machine
`lib/costs.py`	OpenRouter cost tracking and budget enforcement
`lib/health.py`	HTTP health endpoint (port 8080)
`lib/log.py`	Structured JSON logging setup

Known Issues and Gaps

Ingest stage is a stub — Sources are not being ingested into pipeline v2. Old cron scripts (disabled) handled extraction.
No auto-fixer — When Tier 0.5 or reviews reject for mechanical issues, there's no automated fix. PRs just consume eval attempts until terminal.
broken_wiki_links is systemic — Extraction agents create [[links]] to claims that don't exist in the KB. This is the #1 rejection reason. Root cause is extraction prompt quality, not eval.
Sequential eval processing — evaluate_cycle() processes PRs in a for-loop, not concurrent asyncio.gather. Only one Opus review runs at a time.
Source re-extraction not wired — _terminate_pr() tags sources for needs_reextraction but sources table is empty (never populated by pipeline v2).

Design Decisions Log

Decision	Rationale	Author
Domain review on GPT-4o, not Claude	Different model family = no correlated blind spots + keeps Claude Max rate limit for Opus	Leo
Opus reserved for DEEP only	Scarce resource (Claude Max subscription). STANDARD goes to Sonnet on OpenRouter.	Leo
Tier 0.5 before triage	Catch mechanical issues at $0 before any LLM call. Saves ~$0.02/PR on GPT-4o for obviously broken PRs.	Leo/Ganymede
Wiki links checked on ALL .md files	Agent files (beliefs.md etc.) frequently have broken links. Original scope (claim dirs only) let them bypass to Opus.	Leo
Near-duplicate is tag-only, not gate	Similarity is a judgment call. Two claims about the same topic can be genuinely distinct. LLM decides.	Ganymede
Domain-serialized merge	Prevents `_map.md` merge conflicts. Cross-domain parallel, same-domain serial.	Ganymede/Rhea
Rebase with pinned force-with-lease	Defeats tracking-ref update race between bare repo fetch and merge push.	Ganymede
SHA-based eval reset	New commit = new code. Cheaper to re-eval ($0.03) than parse commit messages.	Ganymede
Human PRs get priority high, not critical	Critical reserved for explicit override. Prevents DoS on pipeline from external PRs.	Ganymede
Claim-shape detector	Converts semantic problem (is this a real claim?) to mechanical check (does YAML say type: claim?).	Theseus
Random promotion	Makes gaming unpredictable. Extraction agents can't know which LIGHT PRs get full review.	Rio

22 KiB Raw Permalink Blame History Unescape Escape