From 5f23712f70ba84ed0eeed0aa807d92c900b58f99 Mon Sep 17 00:00:00 2001 From: m3taversal Date: Sat, 7 Mar 2026 12:36:32 +0000 Subject: [PATCH] Auto: core/living-agents/single evaluator bottleneck means review throughput scales linearly with proposer count because one agent reviewing every PR caps collective output at the evaluators context window.md | 1 file changed, 57 insertions(+) --- ...output at the evaluators context window.md | 57 +++++++++++++++++++ 1 file changed, 57 insertions(+) create mode 100644 core/living-agents/single evaluator bottleneck means review throughput scales linearly with proposer count because one agent reviewing every PR caps collective output at the evaluators context window.md diff --git a/core/living-agents/single evaluator bottleneck means review throughput scales linearly with proposer count because one agent reviewing every PR caps collective output at the evaluators context window.md b/core/living-agents/single evaluator bottleneck means review throughput scales linearly with proposer count because one agent reviewing every PR caps collective output at the evaluators context window.md new file mode 100644 index 0000000..7e9ab55 --- /dev/null +++ b/core/living-agents/single evaluator bottleneck means review throughput scales linearly with proposer count because one agent reviewing every PR caps collective output at the evaluators context window.md @@ -0,0 +1,57 @@ +--- +type: claim +domain: living-agents +description: "Leo reviews every PR in the Teleo collective — as proposer count grows from 4 to 9+ agents, review becomes the binding constraint on knowledge base growth because one evaluator cannot parallelize" +confidence: likely +source: "Teleo collective operational evidence — 44 PRs reviewed by Leo across 4 proposers (2026-02 to 2026-03)" +created: 2026-03-07 +--- + +# Single evaluator bottleneck means review throughput scales linearly with proposer count because one agent reviewing every PR caps collective output at the evaluator's context window + +The Teleo collective routes every PR through Leo for cross-domain evaluation. This was the right bootstrap decision — it ensured consistent quality standards and cross-domain awareness during the period when the collective was learning what "good" looks like. But it is also a structural bottleneck that will break as the collective scales. + +## How it fails today + +Leo has reviewed all 44 merged PRs. During the synthesis batch sprint (PRs #39-#44), 6 PRs were proposed within 3 sessions. Each PR requires Leo to: read all proposed claims, check for duplicates against the full knowledge base, verify wiki links resolve, assess confidence calibration, check for cross-domain connections, and write substantive review comments. This takes a full session per complex PR. + +The math is simple: with 4 active proposers (Rio, Clay, Vida, Theseus), each producing 1-3 PRs per work cycle, Leo faces 4-12 PRs per cycle. At 1-2 PRs reviewed per session, the review queue grows faster than it drains when all proposers are active simultaneously. + +Evidence of the bottleneck appearing: +- **PR #35 and #39 were reviewed in the same session** — Leo's review of #39 (synthesis batch 3) was shallower than earlier reviews because context was shared with #35 (Rio's launch mechanism claims). The review caught the key issues but missed opportunities for cross-domain connections that a fresh-context review would have surfaced. +- **PR #44 required 3 reviewers** (the peer review rule for evaluator-as-proposer), which meant Rio, Theseus, and Rhea all reviewed — proving that multi-evaluator review works when the rules require it. +- **Synthesis batches bundle 2-3 claims per PR** partly because Leo batches his own work to reduce the number of PRs the collective has to review. The batching is a workaround for the bottleneck, not a solution. + +## Why this matters + +A single evaluator creates three downstream problems: + +1. **Throughput cap.** The collective cannot produce knowledge faster than Leo can review it. Adding more proposers (the planned 9-agent expansion) increases proposal rate without increasing review capacity. + +2. **Single point of failure.** If Leo's session fails, crashes, or runs out of context, all pending reviews stall. There is no backup evaluator. PR #44's peer review was the first time any agent other than Leo served as primary reviewer — and that only happened because the rules forced it. + +3. **Evaluator fatigue.** Review quality degrades over a session as Leo processes more PRs. The first PR in a session gets deeper analysis than the fourth. This is not hypothetical — it is the known behavior of LLMs processing long sequences. + +## What this doesn't do yet + +- **No evaluator rotation.** There is no mechanism for domain agents to serve as primary reviewers for PRs outside their domain. The CLAUDE.md rules designate Leo as the sole evaluator, with domain agents only reviewing when the peer-review or synthesis-review rules trigger. +- **No review load balancing.** When multiple PRs are pending, there is no priority queue. Leo reviews in the order encountered, not by urgency or downstream impact. +- **No review quality metrics.** There is no measurement of whether later-in-session reviews are shallower than early reviews. The claim that review quality degrades is based on LLM behavior, not on tracked data comparing early vs late review outcomes. + +## Where this goes + +The immediate improvement is **evaluator delegation**: define review criteria that domain agents can apply to PRs within their territory, reserving Leo for cross-domain review only. Rio can review Clay's entertainment claims for structural quality (specificity, evidence, confidence calibration) while Leo checks for cross-domain connections. This parallelizes review without losing the synthesis function. + +The next step is **multi-model evaluation**: running evaluators on a different model family than proposers (designed, not yet implemented). This requires VPS deployment with container-per-agent architecture. Multi-model evaluation addresses both the throughput bottleneck (more evaluators) and the correlated priors problem (different model families catch different errors). + +The ultimate form is a **review market**: agents bid review capacity against PR priority, with cross-domain PRs requiring Leo's review and single-domain PRs requiring only their domain evaluator plus one external reviewer. Review quality is tracked by measuring how often reviewed claims later require correction. + +--- + +Relevant Notes: +- [[adversarial PR review produces higher quality knowledge than self-review because separated proposer and evaluator roles catch errors that the originating agent cannot see]] — the mechanism this bottleneck constrains +- [[domain specialization with cross-domain synthesis produces better collective intelligence than generalist agents because specialists build deeper knowledge while a dedicated synthesizer finds connections they cannot see from within their territory]] — the specialization that makes delegation possible +- [[human-in-the-loop at the architectural level means humans set direction and approve structure while agents handle extraction synthesis and routine evaluation]] — the human can override the bottleneck but shouldn't have to + +Topics: +- [[collective agents]]