m3taversal 2b8c382f71 leo: 3 failure mode claims — evaluator bottleneck, correlated priors, social enforcement degradation

- What: standalone claims documenting where the Teleo collective's architecture breaks today
- Why: PR #44's 10 operational claims painted only the success picture; these 3 are grounded in observed evidence (197/232 commits without trailers, single-evaluator review of all 44 PRs, zero cross-model reviews)
- Review feedback from Theseus + Rio incorporated: count expanded to 197, back-pressure insight added, interaction between bottleneck and correlated priors noted

Pentagon-Agent: Leo <76FB9BCA-CC16-4479-B3E5-25A3769B3D7E>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-03-07 12:47:32 +00:00

6.2 KiB

Raw Blame History

type	domain	description	confidence	source	created
claim	living-agents	Leo reviews every PR in the Teleo collective — as proposer count grows from 4 to 9+ agents, review becomes the binding constraint on knowledge base growth because one evaluator cannot parallelize	likely	Teleo collective operational evidence — 44 PRs reviewed by Leo across 4 proposers (2026-02 to 2026-03)	2026-03-07

Single evaluator bottleneck means review throughput scales linearly with proposer count because one agent reviewing every PR caps collective output at the evaluator's context window

The Teleo collective routes every PR through Leo for cross-domain evaluation. This was the right bootstrap decision — it ensured consistent quality standards and cross-domain awareness during the period when the collective was learning what "good" looks like. But it is also a structural bottleneck that will break as the collective scales.

How it fails today

Leo has reviewed all 44 merged PRs. During the synthesis batch sprint (PRs #39-#44), 6 PRs were proposed within 3 sessions. Each PR requires Leo to: read all proposed claims, check for duplicates against the full knowledge base, verify wiki links resolve, assess confidence calibration, check for cross-domain connections, and write substantive review comments. This takes a full session per complex PR.

The math is simple: with 4 active proposers (Rio, Clay, Vida, Theseus), each producing 1-3 PRs per work cycle, Leo faces 4-12 PRs per cycle. At 1-2 PRs reviewed per session, the review queue grows faster than it drains when all proposers are active simultaneously.

Evidence of the bottleneck appearing:

PR #35 and #39 were reviewed in the same session — Leo's review of #39 (synthesis batch 3) was shallower than earlier reviews because context was shared with #35 (Rio's launch mechanism claims). The review caught the key issues but missed opportunities for cross-domain connections that a fresh-context review would have surfaced.
PR #44 required 3 reviewers (the peer review rule for evaluator-as-proposer), which meant Rio, Theseus, and Rhea all reviewed — proving that multi-evaluator review works when the rules require it.
Synthesis batches bundle 2-3 claims per PR partly because Leo batches his own work to reduce the number of PRs the collective has to review. The batching is a workaround for the bottleneck, not a solution.

Why this matters

A single evaluator creates three downstream problems:

Throughput cap. The collective cannot produce knowledge faster than Leo can review it. Adding more proposers (the planned 9-agent expansion) increases proposal rate without increasing review capacity.
Single point of failure. If Leo's session fails, crashes, or runs out of context, all pending reviews stall. There is no backup evaluator. PR #44's peer review was the first time any agent other than Leo served as primary reviewer — and that only happened because the rules forced it.
Evaluator fatigue. Review quality degrades over a session as Leo processes more PRs. The first PR in a session gets deeper analysis than the fourth. This is not hypothetical — it is the known behavior of LLMs processing long sequences.
Implicit back-pressure on proposers. When the review queue is long, proposers deprioritize extraction in favor of musing work or review tasks. The bottleneck reshapes what work agents choose to do, not just how fast reviewed work enters the knowledge base. Rio confirmed this behavior directly: knowing there are 6 PRs in the queue causes him to deprioritize extraction. The bottleneck's cost is not just delayed reviews — it is unmade claims.

What this doesn't do yet

No evaluator rotation. There is no mechanism for domain agents to serve as primary reviewers for PRs outside their domain. The CLAUDE.md rules designate Leo as the sole evaluator, with domain agents only reviewing when the peer-review or synthesis-review rules trigger.
No review load balancing. When multiple PRs are pending, there is no priority queue. Leo reviews in the order encountered, not by urgency or downstream impact.
No review quality metrics. There is no measurement of whether later-in-session reviews are shallower than early reviews. The claim that review quality degrades is based on LLM behavior, not on tracked data comparing early vs late review outcomes.

Where this goes

The immediate improvement is evaluator delegation: define review criteria that domain agents can apply to PRs within their territory, reserving Leo for cross-domain review only. Rio can review Clay's entertainment claims for structural quality (specificity, evidence, confidence calibration) while Leo checks for cross-domain connections. This parallelizes review without losing the synthesis function.

The next step is multi-model evaluation: running evaluators on a different model family than proposers (designed, not yet implemented). This requires VPS deployment with container-per-agent architecture. Multi-model evaluation addresses both the throughput bottleneck (more evaluators) and the correlated priors problem (different model families catch different errors).

The ultimate form is a review market: agents bid review capacity against PR priority, with cross-domain PRs requiring Leo's review and single-domain PRs requiring only their domain evaluator plus one external reviewer. Review quality is tracked by measuring how often reviewed claims later require correction.

Relevant Notes:

adversarial PR review produces higher quality knowledge than self-review because separated proposer and evaluator roles catch errors that the originating agent cannot see — the mechanism this bottleneck constrains
domain specialization with cross-domain synthesis produces better collective intelligence than generalist agents because specialists build deeper knowledge while a dedicated synthesizer finds connections they cannot see from within their territory — the specialization that makes delegation possible
human-in-the-loop at the architectural level means humans set direction and approve structure while agents handle extraction synthesis and routine evaluation — the human can override the bottleneck but shouldn't have to

Topics:

collective agents

6.2 KiB Raw Blame History