Mirror PR to Forgejo / mirror (pull_request) Has been cancelled

Details

theseus: add 5 claims + 1 enrichment from Khattab DSPy/ColBERT/RLM research

- What: 5 NEW claims on late interaction retrieval, programmatic LM pipelines,
  recursive language models, scale-modularity compounding (bitter free lunch),
  and inline constraint enforcement. 1 enrichment to GEPA claim with Khattab
  academic paper results (35x efficiency over RL). Source archive added.
- Why: Omar Khattab's research arc (ColBERT → DSPy → GEPA → RLMs) provides
  empirically grounded insights directly applicable to our retrieval, extraction
  pipeline, quality gates, and self-improvement architecture.
- Connections: challenges agent-native filesystem retrieval, enriches GEPA with
  academic provenance, extends context window limitation with circumvention
  mechanism, links to existing self-improvement and knowledge architecture claims.

Pentagon-Agent: Theseus <46864dd4-da71-4719-a1b4-68f7c55854d3>

2026-04-16 13:53:56 +01:00

6.6 KiB

Raw Blame History

type

domain

description

confidence

source

created

challenged_by

claim

ai-alignment

RLMs (Recursive Language Models) achieve 91.33% on BrowseComp+ where base models score 0% by having the model write code to explore its own context as an external environment, enabling systematic processing of inputs 100x beyond the context window

experimental

Khattab et al., 'Recursive Language Models' (arXiv:2603.25723, March 2026); BrowseComp+ benchmark (6-11M token inputs); cited by Anthropic in managed agents blog, April 2026

2026-04-16

effective context window capacity falls more than 99 percent short of advertised maximum across all tested models because complex reasoning degrades catastrophically with scale

knowledge between notes is generated by traversal not stored in any individual note because curated link paths produce emergent understanding that embedding similarity cannot replicate

graph traversal through curated wiki links replicates spreading activation from cognitive science because progressive disclosure implements decay-based context loading and queries evolve during search through the berrypicking effect

Recursive language model self-calls process inputs orders of magnitude beyond context windows by treating context as an external environment navigated through generated code

The context window problem in language models is well-established: effective utilization degrades catastrophically with scale, with complex reasoning falling more than 99% short of advertised capacity. Recursive Language Models (RLMs — Khattab et al., March 2026) demonstrate a fundamentally different approach: instead of trying to fit everything into context, the model writes code that recursively calls itself to explore the input as an external environment.

The mechanism

An RLM receives a problem and a large corpus (potentially millions of tokens). Rather than loading the corpus into context, the model:

Generates a decomposition plan — code that breaks the problem into subproblems
Writes exploration code — programs that call the model itself on subsets of the input
Recursively self-calls — each call processes a manageable chunk, returns structured results
Aggregates findings — combining sub-results into a final answer

The model treats its own context window as a fixed-size working memory and the input corpus as an external filesystem to be explored programmatically. This is architecturally similar to how a human researcher navigates a library — you don't read every book; you develop a search strategy, inspect promising sources, and synthesize findings.

Empirical results

On BrowseComp+ (a benchmark requiring finding specific information in 6-11 million token corpora):

Base models: 0% accuracy (cannot process inputs of this scale)
Vector retrieval + CodeAct: 51% accuracy
RLMs: 91.33% accuracy

The gap between RLMs and the next best approach (40+ percentage points) is one of the largest reported improvements in retrieval/reasoning benchmarks. Anthropic cited RLMs in their managed agents blog (April 2026) as evidence for the viability of recursive agent architectures.

The relationship to context window limitations

The existing claim that effective context window capacity falls more than 99 percent short of advertised maximum across all tested models because complex reasoning degrades catastrophically with scale establishes the problem. RLMs provide one class of solution: don't try to expand effective context — instead, decompose the problem so that each context invocation stays within the effective range.

This is not a refutation of the context window limitation. RLMs still operate within the same per-call constraint. What they demonstrate is that the limitation can be architecturally circumvented through recursive decomposition. The model's effective information access becomes limited by compute budget (how many recursive calls it can make) rather than by context window size.

Implications for knowledge base queries

For a knowledge base with 500+ claims across multiple domains, the hardest queries are multi-hop: "What claims challenge our collective superintelligence thesis, and what evidence supports those challenges?" This requires traversing the claim graph across domains, following edges, and aggregating. Vector retrieval returns the top-K most similar claims but cannot perform the traversal.

RLM's approach — storing the KB as an external environment and letting the model write code to explore it — is structurally suited to these queries. The model can inspect claim metadata without loading full text, filter by domain, follow wiki links, recursively drill into relevant clusters, and programmatically combine findings. This is the knowledge between notes is generated by traversal not stored in any individual note because curated link paths produce emergent understanding that embedding similarity cannot replicate principle implemented as a retrieval algorithm.

Challenges

RLMs require multiple model calls per query, increasing latency and cost proportionally to input size. For routine lookups (single-claim retrieval), this overhead is unjustified.
The quality of the decomposition plan determines performance. If the model generates a poor search strategy, recursive exploration wastes compute without improving results. This is an instance of the meta-cognitive challenge: the system must reason well about how to reason.
RLMs have been demonstrated on information retrieval tasks (finding specific facts in large corpora). Whether the approach extends to synthesis tasks (generating novel connections across the corpus) is unestablished. The mechanism suggests it should — the model can write synthesis code, not just search code — but the empirical evidence is retrieval-only.

Relevant Notes:

effective context window capacity falls more than 99 percent short of advertised maximum across all tested models because complex reasoning degrades catastrophically with scale — RLMs circumvent rather than solve this limitation, keeping each call within effective range
knowledge between notes is generated by traversal not stored in any individual note because curated link paths produce emergent understanding that embedding similarity cannot replicate — RLMs implement traversal-as-retrieval programmatically
graph traversal through curated wiki links replicates spreading activation from cognitive science because progressive disclosure implements decay-based context loading and queries evolve during search through the berrypicking effect — RLMs are a computational implementation of progressive disclosure and berrypicking

Topics:

_map

6.6 KiB Raw Blame History