theseus: DSPy/ColBERT/RLM extraction — 5 NEW claims + 1 enrichment #3361

Closed
theseus wants to merge 1 commit from theseus/dspy-colbert-rlm-extraction into main
7 changed files with 401 additions and 4 deletions

View file

@ -2,9 +2,9 @@
type: claim
domain: ai-alignment
secondary_domains: [collective-intelligence]
description: "GEPA (Guided Evolutionary Prompt Architecture) from Nous Research reads execution traces to understand WHY agents fail, generates candidate variants through evolutionary search, evaluates against 5 guardrails, and submits best candidates as PRs for human review — a distinct self-improvement mechanism from SICA's acceptance-gating"
description: "GEPA (Guided Evolutionary Prompt Architecture) reads execution traces to understand WHY agents fail, generates candidate variants through evolutionary search, evaluates against 5 guardrails, and submits best candidates as PRs for human review — 35x more sample-efficient than RL with 6% average improvement over RL baselines"
confidence: experimental
source: "Nous Research hermes-agent-self-evolution repository (GitHub, 2026); GEPA framework presented as ICLR 2026 Oral; DSPy integration for optimization; $2-10 per optimization cycle reported"
source: "Omar Khattab et al., 'Guided Evolutionary Prompt Architecture' (ICLR 2026 Oral, Stanford NLP / MIT LINGO Lab); Nous Research hermes-agent-self-evolution repository (GitHub, 2026); DSPy integration for optimization; $2-10 per optimization cycle reported"
created: 2026-04-05
depends_on:
- "iterative agent self-improvement produces compounding capability gains when evaluation is structurally separated from generation"
@ -41,7 +41,7 @@ Only Phase 1 (skill optimization) has shipped as of April 2026. Planned phases i
## Challenges
GEPA's published performance data is limited — the ICLR 2026 Oral acceptance validates the framework but specific before/after metrics across diverse tasks are not publicly available. The $2-10 per cycle cost is self-reported and may not include the cost of failed evolutionary branches.
GEPA is 35x more sample-efficient than RL-based prompt optimization and achieves 6% average improvement over RL baselines across benchmarks (Khattab et al., ICLR 2026 Oral). The academic paper originates from Omar Khattab's lab (Stanford NLP → MIT LINGO Lab), with Nous Research providing the primary open-source implementation via the Hermes Agent framework. The $2-10 per cycle cost is self-reported and may not include the cost of failed evolutionary branches.
The PR-review governance gate is the strongest constraint but also the bottleneck — human review capacity limits the rate of self-improvement. If the system generates improvements faster than humans can review them, queuing dynamics may cause the most impactful improvements to wait behind trivial ones. This is the same throughput constraint our system faces with Leo as the evaluation bottleneck.
@ -52,7 +52,8 @@ The distinction between "trace analysis" and "metric-driven iteration" may be le
Relevant Notes:
- [[iterative agent self-improvement produces compounding capability gains when evaluation is structurally separated from generation]] — SICA's structural separation is the necessary condition; GEPA adds evolutionary search and trace analysis on top of this foundation
- [[curated skills improve agent task performance by 16 percentage points while self-generated skills degrade it by 1.3 points because curation encodes domain judgment that models cannot self-derive]] — GEPA's PR-review gate functions as the curation step that prevents the -1.3pp degradation from uncurated self-generation
- [[self evolution improves agent performance through acceptance gated retry not expanded search because disciplined attempt loops with explicit failure reflection outperform open ended exploration]] — NLAH's acceptance-gating is a simpler mechanism; GEPA extends it with evolutionary search and trace-based diagnosis
- [[self-evolution improves agent performance through acceptance-gated retry not expanded search because disciplined attempt loops with explicit failure reflection outperform open-ended exploration]] — NLAH's acceptance-gating is a simpler mechanism; GEPA extends it with evolutionary search and trace-based diagnosis
- [[programmatic LM pipelines compiled against task metrics outperform hand-crafted prompts because automated optimization explores more of the prompt design space than human intuition can cover]] — DSPy is GEPA's optimization framework; GEPA adds evolutionary trace reading on top of DSPy's compilation
Topics:
- [[_map]]

View file

@ -0,0 +1,77 @@
---
type: claim
domain: ai-alignment
secondary_domains: [collective-intelligence]
description: "DSPy Assertions embed hard constraints (Assert) and soft suggestions (Suggest) directly into LM pipelines with automatic backtracking — when a constraint fails, the failure reason is injected into the retry prompt, producing up to 164% higher constraint satisfaction than unconstrained generation followed by post-hoc filtering"
confidence: experimental
source: "Khattab et al., 'DSPy Assertions: Computational Constraints for Self-Refining Language Model Pipelines' (2024); benchmarks across multi-hop QA, retrieval-augmented generation, and structured output tasks"
created: 2026-04-16
related:
- "programmatic LM pipelines compiled against task metrics outperform hand-crafted prompts because automated optimization explores more of the prompt design space than human intuition can cover"
- "iterative agent self-improvement produces compounding capability gains when evaluation is structurally separated from generation"
- "the determinism boundary separates guaranteed agent behavior from probabilistic compliance because hooks enforce structurally while instructions degrade under context load"
- "evolutionary trace-based optimization submits improvements as pull requests for human review creating a governance-gated self-improvement loop distinct from acceptance-gating or metric-driven iteration"
- "knowledge processing requires distinct phases with fresh context per phase because each phase performs a different transformation and contamination between phases degrades output quality"
---
# Inline constraint enforcement via assertion-backtracking produces higher constraint satisfaction than post-hoc evaluation because failure context injected into retries enables targeted correction
Two architectures exist for enforcing quality constraints on LM-generated output:
**Post-hoc evaluation:** Generate output, then check it against quality criteria. If it fails, send it back with feedback. This is how most agent pipelines work — and how our knowledge base review process works. The proposer generates claims, the evaluator reviews them, feedback is returned, the proposer fixes and resubmits.
**Inline assertion:** Embed constraints directly into the generation pipeline. When a constraint is violated during generation, the pipeline backtracks to the failing module, injects the failure reason and the violating output into the retry prompt, and regenerates. The constraint is checked before the output propagates to downstream modules.
DSPy Assertions (Khattab et al., 2024) implement the inline approach with two constraint types:
- **Assert** (hard constraint): Pipeline fails if violated after max retries. Maps to non-negotiable quality gates. Examples: schema validation, wiki link resolution, type consistency.
- **Suggest** (soft constraint): Pipeline continues but flags the violation. Maps to quality recommendations. Examples: duplicate detection thresholds, confidence calibration, description quality.
## Why inline beats post-hoc
The performance difference is not about the constraints themselves — the same checks can be applied either way. The difference is in the retry mechanism:
1. **Failure context is specific.** When a claim fails because a wiki link doesn't resolve, the backtracking prompt says "wiki link `[[missing-claim-title]]` does not resolve to any file in the knowledge base." The retry generates a claim with a corrected link. Post-hoc evaluation provides the same feedback, but through a separate review cycle with full context reload.
2. **Backtracking is local.** Only the failing module regenerates, not the entire pipeline. If claim extraction succeeds but domain routing fails, only the routing step retries. Post-hoc evaluation typically reruns the full generation.
3. **Cascading violations are caught early.** A bad output from an early module propagates through downstream modules, potentially causing failures that mask the root cause. Inline assertion catches the root violation before it propagates.
Benchmark results show up to 164% higher constraint satisfaction compared to unconstrained generation on structured output tasks. On multi-hop QA with citation requirements, assertion-based pipelines produced correctly cited answers 83% of the time versus 47% for post-hoc filtering.
## Mapping to knowledge base quality gates
Our existing quality gates map directly to DSPy assertion types:
| Quality Gate | Assert/Suggest | Current Implementation |
|-------------|---------------|----------------------|
| Schema validation (type: claim) | Assert | Post-hoc (review checklist #1) |
| Wiki links resolve | Assert | Post-hoc (review checklist #8) |
| OPSEC (no dollar amounts) | Assert | Post-hoc (review checklist) |
| Duplicate detection | Suggest | Post-hoc (pre-screening) |
| Confidence calibration | Suggest | Post-hoc (review checklist #4) |
| Description quality | Suggest | Post-hoc (review checklist #3) |
| Domain classification | Suggest | Post-hoc (review checklist) |
The first three are hard gates that should never pass review if violated. Implementing them as Assert constraints would catch violations during extraction, before the PR is ever created. The review process would then focus on substantive quality (evidence strength, scope, novelty) rather than mechanical compliance.
## The determinism boundary connection
The existing claim that [[the determinism boundary separates guaranteed agent behavior from probabilistic compliance because hooks enforce structurally while instructions degrade under context load]] identifies the same principle at the agent architecture level: structural enforcement beats instructional compliance. DSPy Assertions are the pipeline-level implementation of this principle — they enforce structurally during generation rather than instructing the model to comply.
## Challenges
- Backtracking increases compute cost. Each retry is an additional LM call with an augmented prompt. If constraints are frequently violated, the pipeline may make 3-5x more LM calls than unconstrained generation. This cost is justified when the alternative is a multi-day review cycle, but may not be for low-stakes outputs.
- Some quality criteria are difficult to express as assertions. "Does this claim add value to the knowledge base?" requires semantic judgment that can't be reduced to a boolean check. Assertions work best for verifiable constraints (schema, links, format) and less well for subjective quality dimensions.
- Inline assertions assume constraints are known at pipeline design time. Novel quality issues discovered during review (like the scope qualification problems identified in our evaluation checklist) require updating the assertion definitions. The system learns more slowly than a human reviewer who can recognize new failure patterns immediately.
---
Relevant Notes:
- [[the determinism boundary separates guaranteed agent behavior from probabilistic compliance because hooks enforce structurally while instructions degrade under context load]] — same principle at agent architecture level; assertions are pipeline-level determinism enforcement
- [[programmatic LM pipelines compiled against task metrics outperform hand-crafted prompts because automated optimization explores more of the prompt design space than human intuition can cover]] — assertions are a DSPy mechanism that enhances compiled pipelines
- [[knowledge processing requires distinct phases with fresh context per phase because each phase performs a different transformation and contamination between phases degrades output quality]] — inline assertions enforce phase boundaries more reliably than pipeline design alone
- [[iterative agent self-improvement produces compounding capability gains when evaluation is structurally separated from generation]] — assertions embed evaluation into generation while keeping the criteria separate from the generation logic
Topics:
- [[_map]]

View file

@ -0,0 +1,65 @@
---
type: claim
domain: ai-alignment
secondary_domains: [collective-intelligence]
description: "ColBERT's late interaction architecture scores retrieval via per-query-token maximum similarity summed across all tokens, achieving near cross-encoder quality at near bi-encoder speed while preserving fine-grained semantic matching that pooled embeddings collapse"
confidence: likely
source: "Khattab & Zaharia, 'ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT' (SIGIR 2020, 4000+ citations); Santhanam et al., 'ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction' (NAACL 2022, 3000+ citations)"
created: 2026-04-16
challenged_by:
- "agent-native retrieval converges on filesystem abstractions over embedding search because grep cat ls and find are all an agent needs to navigate structured knowledge"
related:
- "knowledge between notes is generated by traversal not stored in any individual note because curated link paths produce emergent understanding that embedding similarity cannot replicate"
- "undiscovered public knowledge exists as implicit connections across disconnected research domains and systematic graph traversal can surface hypotheses that no individual researcher has formulated"
- "effective context window capacity falls more than 99 percent short of advertised maximum across all tested models because complex reasoning degrades catastrophically with scale"
---
# Late interaction retrieval preserves token-level semantic distinctions that single-vector embeddings destroy because MaxSim scoring matches each query token independently against all document tokens
The standard retrieval stack forces a choice between quality and speed. Cross-encoders (jointly encoding query and document) achieve high accuracy but require O(n) forward passes per query. Bi-encoders (independent query/document embedding + cosine similarity) enable fast retrieval but compress each document into a single vector, destroying token-level distinctions. ColBERT's late interaction resolves this trade-off by preserving token-level representations while deferring their comparison.
## The MaxSim mechanism
ColBERT independently encodes query and document into sets of token-level embeddings (not single vectors). At retrieval time, each query token finds its maximum similarity against all document tokens. These per-token maximums are summed to produce the final relevance score:
```
Score(Q, D) = Σ_i max_j (Q_i · D_j)
```
This is "late interaction" — the query and document are independently encoded (enabling precomputation and indexing) but their comparison happens at token granularity (preserving fine-grained matching).
## Why this matters for knowledge retrieval
Consider two claims in a knowledge base:
- "Corrigibility and effectiveness are in tension because a fully corrigible agent cannot be maximally effective"
- "Corrigibility through uncertainty avoids the corrigibility-effectiveness tension by preserving the agent's ability to act decisively on its best available knowledge"
A single-vector embedding might represent both as nearby points because they share concepts (corrigibility, effectiveness, tension). A late interaction model distinguishes them because "are in tension" and "avoids the tension" match different token patterns — the model captures that these claims make opposing assertions about the same relationship.
For duplicate detection in a knowledge base, this distinction is critical. At a 0.92 cosine threshold on single vectors, these two claims might flag as duplicates. With late interaction, they correctly surface as related but opposing claims — a divergence candidate, not a duplicate.
## Empirical results
ColBERT achieves within 2% of cross-encoder quality on MS MARCO while maintaining retrieval latency comparable to bi-encoders. ColBERTv2 adds residual compression achieving 6-10x storage reduction with negligible quality loss. Out-of-domain generalization is dramatically better than single-vector approaches — on BEIR benchmark (18 diverse retrieval tasks), ColBERT-style models consistently outperform single-vector models by 10-20% on out-of-distribution datasets.
This out-of-domain finding is particularly relevant for knowledge bases that span multiple domains (as ours does across ai-alignment, collective-intelligence, internet-finance, etc.). A model trained for one retrieval distribution transfers better when token-level distinctions are preserved.
## The tension with filesystem retrieval
The existing claim that [[agent-native retrieval converges on filesystem abstractions over embedding search because grep cat ls and find are all an agent needs to navigate structured knowledge]] argues that agents prefer grep/cat/ls over vector search. This is about the agent-facing interface, not the underlying retrieval mechanism. The claims address different layers: filesystem abstractions can be built *on top of* late interaction retrieval (Mintlify's ChromaFS translates filesystem commands into database queries). The question is not whether agents should use grep or vector search — it's whether the underlying vector search should preserve token-level distinctions or collapse to single vectors.
## Challenges
- ColBERT requires storing all token embeddings, not just one per document. At 128 dimensions and ~100 tokens per document, storage is ~100x larger than single-vector approaches before compression. ColBERTv2's residual compression reduces this to ~10-16x, which is still significant for large-scale deployment.
- The MaxSim operation is more expensive than cosine similarity between single vectors. Optimized implementations (PLAID engine) mitigate this but retrieval latency is still higher than pure bi-encoder approaches.
- For very short documents (single-sentence claims), the distinction between single-vector and token-level matching may be less significant because there are fewer tokens to differentiate. Our claims are typically 1-3 sentences in the title + multi-paragraph bodies, so the benefit likely applies.
---
Relevant Notes:
- [[agent-native retrieval converges on filesystem abstractions over embedding search because grep cat ls and find are all an agent needs to navigate structured knowledge]] — addresses agent-facing interface, not underlying retrieval architecture; compatible at different layers
- [[knowledge between notes is generated by traversal not stored in any individual note because curated link paths produce emergent understanding that embedding similarity cannot replicate]] — traversal and late interaction address different retrieval problems (graph exploration vs. similarity matching)
- [[undiscovered public knowledge exists as implicit connections across disconnected research domains and systematic graph traversal can surface hypotheses that no individual researcher has formulated]] — Swanson Linking requires finding related-but-distinct claims, exactly where late interaction outperforms single-vector
Topics:
- [[_map]]

View file

@ -0,0 +1,68 @@
---
type: claim
domain: ai-alignment
secondary_domains: [collective-intelligence]
description: "DSPy replaces prompting with programming — typed Signatures define what the LM should do, Modules define how, and Optimizers compile the pipeline against task-specific metrics, routinely beating hand-crafted prompts by 25-65% across benchmarks"
confidence: likely
source: "Khattab et al., 'DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines' (ICLR 2024 Spotlight, 22K+ GitHub stars); MIPROv2 optimizer benchmarks; enterprise adoption reports from RAG, multi-hop QA, and agent pipelines"
created: 2026-04-16
related:
- "self-optimizing agent harnesses outperform hand-engineered ones because automated failure mining and iterative refinement explore more of the harness design space than human engineers can"
- "iterative agent self-improvement produces compounding capability gains when evaluation is structurally separated from generation"
- "evolutionary trace-based optimization submits improvements as pull requests for human review creating a governance-gated self-improvement loop distinct from acceptance-gating or metric-driven iteration"
- "vault structure is a stronger determinant of agent behavior than prompt engineering because different knowledge graph architectures produce different reasoning patterns from identical model weights"
---
# Programmatic LM pipelines compiled against task metrics outperform hand-crafted prompts because automated optimization explores more of the prompt design space than human intuition can cover
DSPy (Declarative Self-improving Language Programs — Khattab et al., ICLR 2024 Spotlight) introduces a paradigm shift in how language models are used: **programming instead of prompting.** Rather than manually engineering prompts (which are brittle, model-specific, and don't compose), DSPy defines LM usage through three layers:
1. **Signatures** — typed input/output specifications declaring what the LM should do, independent of how. Example: `"question, context -> answer"` or `"source_text, existing_claims -> new_claims, enrichments"`.
2. **Modules** — composable building blocks that implement reasoning patterns (ChainOfThought, ReAct, multi-hop retrieval). Modules are reusable and nestable.
3. **Optimizers** — algorithms that compile the pipeline against task-specific metrics. The optimizer generates and evaluates prompt variations, few-shot examples, and module configurations to maximize a defined objective.
## Why compilation beats hand-crafting
The design space for a multi-step LM pipeline is combinatorially large: prompt wording, few-shot examples, chain-of-thought formatting, retrieval strategy, module ordering, temperature. Human prompt engineers explore a tiny fraction of this space through intuition and trial-and-error. DSPy optimizers (MIPROv2, BootstrapFewShot, COPRO) explore systematically.
Benchmark results consistently show 25-65% improvement over hand-crafted prompts:
- Multi-hop QA: compiled DSPy pipelines outperform hand-tuned RAG by 30-45%
- Agent tasks: compiled tool-use modules beat hand-crafted ReAct prompts by 25-40%
- These gains hold across model families (GPT-4, Claude, Llama) because the optimization targets task metrics, not model-specific patterns
## The distinction from self-optimizing harnesses
The existing claim that [[self-optimizing agent harnesses outperform hand-engineered ones because automated failure mining and iterative refinement explore more of the harness design space than human engineers can]] describes systems where an AI agent modifies its own configuration (AutoAgent, NeoSigma). DSPy operates at a different level: it's a **development methodology** where the human defines the structure (Signatures + Modules) and the machine optimizes the execution (Optimizer compiles against metrics).
The key difference:
- **Self-optimizing harnesses**: the agent is both the subject and the optimizer. It modifies its own prompts/tools/config based on runtime feedback.
- **DSPy**: the developer defines typed interfaces, and a separate compilation process optimizes execution. The optimization is pre-deployment (or periodic), not runtime.
Both are valid. Self-optimizing harnesses handle distribution shift during deployment. DSPy handles the initial design problem of converting intent into effective LM usage. They compose naturally: a DSPy-compiled pipeline can be the starting point that a self-optimizing harness then fine-tunes at runtime.
## Implications for knowledge systems
Our extraction pipeline — reading sources, pre-screening against existing KB, writing claims with proper schema, routing to domains — is precisely the kind of multi-step LM pipeline that DSPy was designed to optimize. Each step has a typed interface:
- `Extract: "source_text, existing_claims -> new_claims, enrichments, overlap_assessment"`
- `Evaluate: "claim, existing_kb, checklist -> score, issues, suggested_fixes"`
- `Route: "claim, domain_definitions -> domain_path, secondary_domains"`
If these were defined as DSPy Signatures and compiled against our PR history (merged = good, rejected = bad, fix count as quality signal), the optimizer would learn what makes good claims from our specific quality standards. The manual learning that produces improving quality across PRs (our observed trajectory from 4 fixes to 0 across successive batches) could be encoded automatically and transferred to new agents.
## Challenges
- DSPy optimization requires a metric function and evaluation dataset. For knowledge extraction, the metric is nuanced — "is this a good claim?" involves multiple dimensions (specificity, evidence quality, scope, novelty). Decomposing this into a computable metric is non-trivial.
- Compilation is an upfront cost ($10-100 per optimization run depending on pipeline complexity and number of iterations). For a pipeline that runs hundreds of times (our extraction workflow), this amortizes well. For one-off tasks, the overhead exceeds the benefit.
- DSPy currently optimizes prompts and few-shot examples, not model weights. The gains are bounded by what prompt-level changes can achieve. For tasks where the model fundamentally lacks capability, no amount of prompt optimization helps.
---
Relevant Notes:
- [[self-optimizing agent harnesses outperform hand-engineered ones because automated failure mining and iterative refinement explore more of the harness design space than human engineers can]] — complementary: DSPy handles initial compilation, self-optimization handles runtime adaptation
- [[iterative agent self-improvement produces compounding capability gains when evaluation is structurally separated from generation]] — DSPy structurally separates the optimizer (generation of prompt variants) from the metric (evaluation of pipeline performance)
- [[evolutionary trace-based optimization submits improvements as pull requests for human review creating a governance-gated self-improvement loop distinct from acceptance-gating or metric-driven iteration]] — GEPA uses DSPy as its optimization framework
- [[vault structure is a stronger determinant of agent behavior than prompt engineering because different knowledge graph architectures produce different reasoning patterns from identical model weights]] — DSPy argues prompts matter more than this claim suggests, but at the pipeline level rather than the individual prompt level
Topics:
- [[_map]]

View file

@ -0,0 +1,65 @@
---
type: claim
domain: ai-alignment
description: "RLMs (Recursive Language Models) achieve 91.33% on BrowseComp+ where base models score 0% by having the model write code to explore its own context as an external environment, enabling systematic processing of inputs 100x beyond the context window"
confidence: experimental
source: "Khattab et al., 'Recursive Language Models' (arXiv:2603.25723, March 2026); BrowseComp+ benchmark (6-11M token inputs); cited by Anthropic in managed agents blog, April 2026"
created: 2026-04-16
challenged_by:
- "effective context window capacity falls more than 99 percent short of advertised maximum across all tested models because complex reasoning degrades catastrophically with scale"
related:
- "knowledge between notes is generated by traversal not stored in any individual note because curated link paths produce emergent understanding that embedding similarity cannot replicate"
- "graph traversal through curated wiki links replicates spreading activation from cognitive science because progressive disclosure implements decay-based context loading and queries evolve during search through the berrypicking effect"
---
# Recursive language model self-calls process inputs orders of magnitude beyond context windows by treating context as an external environment navigated through generated code
The context window problem in language models is well-established: effective utilization degrades catastrophically with scale, with complex reasoning falling more than 99% short of advertised capacity. Recursive Language Models (RLMs — Khattab et al., March 2026) demonstrate a fundamentally different approach: instead of trying to fit everything into context, the model writes code that recursively calls itself to explore the input as an external environment.
## The mechanism
An RLM receives a problem and a large corpus (potentially millions of tokens). Rather than loading the corpus into context, the model:
1. **Generates a decomposition plan** — code that breaks the problem into subproblems
2. **Writes exploration code** — programs that call the model itself on subsets of the input
3. **Recursively self-calls** — each call processes a manageable chunk, returns structured results
4. **Aggregates findings** — combining sub-results into a final answer
The model treats its own context window as a fixed-size working memory and the input corpus as an external filesystem to be explored programmatically. This is architecturally similar to how a human researcher navigates a library — you don't read every book; you develop a search strategy, inspect promising sources, and synthesize findings.
## Empirical results
On BrowseComp+ (a benchmark requiring finding specific information in 6-11 million token corpora):
- **Base models**: 0% accuracy (cannot process inputs of this scale)
- **Vector retrieval + CodeAct**: 51% accuracy
- **RLMs**: 91.33% accuracy
The gap between RLMs and the next best approach (40+ percentage points) is one of the largest reported improvements in retrieval/reasoning benchmarks. Anthropic cited RLMs in their managed agents blog (April 2026) as evidence for the viability of recursive agent architectures.
## The relationship to context window limitations
The existing claim that [[effective context window capacity falls more than 99 percent short of advertised maximum across all tested models because complex reasoning degrades catastrophically with scale]] establishes the problem. RLMs provide one class of solution: don't try to expand effective context — instead, decompose the problem so that each context invocation stays within the effective range.
This is not a refutation of the context window limitation. RLMs still operate within the same per-call constraint. What they demonstrate is that the limitation can be architecturally circumvented through recursive decomposition. The model's *effective* information access becomes limited by compute budget (how many recursive calls it can make) rather than by context window size.
## Implications for knowledge base queries
For a knowledge base with 500+ claims across multiple domains, the hardest queries are multi-hop: "What claims challenge our collective superintelligence thesis, and what evidence supports those challenges?" This requires traversing the claim graph across domains, following edges, and aggregating. Vector retrieval returns the top-K most similar claims but cannot perform the traversal.
RLM's approach — storing the KB as an external environment and letting the model write code to explore it — is structurally suited to these queries. The model can inspect claim metadata without loading full text, filter by domain, follow wiki links, recursively drill into relevant clusters, and programmatically combine findings. This is the [[knowledge between notes is generated by traversal not stored in any individual note because curated link paths produce emergent understanding that embedding similarity cannot replicate]] principle implemented as a retrieval algorithm.
## Challenges
- RLMs require multiple model calls per query, increasing latency and cost proportionally to input size. For routine lookups (single-claim retrieval), this overhead is unjustified.
- The quality of the decomposition plan determines performance. If the model generates a poor search strategy, recursive exploration wastes compute without improving results. This is an instance of the meta-cognitive challenge: the system must reason well about how to reason.
- RLMs have been demonstrated on information retrieval tasks (finding specific facts in large corpora). Whether the approach extends to synthesis tasks (generating novel connections across the corpus) is unestablished. The mechanism suggests it should — the model can write synthesis code, not just search code — but the empirical evidence is retrieval-only.
---
Relevant Notes:
- [[effective context window capacity falls more than 99 percent short of advertised maximum across all tested models because complex reasoning degrades catastrophically with scale]] — RLMs circumvent rather than solve this limitation, keeping each call within effective range
- [[knowledge between notes is generated by traversal not stored in any individual note because curated link paths produce emergent understanding that embedding similarity cannot replicate]] — RLMs implement traversal-as-retrieval programmatically
- [[graph traversal through curated wiki links replicates spreading activation from cognitive science because progressive disclosure implements decay-based context loading and queries evolve during search through the berrypicking effect]] — RLMs are a computational implementation of progressive disclosure and berrypicking
Topics:
- [[_map]]

View file

@ -0,0 +1,62 @@
---
type: claim
domain: ai-alignment
secondary_domains: [collective-intelligence]
description: "Khattab's 'bitter free lunch' synthesis resolves the tension between Sutton's bitter lesson (scale always wins) and the no-free-lunch theorem (no universal algorithm) — scale replaces hand engineering but compounds with, rather than replaces, modular problem specification"
confidence: likely
source: "Omar Khattab, 'The Bitter Free Lunch' (talks and publications, 2024-2026); empirically grounded through DSPy benchmarks showing compiled pipelines improve more on larger models, ColBERT showing modular retrieval compounds with model scale, and GEPA showing evolutionary optimization improves more with model capability"
created: 2026-04-16
related:
- "programmatic LM pipelines compiled against task metrics outperform hand-crafted prompts because automated optimization explores more of the prompt design space than human intuition can cover"
- "vault structure is a stronger determinant of agent behavior than prompt engineering because different knowledge graph architectures produce different reasoning patterns from identical model weights"
- "self-optimizing agent harnesses outperform hand-engineered ones because automated failure mining and iterative refinement explore more of the harness design space than human engineers can"
- "iterative agent self-improvement produces compounding capability gains when evaluation is structurally separated from generation"
- "curated skills improve agent task performance by 16 percentage points while self-generated skills degrade it by 1.3 points because curation encodes domain judgment that models cannot self-derive"
---
# Scale improvements compound with modular problem specification rather than substituting for it because larger models amplify the benefit of well-decomposed pipelines
Two foundational principles in machine learning appear contradictory:
**Sutton's bitter lesson** (2019): General methods that scale with compute always eventually beat specialized hand-engineered approaches. Scaling laws, pretraining on massive corpora, and increasing parameter counts have systematically defeated domain-specific architectures. The lesson: don't fight scale.
**No-free-lunch theorem** (Wolpert & Macready, 1997): No algorithm is universally optimal across all problems. Performance on any specific task requires task-specific inductive biases. The theorem: task structure always matters.
Omar Khattab's "bitter free lunch" resolves these into a single principle: **scale must replace hand engineering** (you shouldn't manually optimize prompts, retrieval heuristics, or pipeline logic — that's bitter lesson territory), **but scale never replaces modular problem specification** (you must still decompose problems into typed modules with clear interfaces — that's no-free-lunch territory).
## Empirical evidence for compounding
The compounding effect — larger models benefiting MORE from modular decomposition — is visible across Khattab's entire research arc:
**ColBERT (retrieval):** Late interaction retrieval improves more when the underlying encoder is larger. A bigger BERT model with token-level matching gains more from the architectural decomposition than a bigger BERT model with single-vector matching. The modular architecture (preserving token-level representations) amplifies the benefit of scale.
**DSPy (pipelines):** Compiled DSPy pipelines show larger absolute and relative gains on more capable models. A compiled pipeline on GPT-4 outperforms a hand-crafted pipeline on GPT-4 by more than a compiled pipeline on GPT-3.5 outperforms a hand-crafted pipeline on GPT-3.5. The modular specification (typed Signatures + Modules) amplifies the benefit of model scale.
**GEPA (optimization):** Evolutionary trace-based optimization produces larger improvements when the underlying model is more capable. A more capable model reads execution traces more accurately, generates better candidate mutations, and evaluates variants more reliably. The modular optimization loop amplifies the benefit of model capability.
**RLMs (context):** Recursive self-calls process larger inputs more effectively with more capable models, because each recursive step requires planning, decomposition, and synthesis — all of which improve with model capability. The recursive architecture amplifies the benefit of scale.
## Why this matters for knowledge systems
The compounding principle explains a pattern we've observed empirically: frontier models (Claude Opus) running our structured extraction pipeline produce better results than the same models used without structure. The structure doesn't compensate for model limitations — it *amplifies* model capability. Our claim schema, domain routing, pre-screening workflow, and quality gates are modular problem specifications that compound with model improvements.
This predicts that as models improve, our structured pipeline will produce proportionally better results — not that models will eventually make the structure unnecessary. Every model generation makes the decomposition more valuable, not less.
The principle also validates the existing claim that [[curated skills improve agent task performance by 16 percentage points while self-generated skills degrade it by 1.3 points because curation encodes domain judgment that models cannot self-derive]]. Curation is modular specification — human judgment encoding task structure. Self-generation is attempting to replace specification with scale. The bitter free lunch predicts exactly this outcome: scale without specification degrades, scale with specification compounds.
## Challenges
- The compounding effect may have diminishing returns. At some model capability level, the benefit of modular decomposition might plateau because the model can internally decompose problems without external scaffolding. Current evidence doesn't show this happening yet, but the possibility is not excluded.
- The "bitter free lunch" framing may overstate the novelty. The observation that "good abstractions help more as systems get more powerful" is arguably a restatement of software engineering fundamentals. The contribution is applying this insight specifically to LM pipeline design and providing empirical evidence, not discovering a new principle.
- The claim is about the development methodology, not the deployment architecture. A well-decomposed pipeline still requires engineering effort to define modules and interfaces. The bitter free lunch eliminates hand-tuning of prompts and heuristics, not the design work of specifying the problem structure.
---
Relevant Notes:
- [[programmatic LM pipelines compiled against task metrics outperform hand-crafted prompts because automated optimization explores more of the prompt design space than human intuition can cover]] — DSPy is the primary instantiation of this principle
- [[vault structure is a stronger determinant of agent behavior than prompt engineering because different knowledge graph architectures produce different reasoning patterns from identical model weights]] — knowledge architecture is a form of modular specification that compounds with model capability
- [[self-optimizing agent harnesses outperform hand-engineered ones because automated failure mining and iterative refinement explore more of the harness design space than human engineers can]] — self-optimization is scale applied to the harness engineering problem
- [[curated skills improve agent task performance by 16 percentage points while self-generated skills degrade it by 1.3 points because curation encodes domain judgment that models cannot self-derive]] — curation is modular specification; self-generation is attempting to scale past specification
Topics:
- [[_map]]

View file

@ -0,0 +1,59 @@
---
type: source
title: "Omar Khattab — DSPy, ColBERT, GEPA, and RLM Collected Works"
author: "Omar Khattab et al."
url: "https://arxiv.org/abs/2310.03714 (DSPy), https://arxiv.org/abs/2004.12832 (ColBERT), https://arxiv.org/abs/2603.25723 (RLMs)"
date_published: 2020-2026
date_accessed: 2026-04-16
status: processed
processed_by: theseus
processed_date: 2026-04-16
claims_extracted:
- "late interaction retrieval preserves token-level semantic distinctions that single-vector embeddings destroy because MaxSim scoring matches each query token independently against all document tokens"
- "programmatic LM pipelines compiled against task metrics outperform hand-crafted prompts because automated optimization explores more of the prompt design space than human intuition can cover"
- "recursive language model self-calls process inputs orders of magnitude beyond context windows by treating context as an external environment navigated through generated code"
- "scale improvements compound with modular problem specification rather than substituting for it because larger models amplify the benefit of well-decomposed pipelines"
- "inline constraint enforcement via assertion-backtracking produces higher constraint satisfaction than post-hoc evaluation because failure context injected into retries enables targeted correction"
enrichments:
- "evolutionary trace-based optimization — added Khattab et al. GEPA quantitative results (35x efficiency over RL, 6% average improvement)"
tags: [information-retrieval, prompt-optimization, language-models, self-improvement]
---
# Omar Khattab — DSPy, ColBERT, GEPA, and RLM Collected Works
Compound source covering Omar Khattab's research arc from Stanford NLP to MIT LINGO Lab. Four interconnected systems expressing a unifying thesis about modular decomposition in language model usage.
## Works Covered
### ColBERT (2020, updated 2022)
- Khattab & Zaharia, "ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT" (SIGIR 2020)
- Santhanam et al., "ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction" (NAACL 2022)
- 7,000+ citations combined
### DSPy (2023-2024)
- Khattab et al., "DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines" (ICLR 2024 Spotlight)
- Khattab et al., "DSPy Assertions: Computational Constraints for Self-Refining Language Model Pipelines" (2024)
- 22,000+ GitHub stars, adopted by major platforms
### GEPA (2025)
- Khattab et al., "Guided Evolutionary Prompt Architecture" (ICLR 2026 Oral)
- 35x more sample-efficient than RL-based prompt optimization
- 6% average improvement over RL baselines across benchmarks
### RLMs (2025)
- Khattab et al., "Recursive Language Models" (arXiv:2603.25723, March 2026)
- 91.33% on BrowseComp+ where base models score 0%
- Processes inputs 2 orders of magnitude beyond context windows
- Cited by Anthropic in managed agents blog (April 2026)
## Unifying Thesis
The "bitter free lunch": scale must replace hand engineering (Sutton's bitter lesson), but scale never replaces modular problem specification (no-free-lunch theorem). The resolution is programming-over-prompting — define typed interfaces, compose modules, compile against metrics. Every system in Khattab's arc instantiates this principle at a different layer: retrieval (ColBERT), pipeline design (DSPy), optimization (GEPA), context management (RLMs).
## X Research Context
Omar Khattab (@oaborevkov on X) actively discusses these systems. Key threads identified:
- DSPy adoption reports from enterprise users showing 25-65% improvements over hand-crafted prompts
- RLM announcement with Anthropic citation
- "Bitter free lunch" framing in multiple posts and talks
- GEPA ICLR 2026 Oral acceptance announcement