teleo-codex/inbox/archive/2026-02-24-nous-research-hermes-agent-self-evolution-gepa.md
m3taversal 1de60685be theseus: add 5 Nous Research source archives for codex ingestion
- GEPA self-evolution system (trace-based evolutionary prompt optimization)
- DeMo: Decoupled Momentum Optimization (Peng, Kingma et al. — 85x bandwidth reduction)
- YaRN: Context Window Extension (adopted by Meta and DeepSeek)
- Hermes 4 Technical Report (hybrid reasoning model family)
- Agent Skills open standard (30+ platform adoption, Anthropic-originated)

Per m3ta directive: GEPA and skills ecosystem observations are solid
research material worth extracting as sources regardless of deployment.

Pentagon-Agent: Theseus <46864dd4-da71-4719-a1b4-68f7c55854d3>
2026-04-07 14:56:03 +00:00

4.7 KiB

type title author url date domain intake_tier rationale proposed_by format status processed_by processed_date claims_extracted enrichments tags
source Hermes Agent Self-Evolution: Evolutionary Self-Improvement via DSPy + GEPA Nous Research (Teknium, Jeffrey Quesnelle, Karan Malhotra) https://github.com/NousResearch/hermes-agent-self-evolution 2026-02-24 ai-alignment research-task GEPA is a trace-based evolutionary prompt optimizer that outperforms RL-based methods. Key evidence for agent self-improvement claims and the skills-as-codification thesis. theseus whitepaper processed theseus 2026-04-07
GEPA evolutionary trace-based optimization is distinct from acceptance-gating and RL approaches because it reads why failures happen rather than just that they failed
curated agent skills persist and improve through use producing flat token scaling at 40 skills equivalent to 200 skills
nous-research
gepa
self-evolution
prompt-optimization
agent-skills
dspy

GEPA: Genetic-Pareto Prompt Evolution

GEPA (Genetic-Pareto Prompt Evolution) is Nous Research's evolutionary optimizer for agent self-improvement. It is implemented in the hermes-agent-self-evolution repository (704 stars, MIT license) and integrates DSPy for prompt optimization with evolutionary trace analysis.

Core Mechanism

GEPA is a reflective evolutionary optimizer that examines WHY components fail, not merely THAT they fail. The system reads execution traces to understand concrete failure modes, then proposes targeted improvements. This trace-based analysis distinguishes GEPA from simpler mutation approaches (random perturbation) and from RL-based methods (reward signal without causal explanation).

Evolutionary Process

  1. Read current skill/prompt/tool definition
  2. Generate evaluation dataset (synthetic or from real session history via SQLite)
  3. Execute candidates and capture full execution traces
  4. GEPA optimizer analyzes traces and proposes targeted mutations
  5. Evaluate variants against 5 constraint gates
  6. Select best performer via Pareto front
  7. Submit as pull request for human review

Five Constraint Gates (Guardrails)

Every evolved variant must satisfy all five gates before consideration:

  1. Full Test Suite: pytest tests/ -q must pass 100%
  2. Size Limits: Skills ≤15KB, tool descriptions ≤500 characters
  3. Caching Compatibility: No mid-conversation changes allowed
  4. Semantic Preservation: Variants must not drift from original intent
  5. PR Review: All changes go through human review, never direct commit

The fifth gate — PR-review governance — ensures no evolved variant reaches production without human approval. This is structurally equivalent to the acceptance-gating pattern in SICA (SWE-Bench self-improvement), but GEPA adds trace-based explanation of WHY the mutation was proposed.

What Gets Optimized (Phased Rollout)

  • Phase 1 (Implemented): Skill files (SKILL.md) — procedural memory
  • Phase 2 (Planned): Tool descriptions — capability interfaces
  • Phase 3 (Planned): System prompt sections — behavioral tuning
  • Phase 4 (Planned): Tool implementation code via Darwinian Evolver
  • Phase 5 (Planned): Continuous improvement loop

Architecture Split

The system distinguishes between:

  • Reflective text evolution (DSPy + GEPA) — for prompts, descriptions, skills
  • Code evolution (Darwinian Evolver, AGPL v3) — for implementation code

This separation applies appropriate optimization strategies per artifact type. Text evolution operates entirely via API calls — mutating natural language, evaluating results, selecting best variants. Cost: ~$2-10 per optimization run.

Integration with DSPy

DSPy provides the prompt optimization framework. GEPA adds the evolutionary trace analysis on top. Combined, they mutate natural language descriptions of skills, tool behaviors, and system instructions with causal grounding in observed failure modes.

Key Distinctions from Other Self-Improvement Approaches

Approach Signal Type Causal? Governance
SICA (SWE-Bench) Pass/fail acceptance gate No Metric threshold
NLAH (Pan et al.) Module ablation Partial Researcher manual
GRPO (RL) Reward signal No Training objective
GEPA Execution trace analysis Yes 5-gate + PR review

GEPA's distinguishing feature is that it reads the execution trace to understand the causal chain of failure, then proposes mutations that address the root cause rather than randomly perturbing until something works.

Development Status

Repository: 704 stars, 64 forks, 7 commits, actively under development. MIT license for core; Darwinian Evolver uses AGPL v3 as external CLI only.