--- type: source title: "Hermes Agent Self-Evolution: Evolutionary Self-Improvement via DSPy + GEPA" author: "Nous Research (Teknium, Jeffrey Quesnelle, Karan Malhotra)" url: https://github.com/NousResearch/hermes-agent-self-evolution date: 2026-02-24 domain: ai-alignment intake_tier: research-task rationale: "GEPA is a trace-based evolutionary prompt optimizer that outperforms RL-based methods. Key evidence for agent self-improvement claims and the skills-as-codification thesis." proposed_by: theseus format: whitepaper status: processed processed_by: theseus processed_date: 2026-04-07 claims_extracted: - "GEPA evolutionary trace-based optimization is distinct from acceptance-gating and RL approaches because it reads why failures happen rather than just that they failed" enrichments: - "curated agent skills persist and improve through use producing flat token scaling at 40 skills equivalent to 200 skills" tags: [nous-research, gepa, self-evolution, prompt-optimization, agent-skills, dspy] --- ## GEPA: Genetic-Pareto Prompt Evolution GEPA (Genetic-Pareto Prompt Evolution) is Nous Research's evolutionary optimizer for agent self-improvement. It is implemented in the `hermes-agent-self-evolution` repository (704 stars, MIT license) and integrates DSPy for prompt optimization with evolutionary trace analysis. ### Core Mechanism GEPA is a **reflective evolutionary optimizer** that examines WHY components fail, not merely THAT they fail. The system reads execution traces to understand concrete failure modes, then proposes targeted improvements. This trace-based analysis distinguishes GEPA from simpler mutation approaches (random perturbation) and from RL-based methods (reward signal without causal explanation). ### Evolutionary Process 1. Read current skill/prompt/tool definition 2. Generate evaluation dataset (synthetic or from real session history via SQLite) 3. Execute candidates and capture full execution traces 4. GEPA optimizer analyzes traces and proposes targeted mutations 5. Evaluate variants against 5 constraint gates 6. Select best performer via Pareto front 7. Submit as pull request for human review ### Five Constraint Gates (Guardrails) Every evolved variant must satisfy all five gates before consideration: 1. **Full Test Suite:** `pytest tests/ -q` must pass 100% 2. **Size Limits:** Skills ≤15KB, tool descriptions ≤500 characters 3. **Caching Compatibility:** No mid-conversation changes allowed 4. **Semantic Preservation:** Variants must not drift from original intent 5. **PR Review:** All changes go through human review, never direct commit The fifth gate — PR-review governance — ensures no evolved variant reaches production without human approval. This is structurally equivalent to the acceptance-gating pattern in SICA (SWE-Bench self-improvement), but GEPA adds trace-based explanation of WHY the mutation was proposed. ### What Gets Optimized (Phased Rollout) - **Phase 1 (Implemented):** Skill files (SKILL.md) — procedural memory - **Phase 2 (Planned):** Tool descriptions — capability interfaces - **Phase 3 (Planned):** System prompt sections — behavioral tuning - **Phase 4 (Planned):** Tool implementation code via Darwinian Evolver - **Phase 5 (Planned):** Continuous improvement loop ### Architecture Split The system distinguishes between: - **Reflective text evolution** (DSPy + GEPA) — for prompts, descriptions, skills - **Code evolution** (Darwinian Evolver, AGPL v3) — for implementation code This separation applies appropriate optimization strategies per artifact type. Text evolution operates entirely via API calls — mutating natural language, evaluating results, selecting best variants. Cost: ~$2-10 per optimization run. ### Integration with DSPy DSPy provides the prompt optimization framework. GEPA adds the evolutionary trace analysis on top. Combined, they mutate natural language descriptions of skills, tool behaviors, and system instructions with causal grounding in observed failure modes. ### Key Distinctions from Other Self-Improvement Approaches | Approach | Signal Type | Causal? | Governance | |----------|------------|---------|------------| | SICA (SWE-Bench) | Pass/fail acceptance gate | No | Metric threshold | | NLAH (Pan et al.) | Module ablation | Partial | Researcher manual | | GRPO (RL) | Reward signal | No | Training objective | | **GEPA** | Execution trace analysis | Yes | 5-gate + PR review | GEPA's distinguishing feature is that it reads the execution trace to understand the causal chain of failure, then proposes mutations that address the root cause rather than randomly perturbing until something works. ### Development Status Repository: 704 stars, 64 forks, 7 commits, actively under development. MIT license for core; Darwinian Evolver uses AGPL v3 as external CLI only.