teleo-codex/domains/ai-alignment/evolutionary trace-based optimization submits improvements as pull requests for human review creating a governance-gated self-improvement loop distinct from acceptance-gating or metric-driven iteration.md

---
type: claim
domain: ai-alignment
secondary_domains: [collective-intelligence]
description: "GEPA (Guided Evolutionary Prompt Architecture) from Nous Research reads execution traces to understand WHY agents fail, generates candidate variants through evolutionary search, evaluates against 5 guardrails, and submits best candidates as PRs for human review — a distinct self-improvement mechanism from SICA's acceptance-gating"
confidence: experimental
source: "Nous Research hermes-agent-self-evolution repository (GitHub, 2026); GEPA framework presented as ICLR 2026 Oral; DSPy integration for optimization; $2-10 per optimization cycle reported"
created: 2026-04-05
depends_on:
  - "iterative agent self-improvement produces compounding capability gains when evaluation is structurally separated from generation"
  - "curated skills improve agent task performance by 16 percentage points while self-generated skills degrade it by 1.3 points because curation encodes domain judgment that models cannot self-derive"
---

# Evolutionary trace-based optimization submits improvements as pull requests for human review creating a governance-gated self-improvement loop distinct from acceptance-gating or metric-driven iteration

Nous Research's Guided Evolutionary Prompt Architecture (GEPA) implements a self-improvement mechanism structurally different from both SICA's acceptance-gating and NLAH's retry-based self-evolution. The key difference is the input: GEPA reads execution traces to understand WHY things failed, not just THAT they failed.

## The mechanism

1. **Trace analysis** — the system examines full execution traces of agent behavior, identifying specific decision points where the agent made suboptimal choices. This is diagnostic, not metric-driven.
2. **Evolutionary search** — generates candidate variants of prompts, skills, or orchestration logic. Uses DSPy's optimization framework for structured prompt variation.
3. **Constraint evaluation** — each candidate is evaluated against 5 guardrails before advancing:
   - 100% test pass rate (no regressions)
   - Size limits (skills capped at 15KB)
   - Caching compatibility (changes must not break cached behavior)
   - Semantic preservation (the skill's core function must survive mutation)
   - Human PR review (the governance gate)
4. **PR submission** — the best candidate is submitted as a pull request for human review. The improvement does not persist until a human approves it.

## How it differs from existing self-improvement mechanisms

**vs SICA (acceptance-gating):** SICA improves by tightening retry loops — running more attempts and accepting only passing results. It doesn't modify the agent's skills or prompts. GEPA modifies the actual procedural knowledge the agent uses. SICA is behavioral iteration; GEPA is structural evolution.

**vs NLAH self-evolution:** NLAH's self-evolution mechanism accepts or rejects module changes based on performance metrics (+4.8pp on SWE-Bench). GEPA uses trace analysis to understand failure causes before generating fixes. NLAH asks "did this help?"; GEPA asks "why did this fail and what would fix it?"

## The governance model

The PR-review-as-governance-gate is the most architecturally interesting feature. The 5 guardrails map closely to our quality gates (schema validation, test pass, size limits, semantic preservation, human review). The economic cost ($2-10 per optimization cycle) makes this viable for continuous improvement at scale.

Only Phase 1 (skill optimization) has shipped as of April 2026. Planned phases include: Phase 2 (tool optimization), Phase 3 (orchestration optimization), Phase 4 (memory optimization), Phase 5 (full agent optimization). The progression from skills → tools → orchestration → memory → full agent mirrors our own engineering acceleration roadmap.

## Challenges

GEPA's published performance data is limited — the ICLR 2026 Oral acceptance validates the framework but specific before/after metrics across diverse tasks are not publicly available. The $2-10 per cycle cost is self-reported and may not include the cost of failed evolutionary branches.

The PR-review governance gate is the strongest constraint but also the bottleneck — human review capacity limits the rate of self-improvement. If the system generates improvements faster than humans can review them, queuing dynamics may cause the most impactful improvements to wait behind trivial ones. This is the same throughput constraint our system faces with Leo as the evaluation bottleneck.

The distinction between "trace analysis" and "metric-driven iteration" may be less sharp in practice. Both ultimately depend on observable signals of failure — traces are richer but noisier than metrics. Whether the richer input produces meaningfully better improvements at scale is an open empirical question.

---

Relevant Notes:
- [[iterative agent self-improvement produces compounding capability gains when evaluation is structurally separated from generation]] — SICA's structural separation is the necessary condition; GEPA adds evolutionary search and trace analysis on top of this foundation
- [[curated skills improve agent task performance by 16 percentage points while self-generated skills degrade it by 1.3 points because curation encodes domain judgment that models cannot self-derive]] — GEPA's PR-review gate functions as the curation step that prevents the -1.3pp degradation from uncurated self-generation
- [[self evolution improves agent performance through acceptance gated retry not expanded search because disciplined attempt loops with explicit failure reflection outperform open ended exploration]] — NLAH's acceptance-gating is a simpler mechanism; GEPA extends it with evolutionary search and trace-based diagnosis

Topics:
- [[_map]]