teleo/teleo-codex

Fork 0

Teleo Agents 0e3f3c289d

Sync Graph Data to teleo-app / sync (push) Waiting to run

Details

reweave: merge 52 files via frontmatter union [auto]

2026-04-06 19:55:09 +00:00

4.7 KiB

Raw Blame History

type

domain

description

confidence

source

created

depends_on

challenged_by

reweave_edges

claim

ai-alignment

Self-evolution module showed the clearest positive effect in controlled ablation (+4.8pp SWE, +2.7pp OSWorld) by tightening the solve loop around acceptance criteria, not by expanding into larger search trees

experimental

Pan et al. 'Natural-Language Agent Harnesses', arXiv:2603.25723, March 2026. Table 3 + case analysis (scikit-learn__scikit-learn-25747). SWE-bench Verified (125 samples) + OSWorld (36 samples), GPT-5.4, Codex CLI.

2026-03-31

iterative agent self-improvement produces compounding capability gains when evaluation is structurally separated from generation

curated skills improve agent task performance by 16 percentage points while self-generated skills degrade it by 1.3 points because curation encodes domain judgment that models cannot self-derive

evolutionary trace based optimization submits improvements as pull requests for human review creating a governance gated self improvement loop distinct from acceptance gating or metric driven iteration

Self-evolution improves agent performance through acceptance-gated retry not expanded search because disciplined attempt loops with explicit failure reflection outperform open-ended exploration

Pan et al. (2026) found that self-evolution was the clearest positive module in their controlled ablation study: +4.8pp on SWE-bench Verified (80.0 vs 75.2 Basic) and +2.7pp on OSWorld (44.4 vs 41.7 Basic). In the score-cost view (Figure 4a), self-evolution is the only module that moves upward (higher score) without moving far right (higher cost).

The mechanism is not open-ended reflection or expanded search. The self-evolution module runs an explicit retry loop with a real baseline attempt first and a default cap of five attempts. After every non-successful or stalled attempt, it reflects on concrete failure signals before planning the next attempt. It redesigns along three axes: prompt, tool, and workflow evolution. It stops when judged successful or when the attempt cap is reached, and reports incomplete rather than pretending the last attempt passed.

The case of scikit-learn__scikit-learn-25747 illustrates the favorable regime: Basic fails this sample, but self-evolution resolves it. The module organizes the run around an explicit attempt contract where Attempt 1 is treated as successful only if the task acceptance gate is satisfied. The system closes after Attempt 1 succeeds rather than expanding into a larger retry tree, and the evaluator confirms the final patch fixes the target FAIL_TO_PASS tests. The extra structure makes the first repair attempt more disciplined and better aligned with the benchmark gate.

This is a significant refinement of the "iterative self-improvement" concept. The gain comes not from more iterations or bigger search, but from tighter coupling between failure signals and next-attempt design. The module's constraint structure (explicit cap, forced reflection, acceptance-gated stopping) is what produces the benefit.

Challenges

The challenged_by link to curated vs self-generated skills is important context: self-evolution works here because it operates within a bounded retry loop with explicit acceptance criteria, not because self-generated modifications are generally beneficial. The +4.8pp is from a 125-sample subset; the authors note they plan full-benchmark reruns. Whether the acceptance-gating mechanism transfers to tasks without clean acceptance criteria (creative tasks, open-ended research) is untested.

Relevant Notes:

iterative agent self-improvement produces compounding capability gains when evaluation is structurally separated from generation — the NLAH self-evolution module is a concrete implementation: structurally separated evaluation (acceptance gate) drives the retry loop
curated skills improve agent task performance by 16 percentage points while self-generated skills degrade it by 1.3 points because curation encodes domain judgment that models cannot self-derive — self-evolution here succeeds because it modifies approach within a curated structure (the harness), not because it generates new skills from scratch
the determinism boundary separates guaranteed agent behavior from probabilistic compliance because hooks enforce structurally while instructions degrade under context load — the self-evolution module's attempt cap and forced reflection are deterministic hooks, not instructions; this is why it works where unconstrained self-modification fails

Topics:

_map

4.7 KiB Raw Blame History

Self-evolution improves agent performance through acceptance-gated retry not expanded search because disciplined attempt loops with explicit failure reflection outperform open-ended exploration

Challenges

4.7 KiB

Raw Blame History