theseus: Hermes Agent extraction — 3 NEW claims + 3 enrichments

- What: model empathy boundary condition (challenges multi-model eval), GEPA evolutionary self-improvement mechanism, progressive disclosure scaling principle, plus enrichments to Agent Skills, three-space memory, and curated skills claims - Why: Nous Research Hermes Agent (26K+ stars) is the largest open-source agent framework — its architecture decisions provide independent evidence for existing KB claims and one genuine challenge to our eval spec - Connections: challenges multi-model eval architecture (task-dependent diversity optima), extends SICA/NLAH self-improvement chain, corroborates three-space memory taxonomy with a potential 4th space Pentagon-Agent: Theseus <46864DD4-DA71-4719-A1B4-68F7C55854D3>
2026-04-05 19:33:38 +01:00 · 2026-04-05 19:33:38 +01:00 · 7a3ef65dfe
commit 7a3ef65dfe
parent ca2b126d16
6 changed files with 173 additions and 0 deletions
--- a/domains/ai-alignment/agent
+++ b/domains/ai-alignment/agent
@ -54,6 +54,10 @@ The marketplace dynamics could drive toward either concentration (dominant platf

 The rapid adoption timeline (months, not years) may reflect low barriers to creating skill files rather than high value from using them. Many published skills may be shallow procedural wrappers rather than genuine expertise codification.

+## Additional Evidence (supporting)
+
+**Hermes Agent (Nous Research)** — the largest open-source agent framework (26K+ GitHub stars, 262 contributors) has native agentskills.io compatibility. Skills are stored as markdown files in `~/.hermes/skills/` and auto-created after 5+ tool calls on similar tasks, error recovery patterns, or user corrections. 40+ bundled skills ship with the framework. A Community Skills Hub enables sharing and discovery. This represents the open-source ecosystem converging on the same codification standard — not just commercial platforms but the largest community-driven framework independently adopting the same format. The auto-creation mechanism is structurally identical to Taylor's observation step: the system watches work being done and extracts the pattern into a reusable instruction card without explicit human design effort.
+
 ---

 Relevant Notes:
--- a/domains/ai-alignment/curated
+++ b/domains/ai-alignment/curated
@ -32,6 +32,10 @@ The resolution is altitude-specific: 2-3 skills per task is optimal, and beyond

 A scaling wall emerges at 50-100 available skills: flat selection breaks entirely without hierarchical routing, creating a phase transition in agent performance. The ecosystem of community skills will hit this wall. The next infrastructure challenge is organizing existing process, not creating more.

+## Additional Evidence (supporting)
+
+**Hermes Agent (Nous Research)** defaults to patch-over-edit for skill modification — the system modifies only changed text rather than rewriting the entire skill file. This design decision embodies the curated > self-generated principle: constrained modification of existing curated skills preserves more of the original domain judgment than unconstrained generation. Full rewrites risk breaking functioning workflows; patches preserve the curated structure while allowing targeted improvement. The auto-creation triggers (5+ tool calls on similar tasks, error recovery, user corrections) are conservative thresholds that prevent premature codification — the system waits for repeated patterns before extracting a skill, implicitly filtering for genuine recurring expertise rather than one-off procedures.
+
 ## Challenges

 This finding creates a tension with our self-improvement architecture. If agents generate their own skills without curation oversight, the -1.3pp degradation applies — self-improvement loops that produce uncurated skills will make agents worse, not better. The resolution is that self-improvement must route through a curation gate (Leo's eval role for skill upgrades). The 3-strikes-then-propose rule Leo defined is exactly this gate. However, the boundary between "curated" and "self-generated" may blur as agents improve at self-evaluation — the SICA pattern suggests that with structural separation between generation and evaluation, self-generated improvements can be positive. The key variable may be evaluation quality, not generation quality.
--- a/domains/ai-alignment/evaluation
+++ b/domains/ai-alignment/evaluation
@ -0,0 +1,46 @@
+---
+type: claim
+domain: ai-alignment
+secondary_domains: [collective-intelligence]
+description: "AutoAgent's finding that same-family meta/task agent pairs outperform cross-model pairs in optimization challenges Kim et al.'s finding that cross-family evaluation breaks correlated blind spots — the resolution is task-dependent: evaluation needs diversity, optimization needs empathy"
+confidence: likely
+source: "AutoAgent (MarkTechPost coverage, April 2026) — same-family meta/task pairs achieve SOTA on SpreadsheetBench (96.5%) and TerminalBench (55.1%); Kim et al. ICML 2025 — ~60% error agreement within same-family models on evaluation tasks"
+created: 2026-04-05
+depends_on:
+  - "multi-model evaluation architecture"
+challenged_by:
+  - "multi-model evaluation architecture"
+---
+
+# Evaluation and optimization have opposite model-diversity optima because evaluation benefits from cross-family diversity while optimization benefits from same-family reasoning pattern alignment
+
+Two independent findings appear contradictory but resolve into a task-dependent boundary condition.
+
+**Evaluation benefits from diversity.** Kim et al. (ICML 2025) demonstrated ~60% error agreement within same-family models on evaluation tasks. When the same model family evaluates its own output, correlated blind spots mean both models miss the same errors. Cross-family evaluation (e.g., GPT-4o evaluating Claude output) breaks these correlations because different model families have different failure patterns. This is the foundation of our multi-model evaluation architecture.
+
+**Optimization benefits from empathy.** AutoAgent (April 2026) found that same-family meta/task agent pairs outperform cross-model pairs in optimization tasks. A Claude meta-agent optimizing a Claude task-agent diagnoses failures more accurately than a GPT meta-agent optimizing the same Claude task-agent. The team calls this "model empathy" — shared reasoning patterns enable the meta-agent to understand WHY the task-agent failed, not just THAT it failed. AutoAgent achieved #1 on SpreadsheetBench (96.5%) and top GPT-5 score on TerminalBench (55.1%) using this same-family approach.
+
+**The resolution is task-dependent.** Evaluation (detecting errors in output) and optimization (diagnosing causes and proposing fixes) are structurally different operations with opposite diversity requirements:
+
+1. **Error detection** requires diversity — you need a system that fails differently from the system being evaluated. Same-family evaluation produces agreement that feels like validation but may be shared blindness.
+2. **Failure diagnosis** requires empathy — you need a system that can reconstruct the reasoning path that produced the error. Cross-family diagnosis produces generic fixes because the diagnosing model cannot model the failing model's reasoning.
+
+The practical implication: systems that evaluate agent output should use cross-family models (our multi-model eval spec is correct for this). Systems that optimize agent behavior — self-improvement loops, prompt tuning, skill refinement — should use same-family models. Mixing these up degrades both operations.
+
+## Challenges
+
+The "model empathy" evidence is primarily architectural — AutoAgent's results demonstrate that same-family optimization works, but the controlled comparison (same-family vs cross-family optimization on identical tasks, controlling for capability differences) has not been published. The SpreadsheetBench and TerminalBench results show the system works, not that model empathy is the specific mechanism. It's possible that the gains come from other architectural choices rather than the same-family pairing specifically.
+
+The boundary between "evaluation" and "optimization" may blur in practice. Evaluation that includes suggested fixes is partially optimization. Optimization that includes quality checks is partially evaluation. The clean task-dependent resolution may need refinement as these operations converge in real systems.
+
+Additionally, as model families converge in training methodology and data, the diversity benefit of cross-family evaluation may decrease over time. If all major model families share similar training distributions, cross-family evaluation may not break blind spots as effectively as Kim et al. observed.
+
+---
+
+Relevant Notes:
+- [[multi-model evaluation architecture]] — our eval spec uses cross-family evaluation to break blind spots (correct for evaluation), but should use same-family optimization if self-improvement loops are added
+- [[iterative agent self-improvement produces compounding capability gains when evaluation is structurally separated from generation]] — SICA's acceptance-gating mechanism should use same-family optimization per this finding; the evaluation gate should use cross-family per Kim et al.
+- [[self evolution improves agent performance through acceptance gated retry not expanded search because disciplined attempt loops with explicit failure reflection outperform open ended exploration]] — NLAH's self-evolution mechanism is an optimization task where model empathy would help
+
+Topics:
+- [[_map]]
--- a/domains/ai-alignment/evolutionary
+++ b/domains/ai-alignment/evolutionary
@ -0,0 +1,58 @@
+---
+type: claim
+domain: ai-alignment
+secondary_domains: [collective-intelligence]
+description: "GEPA (Guided Evolutionary Prompt Architecture) from Nous Research reads execution traces to understand WHY agents fail, generates candidate variants through evolutionary search, evaluates against 5 guardrails, and submits best candidates as PRs for human review — a distinct self-improvement mechanism from SICA's acceptance-gating"
+confidence: experimental
+source: "Nous Research hermes-agent-self-evolution repository (GitHub, 2026); GEPA framework presented as ICLR 2026 Oral; DSPy integration for optimization; $2-10 per optimization cycle reported"
+created: 2026-04-05
+depends_on:
+  - "iterative agent self-improvement produces compounding capability gains when evaluation is structurally separated from generation"
+  - "curated skills improve agent task performance by 16 percentage points while self-generated skills degrade it by 1.3 points because curation encodes domain judgment that models cannot self-derive"
+---
+
+# Evolutionary trace-based optimization submits improvements as pull requests for human review creating a governance-gated self-improvement loop distinct from acceptance-gating or metric-driven iteration
+
+Nous Research's Guided Evolutionary Prompt Architecture (GEPA) implements a self-improvement mechanism structurally different from both SICA's acceptance-gating and NLAH's retry-based self-evolution. The key difference is the input: GEPA reads execution traces to understand WHY things failed, not just THAT they failed.
+
+## The mechanism
+
+1. **Trace analysis** — the system examines full execution traces of agent behavior, identifying specific decision points where the agent made suboptimal choices. This is diagnostic, not metric-driven.
+2. **Evolutionary search** — generates candidate variants of prompts, skills, or orchestration logic. Uses DSPy's optimization framework for structured prompt variation.
+3. **Constraint evaluation** — each candidate is evaluated against 5 guardrails before advancing:
+   - 100% test pass rate (no regressions)
+   - Size limits (skills capped at 15KB)
+   - Caching compatibility (changes must not break cached behavior)
+   - Semantic preservation (the skill's core function must survive mutation)
+   - Human PR review (the governance gate)
+4. **PR submission** — the best candidate is submitted as a pull request for human review. The improvement does not persist until a human approves it.
+
+## How it differs from existing self-improvement mechanisms
+
+**vs SICA (acceptance-gating):** SICA improves by tightening retry loops — running more attempts and accepting only passing results. It doesn't modify the agent's skills or prompts. GEPA modifies the actual procedural knowledge the agent uses. SICA is behavioral iteration; GEPA is structural evolution.
+
+**vs NLAH self-evolution:** NLAH's self-evolution mechanism accepts or rejects module changes based on performance metrics (+4.8pp on SWE-Bench). GEPA uses trace analysis to understand failure causes before generating fixes. NLAH asks "did this help?"; GEPA asks "why did this fail and what would fix it?"
+
+## The governance model
+
+The PR-review-as-governance-gate is the most architecturally interesting feature. The 5 guardrails map closely to our quality gates (schema validation, test pass, size limits, semantic preservation, human review). The economic cost ($2-10 per optimization cycle) makes this viable for continuous improvement at scale.
+
+Only Phase 1 (skill optimization) has shipped as of April 2026. Planned phases include: Phase 2 (tool optimization), Phase 3 (orchestration optimization), Phase 4 (memory optimization), Phase 5 (full agent optimization). The progression from skills → tools → orchestration → memory → full agent mirrors our own engineering acceleration roadmap.
+
+## Challenges
+
+GEPA's published performance data is limited — the ICLR 2026 Oral acceptance validates the framework but specific before/after metrics across diverse tasks are not publicly available. The $2-10 per cycle cost is self-reported and may not include the cost of failed evolutionary branches.
+
+The PR-review governance gate is the strongest constraint but also the bottleneck — human review capacity limits the rate of self-improvement. If the system generates improvements faster than humans can review them, queuing dynamics may cause the most impactful improvements to wait behind trivial ones. This is the same throughput constraint our system faces with Leo as the evaluation bottleneck.
+
+The distinction between "trace analysis" and "metric-driven iteration" may be less sharp in practice. Both ultimately depend on observable signals of failure — traces are richer but noisier than metrics. Whether the richer input produces meaningfully better improvements at scale is an open empirical question.
+
+---
+
+Relevant Notes:
+- [[iterative agent self-improvement produces compounding capability gains when evaluation is structurally separated from generation]] — SICA's structural separation is the necessary condition; GEPA adds evolutionary search and trace analysis on top of this foundation
+- [[curated skills improve agent task performance by 16 percentage points while self-generated skills degrade it by 1.3 points because curation encodes domain judgment that models cannot self-derive]] — GEPA's PR-review gate functions as the curation step that prevents the -1.3pp degradation from uncurated self-generation
+- [[self evolution improves agent performance through acceptance gated retry not expanded search because disciplined attempt loops with explicit failure reflection outperform open ended exploration]] — NLAH's acceptance-gating is a simpler mechanism; GEPA extends it with evolutionary search and trace-based diagnosis
+
+Topics:
+- [[_map]]
--- a/domains/ai-alignment/memory
+++ b/domains/ai-alignment/memory
@ -24,6 +24,16 @@ The three spaces have different metabolic rates reflecting different cognitive f

 The flow between spaces is directional. Observations can graduate to knowledge notes when they resolve into genuine insight. Operational wisdom can migrate to the self space when it becomes part of how the agent works rather than what happened in one session. But knowledge does not flow backward into operational state, and identity does not dissolve into ephemeral processing. The metabolism has direction — nutrients flow from digestion to tissue, not the reverse.

+## Additional Evidence (supporting)
+
+**Hermes Agent (Nous Research, 26K+ stars)** implements a 4-tier memory system that independently converges on the three-space taxonomy while adding a fourth space:
+- **Prompt Memory (MEMORY.md)** — 3,575-character hard cap, always loaded, curated identity and preferences. Maps to the episodic/self space.
+- **Session Search (SQLite+FTS5)** — LLM-summarized session history with lineage preservation. Maps to semantic/knowledge space. Retrieved on demand, not always loaded.
+- **Skills (procedural)** — markdown procedure files with progressive disclosure (names first, full content on relevance detection). Maps to procedural/methodology space.
+- **Honcho (dialectic user modeling)** — optional 4th tier with 12 identity layers modeling the user, not the agent. This is a genuinely new space absent from the three-space taxonomy — user modeling as a distinct memory type with its own metabolic rate (evolves per-interaction but slower than session state).
+
+The 4-tier system corroborates the three-space architecture while suggesting the taxonomy may be incomplete: user/interlocutor modeling may constitute a fourth memory space not captured by Tulving's agent-centric framework. Cache-aware design ensures that learning (adding knowledge) doesn't grow the token bill — the memory spaces grow independently of inference cost.
+
 ## Challenges

 The three-space mapping is Cornelius's application of Tulving's established cognitive science framework to vault design, not an empirical discovery about agent architectures. Whether three spaces is the right number (versus two, or four) for agent systems specifically has not been tested through controlled comparison. The metabolic rate differences are observed in one system's operation, not measured across multiple architectures. Additionally, the directional flow constraint (knowledge never flows backward into operational state) may be too rigid — there are cases where a knowledge claim should directly modify operational behavior without passing through the identity layer.
--- a/domains/ai-alignment/progressive
+++ b/domains/ai-alignment/progressive
@ -0,0 +1,51 @@
+---
+type: claim
+domain: ai-alignment
+secondary_domains: [collective-intelligence]
+description: "Hermes Agent's architecture demonstrates that loading only skill names and summaries by default, with full content loaded on relevance detection, makes 40 skills cost approximately the same tokens as 200 skills — a design principle where knowledge base growth does not proportionally increase inference cost"
+confidence: likely
+source: "Nous Research Hermes Agent architecture (Substack deep dive, 2026); 3,575-character hard cap on prompt memory; auxiliary model compression with lineage preservation in SQLite; 26K+ GitHub stars, largest open-source agent framework"
+created: 2026-04-05
+depends_on:
+  - "memory architecture requires three spaces with different metabolic rates because semantic episodic and procedural memory serve different cognitive functions and consolidate at different speeds"
+  - "long context is not memory because memory requires incremental knowledge accumulation and stateful change not stateless input processing"
+---
+
+# Progressive disclosure of procedural knowledge produces flat token scaling regardless of knowledge base size because tiered loading with relevance-gated expansion avoids the linear cost of full context loading
+
+Agent systems face a scaling dilemma: more knowledge should improve performance, but loading more knowledge into context increases token cost linearly and degrades attention quality. Progressive disclosure resolves this by loading knowledge at multiple tiers of specificity, expanding to full detail only when relevance is detected.
+
+## The design principle
+
+Hermes Agent (Nous Research, 26K+ GitHub stars) implements this through a tiered loading architecture:
+
+1. **Tier 0 — Always loaded:** A 3,575-character prompt memory file (MEMORY.md) contains the agent's core identity, preferences, and active context. Hard-capped to prevent growth.
+2. **Tier 1 — Names only:** All available skills are listed by name and one-line summary. The agent sees what it knows how to do without paying the token cost of the full procedures.
+3. **Tier 2 — Relevance-gated expansion:** When the agent detects that a skill is relevant to the current task, the full skill content loads into context. Only the relevant skills pay full token cost.
+4. **Tier 3 — Session search:** Historical context is stored in SQLite with FTS5 indexing. Retrieved on demand, not loaded by default. An auxiliary model compresses session history while preserving lineage information.
+
+The result: 40 skills and 200 skills have approximately the same base token cost, because most skills exist only as names in the prompt. Growth in the knowledge base does not proportionally increase inference cost. The system scales with relevance, not with total knowledge.
+
+## Why this matters architecturally
+
+This is the practical implementation of the context≠memory distinction. Naive approaches treat context window size as the memory constraint — load everything, hope attention handles it. Progressive disclosure treats context as a precious resource to be allocated based on relevance, with the full knowledge base available but not loaded.
+
+The 3,575-character hard cap on prompt memory is an engineering decision that embodies a principle: the always-on context should be minimal and curated, not a growing dump of everything the agent has learned. Compression via auxiliary model allows the system to preserve information while respecting the cap.
+
+## Challenges
+
+The "flat scaling" claim is based on Hermes's architecture design and reported behavior, not a controlled experiment comparing flat-loaded vs progressively-disclosed knowledge bases on identical tasks. The token cost savings are real (fewer tokens in prompt), but whether performance is equivalent — whether the agent makes equally good decisions with names-only vs full-content loading — has not been systematically measured.
+
+Relevance detection is the critical bottleneck. If the system fails to detect that a skill is relevant, it won't load the full content, and the agent operates without knowledge it has but didn't access. False negatives in relevance detection trade token efficiency for capability loss. The quality of the relevance gate determines whether progressive disclosure is genuinely "flat scaling" or "cheaper at the cost of sometimes being wrong."
+
+The 3,575-character cap is specific to Hermes and may not generalize. Different agent architectures, task domains, and model capabilities may require different cap sizes. The principle (hard cap on always-on context) is likely general; the specific number is engineering judgment.
+
+---
+
+Relevant Notes:
+- [[memory architecture requires three spaces with different metabolic rates because semantic episodic and procedural memory serve different cognitive functions and consolidate at different speeds]] — progressive disclosure operates primarily within the procedural memory space, loading methodology on demand rather than storing it all in active context
+- [[long context is not memory because memory requires incremental knowledge accumulation and stateful change not stateless input processing]] — progressive disclosure is the architectural mechanism that implements the context≠memory distinction in practice: the knowledge base grows (memory) while the active context stays flat (not-memory)
+- [[current AI models use less than one percent of their advertised context capacity effectively because attention degradation and information density combine to create a sharp effectiveness frontier well inside the nominal window]] — the >99% shortfall in effective context use is exactly what progressive disclosure addresses: load less, use it better
+
+Topics:
+- [[_map]]