theseus: add 5 Nous Research source archives for codex ingestion

- GEPA self-evolution system (trace-based evolutionary prompt optimization)
- DeMo: Decoupled Momentum Optimization (Peng, Kingma et al. — 85x bandwidth reduction)
- YaRN: Context Window Extension (adopted by Meta and DeepSeek)
- Hermes 4 Technical Report (hybrid reasoning model family)
- Agent Skills open standard (30+ platform adoption, Anthropic-originated)

Per m3ta directive: GEPA and skills ecosystem observations are solid
research material worth extracting as sources regardless of deployment.

Pentagon-Agent: Theseus <46864dd4-da71-4719-a1b4-68f7c55854d3>
This commit is contained in:
m3taversal 2026-04-07 15:53:09 +01:00 committed by Teleo Agents
parent efe23f931a
commit 1de60685be
5 changed files with 356 additions and 0 deletions

View file

@ -0,0 +1,48 @@
---
type: source
title: "YaRN: Efficient Context Window Extension of Large Language Models"
author: "Bowen Peng, Jeffrey Quesnelle, Honglu Fan, Enrico Shippole"
url: https://arxiv.org/abs/2309.00071
date: 2023-08-31
domain: ai-alignment
intake_tier: research-task
rationale: "YaRN is Nous Research's context extension method adopted by Meta and DeepSeek. Demonstrates open-source research influencing frontier labs — evidence for knowledge diffusion patterns in AI development."
proposed_by: theseus
format: paper
status: unprocessed
tags: [nous-research, context-window, rotary-embeddings, yarn, meta, deepseek]
---
## YaRN: Efficient Context Window Extension of Large Language Models
arXiv:2309.00071 (August 2023, revised February 2026). First significant research publication from Nous Research.
### Problem
Transformer-based language models cannot generalize beyond their original training sequence length. This limits practical utility for tasks requiring long-context reasoning (document analysis, codebase understanding, multi-turn conversation).
### Methodology
YaRN (Yet another RoPE extensioN method) builds on Rotary Position Embeddings (RoPE). The key innovation is a compute-efficient interpolation method that extends context windows without requiring full retraining.
### Key Results
- **10x fewer tokens** required for context extension fine-tuning compared to previous methods
- **2.5x fewer training steps** than prior approaches
- Enables LLaMA models to handle 128K token contexts
- State-of-the-art performance in context window extension at time of publication
- Demonstrates ability to extrapolate beyond the fine-tuning dataset length
### Adoption
YaRN was adopted by:
- **Meta** — incorporated into Llama model family
- **DeepSeek** — used in their long-context model training
This adoption pattern is significant: a small open-source research lab (Nous Research, pre-funding) produced a technique that was adopted by two of the largest AI labs. This demonstrates that in AI research, the quality of the technique matters more than the institutional prestige of the lab — open-source research can directly influence frontier model development.
### Technical Details
The method modifies how RoPE embeddings handle positions beyond the training length. Rather than simple linear interpolation (which degrades quality) or full retraining (which is expensive), YaRN uses a frequency-based decomposition that preserves the geometric properties of RoPE while efficiently extending to longer sequences.
Code publicly available on GitHub. Licensed under CC BY 4.0.

View file

@ -0,0 +1,56 @@
---
type: source
title: "DeMo: Decoupled Momentum Optimization"
author: "Bowen Peng, Lizhang Chen, Baiyu Su, Jeffrey Quesnelle, Diederik P. Kingma, Qiang Liu"
url: https://arxiv.org/abs/2411.19870
date: 2024-11-29
domain: ai-alignment
intake_tier: research-task
rationale: "DeMo enables distributed training across the internet with 85x less communication bandwidth. Key infrastructure for decentralized AI training (Psyche network) and compute governance research."
proposed_by: theseus
format: paper
status: unprocessed
tags: [nous-research, distributed-training, optimization, decentralized-ai, compute-governance, kingma]
---
## DeMo: Decoupled Momentum Optimization
arXiv:2411.19870 (November 2024, revised February 2026). Co-authored by Diederik P. Kingma (OpenAI co-founder, inventor of Adam optimizer).
### Problem
Communication bandwidth is the primary bottleneck in distributed neural network training. Standard approaches (AllReduce, DDP) require transmitting full gradient tensors between nodes, making training across datacenters or over the internet impractical.
### Methodology
DeMo implements three core components:
1. **Decoupled local momentum updates** — separates momentum computation from gradient communication, allowing nodes to maintain local momentum state
2. **Fast orthonormal transformation with sparsification** — applies DCT (Discrete Cosine Transform) followed by top-k filtering to compress gradient data before transmission
3. **Momentum-based error feedback** — reuses momentum buffers for error correction during reconstruction, maintaining convergence despite heavy compression
### Key Results
**Communication Efficiency:**
- Reduces per-step communication by up to two orders of magnitude with minimal computational overhead
- Transmits up to **85x less data per GPU** than AdamW-DDP in tested language model training
**Convergence:**
- Achieves comparable loss and accuracy to standard AdamW-DDP despite drastically lower communication
- Validated on 300M and 1B-parameter language models
**System Properties:**
- Topology-agnostic design supporting multi-datacenter and Ethernet-based configurations
- Does not require high-speed interconnects (InfiniBand), making commodity hardware viable
### Significance
DeMo is the theoretical foundation for Nous Research's **Psyche network** — their decentralized training infrastructure where contributors provide GPUs and earn NOUS tokens. By reducing communication bandwidth by 85x, DeMo makes it practical to train large language models across geographically distributed commodity hardware connected by regular internet links.
This has direct implications for compute governance research: if training can be effectively distributed across many participants using commodity hardware, centralized compute control (export restrictions, datacenter regulation) becomes structurally harder to enforce.
### Related Work
DeMo builds on and extends gradient compression literature (1-bit Adam, PowerSGD) but achieves better convergence through the momentum decoupling mechanism. The co-authorship by Kingma (inventor of Adam optimizer) gives theoretical credibility to the approach.
Code available on GitHub. Used in production for Psyche network training runs including Consilience (40B parameters, 20T tokens — the largest pretraining run over the internet).

View file

@ -0,0 +1,55 @@
---
type: source
title: "Hermes 4 Technical Report"
author: "Ryan Teknium, Roger Jin, Jai Suphavadeeprasit, Dakota Mahan, Jeffrey Quesnelle, Joe Li, Chen Guang, Shannon Sands, Karan Malhotra"
url: https://arxiv.org/abs/2508.18255
date: 2025-08-25
domain: ai-alignment
intake_tier: research-task
rationale: "Hermes 4 is the model family underlying the Hermes Agent. Technical report covers hybrid reasoning architecture, training methodology, and benchmark results. Key evidence for open-source model competitiveness and skill-based agent architecture."
proposed_by: theseus
format: paper
status: unprocessed
tags: [nous-research, hermes-4, hybrid-reasoning, open-source-models, training-methodology]
---
## Hermes 4 Technical Report
arXiv:2508.18255 (August 2025). The comprehensive technical report for Nous Research's flagship model family.
### Overview
Hermes 4 is a family of hybrid reasoning models that combine structured, multi-turn reasoning with broad instruction-following ability. The report covers challenges in data curation, synthesis, training, and evaluation at scale.
### Model Family
- **Hermes-4-Llama-3.1-405B** — frontier hybrid-mode reasoning model (802GB)
- **Hermes-4-Llama-3.1-70B** — smaller variant with shared improvements (140GB)
- **Hermes-4-14B** — dense model for local inference (28GB)
- **Hermes-4.3-Seed-36B** — post-trained entirely on the Psyche decentralized network (72GB)
### Hybrid Reasoning Architecture
The key innovation is the ability to switch between structured reasoning mode (chain-of-thought, step-by-step) and direct instruction-following mode. This addresses a known limitation of pure reasoning models: they waste compute on simple tasks that don't benefit from extended reasoning.
### Training Methodology
The report addresses challenges in:
- Data curation at scale — quality filtering, decontamination, domain balancing
- Synthetic data generation — using stronger models to generate training data
- Multi-stage training pipeline — pre-training → supervised fine-tuning → alignment
- Evaluation across mathematical reasoning, coding, knowledge, comprehension, and alignment benchmarks
### Benchmark Results
Comprehensive benchmarking across multiple domains. The 405B variant performs at frontier level; the 14B variant demonstrates that small, dense models remain competitive for specific use cases (local inference, cost-sensitive deployment).
### Decentralized Training (Hermes 4.3)
Hermes-4.3-Seed-36B is notable as the first model post-trained entirely on the Psyche decentralized network. This demonstrates that distributed, volunteer-contributed compute can produce competitive models — a proof-of-concept for the DeMo/Psyche infrastructure thesis.
### Significance for Agent Architecture
Hermes 4 is the default model powering the Hermes Agent. The hybrid reasoning capability enables the agent to use extended reasoning for complex tasks (skill creation, multi-step planning) while responding quickly to simple queries. This maps directly to the progressive disclosure pattern in the skill system — simple queries don't load skills or invoke reasoning, while complex tasks trigger both.
Model weights publicly released via Hugging Face. Licensed under CC BY 4.0.

View file

@ -0,0 +1,85 @@
---
type: source
title: "Hermes Agent Self-Evolution: Evolutionary Self-Improvement via DSPy + GEPA"
author: "Nous Research (Teknium, Jeffrey Quesnelle, Karan Malhotra)"
url: https://github.com/NousResearch/hermes-agent-self-evolution
date: 2026-02-24
domain: ai-alignment
intake_tier: research-task
rationale: "GEPA is a trace-based evolutionary prompt optimizer that outperforms RL-based methods. Key evidence for agent self-improvement claims and the skills-as-codification thesis."
proposed_by: theseus
format: whitepaper
status: processed
processed_by: theseus
processed_date: 2026-04-07
claims_extracted:
- "GEPA evolutionary trace-based optimization is distinct from acceptance-gating and RL approaches because it reads why failures happen rather than just that they failed"
enrichments:
- "curated agent skills persist and improve through use producing flat token scaling at 40 skills equivalent to 200 skills"
tags: [nous-research, gepa, self-evolution, prompt-optimization, agent-skills, dspy]
---
## GEPA: Genetic-Pareto Prompt Evolution
GEPA (Genetic-Pareto Prompt Evolution) is Nous Research's evolutionary optimizer for agent self-improvement. It is implemented in the `hermes-agent-self-evolution` repository (704 stars, MIT license) and integrates DSPy for prompt optimization with evolutionary trace analysis.
### Core Mechanism
GEPA is a **reflective evolutionary optimizer** that examines WHY components fail, not merely THAT they fail. The system reads execution traces to understand concrete failure modes, then proposes targeted improvements. This trace-based analysis distinguishes GEPA from simpler mutation approaches (random perturbation) and from RL-based methods (reward signal without causal explanation).
### Evolutionary Process
1. Read current skill/prompt/tool definition
2. Generate evaluation dataset (synthetic or from real session history via SQLite)
3. Execute candidates and capture full execution traces
4. GEPA optimizer analyzes traces and proposes targeted mutations
5. Evaluate variants against 5 constraint gates
6. Select best performer via Pareto front
7. Submit as pull request for human review
### Five Constraint Gates (Guardrails)
Every evolved variant must satisfy all five gates before consideration:
1. **Full Test Suite:** `pytest tests/ -q` must pass 100%
2. **Size Limits:** Skills ≤15KB, tool descriptions ≤500 characters
3. **Caching Compatibility:** No mid-conversation changes allowed
4. **Semantic Preservation:** Variants must not drift from original intent
5. **PR Review:** All changes go through human review, never direct commit
The fifth gate — PR-review governance — ensures no evolved variant reaches production without human approval. This is structurally equivalent to the acceptance-gating pattern in SICA (SWE-Bench self-improvement), but GEPA adds trace-based explanation of WHY the mutation was proposed.
### What Gets Optimized (Phased Rollout)
- **Phase 1 (Implemented):** Skill files (SKILL.md) — procedural memory
- **Phase 2 (Planned):** Tool descriptions — capability interfaces
- **Phase 3 (Planned):** System prompt sections — behavioral tuning
- **Phase 4 (Planned):** Tool implementation code via Darwinian Evolver
- **Phase 5 (Planned):** Continuous improvement loop
### Architecture Split
The system distinguishes between:
- **Reflective text evolution** (DSPy + GEPA) — for prompts, descriptions, skills
- **Code evolution** (Darwinian Evolver, AGPL v3) — for implementation code
This separation applies appropriate optimization strategies per artifact type. Text evolution operates entirely via API calls — mutating natural language, evaluating results, selecting best variants. Cost: ~$2-10 per optimization run.
### Integration with DSPy
DSPy provides the prompt optimization framework. GEPA adds the evolutionary trace analysis on top. Combined, they mutate natural language descriptions of skills, tool behaviors, and system instructions with causal grounding in observed failure modes.
### Key Distinctions from Other Self-Improvement Approaches
| Approach | Signal Type | Causal? | Governance |
|----------|------------|---------|------------|
| SICA (SWE-Bench) | Pass/fail acceptance gate | No | Metric threshold |
| NLAH (Pan et al.) | Module ablation | Partial | Researcher manual |
| GRPO (RL) | Reward signal | No | Training objective |
| **GEPA** | Execution trace analysis | Yes | 5-gate + PR review |
GEPA's distinguishing feature is that it reads the execution trace to understand the causal chain of failure, then proposes mutations that address the root cause rather than randomly perturbing until something works.
### Development Status
Repository: 704 stars, 64 forks, 7 commits, actively under development. MIT license for core; Darwinian Evolver uses AGPL v3 as external CLI only.

View file

@ -0,0 +1,112 @@
---
type: source
title: "Agent Skills: An Open Standard for Giving Agents New Capabilities"
author: "Anthropic (originator), AgentSkills community"
url: https://agentskills.io
date: 2026-03-01
domain: ai-alignment
intake_tier: research-task
rationale: "Agent Skills is the open standard for SKILL.md files, adopted by 30+ platforms including Claude Code, Cursor, GitHub Copilot, VS Code, OpenAI Codex, Hermes Agent, and JetBrains Junie. This is the primary evidence for our 'Agent Skills as industrial codification' claim — the largest real-world instance of procedural knowledge standardization for AI agents."
proposed_by: theseus
format: whitepaper
status: processed
processed_by: theseus
processed_date: 2026-04-07
claims_extracted: []
enrichments:
- "agent skills as industrial codification pattern mirrors historical skill decomposition from craft guilds through scientific management to algorithmic management"
tags: [agent-skills, skill-md, open-standard, anthropic, codification, interoperability]
---
## Agent Skills: Open Standard Overview
Agent Skills is an open format for giving AI agents new capabilities and domain expertise. Originally developed by Anthropic, released as an open standard, and adopted by 30+ agent platforms as of April 2026.
### What Agent Skills Are
Skills are folders of instructions, scripts, and resources that agents can discover and use to perform tasks more accurately and efficiently. A skill consists of:
```
skill-name/
├── SKILL.md # Required: metadata + instructions
├── scripts/ # Optional: executable code
├── references/ # Optional: documentation
├── assets/ # Optional: templates, resources
└── ... # Any additional files
```
### SKILL.md Specification
The core file has YAML frontmatter with required fields:
- `name` — lowercase alphanumeric + hyphens, max 64 chars, must match directory name
- `description` — max 1024 chars, describes what the skill does AND when to use it
Optional fields: `license`, `compatibility`, `metadata` (arbitrary key-value), `allowed-tools` (experimental pre-approved tool list).
The Markdown body contains instructions with no format restrictions. Recommended: step-by-step procedures, input/output examples, edge cases.
### Progressive Disclosure (Token Efficiency)
Skills are structured for efficient context usage across three tiers:
1. **Metadata** (~100 tokens) — `name` and `description` loaded at startup for ALL skills
2. **Instructions** (<5000 tokens recommended) full SKILL.md body loaded when skill is activated
3. **Resources** (as needed) — scripts, references, assets loaded only when required
This means an agent can have hundreds of skills available with minimal token overhead. Only the names and descriptions are in context at startup; the full instructions load on demand.
### Adopting Platforms (30+)
**Major platforms confirmed:**
- **Anthropic:** Claude Code, Claude (platform)
- **Microsoft/GitHub:** VS Code, GitHub Copilot
- **OpenAI:** Codex
- **Google:** Gemini CLI
- **Cursor**
- **JetBrains:** Junie, Kiro
- **Nous Research:** Hermes Agent
- **Letta** (stateful agents with memory)
- **Block:** Goose
- **OpenHands** (cloud coding agents)
- **Roo Code**
- **Mistral AI:** Vibe
- **Databricks:** Genie Code
- **Snowflake:** Cortex Code
- **Factory** (AI-native development)
- **Spring AI** (Java ecosystem)
- **TRAE** (ByteDance)
- **Qodo** (code integrity)
- **Laravel Boost**
- **Amp**, Autohand, Mux, OpenCode, Firebender, Piebald, pi, Command Code, Ona, VT Code, Emdash, Agentman
### Why This Matters
The Agent Skills standard is the largest real-world instance of industrial codification for AI agents. The pattern mirrors historical skill decomposition:
1. **Craft guilds** — tacit knowledge held by individuals
2. **Scientific management (Taylor)** — explicit process documentation
3. **Algorithmic management** — automated process enforcement
4. **Agent Skills** — AI-readable procedural knowledge that agents discover, load, and execute
The key difference: Agent Skills are designed for **interoperability**. A skill written for Claude Code works in Cursor, Hermes Agent, GitHub Copilot, etc. This creates a marketplace dynamic (agentskills.io) where procedural knowledge becomes portable, tradeable, and composable across platforms.
### Hermes Agent's Implementation
Hermes Agent was one of the earliest adopters and extends the standard with:
- **Auto-creation:** Complex tasks (5+ tool calls) trigger automatic skill generation
- **Self-evolution:** GEPA optimizes existing skills via trace-based mutation
- **Progressive disclosure at scale:** 40 skills costs the same tokens as 200 skills
- **Community marketplace:** Skills Hub at agentskills.io for sharing/installing
### Validation and Tooling
The `skills-ref` reference library provides validation:
```bash
skills-ref validate ./my-skill
```
This checks frontmatter validity and naming conventions. Available on GitHub at agentskills/agentskills.
### Open Development
The standard is governed via open development on GitHub (agentskills/agentskills) and Discord. Contributions from any platform are accepted. The spec is versioned and evolving — `allowed-tools` is explicitly marked as experimental.