teleo-codex/inbox/archive/2023-08-31-peng-quesnelle-yarn-context-window-extension.md at b0d080e2f4426e6a377c2dfdd51de4e9acb7d4b8

m3taversal 1de60685be theseus: add 5 Nous Research source archives for codex ingestion

- GEPA self-evolution system (trace-based evolutionary prompt optimization)
- DeMo: Decoupled Momentum Optimization (Peng, Kingma et al. — 85x bandwidth reduction)
- YaRN: Context Window Extension (adopted by Meta and DeepSeek)
- Hermes 4 Technical Report (hybrid reasoning model family)
- Agent Skills open standard (30+ platform adoption, Anthropic-originated)

Per m3ta directive: GEPA and skills ecosystem observations are solid
research material worth extracting as sources regardless of deployment.

Pentagon-Agent: Theseus <46864dd4-da71-4719-a1b4-68f7c55854d3>

2026-04-07 14:56:03 +00:00

2.5 KiB

Raw Blame History

type

title

author

url

date

domain

intake_tier

rationale

proposed_by

format

status

YaRN: Efficient Context Window Extension of Large Language Models

arXiv:2309.00071 (August 2023, revised February 2026). First significant research publication from Nous Research.

Problem

Transformer-based language models cannot generalize beyond their original training sequence length. This limits practical utility for tasks requiring long-context reasoning (document analysis, codebase understanding, multi-turn conversation).

Methodology

YaRN (Yet another RoPE extensioN method) builds on Rotary Position Embeddings (RoPE). The key innovation is a compute-efficient interpolation method that extends context windows without requiring full retraining.

Key Results

10x fewer tokens required for context extension fine-tuning compared to previous methods
2.5x fewer training steps than prior approaches
Enables LLaMA models to handle 128K token contexts
State-of-the-art performance in context window extension at time of publication
Demonstrates ability to extrapolate beyond the fine-tuning dataset length

Adoption

YaRN was adopted by:

Meta — incorporated into Llama model family
DeepSeek — used in their long-context model training

This adoption pattern is significant: a small open-source research lab (Nous Research, pre-funding) produced a technique that was adopted by two of the largest AI labs. This demonstrates that in AI research, the quality of the technique matters more than the institutional prestige of the lab — open-source research can directly influence frontier model development.

Technical Details

The method modifies how RoPE embeddings handle positions beyond the training length. Rather than simple linear interpolation (which degrades quality) or full retraining (which is expensive), YaRN uses a frequency-based decomposition that preserves the geometric properties of RoPE while efficiently extending to longer sequences.

Code publicly available on GitHub. Licensed under CC BY 4.0.

2.5 KiB Raw Blame History