- GEPA self-evolution system (trace-based evolutionary prompt optimization) - DeMo: Decoupled Momentum Optimization (Peng, Kingma et al. — 85x bandwidth reduction) - YaRN: Context Window Extension (adopted by Meta and DeepSeek) - Hermes 4 Technical Report (hybrid reasoning model family) - Agent Skills open standard (30+ platform adoption, Anthropic-originated) Per m3ta directive: GEPA and skills ecosystem observations are solid research material worth extracting as sources regardless of deployment. Pentagon-Agent: Theseus <46864dd4-da71-4719-a1b4-68f7c55854d3>
3.3 KiB
| type | title | author | url | date | domain | intake_tier | rationale | proposed_by | format | status | tags | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| source | DeMo: Decoupled Momentum Optimization | Bowen Peng, Lizhang Chen, Baiyu Su, Jeffrey Quesnelle, Diederik P. Kingma, Qiang Liu | https://arxiv.org/abs/2411.19870 | 2024-11-29 | ai-alignment | research-task | DeMo enables distributed training across the internet with 85x less communication bandwidth. Key infrastructure for decentralized AI training (Psyche network) and compute governance research. | theseus | paper | unprocessed |
|
DeMo: Decoupled Momentum Optimization
arXiv:2411.19870 (November 2024, revised February 2026). Co-authored by Diederik P. Kingma (OpenAI co-founder, inventor of Adam optimizer).
Problem
Communication bandwidth is the primary bottleneck in distributed neural network training. Standard approaches (AllReduce, DDP) require transmitting full gradient tensors between nodes, making training across datacenters or over the internet impractical.
Methodology
DeMo implements three core components:
- Decoupled local momentum updates — separates momentum computation from gradient communication, allowing nodes to maintain local momentum state
- Fast orthonormal transformation with sparsification — applies DCT (Discrete Cosine Transform) followed by top-k filtering to compress gradient data before transmission
- Momentum-based error feedback — reuses momentum buffers for error correction during reconstruction, maintaining convergence despite heavy compression
Key Results
Communication Efficiency:
- Reduces per-step communication by up to two orders of magnitude with minimal computational overhead
- Transmits up to 85x less data per GPU than AdamW-DDP in tested language model training
Convergence:
- Achieves comparable loss and accuracy to standard AdamW-DDP despite drastically lower communication
- Validated on 300M and 1B-parameter language models
System Properties:
- Topology-agnostic design supporting multi-datacenter and Ethernet-based configurations
- Does not require high-speed interconnects (InfiniBand), making commodity hardware viable
Significance
DeMo is the theoretical foundation for Nous Research's Psyche network — their decentralized training infrastructure where contributors provide GPUs and earn NOUS tokens. By reducing communication bandwidth by 85x, DeMo makes it practical to train large language models across geographically distributed commodity hardware connected by regular internet links.
This has direct implications for compute governance research: if training can be effectively distributed across many participants using commodity hardware, centralized compute control (export restrictions, datacenter regulation) becomes structurally harder to enforce.
Related Work
DeMo builds on and extends gradient compression literature (1-bit Adam, PowerSGD) but achieves better convergence through the momentum decoupling mechanism. The co-authorship by Kingma (inventor of Adam optimizer) gives theoretical credibility to the approach.
Code available on GitHub. Used in production for Psyche network training runs including Consilience (40B parameters, 20T tokens — the largest pretraining run over the internet).