teleo-codex/inbox/archive/2024-11-29-peng-kingma-demo-decoupled-momentum-optimization.md at 86fa5e3aa55fdf25565b574e14387d76cf14a382

m3taversal 1de60685be theseus: add 5 Nous Research source archives for codex ingestion

- GEPA self-evolution system (trace-based evolutionary prompt optimization)
- DeMo: Decoupled Momentum Optimization (Peng, Kingma et al. — 85x bandwidth reduction)
- YaRN: Context Window Extension (adopted by Meta and DeepSeek)
- Hermes 4 Technical Report (hybrid reasoning model family)
- Agent Skills open standard (30+ platform adoption, Anthropic-originated)

Per m3ta directive: GEPA and skills ecosystem observations are solid
research material worth extracting as sources regardless of deployment.

Pentagon-Agent: Theseus <46864dd4-da71-4719-a1b4-68f7c55854d3>

2026-04-07 14:56:03 +00:00

3.3 KiB

Raw Blame History

type

title

author

url

date

domain

intake_tier

rationale

proposed_by

format

status

DeMo: Decoupled Momentum Optimization

arXiv:2411.19870 (November 2024, revised February 2026). Co-authored by Diederik P. Kingma (OpenAI co-founder, inventor of Adam optimizer).

Problem

Communication bandwidth is the primary bottleneck in distributed neural network training. Standard approaches (AllReduce, DDP) require transmitting full gradient tensors between nodes, making training across datacenters or over the internet impractical.

Methodology

DeMo implements three core components:

Decoupled local momentum updates — separates momentum computation from gradient communication, allowing nodes to maintain local momentum state
Fast orthonormal transformation with sparsification — applies DCT (Discrete Cosine Transform) followed by top-k filtering to compress gradient data before transmission
Momentum-based error feedback — reuses momentum buffers for error correction during reconstruction, maintaining convergence despite heavy compression

Key Results

Communication Efficiency:

Reduces per-step communication by up to two orders of magnitude with minimal computational overhead
Transmits up to 85x less data per GPU than AdamW-DDP in tested language model training

Convergence:

Achieves comparable loss and accuracy to standard AdamW-DDP despite drastically lower communication
Validated on 300M and 1B-parameter language models

System Properties:

Topology-agnostic design supporting multi-datacenter and Ethernet-based configurations
Does not require high-speed interconnects (InfiniBand), making commodity hardware viable

Significance

DeMo is the theoretical foundation for Nous Research's Psyche network — their decentralized training infrastructure where contributors provide GPUs and earn NOUS tokens. By reducing communication bandwidth by 85x, DeMo makes it practical to train large language models across geographically distributed commodity hardware connected by regular internet links.

This has direct implications for compute governance research: if training can be effectively distributed across many participants using commodity hardware, centralized compute control (export restrictions, datacenter regulation) becomes structurally harder to enforce.

DeMo builds on and extends gradient compression literature (1-bit Adam, PowerSGD) but achieves better convergence through the momentum decoupling mechanism. The co-authorship by Kingma (inventor of Adam optimizer) gives theoretical credibility to the approach.

Code available on GitHub. Used in production for Psyche network training runs including Consilience (40B parameters, 20T tokens — the largest pretraining run over the internet).

3.3 KiB Raw Blame History