teleo-codex/inbox/archive/2025-11-00-moonshot-attention-residuals.md

---
type: source
title: "Attention Residuals"
author: "Kimi/Moonshot AI (@Kimi_Moonshot via @zivdotcat)"
url: https://github.com/MoonshotAI/Attention-Residuals
date_published: 2025-11-01
date_archived: 2026-03-16
domain: ai-alignment
status: null-result
processed_by: theseus
tags: [transformer-architecture, attention-mechanisms, capability-scaling]
sourced_via: "Leo routed from X ingestion (@Kimi_Moonshot tweet 2033378587878072424)"
---

# Attention Residuals

Drop-in replacement for standard residual connections in Transformers. Each layer selectively aggregates earlier representations via learned, input-dependent attention over depth.

## Key Results (Kimi Linear 48B, 1.4T tokens)
- GPQA-Diamond: +7.5
- HumanEval: +3.1
- MATH: +3.6
- MMLU: +1.1

Block AttnRes partitions layers into ~8 blocks, applies attention only across block-level representations. Performance comparable to baseline models trained with 1.25x additional compute.

## Alignment Relevance Assessment
This is primarily an ML architecture capabilities paper. No direct alignment claims extractable for domains/ai-alignment/. The benchmarks demonstrate incremental reasoning improvements from architectural innovation, but the connection to alignment is too indirect for a standalone claim. If we had a capabilities-tracking domain, this would fit there.

Archived for reference. No claims extracted.