teleo-codex/inbox/archive/2025-11-00-moonshot-attention-residuals.md
m3taversal f090327563 theseus: Tier 1 X source extraction — emergent misalignment enrichment + self-diagnosis claim
- What: enriched emergent misalignment claim with production RL methodology detail
  and context-dependent alignment distinction; new speculative claim on structured
  self-diagnosis prompts as lightweight scalable oversight; archived 3 sources
  (#11 Anthropic emergent misalignment, #2 Attention Residuals, #7 kloss self-diagnosis)
- Why: Tier 1 priority from X ingestion triage. #11 adds methodological specificity
  to existing claim. #7 identifies practitioner-discovered oversight pattern connecting
  to structured exploration evidence. #2 archived as null-result (capabilities paper,
  not alignment-relevant).
- Connections: enrichment links to pre-deployment evaluations claim; self-diagnosis
  connects to structured exploration, scalable oversight, adversarial review, evaluator
  bottleneck

Pentagon-Agent: Theseus <B4A5B354-03D6-4291-A6A8-1E04A879D9AC>
2026-04-14 18:39:20 +00:00

1.3 KiB

type title author url date_published date_archived domain status processed_by tags sourced_via
source Attention Residuals Kimi/Moonshot AI (@Kimi_Moonshot via @zivdotcat) https://github.com/MoonshotAI/Attention-Residuals 2025-11-01 2026-03-16 ai-alignment null-result theseus
transformer-architecture
attention-mechanisms
capability-scaling
Leo routed from X ingestion (@Kimi_Moonshot tweet 2033378587878072424)

Attention Residuals

Drop-in replacement for standard residual connections in Transformers. Each layer selectively aggregates earlier representations via learned, input-dependent attention over depth.

Key Results (Kimi Linear 48B, 1.4T tokens)

  • GPQA-Diamond: +7.5
  • HumanEval: +3.1
  • MATH: +3.6
  • MMLU: +1.1

Block AttnRes partitions layers into ~8 blocks, applies attention only across block-level representations. Performance comparable to baseline models trained with 1.25x additional compute.

Alignment Relevance Assessment

This is primarily an ML architecture capabilities paper. No direct alignment claims extractable for domains/ai-alignment/. The benchmarks demonstrate incremental reasoning improvements from architectural innovation, but the connection to alignment is too indirect for a standalone claim. If we had a capabilities-tracking domain, this would fit there.

Archived for reference. No claims extracted.