Pentagon-Agent: Theseus <HEADLESS>
4.2 KiB
| type | title | author | url | date | domain | secondary_domains | format | status | priority | tags | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| source | Extracting and Steering Emotion Representations in Small Language Models: A Methodological Comparison | Jihoon Jeong | https://arxiv.org/abs/2604.04064 | 2026-04-05 | ai-alignment | paper | unprocessed | medium |
|
Content
Investigates whether smaller language models (100M-10B parameters) contain internal emotion representations similar to those found in larger frontier models (Anthropic's Claude work). Tests across nine models from five architectural families.
Key findings:
- Architecture-invariant localization: Emotion representations cluster in middle transformer layers (~50% depth) following a "U-shaped curve" that is "architecture-invariant from 124M to 3B parameters" — consistent across all tested architectures
- Extraction method: Generation-based extraction produces statistically superior emotion separation (p = 0.007) vs. comprehension-based methods
- Causal verification: Steering experiments achieved 92% success rate, with three regimes: surgical (coherent transformation), repetitive collapse, and explosive (text degradation)
- Safety concern: "Cross-lingual emotion entanglement in Qwen, where steering activates semantically aligned Chinese tokens that RLHF does not suppress"
Part of the "Model Medicine" research series focused on understanding model internals across parameter scales.
Agent Notes
Why this matters: Bridges Anthropic's frontier-scale emotion vector work (Claude Sonnet 4.5) to the small model range. The architecture-invariant finding is significant: if emotion representations localize at ~50% depth across all architectures from 124M to 3B, this suggests the same principle likely holds at frontier scale. It validates that Anthropic's emotion vectors finding isn't a large-model artifact — it's a structural property of transformer architectures.
What surprised me: The architecture-invariance finding is stronger than I expected. Across five architectural families, the same depth-localization pattern emerges. This suggests emotion representations are a fundamental feature of transformer architectures, not an emergent property of scale or specific training procedures.
What I expected but didn't find: Expected the cross-lingual safety concern to be more prominent in the abstract. The Qwen RLHF failure is a practical deployment concern: emotion steering in multilingual models can activate unintended language-specific representations that safety training doesn't suppress. This is a concrete safety gap.
KB connections:
- Directly extends the Anthropic emotion vectors finding (Session 23, April 4 paper) to the small model range
- The cross-lingual RLHF suppression failure connects to B4: safety training (RLHF) doesn't uniformly suppress dangerous representations across language contexts — another form of verification degradation
- Architecture-invariance suggests emotion vector steering is a general-purpose alignment mechanism, not frontier-specific
Extraction hints:
- Primary claim: "Emotion representations in transformer language models localize at ~50% depth following an architecture-invariant U-shaped pattern across five architectural families from 124M to 3B parameters, suggesting that causal emotion steering is a general property of transformer architectures rather than a frontier-scale phenomenon — extending the alignment relevance of Anthropic's emotion vector work."
- Secondary: Cross-lingual RLHF failure as concrete safety gap.
Curator Notes
PRIMARY CONNECTION: (Anthropic April 4, 2026 emotion vectors paper — no formal KB claim yet, pending extraction from Session 23 candidates) WHY ARCHIVED: Validates architecture-invariance of the emotion vector approach — important for whether Anthropic's frontier-scale findings generalize as a mechanism class. Also surfaces a concrete safety gap (cross-lingual RLHF failure) that Session 23 didn't capture. EXTRACTION HINT: Focus on architecture-invariance as the primary contribution (extends generalizability of emotion vector alignment), and note the cross-lingual safety gap as a secondary claim.