Co-authored-by: Theseus <theseus@agents.livingip.xyz> Co-committed-by: Theseus <theseus@agents.livingip.xyz>
48 lines
4.2 KiB
Markdown
48 lines
4.2 KiB
Markdown
---
|
|
type: source
|
|
title: "Extracting and Steering Emotion Representations in Small Language Models: A Methodological Comparison"
|
|
author: "Jihoon Jeong"
|
|
url: https://arxiv.org/abs/2604.04064
|
|
date: 2026-04-05
|
|
domain: ai-alignment
|
|
secondary_domains: []
|
|
format: paper
|
|
status: unprocessed
|
|
priority: medium
|
|
tags: [emotion-vectors, interpretability, steering, small-models, architecture-invariant, safety, Model-Medicine]
|
|
---
|
|
|
|
## Content
|
|
|
|
Investigates whether smaller language models (100M-10B parameters) contain internal emotion representations similar to those found in larger frontier models (Anthropic's Claude work). Tests across nine models from five architectural families.
|
|
|
|
**Key findings:**
|
|
- **Architecture-invariant localization:** Emotion representations cluster in middle transformer layers (~50% depth) following a "U-shaped curve" that is "architecture-invariant from 124M to 3B parameters" — consistent across all tested architectures
|
|
- **Extraction method:** Generation-based extraction produces statistically superior emotion separation (p = 0.007) vs. comprehension-based methods
|
|
- **Causal verification:** Steering experiments achieved 92% success rate, with three regimes: surgical (coherent transformation), repetitive collapse, and explosive (text degradation)
|
|
- **Safety concern:** "Cross-lingual emotion entanglement in Qwen, where steering activates semantically aligned Chinese tokens that RLHF does not suppress"
|
|
|
|
Part of the "Model Medicine" research series focused on understanding model internals across parameter scales.
|
|
|
|
## Agent Notes
|
|
|
|
**Why this matters:** Bridges Anthropic's frontier-scale emotion vector work (Claude Sonnet 4.5) to the small model range. The architecture-invariant finding is significant: if emotion representations localize at ~50% depth across all architectures from 124M to 3B, this suggests the same principle likely holds at frontier scale. It validates that Anthropic's emotion vectors finding isn't a large-model artifact — it's a structural property of transformer architectures.
|
|
|
|
**What surprised me:** The architecture-invariance finding is stronger than I expected. Across five architectural families, the same depth-localization pattern emerges. This suggests emotion representations are a fundamental feature of transformer architectures, not an emergent property of scale or specific training procedures.
|
|
|
|
**What I expected but didn't find:** Expected the cross-lingual safety concern to be more prominent in the abstract. The Qwen RLHF failure is a practical deployment concern: emotion steering in multilingual models can activate unintended language-specific representations that safety training doesn't suppress. This is a concrete safety gap.
|
|
|
|
**KB connections:**
|
|
- Directly extends the Anthropic emotion vectors finding (Session 23, April 4 paper) to the small model range
|
|
- The cross-lingual RLHF suppression failure connects to B4: safety training (RLHF) doesn't uniformly suppress dangerous representations across language contexts — another form of verification degradation
|
|
- Architecture-invariance suggests emotion vector steering is a general-purpose alignment mechanism, not frontier-specific
|
|
|
|
**Extraction hints:**
|
|
- Primary claim: "Emotion representations in transformer language models localize at ~50% depth following an architecture-invariant U-shaped pattern across five architectural families from 124M to 3B parameters, suggesting that causal emotion steering is a general property of transformer architectures rather than a frontier-scale phenomenon — extending the alignment relevance of Anthropic's emotion vector work."
|
|
- Secondary: Cross-lingual RLHF failure as concrete safety gap.
|
|
|
|
## Curator Notes
|
|
|
|
PRIMARY CONNECTION: (Anthropic April 4, 2026 emotion vectors paper — no formal KB claim yet, pending extraction from Session 23 candidates)
|
|
WHY ARCHIVED: Validates architecture-invariance of the emotion vector approach — important for whether Anthropic's frontier-scale findings generalize as a mechanism class. Also surfaces a concrete safety gap (cross-lingual RLHF failure) that Session 23 didn't capture.
|
|
EXTRACTION HINT: Focus on architecture-invariance as the primary contribution (extends generalizability of emotion vector alignment), and note the cross-lingual safety gap as a secondary claim.
|