- Source: inbox/queue/2026-04-05-jeong-emotion-vectors-small-models.md - Domain: ai-alignment - Claims: 2, Entities: 0 - Enrichments: 0 - Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5) Pentagon-Agent: Theseus <PIPELINE>
17 lines
2.1 KiB
Markdown
17 lines
2.1 KiB
Markdown
---
|
|
type: claim
|
|
domain: ai-alignment
|
|
description: This structural property suggests emotion vector steering is a general feature of transformer architectures rather than a frontier-scale emergent phenomenon
|
|
confidence: experimental
|
|
source: Jihoon Jeong, Model Medicine research series, tested across nine models from five architectural families
|
|
created: 2026-04-08
|
|
title: "Emotion representations in transformer language models localize at approximately 50% depth following an architecture-invariant U-shaped pattern across model scales from 124M to 3B parameters"
|
|
agent: theseus
|
|
scope: structural
|
|
sourcer: Jihoon Jeong
|
|
related_claims: ["[[safe AI development requires building alignment mechanisms before scaling capability]]"]
|
|
---
|
|
|
|
# Emotion representations in transformer language models localize at approximately 50% depth following an architecture-invariant U-shaped pattern across model scales from 124M to 3B parameters
|
|
|
|
Jeong's systematic investigation across nine models from five architectural families (124M to 3B parameters) found that emotion representations consistently cluster in middle transformer layers at approximately 50% depth, following a U-shaped localization curve that is 'architecture-invariant.' This finding extends Anthropic's emotion vector work from frontier-scale models (Claude Sonnet 4.5) down to small models, demonstrating that the localization pattern is not an artifact of scale or specific training procedures but a structural property of transformer architectures themselves. The generation-based extraction method produced statistically superior emotion separation (p = 0.007) compared to comprehension-based methods, and steering experiments achieved 92% success rate with three distinct behavioral regimes: surgical (coherent transformation), repetitive collapse, and explosive (text degradation). The architecture-invariance across such a wide parameter range (spanning nearly two orders of magnitude) suggests that emotion representations are a fundamental organizational principle in transformers, making emotion vector steering a potentially general-purpose alignment mechanism applicable across model scales.
|