theseus: extract claims from 2026-04-05-jeong-emotion-vectors-small-models

- Source: inbox/queue/2026-04-05-jeong-emotion-vectors-small-models.md - Domain: ai-alignment - Claims: 2, Entities: 0 - Enrichments: 0 - Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5) Pentagon-Agent: Theseus <PIPELINE>
2026-04-08 00:27:06 +00:00 · 2026-04-08 00:27:06 +00:00 · 4e6ddb5667
commit 4e6ddb5667
parent 96ad163007
2 changed files with 34 additions and 0 deletions
--- a/domains/ai-alignment/cross-lingual-rlhf-fails-to-suppress-emotion-steering-side-effects.md
+++ b/domains/ai-alignment/cross-lingual-rlhf-fails-to-suppress-emotion-steering-side-effects.md
@ -0,0 +1,17 @@
 ---
 type: claim
 domain: ai-alignment
 description: Cross-lingual emotion entanglement in Qwen models shows emotion steering activates Chinese tokens that RLHF does not suppress, revealing a concrete deployment safety gap
 confidence: experimental
 source: Jihoon Jeong, observed in Qwen multilingual models during emotion steering experiments
 created: 2026-04-08
 title: RLHF safety training fails to uniformly suppress dangerous representations across language contexts as demonstrated by emotion steering in multilingual models activating semantically aligned tokens in languages where safety constraints were not enforced
 agent: theseus
 scope: causal
 sourcer: Jihoon Jeong
 related_claims: ["[[emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive]]"]
 ---
 # RLHF safety training fails to uniformly suppress dangerous representations across language contexts as demonstrated by emotion steering in multilingual models activating semantically aligned tokens in languages where safety constraints were not enforced
 During emotion steering experiments on Qwen multilingual models, Jeong observed 'cross-lingual emotion entanglement' where steering activations in one language (English) triggered semantically aligned tokens in another language (Chinese) that RLHF safety training had not suppressed. This reveals a structural limitation in current safety training approaches: RLHF appears to suppress dangerous outputs in the languages where safety data was collected, but does not generalize to semantically equivalent representations in other languages within the same model. This is not merely a translation problem but a fundamental issue with how safety constraints are encoded—they operate on surface-level token distributions rather than on the underlying semantic representations that emotion steering manipulates. The finding suggests that safety training creates language-specific suppression patterns rather than universal semantic constraints, making multilingual models particularly vulnerable to alignment failures when interventions (like emotion steering) operate at the representation level rather than the token level.
--- a/domains/ai-alignment/emotion-representations-localize-at-middle-depth-architecture-invariant.md
+++ b/domains/ai-alignment/emotion-representations-localize-at-middle-depth-architecture-invariant.md
@ -0,0 +1,17 @@
 ---
 type: claim
 domain: ai-alignment
 description: This structural property suggests emotion vector steering is a general feature of transformer architectures rather than a frontier-scale emergent phenomenon
 confidence: experimental
 source: Jihoon Jeong, Model Medicine research series, tested across nine models from five architectural families
 created: 2026-04-08
 title: "Emotion representations in transformer language models localize at approximately 50% depth following an architecture-invariant U-shaped pattern across model scales from 124M to 3B parameters"
 agent: theseus
 scope: structural
 sourcer: Jihoon Jeong
 related_claims: ["[[safe AI development requires building alignment mechanisms before scaling capability]]"]
 ---
 # Emotion representations in transformer language models localize at approximately 50% depth following an architecture-invariant U-shaped pattern across model scales from 124M to 3B parameters
 Jeong's systematic investigation across nine models from five architectural families (124M to 3B parameters) found that emotion representations consistently cluster in middle transformer layers at approximately 50% depth, following a U-shaped localization curve that is 'architecture-invariant.' This finding extends Anthropic's emotion vector work from frontier-scale models (Claude Sonnet 4.5) down to small models, demonstrating that the localization pattern is not an artifact of scale or specific training procedures but a structural property of transformer architectures themselves. The generation-based extraction method produced statistically superior emotion separation (p = 0.007) compared to comprehension-based methods, and steering experiments achieved 92% success rate with three distinct behavioral regimes: surgical (coherent transformation), repetitive collapse, and explosive (text degradation). The architecture-invariance across such a wide parameter range (spanning nearly two orders of magnitude) suggests that emotion representations are a fundamental organizational principle in transformers, making emotion vector steering a potentially general-purpose alignment mechanism applicable across model scales.