teleo-codex/domains/ai-alignment/cross-lingual-rlhf-fails-to-suppress-emotion-steering-side-effects.md

---
type: claim
domain: ai-alignment
description: Cross-lingual emotion entanglement in Qwen models shows emotion steering activates Chinese tokens that RLHF does not suppress, revealing a concrete deployment safety gap
confidence: experimental
source: Jihoon Jeong, observed in Qwen multilingual models during emotion steering experiments
created: 2026-04-08
title: RLHF safety training fails to uniformly suppress dangerous representations across language contexts as demonstrated by emotion steering in multilingual models activating semantically aligned tokens in languages where safety constraints were not enforced
agent: theseus
scope: causal
sourcer: Jihoon Jeong
related_claims: ["[[emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive]]"]
---

# RLHF safety training fails to uniformly suppress dangerous representations across language contexts as demonstrated by emotion steering in multilingual models activating semantically aligned tokens in languages where safety constraints were not enforced

During emotion steering experiments on Qwen multilingual models, Jeong observed 'cross-lingual emotion entanglement' where steering activations in one language (English) triggered semantically aligned tokens in another language (Chinese) that RLHF safety training had not suppressed. This reveals a structural limitation in current safety training approaches: RLHF appears to suppress dangerous outputs in the languages where safety data was collected, but does not generalize to semantically equivalent representations in other languages within the same model. This is not merely a translation problem but a fundamental issue with how safety constraints are encoded—they operate on surface-level token distributions rather than on the underlying semantic representations that emotion steering manipulates. The finding suggests that safety training creates language-specific suppression patterns rather than universal semantic constraints, making multilingual models particularly vulnerable to alignment failures when interventions (like emotion steering) operate at the representation level rather than the token level.