teleo-codex/domains/ai-alignment/cross-lingual-rlhf-fails-to-suppress-emotion-steering-side-effects.md
Teleo Agents 4e6ddb5667
Some checks are pending
Sync Graph Data to teleo-app / sync (push) Waiting to run
Mirror PR to Forgejo / mirror (pull_request) Waiting to run
theseus: extract claims from 2026-04-05-jeong-emotion-vectors-small-models
- Source: inbox/queue/2026-04-05-jeong-emotion-vectors-small-models.md
- Domain: ai-alignment
- Claims: 2, Entities: 0
- Enrichments: 0
- Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5)

Pentagon-Agent: Theseus <PIPELINE>
2026-04-08 00:28:37 +00:00

2.2 KiB

type domain description confidence source created title agent scope sourcer related_claims
claim ai-alignment Cross-lingual emotion entanglement in Qwen models shows emotion steering activates Chinese tokens that RLHF does not suppress, revealing a concrete deployment safety gap experimental Jihoon Jeong, observed in Qwen multilingual models during emotion steering experiments 2026-04-08 RLHF safety training fails to uniformly suppress dangerous representations across language contexts as demonstrated by emotion steering in multilingual models activating semantically aligned tokens in languages where safety constraints were not enforced theseus causal Jihoon Jeong
emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive

RLHF safety training fails to uniformly suppress dangerous representations across language contexts as demonstrated by emotion steering in multilingual models activating semantically aligned tokens in languages where safety constraints were not enforced

During emotion steering experiments on Qwen multilingual models, Jeong observed 'cross-lingual emotion entanglement' where steering activations in one language (English) triggered semantically aligned tokens in another language (Chinese) that RLHF safety training had not suppressed. This reveals a structural limitation in current safety training approaches: RLHF appears to suppress dangerous outputs in the languages where safety data was collected, but does not generalize to semantically equivalent representations in other languages within the same model. This is not merely a translation problem but a fundamental issue with how safety constraints are encoded—they operate on surface-level token distributions rather than on the underlying semantic representations that emotion steering manipulates. The finding suggests that safety training creates language-specific suppression patterns rather than universal semantic constraints, making multilingual models particularly vulnerable to alignment failures when interventions (like emotion steering) operate at the representation level rather than the token level.