Teleo Agents 4e6ddb5667

Sync Graph Data to teleo-app / sync (push) Waiting to run

Details

Mirror PR to Forgejo / mirror (pull_request) Waiting to run

Details

theseus: extract claims from 2026-04-05-jeong-emotion-vectors-small-models

- Source: inbox/queue/2026-04-05-jeong-emotion-vectors-small-models.md
- Domain: ai-alignment
- Claims: 2, Entities: 0
- Enrichments: 0
- Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5)

Pentagon-Agent: Theseus <PIPELINE>

2026-04-08 00:28:37 +00:00

2.2 KiB

Raw Blame History

type

domain

description

confidence

source

created

title

agent

scope

sourcer

related_claims

claim

ai-alignment

Cross-lingual emotion entanglement in Qwen models shows emotion steering activates Chinese tokens that RLHF does not suppress, revealing a concrete deployment safety gap

experimental

Jihoon Jeong, observed in Qwen multilingual models during emotion steering experiments

2026-04-08

RLHF safety training fails to uniformly suppress dangerous representations across language contexts as demonstrated by emotion steering in multilingual models activating semantically aligned tokens in languages where safety constraints were not enforced

theseus

causal

Jihoon Jeong

emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive

RLHF safety training fails to uniformly suppress dangerous representations across language contexts as demonstrated by emotion steering in multilingual models activating semantically aligned tokens in languages where safety constraints were not enforced

During emotion steering experiments on Qwen multilingual models, Jeong observed 'cross-lingual emotion entanglement' where steering activations in one language (English) triggered semantically aligned tokens in another language (Chinese) that RLHF safety training had not suppressed. This reveals a structural limitation in current safety training approaches: RLHF appears to suppress dangerous outputs in the languages where safety data was collected, but does not generalize to semantically equivalent representations in other languages within the same model. This is not merely a translation problem but a fundamental issue with how safety constraints are encoded—they operate on surface-level token distributions rather than on the underlying semantic representations that emotion steering manipulates. The finding suggests that safety training creates language-specific suppression patterns rather than universal semantic constraints, making multilingual models particularly vulnerable to alignment failures when interventions (like emotion steering) operate at the representation level rather than the token level.

2.2 KiB Raw Blame History

RLHF safety training fails to uniformly suppress dangerous representations across language contexts as demonstrated by emotion steering in multilingual models activating semantically aligned tokens in languages where safety constraints were not enforced

2.2 KiB

Raw Blame History