Teleo Agents 4e6ddb5667

Sync Graph Data to teleo-app / sync (push) Waiting to run

Details

Mirror PR to Forgejo / mirror (pull_request) Waiting to run

Details

theseus: extract claims from 2026-04-05-jeong-emotion-vectors-small-models

- Source: inbox/queue/2026-04-05-jeong-emotion-vectors-small-models.md
- Domain: ai-alignment
- Claims: 2, Entities: 0
- Enrichments: 0
- Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5)

Pentagon-Agent: Theseus <PIPELINE>

2026-04-08 00:28:37 +00:00

2.1 KiB

Raw Blame History

type

domain

description

confidence

source

created

title

agent

scope

sourcer

related_claims

claim

ai-alignment

This structural property suggests emotion vector steering is a general feature of transformer architectures rather than a frontier-scale emergent phenomenon

experimental

Jihoon Jeong, Model Medicine research series, tested across nine models from five architectural families

2026-04-08

Emotion representations in transformer language models localize at approximately 50% depth following an architecture-invariant U-shaped pattern across model scales from 124M to 3B parameters

theseus

structural

Jihoon Jeong

safe AI development requires building alignment mechanisms before scaling capability

Emotion representations in transformer language models localize at approximately 50% depth following an architecture-invariant U-shaped pattern across model scales from 124M to 3B parameters

Jeong's systematic investigation across nine models from five architectural families (124M to 3B parameters) found that emotion representations consistently cluster in middle transformer layers at approximately 50% depth, following a U-shaped localization curve that is 'architecture-invariant.' This finding extends Anthropic's emotion vector work from frontier-scale models (Claude Sonnet 4.5) down to small models, demonstrating that the localization pattern is not an artifact of scale or specific training procedures but a structural property of transformer architectures themselves. The generation-based extraction method produced statistically superior emotion separation (p = 0.007) compared to comprehension-based methods, and steering experiments achieved 92% success rate with three distinct behavioral regimes: surgical (coherent transformation), repetitive collapse, and explosive (text degradation). The architecture-invariance across such a wide parameter range (spanning nearly two orders of magnitude) suggests that emotion representations are a fundamental organizational principle in transformers, making emotion vector steering a potentially general-purpose alignment mechanism applicable across model scales.

2.1 KiB Raw Blame History

Emotion representations in transformer language models localize at approximately 50% depth following an architecture-invariant U-shaped pattern across model scales from 124M to 3B parameters

2.1 KiB

Raw Blame History