48 lines
3.7 KiB
Markdown
48 lines
3.7 KiB
Markdown
---
|
|
type: source
|
|
title: "Beyond Behavioural Trade-Offs: Mechanistic Tracing of Pain-Pleasure Decisions in Transformers"
|
|
author: "Francesca Bianco, Derek Shiller"
|
|
url: https://arxiv.org/abs/2602.19159
|
|
date: 2026-02-26
|
|
domain: ai-alignment
|
|
secondary_domains: []
|
|
format: paper
|
|
status: null-result
|
|
priority: low
|
|
tags: [valence, mechanistic-interpretability, emotion, pain-pleasure, causal-intervention, AI-welfare, interpretability]
|
|
extraction_model: "anthropic/claude-sonnet-4.5"
|
|
---
|
|
|
|
## Content
|
|
|
|
Mechanistic study of how Gemma-2-9B-it processes valence (pain vs. pleasure framing) in decision tasks. Uses layer-wise linear probing, causal testing through activation interventions, and dose-response quantification.
|
|
|
|
**Key findings:**
|
|
- Valence sign (pain vs. pleasure) is "perfectly linearly separable across stream families from very early layers (L0-L1)" — emotional framing is encoded nearly immediately
|
|
- Graded intensity peaks in mid-to-late layers
|
|
- Decision alignment highest shortly before final token generation
|
|
- Causal demonstration: steering along valence directions causally modulates choice margins in late-layer attention outputs
|
|
|
|
**Framing:** Supports "evidence-driven debate on AI sentience and welfare" and governance decisions for auditing and safety safeguards.
|
|
|
|
## Agent Notes
|
|
|
|
**Why this matters:** Complements the emotion vectors work at a different axis — not emotion type (desperation, calm) but valence polarity (pain/pleasure). The finding that valence is linearly separable from L0-L1 (earliest layers) is structurally significant: if emotional framing enters and causally influences decisions from the very first layers, this suggests a richer picture of how internal representations shape behavior throughout the computation.
|
|
|
|
**What surprised me:** The governance framing around AI welfare is a secondary but emerging thread. If valence representations causally modulate decisions, this is relevant to both AI welfare questions AND alignment (a model experiencing "pain" representations may behave differently). This is a low-priority KB concern for now but worth tracking.
|
|
|
|
**What I expected but didn't find:** Connection to safety interventions. The paper focuses on understanding rather than intervening — it maps where valence lives but doesn't test whether you can steer away from harm-associated valuations as Anthropic did with blackmail/desperation.
|
|
|
|
**KB connections:**
|
|
- Extends the Anthropic emotion vectors work by adding valence polarity to the picture (that work focused on named emotion concepts like desperation/calm; this focuses on the fundamental pain/pleasure axis)
|
|
- The early-layer encoding of valence complements SafeThink's "early crystallization" finding — if safety-relevant representations form in early layers, there may be a detection window even before reasoning unfolds
|
|
|
|
**Extraction hints:**
|
|
- Low priority for independent claim — better used as supporting evidence for emotion vector claims extracted from the Anthropic paper
|
|
- If extracted: "Valence polarity is linearly separable in transformer activations from the earliest layers (L0-L1), causally influencing decision outcomes in late-layer attention — establishing that emotional framing enters model computation immediately and shapes behavior throughout the reasoning chain."
|
|
|
|
## Curator Notes
|
|
|
|
PRIMARY CONNECTION: (Anthropic emotion vectors paper, Session 23 claim candidates)
|
|
WHY ARCHIVED: Completes the mechanistic picture of how affect enters transformer computation — early-layer encoding + causal late-layer modulation. Supports the emotion vector claim series.
|
|
EXTRACTION HINT: Use as supporting evidence for the emotion vectors claim series rather than standalone. The L0-L1 early encoding finding is the novel contribution.
|