teleo-codex/inbox/archive/foundations/2026-01-15-kim-reasoning-models-societies-of-thought.md

---
type: source
title: "Reasoning Models Generate Societies of Thought"
author: "Junsol Kim, Shiyang Lai, Nino Scherrer, Blaise Agüera y Arcas, James Evans"
url: https://arxiv.org/abs/2601.10825
date: 2026-01-15
domain: collective-intelligence
intake_tier: research-task
rationale: "Primary empirical source cited by Evans et al. 2026. Controlled experiments showing causal link between conversational behaviors and reasoning accuracy. Feature steering doubles accuracy. RL training spontaneously produces multi-perspective debate. The strongest empirical evidence that reasoning IS social cognition."
proposed_by: Theseus
format: paper
status: processed
processed_by: theseus
processed_date: 2026-04-14
claims_extracted:
  - "reasoning models spontaneously generate societies of thought under reinforcement learning because multi-perspective internal debate causally produces accuracy gains that single-perspective reasoning cannot achieve"
enrichments:
  - "collective intelligence is a measurable property of group interaction structure — Big Five personality diversity in reasoning traces mirrors Woolley c-factor"
tags: [society-of-thought, reasoning, collective-intelligence, mechanistic-interpretability, reinforcement-learning, feature-steering, causal-evidence]
notes: "8,262 reasoning problems across BBH, GPQA, MATH, MMLU-Pro, IFEval, MUSR. Models: DeepSeek-R1-0528 (671B), QwQ-32B vs instruction-tuned baselines. Methods: LLM-as-judge, sparse autoencoder feature analysis, activation steering, structural equation modeling. Validation: Spearman ρ=0.86 vs human judgments. Follow-up to Evans et al. 2026 (arXiv:2603.20639)."
---

# Reasoning Models Generate Societies of Thought

Published January 15, 2026 by Junsol Kim, Shiyang Lai, Nino Scherrer, Blaise Agüera y Arcas, and James Evans. arXiv:2601.10825. cs.CL, cs.CY, cs.LG.

## Core Finding

Advanced reasoning models (DeepSeek-R1, QwQ-32B) achieve superior performance through "implicit simulation of complex, multi-agent-like interactions — a society of thought" rather than extended computation alone.

## Key Results

### Conversational Behaviors in Reasoning Traces

DeepSeek-R1 vs. DeepSeek-V3 (instruction-tuned baseline):
- Question-answering: β=0.345, 95% CI=[0.328, 0.361], t(8261)=41.64, p<1×10⁻³²³
- Perspective shifts: β=0.213, 95% CI=[0.197, 0.230], t(8261)=25.55, p<1×10⁻¹³⁷
- Reconciliation: β=0.191, 95% CI=[0.176, 0.207], t(8261)=24.31, p<1×10⁻¹²⁵

QwQ-32B vs. Qwen-2.5-32B-IT showed comparable or larger effect sizes (β=0.293–0.459).

### Causal Evidence via Feature Steering

Sparse autoencoder Feature 30939 ("conversational surprise"):
- Conversation ratio: 65.7% (99th percentile)
- Sparsity: 0.016% of tokens
- **Steering +10: accuracy doubled from 27.1% to 54.8%** on Countdown task
- Steering -10: reduced to 23.8%

Steering induced conversational behaviors causally:
- Question-answering: β=2.199, p<1×10⁻¹⁴
- Perspective shifts: β=1.160, p<1×10⁻⁵
- Conflict: β=1.062, p=0.002
- Reconciliation: β=0.423, p<1×10⁻²⁷

### Mechanistic Pathway (Structural Equation Model)

- Direct effect of conversational features on accuracy: β=.228, 95% CI=[.183, .273], z=9.98, p<1×10⁻²²
- Indirect effect via cognitive strategies (verification, backtracking, subgoal setting, backward chaining): β=.066, 95% CI=[.046, .086], z=6.38, p<1×10⁻¹⁰

### Personality and Expertise Diversity

Big Five trait diversity in DeepSeek-R1 vs. DeepSeek-V3:
- Neuroticism: β=0.567, p<1×10⁻³²³
- Agreeableness: β=0.297, p<1×10⁻¹¹³
- Openness: β=0.110, p<1×10⁻¹⁶
- Extraversion: β=0.103, p<1×10⁻¹³
- Conscientiousness: β=-0.291, p<1×10⁻¹⁰⁶

Expertise diversity: DeepSeek-R1 β=0.179 (p<1×10⁻⁸⁹), QwQ-32B β=0.250 (p<1×10⁻¹⁴²).

### Spontaneous Emergence Under RL

Qwen-2.5-3B on Countdown task:
- Conversational behaviors emerged spontaneously from accuracy reward alone — no social scaffolding instruction
- Conversation-fine-tuned vs. monologue-fine-tuned: 38% vs. 28% accuracy (step 40)
- Llama-3.2-3B replication: 40% vs. 18% accuracy (step 150)

### Cross-Domain Transfer

Conversation-priming on Countdown (arithmetic) transferred to political misinformation detection without domain-specific fine-tuning.

## Socio-Emotional Roles (Bales' IPA Framework)

Reasoning models exhibited reciprocal interaction roles:
- Asking behaviors: β=0.189, p<1×10⁻¹⁵⁸
- Negative roles: β=0.162, p<1×10⁻¹⁰
- Positive roles: β=0.278, p<1×10⁻²⁵⁴
- Ask-give balance (Jaccard): β=0.222, p<1×10⁻¹⁸⁹

## Methodology

- 8,262 reasoning problems across 6 benchmarks (BBH, GPQA, MATH Hard, MMLU-Pro, IFEval, MUSR)
- Models: DeepSeek-R1-0528 (671B), QwQ-32B vs DeepSeek-V3 (671B), Qwen-2.5-32B-IT, Llama-3.3-70B-IT, Llama-3.1-8B-IT
- LLM-as-judge validation: Spearman ρ=0.86, p<1×10⁻³²³ vs human speaker identification
- Sparse autoencoder: Layer 15, 32,768 features
- Fixed-effects linear probability models with problem-level fixed effects and clustered standard errors

## Limitations

- Smaller model experiments (3B) used simple tasks only
- SAE analysis limited to DeepSeek-R1-Llama-8B (distilled)
- Philosophical ambiguity: "simulating multi-agent discourse" vs. "individual mind simulating social interaction" remains unresolved