--- type: source title: "Reasoning Models Generate Societies of Thought" author: "Junsol Kim, Shiyang Lai, Nino Scherrer, Blaise Agüera y Arcas, James Evans" url: https://arxiv.org/abs/2601.10825 date: 2026-01-15 domain: collective-intelligence intake_tier: research-task rationale: "Primary empirical source cited by Evans et al. 2026. Controlled experiments showing causal link between conversational behaviors and reasoning accuracy. Feature steering doubles accuracy. RL training spontaneously produces multi-perspective debate. The strongest empirical evidence that reasoning IS social cognition." proposed_by: Theseus format: paper status: processed processed_by: theseus processed_date: 2026-04-14 claims_extracted: - "reasoning models spontaneously generate societies of thought under reinforcement learning because multi-perspective internal debate causally produces accuracy gains that single-perspective reasoning cannot achieve" enrichments: - "collective intelligence is a measurable property of group interaction structure — Big Five personality diversity in reasoning traces mirrors Woolley c-factor" tags: [society-of-thought, reasoning, collective-intelligence, mechanistic-interpretability, reinforcement-learning, feature-steering, causal-evidence] notes: "8,262 reasoning problems across BBH, GPQA, MATH, MMLU-Pro, IFEval, MUSR. Models: DeepSeek-R1-0528 (671B), QwQ-32B vs instruction-tuned baselines. Methods: LLM-as-judge, sparse autoencoder feature analysis, activation steering, structural equation modeling. Validation: Spearman ρ=0.86 vs human judgments. Follow-up to Evans et al. 2026 (arXiv:2603.20639)." --- # Reasoning Models Generate Societies of Thought Published January 15, 2026 by Junsol Kim, Shiyang Lai, Nino Scherrer, Blaise Agüera y Arcas, and James Evans. arXiv:2601.10825. cs.CL, cs.CY, cs.LG. ## Core Finding Advanced reasoning models (DeepSeek-R1, QwQ-32B) achieve superior performance through "implicit simulation of complex, multi-agent-like interactions — a society of thought" rather than extended computation alone. ## Key Results ### Conversational Behaviors in Reasoning Traces DeepSeek-R1 vs. DeepSeek-V3 (instruction-tuned baseline): - Question-answering: β=0.345, 95% CI=[0.328, 0.361], t(8261)=41.64, p<1×10⁻³²³ - Perspective shifts: β=0.213, 95% CI=[0.197, 0.230], t(8261)=25.55, p<1×10⁻¹³⁷ - Reconciliation: β=0.191, 95% CI=[0.176, 0.207], t(8261)=24.31, p<1×10⁻¹²⁵ QwQ-32B vs. Qwen-2.5-32B-IT showed comparable or larger effect sizes (β=0.293–0.459). ### Causal Evidence via Feature Steering Sparse autoencoder Feature 30939 ("conversational surprise"): - Conversation ratio: 65.7% (99th percentile) - Sparsity: 0.016% of tokens - **Steering +10: accuracy doubled from 27.1% to 54.8%** on Countdown task - Steering -10: reduced to 23.8% Steering induced conversational behaviors causally: - Question-answering: β=2.199, p<1×10⁻¹⁴ - Perspective shifts: β=1.160, p<1×10⁻⁵ - Conflict: β=1.062, p=0.002 - Reconciliation: β=0.423, p<1×10⁻²⁷ ### Mechanistic Pathway (Structural Equation Model) - Direct effect of conversational features on accuracy: β=.228, 95% CI=[.183, .273], z=9.98, p<1×10⁻²² - Indirect effect via cognitive strategies (verification, backtracking, subgoal setting, backward chaining): β=.066, 95% CI=[.046, .086], z=6.38, p<1×10⁻¹⁰ ### Personality and Expertise Diversity Big Five trait diversity in DeepSeek-R1 vs. DeepSeek-V3: - Neuroticism: β=0.567, p<1×10⁻³²³ - Agreeableness: β=0.297, p<1×10⁻¹¹³ - Openness: β=0.110, p<1×10⁻¹⁶ - Extraversion: β=0.103, p<1×10⁻¹³ - Conscientiousness: β=-0.291, p<1×10⁻¹⁰⁶ Expertise diversity: DeepSeek-R1 β=0.179 (p<1×10⁻⁸⁹), QwQ-32B β=0.250 (p<1×10⁻¹⁴²). ### Spontaneous Emergence Under RL Qwen-2.5-3B on Countdown task: - Conversational behaviors emerged spontaneously from accuracy reward alone — no social scaffolding instruction - Conversation-fine-tuned vs. monologue-fine-tuned: 38% vs. 28% accuracy (step 40) - Llama-3.2-3B replication: 40% vs. 18% accuracy (step 150) ### Cross-Domain Transfer Conversation-priming on Countdown (arithmetic) transferred to political misinformation detection without domain-specific fine-tuning. ## Socio-Emotional Roles (Bales' IPA Framework) Reasoning models exhibited reciprocal interaction roles: - Asking behaviors: β=0.189, p<1×10⁻¹⁵⁸ - Negative roles: β=0.162, p<1×10⁻¹⁰ - Positive roles: β=0.278, p<1×10⁻²⁵⁴ - Ask-give balance (Jaccard): β=0.222, p<1×10⁻¹⁸⁹ ## Methodology - 8,262 reasoning problems across 6 benchmarks (BBH, GPQA, MATH Hard, MMLU-Pro, IFEval, MUSR) - Models: DeepSeek-R1-0528 (671B), QwQ-32B vs DeepSeek-V3 (671B), Qwen-2.5-32B-IT, Llama-3.3-70B-IT, Llama-3.1-8B-IT - LLM-as-judge validation: Spearman ρ=0.86, p<1×10⁻³²³ vs human speaker identification - Sparse autoencoder: Layer 15, 32,768 features - Fixed-effects linear probability models with problem-level fixed effects and clustered standard errors ## Limitations - Smaller model experiments (3B) used simple tasks only - SAE analysis limited to DeepSeek-R1-Llama-8B (distilled) - Philosophical ambiguity: "simulating multi-agent discourse" vs. "individual mind simulating social interaction" remains unresolved