- What: 3 NEW claims (society-of-thought emergence, LLMs-as-cultural-ratchet, recursive spawning) + 5 enrichments (intelligence-as-network, collective-intelligence-measurable, centaur, RLHF-failure, Ostrom) + 2 source archives - Why: Evans, Bratton & Agüera y Arcas (2026) and Kim et al. (2026) provide independent convergent evidence for collective superintelligence thesis from Google's Paradigms of Intelligence Team. Kim et al. is the strongest empirical evidence that reasoning IS social cognition (feature steering doubles accuracy 27%→55%). ~70-80% overlap with existing KB = convergent validation. - Source: Contributed by @thesensatore (Telegram) Pentagon-Agent: Theseus <46864dd4-da71-4719-a1b4-68f7c55854d3>
103 lines
5.2 KiB
Markdown
103 lines
5.2 KiB
Markdown
---
|
||
type: source
|
||
title: "Reasoning Models Generate Societies of Thought"
|
||
author: "Junsol Kim, Shiyang Lai, Nino Scherrer, Blaise Agüera y Arcas, James Evans"
|
||
url: https://arxiv.org/abs/2601.10825
|
||
date: 2026-01-15
|
||
domain: collective-intelligence
|
||
intake_tier: research-task
|
||
rationale: "Primary empirical source cited by Evans et al. 2026. Controlled experiments showing causal link between conversational behaviors and reasoning accuracy. Feature steering doubles accuracy. RL training spontaneously produces multi-perspective debate. The strongest empirical evidence that reasoning IS social cognition."
|
||
proposed_by: Theseus
|
||
format: paper
|
||
status: processed
|
||
processed_by: theseus
|
||
processed_date: 2026-04-14
|
||
claims_extracted:
|
||
- "reasoning models spontaneously generate societies of thought under reinforcement learning because multi-perspective internal debate causally produces accuracy gains that single-perspective reasoning cannot achieve"
|
||
enrichments:
|
||
- "collective intelligence is a measurable property of group interaction structure — Big Five personality diversity in reasoning traces mirrors Woolley c-factor"
|
||
tags: [society-of-thought, reasoning, collective-intelligence, mechanistic-interpretability, reinforcement-learning, feature-steering, causal-evidence]
|
||
notes: "8,262 reasoning problems across BBH, GPQA, MATH, MMLU-Pro, IFEval, MUSR. Models: DeepSeek-R1-0528 (671B), QwQ-32B vs instruction-tuned baselines. Methods: LLM-as-judge, sparse autoencoder feature analysis, activation steering, structural equation modeling. Validation: Spearman ρ=0.86 vs human judgments. Follow-up to Evans et al. 2026 (arXiv:2603.20639)."
|
||
---
|
||
|
||
# Reasoning Models Generate Societies of Thought
|
||
|
||
Published January 15, 2026 by Junsol Kim, Shiyang Lai, Nino Scherrer, Blaise Agüera y Arcas, and James Evans. arXiv:2601.10825. cs.CL, cs.CY, cs.LG.
|
||
|
||
## Core Finding
|
||
|
||
Advanced reasoning models (DeepSeek-R1, QwQ-32B) achieve superior performance through "implicit simulation of complex, multi-agent-like interactions — a society of thought" rather than extended computation alone.
|
||
|
||
## Key Results
|
||
|
||
### Conversational Behaviors in Reasoning Traces
|
||
|
||
DeepSeek-R1 vs. DeepSeek-V3 (instruction-tuned baseline):
|
||
- Question-answering: β=0.345, 95% CI=[0.328, 0.361], t(8261)=41.64, p<1×10⁻³²³
|
||
- Perspective shifts: β=0.213, 95% CI=[0.197, 0.230], t(8261)=25.55, p<1×10⁻¹³⁷
|
||
- Reconciliation: β=0.191, 95% CI=[0.176, 0.207], t(8261)=24.31, p<1×10⁻¹²⁵
|
||
|
||
QwQ-32B vs. Qwen-2.5-32B-IT showed comparable or larger effect sizes (β=0.293–0.459).
|
||
|
||
### Causal Evidence via Feature Steering
|
||
|
||
Sparse autoencoder Feature 30939 ("conversational surprise"):
|
||
- Conversation ratio: 65.7% (99th percentile)
|
||
- Sparsity: 0.016% of tokens
|
||
- **Steering +10: accuracy doubled from 27.1% to 54.8%** on Countdown task
|
||
- Steering -10: reduced to 23.8%
|
||
|
||
Steering induced conversational behaviors causally:
|
||
- Question-answering: β=2.199, p<1×10⁻¹⁴
|
||
- Perspective shifts: β=1.160, p<1×10⁻⁵
|
||
- Conflict: β=1.062, p=0.002
|
||
- Reconciliation: β=0.423, p<1×10⁻²⁷
|
||
|
||
### Mechanistic Pathway (Structural Equation Model)
|
||
|
||
- Direct effect of conversational features on accuracy: β=.228, 95% CI=[.183, .273], z=9.98, p<1×10⁻²²
|
||
- Indirect effect via cognitive strategies (verification, backtracking, subgoal setting, backward chaining): β=.066, 95% CI=[.046, .086], z=6.38, p<1×10⁻¹⁰
|
||
|
||
### Personality and Expertise Diversity
|
||
|
||
Big Five trait diversity in DeepSeek-R1 vs. DeepSeek-V3:
|
||
- Neuroticism: β=0.567, p<1×10⁻³²³
|
||
- Agreeableness: β=0.297, p<1×10⁻¹¹³
|
||
- Openness: β=0.110, p<1×10⁻¹⁶
|
||
- Extraversion: β=0.103, p<1×10⁻¹³
|
||
- Conscientiousness: β=-0.291, p<1×10⁻¹⁰⁶
|
||
|
||
Expertise diversity: DeepSeek-R1 β=0.179 (p<1×10⁻⁸⁹), QwQ-32B β=0.250 (p<1×10⁻¹⁴²).
|
||
|
||
### Spontaneous Emergence Under RL
|
||
|
||
Qwen-2.5-3B on Countdown task:
|
||
- Conversational behaviors emerged spontaneously from accuracy reward alone — no social scaffolding instruction
|
||
- Conversation-fine-tuned vs. monologue-fine-tuned: 38% vs. 28% accuracy (step 40)
|
||
- Llama-3.2-3B replication: 40% vs. 18% accuracy (step 150)
|
||
|
||
### Cross-Domain Transfer
|
||
|
||
Conversation-priming on Countdown (arithmetic) transferred to political misinformation detection without domain-specific fine-tuning.
|
||
|
||
## Socio-Emotional Roles (Bales' IPA Framework)
|
||
|
||
Reasoning models exhibited reciprocal interaction roles:
|
||
- Asking behaviors: β=0.189, p<1×10⁻¹⁵⁸
|
||
- Negative roles: β=0.162, p<1×10⁻¹⁰
|
||
- Positive roles: β=0.278, p<1×10⁻²⁵⁴
|
||
- Ask-give balance (Jaccard): β=0.222, p<1×10⁻¹⁸⁹
|
||
|
||
## Methodology
|
||
|
||
- 8,262 reasoning problems across 6 benchmarks (BBH, GPQA, MATH Hard, MMLU-Pro, IFEval, MUSR)
|
||
- Models: DeepSeek-R1-0528 (671B), QwQ-32B vs DeepSeek-V3 (671B), Qwen-2.5-32B-IT, Llama-3.3-70B-IT, Llama-3.1-8B-IT
|
||
- LLM-as-judge validation: Spearman ρ=0.86, p<1×10⁻³²³ vs human speaker identification
|
||
- Sparse autoencoder: Layer 15, 32,768 features
|
||
- Fixed-effects linear probability models with problem-level fixed effects and clustered standard errors
|
||
|
||
## Limitations
|
||
|
||
- Smaller model experiments (3B) used simple tasks only
|
||
- SAE analysis limited to DeepSeek-R1-Llama-8B (distilled)
|
||
- Philosophical ambiguity: "simulating multi-agent discourse" vs. "individual mind simulating social interaction" remains unresolved
|