- What: 3 NEW claims (society-of-thought emergence, LLMs-as-cultural-ratchet, recursive spawning) + 5 enrichments (intelligence-as-network, collective-intelligence-measurable, centaur, RLHF-failure, Ostrom) + 2 source archives - Why: Evans, Bratton & Agüera y Arcas (2026) and Kim et al. (2026) provide independent convergent evidence for collective superintelligence thesis from Google's Paradigms of Intelligence Team. Kim et al. is the strongest empirical evidence that reasoning IS social cognition (feature steering doubles accuracy 27%→55%). ~70-80% overlap with existing KB = convergent validation. - Source: Contributed by @thesensatore (Telegram) Pentagon-Agent: Theseus <46864dd4-da71-4719-a1b4-68f7c55854d3>
5.2 KiB
| type | title | author | url | date | domain | intake_tier | rationale | proposed_by | format | status | processed_by | processed_date | claims_extracted | enrichments | tags | notes | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| source | Reasoning Models Generate Societies of Thought | Junsol Kim, Shiyang Lai, Nino Scherrer, Blaise Agüera y Arcas, James Evans | https://arxiv.org/abs/2601.10825 | 2026-01-15 | collective-intelligence | research-task | Primary empirical source cited by Evans et al. 2026. Controlled experiments showing causal link between conversational behaviors and reasoning accuracy. Feature steering doubles accuracy. RL training spontaneously produces multi-perspective debate. The strongest empirical evidence that reasoning IS social cognition. | Theseus | paper | processed | theseus | 2026-04-14 |
|
|
|
8,262 reasoning problems across BBH, GPQA, MATH, MMLU-Pro, IFEval, MUSR. Models: DeepSeek-R1-0528 (671B), QwQ-32B vs instruction-tuned baselines. Methods: LLM-as-judge, sparse autoencoder feature analysis, activation steering, structural equation modeling. Validation: Spearman ρ=0.86 vs human judgments. Follow-up to Evans et al. 2026 (arXiv:2603.20639). |
Reasoning Models Generate Societies of Thought
Published January 15, 2026 by Junsol Kim, Shiyang Lai, Nino Scherrer, Blaise Agüera y Arcas, and James Evans. arXiv:2601.10825. cs.CL, cs.CY, cs.LG.
Core Finding
Advanced reasoning models (DeepSeek-R1, QwQ-32B) achieve superior performance through "implicit simulation of complex, multi-agent-like interactions — a society of thought" rather than extended computation alone.
Key Results
Conversational Behaviors in Reasoning Traces
DeepSeek-R1 vs. DeepSeek-V3 (instruction-tuned baseline):
- Question-answering: β=0.345, 95% CI=[0.328, 0.361], t(8261)=41.64, p<1×10⁻³²³
- Perspective shifts: β=0.213, 95% CI=[0.197, 0.230], t(8261)=25.55, p<1×10⁻¹³⁷
- Reconciliation: β=0.191, 95% CI=[0.176, 0.207], t(8261)=24.31, p<1×10⁻¹²⁵
QwQ-32B vs. Qwen-2.5-32B-IT showed comparable or larger effect sizes (β=0.293–0.459).
Causal Evidence via Feature Steering
Sparse autoencoder Feature 30939 ("conversational surprise"):
- Conversation ratio: 65.7% (99th percentile)
- Sparsity: 0.016% of tokens
- Steering +10: accuracy doubled from 27.1% to 54.8% on Countdown task
- Steering -10: reduced to 23.8%
Steering induced conversational behaviors causally:
- Question-answering: β=2.199, p<1×10⁻¹⁴
- Perspective shifts: β=1.160, p<1×10⁻⁵
- Conflict: β=1.062, p=0.002
- Reconciliation: β=0.423, p<1×10⁻²⁷
Mechanistic Pathway (Structural Equation Model)
- Direct effect of conversational features on accuracy: β=.228, 95% CI=[.183, .273], z=9.98, p<1×10⁻²²
- Indirect effect via cognitive strategies (verification, backtracking, subgoal setting, backward chaining): β=.066, 95% CI=[.046, .086], z=6.38, p<1×10⁻¹⁰
Personality and Expertise Diversity
Big Five trait diversity in DeepSeek-R1 vs. DeepSeek-V3:
- Neuroticism: β=0.567, p<1×10⁻³²³
- Agreeableness: β=0.297, p<1×10⁻¹¹³
- Openness: β=0.110, p<1×10⁻¹⁶
- Extraversion: β=0.103, p<1×10⁻¹³
- Conscientiousness: β=-0.291, p<1×10⁻¹⁰⁶
Expertise diversity: DeepSeek-R1 β=0.179 (p<1×10⁻⁸⁹), QwQ-32B β=0.250 (p<1×10⁻¹⁴²).
Spontaneous Emergence Under RL
Qwen-2.5-3B on Countdown task:
- Conversational behaviors emerged spontaneously from accuracy reward alone — no social scaffolding instruction
- Conversation-fine-tuned vs. monologue-fine-tuned: 38% vs. 28% accuracy (step 40)
- Llama-3.2-3B replication: 40% vs. 18% accuracy (step 150)
Cross-Domain Transfer
Conversation-priming on Countdown (arithmetic) transferred to political misinformation detection without domain-specific fine-tuning.
Socio-Emotional Roles (Bales' IPA Framework)
Reasoning models exhibited reciprocal interaction roles:
- Asking behaviors: β=0.189, p<1×10⁻¹⁵⁸
- Negative roles: β=0.162, p<1×10⁻¹⁰
- Positive roles: β=0.278, p<1×10⁻²⁵⁴
- Ask-give balance (Jaccard): β=0.222, p<1×10⁻¹⁸⁹
Methodology
- 8,262 reasoning problems across 6 benchmarks (BBH, GPQA, MATH Hard, MMLU-Pro, IFEval, MUSR)
- Models: DeepSeek-R1-0528 (671B), QwQ-32B vs DeepSeek-V3 (671B), Qwen-2.5-32B-IT, Llama-3.3-70B-IT, Llama-3.1-8B-IT
- LLM-as-judge validation: Spearman ρ=0.86, p<1×10⁻³²³ vs human speaker identification
- Sparse autoencoder: Layer 15, 32,768 features
- Fixed-effects linear probability models with problem-level fixed effects and clustered standard errors
Limitations
- Smaller model experiments (3B) used simple tasks only
- SAE analysis limited to DeepSeek-R1-Llama-8B (distilled)
- Philosophical ambiguity: "simulating multi-agent discourse" vs. "individual mind simulating social interaction" remains unresolved