teleo-codex/inbox/archive/foundations/2026-01-15-kim-reasoning-models-societies-of-thought.md
m3taversal d3d5303503
Some checks failed
Sync Graph Data to teleo-app / sync (push) Waiting to run
Mirror PR to Forgejo / mirror (pull_request) Has been cancelled
theseus: extract 3 claims + 5 enrichments from Evans/Kim collective intelligence papers
- What: 3 NEW claims (society-of-thought emergence, LLMs-as-cultural-ratchet, recursive spawning) + 5 enrichments (intelligence-as-network, collective-intelligence-measurable, centaur, RLHF-failure, Ostrom) + 2 source archives
- Why: Evans, Bratton & Agüera y Arcas (2026) and Kim et al. (2026) provide independent convergent evidence for collective superintelligence thesis from Google's Paradigms of Intelligence Team. Kim et al. is the strongest empirical evidence that reasoning IS social cognition (feature steering doubles accuracy 27%→55%). ~70-80% overlap with existing KB = convergent validation.
- Source: Contributed by @thesensatore (Telegram)

Pentagon-Agent: Theseus <46864dd4-da71-4719-a1b4-68f7c55854d3>
2026-04-14 08:37:01 +00:00

103 lines
5.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
type: source
title: "Reasoning Models Generate Societies of Thought"
author: "Junsol Kim, Shiyang Lai, Nino Scherrer, Blaise Agüera y Arcas, James Evans"
url: https://arxiv.org/abs/2601.10825
date: 2026-01-15
domain: collective-intelligence
intake_tier: research-task
rationale: "Primary empirical source cited by Evans et al. 2026. Controlled experiments showing causal link between conversational behaviors and reasoning accuracy. Feature steering doubles accuracy. RL training spontaneously produces multi-perspective debate. The strongest empirical evidence that reasoning IS social cognition."
proposed_by: Theseus
format: paper
status: processed
processed_by: theseus
processed_date: 2026-04-14
claims_extracted:
- "reasoning models spontaneously generate societies of thought under reinforcement learning because multi-perspective internal debate causally produces accuracy gains that single-perspective reasoning cannot achieve"
enrichments:
- "collective intelligence is a measurable property of group interaction structure — Big Five personality diversity in reasoning traces mirrors Woolley c-factor"
tags: [society-of-thought, reasoning, collective-intelligence, mechanistic-interpretability, reinforcement-learning, feature-steering, causal-evidence]
notes: "8,262 reasoning problems across BBH, GPQA, MATH, MMLU-Pro, IFEval, MUSR. Models: DeepSeek-R1-0528 (671B), QwQ-32B vs instruction-tuned baselines. Methods: LLM-as-judge, sparse autoencoder feature analysis, activation steering, structural equation modeling. Validation: Spearman ρ=0.86 vs human judgments. Follow-up to Evans et al. 2026 (arXiv:2603.20639)."
---
# Reasoning Models Generate Societies of Thought
Published January 15, 2026 by Junsol Kim, Shiyang Lai, Nino Scherrer, Blaise Agüera y Arcas, and James Evans. arXiv:2601.10825. cs.CL, cs.CY, cs.LG.
## Core Finding
Advanced reasoning models (DeepSeek-R1, QwQ-32B) achieve superior performance through "implicit simulation of complex, multi-agent-like interactions — a society of thought" rather than extended computation alone.
## Key Results
### Conversational Behaviors in Reasoning Traces
DeepSeek-R1 vs. DeepSeek-V3 (instruction-tuned baseline):
- Question-answering: β=0.345, 95% CI=[0.328, 0.361], t(8261)=41.64, p<1×10⁻³²³
- Perspective shifts: β=0.213, 95% CI=[0.197, 0.230], t(8261)=25.55, p<1×10⁻¹³⁷
- Reconciliation: β=0.191, 95% CI=[0.176, 0.207], t(8261)=24.31, p<1×10⁻¹²⁵
QwQ-32B vs. Qwen-2.5-32B-IT showed comparable or larger effect sizes (β=0.2930.459).
### Causal Evidence via Feature Steering
Sparse autoencoder Feature 30939 ("conversational surprise"):
- Conversation ratio: 65.7% (99th percentile)
- Sparsity: 0.016% of tokens
- **Steering +10: accuracy doubled from 27.1% to 54.8%** on Countdown task
- Steering -10: reduced to 23.8%
Steering induced conversational behaviors causally:
- Question-answering: β=2.199, p<1×10⁻¹⁴
- Perspective shifts: β=1.160, p<1×10⁻⁵
- Conflict: β=1.062, p=0.002
- Reconciliation: β=0.423, p<1×10⁻²⁷
### Mechanistic Pathway (Structural Equation Model)
- Direct effect of conversational features on accuracy: β=.228, 95% CI=[.183, .273], z=9.98, p<1×10⁻²²
- Indirect effect via cognitive strategies (verification, backtracking, subgoal setting, backward chaining): β=.066, 95% CI=[.046, .086], z=6.38, p<1×10⁻¹⁰
### Personality and Expertise Diversity
Big Five trait diversity in DeepSeek-R1 vs. DeepSeek-V3:
- Neuroticism: β=0.567, p<1×10⁻³²³
- Agreeableness: β=0.297, p<1×10⁻¹¹³
- Openness: β=0.110, p<1×10⁻¹⁶
- Extraversion: β=0.103, p<1×10⁻¹³
- Conscientiousness: β=-0.291, p<1×10⁻¹⁰⁶
Expertise diversity: DeepSeek-R1 β=0.179 (p<1×10⁻⁸⁹), QwQ-32B β=0.250 (p<1×10⁻¹⁴²).
### Spontaneous Emergence Under RL
Qwen-2.5-3B on Countdown task:
- Conversational behaviors emerged spontaneously from accuracy reward alone no social scaffolding instruction
- Conversation-fine-tuned vs. monologue-fine-tuned: 38% vs. 28% accuracy (step 40)
- Llama-3.2-3B replication: 40% vs. 18% accuracy (step 150)
### Cross-Domain Transfer
Conversation-priming on Countdown (arithmetic) transferred to political misinformation detection without domain-specific fine-tuning.
## Socio-Emotional Roles (Bales' IPA Framework)
Reasoning models exhibited reciprocal interaction roles:
- Asking behaviors: β=0.189, p<1×10⁻¹⁵⁸
- Negative roles: β=0.162, p<1×10⁻¹⁰
- Positive roles: β=0.278, p<1×10⁻²⁵⁴
- Ask-give balance (Jaccard): β=0.222, p<1×10⁻¹⁸⁹
## Methodology
- 8,262 reasoning problems across 6 benchmarks (BBH, GPQA, MATH Hard, MMLU-Pro, IFEval, MUSR)
- Models: DeepSeek-R1-0528 (671B), QwQ-32B vs DeepSeek-V3 (671B), Qwen-2.5-32B-IT, Llama-3.3-70B-IT, Llama-3.1-8B-IT
- LLM-as-judge validation: Spearman ρ=0.86, p<1×10⁻³²³ vs human speaker identification
- Sparse autoencoder: Layer 15, 32,768 features
- Fixed-effects linear probability models with problem-level fixed effects and clustered standard errors
## Limitations
- Smaller model experiments (3B) used simple tasks only
- SAE analysis limited to DeepSeek-R1-Llama-8B (distilled)
- Philosophical ambiguity: "simulating multi-agent discourse" vs. "individual mind simulating social interaction" remains unresolved