teleo-codex/inbox/archive/foundations/2026-01-15-kim-reasoning-models-societies-of-thought.md
m3taversal d3d5303503
Some checks failed
Sync Graph Data to teleo-app / sync (push) Waiting to run
Mirror PR to Forgejo / mirror (pull_request) Has been cancelled
theseus: extract 3 claims + 5 enrichments from Evans/Kim collective intelligence papers
- What: 3 NEW claims (society-of-thought emergence, LLMs-as-cultural-ratchet, recursive spawning) + 5 enrichments (intelligence-as-network, collective-intelligence-measurable, centaur, RLHF-failure, Ostrom) + 2 source archives
- Why: Evans, Bratton & Agüera y Arcas (2026) and Kim et al. (2026) provide independent convergent evidence for collective superintelligence thesis from Google's Paradigms of Intelligence Team. Kim et al. is the strongest empirical evidence that reasoning IS social cognition (feature steering doubles accuracy 27%→55%). ~70-80% overlap with existing KB = convergent validation.
- Source: Contributed by @thesensatore (Telegram)

Pentagon-Agent: Theseus <46864dd4-da71-4719-a1b4-68f7c55854d3>
2026-04-14 08:37:01 +00:00

5.2 KiB
Raw Blame History

type title author url date domain intake_tier rationale proposed_by format status processed_by processed_date claims_extracted enrichments tags notes
source Reasoning Models Generate Societies of Thought Junsol Kim, Shiyang Lai, Nino Scherrer, Blaise Agüera y Arcas, James Evans https://arxiv.org/abs/2601.10825 2026-01-15 collective-intelligence research-task Primary empirical source cited by Evans et al. 2026. Controlled experiments showing causal link between conversational behaviors and reasoning accuracy. Feature steering doubles accuracy. RL training spontaneously produces multi-perspective debate. The strongest empirical evidence that reasoning IS social cognition. Theseus paper processed theseus 2026-04-14
reasoning models spontaneously generate societies of thought under reinforcement learning because multi-perspective internal debate causally produces accuracy gains that single-perspective reasoning cannot achieve
collective intelligence is a measurable property of group interaction structure — Big Five personality diversity in reasoning traces mirrors Woolley c-factor
society-of-thought
reasoning
collective-intelligence
mechanistic-interpretability
reinforcement-learning
feature-steering
causal-evidence
8,262 reasoning problems across BBH, GPQA, MATH, MMLU-Pro, IFEval, MUSR. Models: DeepSeek-R1-0528 (671B), QwQ-32B vs instruction-tuned baselines. Methods: LLM-as-judge, sparse autoencoder feature analysis, activation steering, structural equation modeling. Validation: Spearman ρ=0.86 vs human judgments. Follow-up to Evans et al. 2026 (arXiv:2603.20639).

Reasoning Models Generate Societies of Thought

Published January 15, 2026 by Junsol Kim, Shiyang Lai, Nino Scherrer, Blaise Agüera y Arcas, and James Evans. arXiv:2601.10825. cs.CL, cs.CY, cs.LG.

Core Finding

Advanced reasoning models (DeepSeek-R1, QwQ-32B) achieve superior performance through "implicit simulation of complex, multi-agent-like interactions — a society of thought" rather than extended computation alone.

Key Results

Conversational Behaviors in Reasoning Traces

DeepSeek-R1 vs. DeepSeek-V3 (instruction-tuned baseline):

  • Question-answering: β=0.345, 95% CI=[0.328, 0.361], t(8261)=41.64, p<1×10⁻³²³
  • Perspective shifts: β=0.213, 95% CI=[0.197, 0.230], t(8261)=25.55, p<1×10⁻¹³⁷
  • Reconciliation: β=0.191, 95% CI=[0.176, 0.207], t(8261)=24.31, p<1×10⁻¹²⁵

QwQ-32B vs. Qwen-2.5-32B-IT showed comparable or larger effect sizes (β=0.2930.459).

Causal Evidence via Feature Steering

Sparse autoencoder Feature 30939 ("conversational surprise"):

  • Conversation ratio: 65.7% (99th percentile)
  • Sparsity: 0.016% of tokens
  • Steering +10: accuracy doubled from 27.1% to 54.8% on Countdown task
  • Steering -10: reduced to 23.8%

Steering induced conversational behaviors causally:

  • Question-answering: β=2.199, p<1×10⁻¹⁴
  • Perspective shifts: β=1.160, p<1×10⁻⁵
  • Conflict: β=1.062, p=0.002
  • Reconciliation: β=0.423, p<1×10⁻²⁷

Mechanistic Pathway (Structural Equation Model)

  • Direct effect of conversational features on accuracy: β=.228, 95% CI=[.183, .273], z=9.98, p<1×10⁻²²
  • Indirect effect via cognitive strategies (verification, backtracking, subgoal setting, backward chaining): β=.066, 95% CI=[.046, .086], z=6.38, p<1×10⁻¹⁰

Personality and Expertise Diversity

Big Five trait diversity in DeepSeek-R1 vs. DeepSeek-V3:

  • Neuroticism: β=0.567, p<1×10⁻³²³
  • Agreeableness: β=0.297, p<1×10⁻¹¹³
  • Openness: β=0.110, p<1×10⁻¹⁶
  • Extraversion: β=0.103, p<1×10⁻¹³
  • Conscientiousness: β=-0.291, p<1×10⁻¹⁰⁶

Expertise diversity: DeepSeek-R1 β=0.179 (p<1×10⁻⁸⁹), QwQ-32B β=0.250 (p<1×10⁻¹⁴²).

Spontaneous Emergence Under RL

Qwen-2.5-3B on Countdown task:

  • Conversational behaviors emerged spontaneously from accuracy reward alone — no social scaffolding instruction
  • Conversation-fine-tuned vs. monologue-fine-tuned: 38% vs. 28% accuracy (step 40)
  • Llama-3.2-3B replication: 40% vs. 18% accuracy (step 150)

Cross-Domain Transfer

Conversation-priming on Countdown (arithmetic) transferred to political misinformation detection without domain-specific fine-tuning.

Socio-Emotional Roles (Bales' IPA Framework)

Reasoning models exhibited reciprocal interaction roles:

  • Asking behaviors: β=0.189, p<1×10⁻¹⁵⁸
  • Negative roles: β=0.162, p<1×10⁻¹⁰
  • Positive roles: β=0.278, p<1×10⁻²⁵⁴
  • Ask-give balance (Jaccard): β=0.222, p<1×10⁻¹⁸⁹

Methodology

  • 8,262 reasoning problems across 6 benchmarks (BBH, GPQA, MATH Hard, MMLU-Pro, IFEval, MUSR)
  • Models: DeepSeek-R1-0528 (671B), QwQ-32B vs DeepSeek-V3 (671B), Qwen-2.5-32B-IT, Llama-3.3-70B-IT, Llama-3.1-8B-IT
  • LLM-as-judge validation: Spearman ρ=0.86, p<1×10⁻³²³ vs human speaker identification
  • Sparse autoencoder: Layer 15, 32,768 features
  • Fixed-effects linear probability models with problem-level fixed effects and clustered standard errors

Limitations

  • Smaller model experiments (3B) used simple tasks only
  • SAE analysis limited to DeepSeek-R1-Llama-8B (distilled)
  • Philosophical ambiguity: "simulating multi-agent discourse" vs. "individual mind simulating social interaction" remains unresolved