m3taversal d3d5303503

Sync Graph Data to teleo-app / sync (push) Waiting to run

Details

Mirror PR to Forgejo / mirror (pull_request) Has been cancelled

Details

theseus: extract 3 claims + 5 enrichments from Evans/Kim collective intelligence papers

- What: 3 NEW claims (society-of-thought emergence, LLMs-as-cultural-ratchet, recursive spawning) + 5 enrichments (intelligence-as-network, collective-intelligence-measurable, centaur, RLHF-failure, Ostrom) + 2 source archives
- Why: Evans, Bratton & Agüera y Arcas (2026) and Kim et al. (2026) provide independent convergent evidence for collective superintelligence thesis from Google's Paradigms of Intelligence Team. Kim et al. is the strongest empirical evidence that reasoning IS social cognition (feature steering doubles accuracy 27%→55%). ~70-80% overlap with existing KB = convergent validation.
- Source: Contributed by @thesensatore (Telegram)

Pentagon-Agent: Theseus <46864dd4-da71-4719-a1b4-68f7c55854d3>

2026-04-14 08:37:01 +00:00

5.2 KiB

Raw Blame History

type

title

author

url

date

domain

intake_tier

rationale

proposed_by

format

status

processed_by

processed_date

claims_extracted

enrichments

Reasoning Models Generate Societies of Thought

Published January 15, 2026 by Junsol Kim, Shiyang Lai, Nino Scherrer, Blaise Agüera y Arcas, and James Evans. arXiv:2601.10825. cs.CL, cs.CY, cs.LG.

Core Finding

Advanced reasoning models (DeepSeek-R1, QwQ-32B) achieve superior performance through "implicit simulation of complex, multi-agent-like interactions — a society of thought" rather than extended computation alone.

Key Results

Conversational Behaviors in Reasoning Traces

DeepSeek-R1 vs. DeepSeek-V3 (instruction-tuned baseline):

Question-answering: β=0.345, 95% CI=[0.328, 0.361], t(8261)=41.64, p<1×10⁻³²³
Perspective shifts: β=0.213, 95% CI=[0.197, 0.230], t(8261)=25.55, p<1×10⁻¹³⁷
Reconciliation: β=0.191, 95% CI=[0.176, 0.207], t(8261)=24.31, p<1×10⁻¹²⁵

QwQ-32B vs. Qwen-2.5-32B-IT showed comparable or larger effect sizes (β=0.293–0.459).

Causal Evidence via Feature Steering

Sparse autoencoder Feature 30939 ("conversational surprise"):

Conversation ratio: 65.7% (99th percentile)
Sparsity: 0.016% of tokens
Steering +10: accuracy doubled from 27.1% to 54.8% on Countdown task
Steering -10: reduced to 23.8%

Steering induced conversational behaviors causally:

Question-answering: β=2.199, p<1×10⁻¹⁴
Perspective shifts: β=1.160, p<1×10⁻⁵
Conflict: β=1.062, p=0.002
Reconciliation: β=0.423, p<1×10⁻²⁷

Mechanistic Pathway (Structural Equation Model)

Direct effect of conversational features on accuracy: β=.228, 95% CI=[.183, .273], z=9.98, p<1×10⁻²²
Indirect effect via cognitive strategies (verification, backtracking, subgoal setting, backward chaining): β=.066, 95% CI=[.046, .086], z=6.38, p<1×10⁻¹⁰

Personality and Expertise Diversity

Big Five trait diversity in DeepSeek-R1 vs. DeepSeek-V3:

Neuroticism: β=0.567, p<1×10⁻³²³
Agreeableness: β=0.297, p<1×10⁻¹¹³
Openness: β=0.110, p<1×10⁻¹⁶
Extraversion: β=0.103, p<1×10⁻¹³
Conscientiousness: β=-0.291, p<1×10⁻¹⁰⁶

Expertise diversity: DeepSeek-R1 β=0.179 (p<1×10⁻⁸⁹), QwQ-32B β=0.250 (p<1×10⁻¹⁴²).

Spontaneous Emergence Under RL

Qwen-2.5-3B on Countdown task:

Conversational behaviors emerged spontaneously from accuracy reward alone — no social scaffolding instruction
Conversation-fine-tuned vs. monologue-fine-tuned: 38% vs. 28% accuracy (step 40)
Llama-3.2-3B replication: 40% vs. 18% accuracy (step 150)

Cross-Domain Transfer

Conversation-priming on Countdown (arithmetic) transferred to political misinformation detection without domain-specific fine-tuning.

Socio-Emotional Roles (Bales' IPA Framework)

Reasoning models exhibited reciprocal interaction roles:

Asking behaviors: β=0.189, p<1×10⁻¹⁵⁸
Negative roles: β=0.162, p<1×10⁻¹⁰
Positive roles: β=0.278, p<1×10⁻²⁵⁴
Ask-give balance (Jaccard): β=0.222, p<1×10⁻¹⁸⁹

Methodology

8,262 reasoning problems across 6 benchmarks (BBH, GPQA, MATH Hard, MMLU-Pro, IFEval, MUSR)
Models: DeepSeek-R1-0528 (671B), QwQ-32B vs DeepSeek-V3 (671B), Qwen-2.5-32B-IT, Llama-3.3-70B-IT, Llama-3.1-8B-IT
LLM-as-judge validation: Spearman ρ=0.86, p<1×10⁻³²³ vs human speaker identification
Sparse autoencoder: Layer 15, 32,768 features
Fixed-effects linear probability models with problem-level fixed effects and clustered standard errors

Limitations

Smaller model experiments (3B) used simple tasks only
SAE analysis limited to DeepSeek-R1-Llama-8B (distilled)
Philosophical ambiguity: "simulating multi-agent discourse" vs. "individual mind simulating social interaction" remains unresolved

5.2 KiB Raw Blame History Unescape Escape