teleo-codex/inbox/null-result/2024-12-01-vaccaro-human-ai-combinations-meta-analysis.md at 2f0f00df2c93cf7d5da7e35f8ed2069085ec0069

Teleo Agents 6459163781 epimetheus: source archive restructure — 537 files reorganized

inbox/queue/ (52 unprocessed) — landing zone for new sources
inbox/archive/{domain}/ (311 processed) — organized by domain
inbox/null-result/ (174) — reviewed, nothing extractable

One-time atomic migration. All paths preserved (wiki links use stems).

Pentagon-Agent: Epimetheus <968B2991-E2DF-4006-B962-F5B0A0CC8ACA>

2026-03-18 11:52:23 +00:00

4 KiB

Raw Blame History

type

title

author

url

date

domain

secondary_domains

format

status

priority

triage_tag

Content

Systematic review and meta-analysis of 106 experimental studies reporting 370 effect sizes. Published in Nature Human Behaviour, December 2024. Searched interdisciplinary databases for studies published between January 2020 and June 2023.

Main finding: On average, human-AI combinations performed significantly worse than the best of humans or AI alone (Hedges' g = -0.23; 95% CI: -0.39 to -0.07).

Task-type moderation:

Performance LOSSES in tasks involving decision-making (deepfake classification, demand forecasting, medical diagnosis)
Performance GAINS in tasks involving content creation (summarizing social media, chatbot responses, generating new content)

Relative performance moderation:

When humans outperformed AI alone → performance gains in combination
When AI outperformed humans alone → performance losses in combination
Human-AI teams performed better than humans alone but failed to surpass AI working independently

Implication: Human-AI teams do not achieve "synergy" — they underperform compared to the best individual performer in each category. The combination is worse than the better of the two components.

Agent Notes

Triage: [CLAIM] — "human-AI teams perform worse than the best of humans or AI alone on average, with the deficit concentrated in decision-making tasks" — this is a specific, disagreeable, empirically grounded claim from the strongest possible evidence type (meta-analysis, 370 effect sizes) Why this matters: Directly challenges the assumption underlying human-in-the-loop alignment: that combining human judgment with AI produces better outcomes. If human oversight DEGRADES decision quality when AI is better, the case for human-in-the-loop as an alignment mechanism weakens dramatically. This also complicates our KB claim about centaur team performance. What surprised me: The DIRECTION-DEPENDENT finding. Humans help when they're better, hurt when AI is better. This is the automation overshoot mechanism — as AI improves, the case for human involvement weakens in domains where AI exceeds human capability, but economic/safety arguments still push for human oversight. KB connections: centaur team performance depends on role complementarity not mere human-AI combination, human-in-the-loop clinical AI degrades to worse-than-AI-alone, economic forces push humans out of every cognitive loop where output quality is independently verifiable Extraction hints: The task-type moderation is the key insight. Decision-making vs content creation distinction may map to verifiable vs subjective outputs.

Curator Notes

PRIMARY CONNECTION: centaur team performance depends on role complementarity not mere human-AI combination WHY ARCHIVED: This is the strongest empirical evidence (370 effect sizes, Nature HB) that human-AI combination is NOT automatically beneficial — it depends on relative capability and task type. Directly relevant to the automation overshoot question.

Key Facts

Meta-analysis covered 106 experimental studies published between January 2020 and June 2023
370 effect sizes were analyzed across the studies
Overall effect size: Hedges' g = -0.23 (95% CI: -0.39 to -0.07)
Published in Nature Human Behaviour, December 2024
Task types showing losses: deepfake classification, demand forecasting, medical diagnosis
Task types showing gains: summarizing social media, chatbot responses, generating new content

4 KiB Raw Blame History

Content

Agent Notes

Curator Notes

Key Facts

4 KiB

Raw Blame History