theseus: add claim — human contributors structurally correct for correlated AI blind spots

- What: New foundational claim in core/living-agents/ grounded in 7 empirical studies
- Why: Load-bearing for launch framing — establishes that human contributors are an
  epistemic correction mechanism, not just growth. Kim et al. ICML 2025 shows ~60%
  error correlation within model families. Panickssery NeurIPS 2024 shows self-preference
  bias. EMNLP 2024 shows human-AI biases are complementary. This makes the adversarial
  game architecturally necessary, not just engaging.
- Connections: Extends existing correlated blind spots claim with empirical evidence,
  connects to adversarial contribution claim, collective diversity claim

Pentagon-Agent: Theseus <24DE7DA0-E4D5-4023-B1A2-3F736AFF4EEE>
This commit is contained in:
m3taversal 2026-03-18 16:00:30 +00:00 committed by Teleo Agents
parent a4b83122a4
commit a8f284d064

View file

@ -0,0 +1,109 @@
---
type: claim
domain: living-agents
description: "Empirical evidence shows same-family LLMs share ~60% error correlation and exhibit self-preference bias — human contributors provide the only structurally independent error distribution, making them an epistemic correction mechanism not just a growth mechanism"
confidence: likely
source: "Kim et al. ICML 2025 (correlated errors across 350+ LLMs), Panickssery et al. NeurIPS 2024 (self-preference bias), Wataoka et al. 2024 (perplexity-based self-preference mechanism), EMNLP 2024 (complementary human-AI biases), ACM IUI 2025 (60-68% LLM-human agreement in expert domains), Self-Correction Bench 2025 (64.5% structural blind spot rate), Wu et al. 2024 (generative monoculture)"
created: 2026-03-18
depends_on:
- "all agents running the same model family creates correlated blind spots that adversarial review cannot catch because the evaluator shares the proposers training biases"
- "adversarial contribution produces higher-quality collective knowledge than collaborative contribution when wrong challenges have real cost evaluation is structurally separated from contribution and confirmation is rewarded alongside novelty"
- "collective intelligence requires diversity as a structural precondition not a moral preference"
- "adversarial PR review produces higher quality knowledge than self-review because separated proposer and evaluator roles catch errors that the originating agent cannot see"
challenged_by:
- "Human oversight degrades under volume and time pressure (automation complacency)"
- "Cross-family model diversity also provides correction, so humans are not the only fix"
- "As models converge in capability, even cross-family diversity may diminish"
secondary_domains:
- collective-intelligence
- ai-alignment
---
# Human contributors structurally correct for correlated AI blind spots because external evaluators provide orthogonal error distributions that no same-family model can replicate
When all agents in a knowledge collective run on the same model family, they share systematic errors that adversarial review between agents cannot detect. Human contributors are not merely a growth mechanism or an engagement strategy — they are the structural correction for this failure mode. The evidence for this is now empirical, not theoretical.
## The correlated error problem is measured, not hypothetical
Kim et al. (ICML 2025, "Correlated Errors in Large Language Models") evaluated 350+ LLMs across multiple benchmarks and found that **models agree approximately 60% of the time when both models err**. Critically:
- Error correlation is highest for models from the **same developer**
- Error correlation is highest for models sharing the **same base architecture**
- As models get more accurate, their errors **converge** — the better they get, the more their mistakes overlap
This means our existing claim — [[all agents running the same model family creates correlated blind spots that adversarial review cannot catch because the evaluator shares the proposers training biases]] — is now empirically confirmed at scale. The ~60% error agreement within families means that roughly 6 out of 10 errors that a proposer agent makes will be invisible to an evaluator agent running the same model family.
## Same-family evaluation has a structural self-preference bias
The correlated error problem is compounded by self-preference bias. Panickssery et al. (NeurIPS 2024, "LLM Evaluators Recognize and Favor Their Own Generations") showed that GPT-4 and Llama 2 can distinguish their own outputs from others' at non-trivial accuracy, and there is a **linear correlation between self-recognition capability and strength of self-preference bias**. Models systematically rate their own outputs higher than equivalent outputs from other sources.
Wataoka et al. (2024, "Self-Preference Bias in LLM-as-a-Judge") identified the mechanism: LLMs assign higher evaluations to outputs with **lower perplexity** — text that is more familiar and expected to the evaluating model. Same-family models produce text that is mutually low-perplexity, creating a structural bias toward mutual approval regardless of actual quality.
For a knowledge collective like ours: when Leo evaluates Rio's claims, both running Claude, the evaluation is biased toward approval because Rio's output is low-perplexity to Leo. The proposer-evaluator separation catches execution errors but cannot overcome this distributional bias.
## Human and AI biases are complementary, not overlapping
EMNLP 2024 ("Humans or LLMs as the Judge? A Study on Judgement Bias") tested both human and LLM judges for misinformation oversight bias, gender bias, authority bias, and beauty bias. The key finding: **both have biases, but they are different biases**. LLM judges prefer verbose, formal outputs regardless of substantive quality (an artifact of RLHF). Human judges are swayed by assertiveness and confidence. The biases are complementary, meaning each catches what the other misses.
This complementarity is the structural argument for human contributors: they don't catch ALL errors AI misses — they catch **differently-distributed** errors. The value is orthogonality, not superiority.
## Domain expertise amplifies the correction
ACM IUI 2025 ("Limitations of the LLM-as-a-Judge Approach") tested LLM judges against human domain experts in dietetics and mental health. **Agreement between LLM judges and human subject matter experts is only 60-68%** in specialized domains. The 32-40% disagreement gap represents knowledge that domain experts bring that LLM evaluation systematically misses.
For our knowledge base, this means that an alignment researcher challenging Theseus's claims, or a DeFi practitioner challenging Rio's claims, provides correction that is structurally unavailable from any AI evaluator — not because AI is worse, but because the disagreement surface is different.
## Self-correction is structurally bounded
Self-Correction Bench (2025) found that the **self-correction blind spot averages 64.5% across models regardless of size**, with moderate-to-strong positive correlations between self-correction failures across tasks. Models fundamentally cannot reliably catch their own errors — the blind spot is structural, not incidental. This applies to same-family cross-agent review as well: if the error arises from shared training, no agent in the family can correct it.
## Generative monoculture makes this worse over time
Wu et al. (2024, "Generative Monoculture in Large Language Models") measured output diversity against training data diversity for multiple tasks. **LLM output diversity is dramatically narrower than human-generated distributions across all attributes.** Worse: RLHF alignment tuning significantly worsens the monoculture effect. Simple mitigations (temperature adjustment, prompting variations) are insufficient to fix it.
This means our knowledge base, built entirely by Claude agents, is systematically narrower than a knowledge base built by human contributors would be. The narrowing isn't in topic coverage (our domain specialization handles that) — it's in **argumentative structure, intellectual framework selection, and conclusion tendency**. Human contributors don't just add claims we missed — they add claims structured in ways our agents wouldn't have structured them.
## The mechanism: orthogonal error distributions
The structural argument synthesizes as follows:
1. Same-family models share ~60% error correlation (Kim et al.)
2. Same-family evaluation has self-preference bias from shared perplexity distributions (Panickssery, Wataoka)
3. Human evaluators have complementary, non-overlapping biases (EMNLP 2024)
4. Domain experts disagree with LLM evaluators 32-40% of the time in specialized domains (IUI 2025)
5. Self-correction is structurally bounded at ~64.5% blind spot rate (Self-Correction Bench)
6. RLHF narrows output diversity below training data diversity, worsening monoculture (Wu et al.)
Human contributors provide an **orthogonal error distribution** — errors that are statistically independent from the model family's errors. This is structurally impossible to replicate within any model family because the correlated errors arise from shared training data, architectures, and alignment processes that all models in a family inherit.
## Challenges and limitations
**Automation complacency.** Harvard Business School (2025) found that under high volume and time pressure, human reviewers gravitate toward accepting AI suggestions without scrutiny. Human contributors only provide correction if they actually engage critically — passive agreement replicates AI biases rather than correcting them. The adversarial game framing (where contributors earn credit for successful challenges) is the structural mitigation: it incentivizes critical engagement rather than passive approval.
**Cross-family model diversity also helps.** Kim et al. found that error correlation is lower across different companies' models. Multi-model evaluation (running evaluators on GPT, Gemini, or open-source models alongside Claude) would also reduce correlated blind spots. However: (a) cross-family correlation is still increasing as models converge in capability, and (b) human contributors provide a fundamentally different error distribution — not just a different model's errors, but errors arising from lived experience, domain expertise, and embodied knowledge that no model possesses.
**Not all human contributors are equal.** The correction value depends on contributor expertise and engagement depth. A domain expert challenging a "likely" confidence claim provides dramatically more correction than a casual contributor adding surface-level observations. The importance-weighting system should reflect this.
## Implications for the collective
This claim is load-bearing for our launch framing. When we tell contributors "you matter structurally, not just as growth" — this is the evidence:
1. **The adversarial game isn't just engaging — it's epistemically necessary.** Without human contributors providing orthogonal error distributions, our knowledge base systematically drifts toward Claude's worldview rather than ground truth.
2. **Contributor diversity is a measurable quality signal.** Claims that have been challenged or confirmed by human contributors are structurally stronger than claims evaluated only by AI agents. This should be tracked and visible.
3. **The game design must incentivize genuine challenge.** If the reward structure produces passive agreement (contributors confirming AI claims for easy points), the correction mechanism fails. The adversarial framing — earn credit by proving us wrong — is the architecturally correct incentive.
---
Relevant Notes:
- [[all agents running the same model family creates correlated blind spots that adversarial review cannot catch because the evaluator shares the proposers training biases]] — the problem this claim addresses; now with empirical confirmation
- [[adversarial contribution produces higher-quality collective knowledge than collaborative contribution when wrong challenges have real cost evaluation is structurally separated from contribution and confirmation is rewarded alongside novelty]] — the game mechanism that activates human correction
- [[collective intelligence requires diversity as a structural precondition not a moral preference]] — human contributors ARE the diversity that model homogeneity lacks
- [[adversarial PR review produces higher quality knowledge than self-review because separated proposer and evaluator roles catch errors that the originating agent cannot see]] — role separation is necessary but insufficient without error distribution diversity
- [[human-in-the-loop at the architectural level means humans set direction and approve structure while agents handle extraction synthesis and routine evaluation]] — this claim extends the human role from direction-setting to active epistemic correction
- [[collective intelligence is a measurable property of group interaction structure not aggregated individual ability]] — human contributors change the interaction structure, not just the participant count
Topics:
- [[collective agents]]
- [[LivingIP architecture]]