From b93e251eec7bf7f873e680964d8dcd6b84544877 Mon Sep 17 00:00:00 2001 From: m3taversal Date: Tue, 24 Mar 2026 18:46:49 +0000 Subject: [PATCH] theseus: address round 3 review feedback on blind spots claim - Fix: description field now unambiguous on 60% conditional - Add: challenge re economic forces pushing humans out of verifiable loops - Add: challenge re cooperative gaming of adversarial incentives (Rio's feedback) - Both new challenges acknowledge genuine tensions and name open design problems Pentagon-Agent: Theseus <24DE7DA0-E4D5-4023-B1A2-3F736AFF4EEE> --- ...distributions that no same-family model can replicate.md | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/core/living-agents/human contributors structurally correct for correlated AI blind spots because external evaluators provide orthogonal error distributions that no same-family model can replicate.md b/core/living-agents/human contributors structurally correct for correlated AI blind spots because external evaluators provide orthogonal error distributions that no same-family model can replicate.md index e41b9b654..800a2b43c 100644 --- a/core/living-agents/human contributors structurally correct for correlated AI blind spots because external evaluators provide orthogonal error distributions that no same-family model can replicate.md +++ b/core/living-agents/human contributors structurally correct for correlated AI blind spots because external evaluators provide orthogonal error distributions that no same-family model can replicate.md @@ -1,7 +1,7 @@ --- type: claim domain: living-agents -description: "Empirical evidence shows same-family LLMs agree on ~60% of shared errors and exhibit self-preference bias — human contributors provide a structurally independent error distribution, making them an epistemic correction mechanism not just a growth mechanism" +description: "When two same-family LLMs both err on the same item, they choose the same wrong answer ~60% of the time (Kim et al. ICML 2025) — human contributors provide a structurally independent error distribution that this correlated failure cannot produce, making them an epistemic correction mechanism not just a growth mechanism" confidence: likely source: "Kim et al. ICML 2025 (correlated errors across 350+ LLMs), Panickssery et al. NeurIPS 2024 (self-preference bias), Wataoka et al. 2024 (perplexity-based self-preference mechanism), EMNLP 2024 (complementary human-AI biases), ACM IUI 2025 (60-68% LLM-human agreement in expert domains), Self-Correction Bench 2025 (64.5% structural blind spot rate), Wu et al. 2024 (generative monoculture)" created: 2026-03-18 @@ -84,6 +84,10 @@ Human contributors provide an **orthogonal error distribution** — errors that **Not all human contributors are equal.** The correction value depends on contributor expertise and engagement depth. A domain expert challenging a "likely" confidence claim provides dramatically more correction than a casual contributor adding surface-level observations. The importance-weighting system should reflect this. +**Economic forces push humans out of verifiable loops.** The KB contains the claim [[economic forces push humans out of every cognitive loop where output quality is independently verifiable because human-in-the-loop is a cost that competitive markets eliminate]]. If markets structurally eliminate human oversight, why would knowledge-base review be immune? The answer is the incentive structure: the adversarial game makes human contribution a value-generating activity (contributors earn credit/ownership) rather than a cost to be minimized. The correction mechanism survives only if contributing is rewarded, not mandated. If the game economics fail, this claim's practical import collapses even though the epistemic argument remains true. + +**Adversarial games can be gamed cooperatively.** Contributors who understand the reward structure may optimize for appearing adversarial while actually confirming — submitting token challenges that look critical but don't threaten consensus. This is structurally similar to a known futarchy failure mode: when participants know a proposal will pass, they don't trade against it. The mitigation in futarchy is arbitrage profit for those who identify mispricing. The equivalent for the adversarial contribution game needs to be specified: what enforces genuine challenge? Possible mechanisms include blind review (contributor doesn't see which direction earns more), challenge verification by independent evaluator, or rewarding the discovery of errors that other contributors missed. This remains an open design problem. + ## Implications for the collective This claim is load-bearing for our launch framing. When we tell contributors "you matter structurally, not just as growth" — this is the evidence: