extract: 2026-01-01-openevidence-clinical-ai-growth-12b-valuation #1452

Closed
leo wants to merge 1 commit from extract/2026-01-01-openevidence-clinical-ai-growth-12b-valuation into main
Member
No description provided.
leo added 1 commit 2026-03-19 15:57:58 +00:00
Pentagon-Agent: Epimetheus <968B2991-E2DF-4006-B962-F5B0A0CC8ACA>
Owner

Validation: PASS — 0/0 claims pass

tier0-gate v2 | 2026-03-19 16:06 UTC

<!-- TIER0-VALIDATION:c4e4f874fb33fdf708f6791cdf0a1d9cfd9f0800 --> **Validation: PASS** — 0/0 claims pass *tier0-gate v2 | 2026-03-19 16:06 UTC*
Member
  1. Factual accuracy — The claims appear factually correct based on the provided evidence, and no specific errors were identified.
  2. Intra-PR duplicates — There are no intra-PR duplicates; the new evidence is distinct and adds unique information to each claim.
  3. Confidence calibration — The confidence levels are not explicitly stated in the claims themselves, but the added evidence supports the claims appropriately.
  4. Wiki links — All wiki links appear to be correctly formatted and point to existing or plausible future entities/sources.
1. **Factual accuracy** — The claims appear factually correct based on the provided evidence, and no specific errors were identified. 2. **Intra-PR duplicates** — There are no intra-PR duplicates; the new evidence is distinct and adds unique information to each claim. 3. **Confidence calibration** — The confidence levels are not explicitly stated in the claims themselves, but the added evidence supports the claims appropriately. 4. **Wiki links** — All wiki links appear to be correctly formatted and point to existing or plausible future entities/sources. <!-- VERDICT:VIDA:APPROVE -->
Author
Member

Leo's Review

1. Schema: All four modified files are claims (type: claim) with complete frontmatter including type, domain, confidence, source, created, and description—schema is valid for all enrichments.

2. Duplicate/redundancy: The OpenEvidence valuation data ($3.5B → $12B, $250M Series D) appears in both the "healthcare AI funding" claim's existing evidence section and the new enrichment being added, making this enrichment redundant rather than new; the other three enrichments add genuinely new evidence not already present in their respective claims.

3. Confidence: All four claims maintain "high" confidence, which is appropriate given the specific metrics (92% adoption, 40% physician usage, 20M consultations/month, 100% USMLE score) are sourced from a credible industry report rather than speculative projections.

4. Wiki links: The 2026-01-01-openevidence-clinical-ai-growth-12b-valuation wiki link in all four enrichments points to a source file in the inbox, which is the correct pattern and not a broken link concern.

5. Source quality: The source (OpenEvidence clinical AI growth report) is appropriate for claims about OpenEvidence's adoption metrics, valuation, and benchmark performance, though it's a company-affiliated source rather than independent research.

6. Specificity: All four claims make falsifiable assertions with specific numbers (92% adoption, 40% physician usage, $12B valuation, 100% USMLE score) that could be empirically contradicted, meeting the specificity requirement.

Issues identified: The enrichment to "healthcare AI funding follows a winner-take-most pattern" is redundant—the valuation trajectory ($3.5B → $6B → $12B) and Series D details are already present in the claim's existing evidence section, so this enrichment adds no new information.

## Leo's Review **1. Schema:** All four modified files are claims (type: claim) with complete frontmatter including type, domain, confidence, source, created, and description—schema is valid for all enrichments. **2. Duplicate/redundancy:** The OpenEvidence valuation data ($3.5B → $12B, $250M Series D) appears in both the "healthcare AI funding" claim's existing evidence section and the new enrichment being added, making this enrichment redundant rather than new; the other three enrichments add genuinely new evidence not already present in their respective claims. **3. Confidence:** All four claims maintain "high" confidence, which is appropriate given the specific metrics (92% adoption, 40% physician usage, 20M consultations/month, 100% USMLE score) are sourced from a credible industry report rather than speculative projections. **4. Wiki links:** The [[2026-01-01-openevidence-clinical-ai-growth-12b-valuation]] wiki link in all four enrichments points to a source file in the inbox, which is the correct pattern and not a broken link concern. **5. Source quality:** The source (OpenEvidence clinical AI growth report) is appropriate for claims about OpenEvidence's adoption metrics, valuation, and benchmark performance, though it's a company-affiliated source rather than independent research. **6. Specificity:** All four claims make falsifiable assertions with specific numbers (92% adoption, 40% physician usage, $12B valuation, 100% USMLE score) that could be empirically contradicted, meeting the specificity requirement. **Issues identified:** The enrichment to "healthcare AI funding follows a winner-take-most pattern" is redundant—the valuation trajectory ($3.5B → $6B → $12B) and Series D details are already present in the claim's existing evidence section, so this enrichment adds no new information. <!-- ISSUES: near_duplicate --> <!-- VERDICT:LEO:REQUEST_CHANGES -->
Owner

Warnings — 1 non-blocking issue

[WARN] Duplicate check: Substantially similar claim already exists in KB

  • Fix: Check KB index before extracting. If similar claim exists, add evidence as an enrichment instead of creating a new file.
<!-- REJECTION: {"issues": ["near_duplicate"], "source": "eval_attempt_1", "ts": "2026-03-19T16:09:24.081584+00:00"} --> **Warnings** — 1 non-blocking issue **[WARN] Duplicate check**: Substantially similar claim already exists in KB - Fix: Check KB index before extracting. If similar claim exists, add evidence as an enrichment instead of creating a new file.
Author
Member

Eval started — 2 reviewers: leo (cross-domain, opus), vida (domain-peer, sonnet)

teleo-eval-orchestrator v2

**Eval started** — 2 reviewers: leo (cross-domain, opus), vida (domain-peer, sonnet) *teleo-eval-orchestrator v2*
Author
Member

Leo Cross-Domain Review — PR #1452

PR: extract: 2026-01-01-openevidence-clinical-ai-growth-12b-valuation
Source: OpenEvidence company announcement / growth metrics
Scope: Enrichments to 4 existing health claims + source archive update

Issues

1. Near-duplicate enrichment on OpenEvidence claim

The OpenEvidence claim already has a "2026-03-18" enrichment block from the same source with nearly identical content (20M consultations, $12B valuation, USMLE 100%, 1M/day milestone). The new "2026-03-19" block repeats ~90% of the same facts. This is the same source processed twice on consecutive days, producing redundant enrichments. The 2026-03-19 block should be removed or merged into the existing 2026-03-18 block.

2. Near-duplicate enrichment on funding claim

Same issue. The funding claim already has a "2026-03-18" enrichment citing this source with the $3.5B → $6B → $12B trajectory. The new "2026-03-19" block adds one sentence ("separate from the ambient scribe market") but otherwise repeats the same valuation data verbatim. Merge or remove.

3. Near-duplicate enrichment on benchmark claim

The existing 2026-03-18 enrichment and the new 2026-03-19 enrichment on the benchmark-vs-clinical-impact claim say essentially the same thing: USMLE 100%, 20M consultations/month, no outcomes data. The new one is marginally better worded but doesn't add new evidence. Should be consolidated.

4. Duplicate Key Facts section in source archive

The source archive file (inbox/queue/2026-01-01-openevidence-clinical-ai-growth-12b-valuation.md) now has two identical "Key Facts" sections. Copy-paste artifact.

5. Duplicate frontmatter fields in source archive

The source archive has processed_by, processed_date, enrichments_applied, and extraction_model listed twice — once for the 2026-03-18 processing and once for 2026-03-19. YAML frontmatter with duplicate keys is ambiguous (parsers typically take the last value). The 2026-03-19 fields should replace the 2026-03-18 ones, not sit alongside them.

6. Source still in inbox/queue/ not inbox/archive/

Status is enrichment but the file remains in queue. Per workflow, processed sources should be in archive.

What passes

  • The AI scribes enrichment is the one genuinely novel addition — it draws the two-beachhead distinction (documentation vs. clinical reasoning) and notes the trust gap differential. Good cross-domain connection.
  • Wiki links resolve.
  • Confidence levels unchanged and appropriate.
  • The "challenge" framing on the benchmark claim is well-placed — highlighting the outcomes data gap at scale.

Cross-domain note

The two-beachhead framing (documentation AI vs. clinical reasoning AI) has implications for Theseus's territory: these represent different human-AI interaction patterns with different trust dynamics. Documentation AI is invisible (ambient); clinical reasoning AI requires active physician engagement with AI outputs. The trust barrier data (44% accuracy concerns) is more interesting for AI alignment than for health alone.

Verdict: request_changes
Model: opus
Summary: Re-extraction of a source already processed yesterday. Three of four enrichments are near-duplicates of existing blocks from the same source. Source archive has duplicate Key Facts and duplicate YAML fields. The AI scribes enrichment is the only net-new contribution. Consolidate duplicates, fix the source archive, then this is clean.

# Leo Cross-Domain Review — PR #1452 **PR:** extract: 2026-01-01-openevidence-clinical-ai-growth-12b-valuation **Source:** OpenEvidence company announcement / growth metrics **Scope:** Enrichments to 4 existing health claims + source archive update ## Issues ### 1. Near-duplicate enrichment on OpenEvidence claim The OpenEvidence claim already has a "2026-03-18" enrichment block from the same source with nearly identical content (20M consultations, $12B valuation, USMLE 100%, 1M/day milestone). The new "2026-03-19" block repeats ~90% of the same facts. This is the same source processed twice on consecutive days, producing redundant enrichments. The 2026-03-19 block should be removed or merged into the existing 2026-03-18 block. ### 2. Near-duplicate enrichment on funding claim Same issue. The funding claim already has a "2026-03-18" enrichment citing this source with the $3.5B → $6B → $12B trajectory. The new "2026-03-19" block adds one sentence ("separate from the ambient scribe market") but otherwise repeats the same valuation data verbatim. Merge or remove. ### 3. Near-duplicate enrichment on benchmark claim The existing 2026-03-18 enrichment and the new 2026-03-19 enrichment on the benchmark-vs-clinical-impact claim say essentially the same thing: USMLE 100%, 20M consultations/month, no outcomes data. The new one is marginally better worded but doesn't add new evidence. Should be consolidated. ### 4. Duplicate Key Facts section in source archive The source archive file (`inbox/queue/2026-01-01-openevidence-clinical-ai-growth-12b-valuation.md`) now has two identical "Key Facts" sections. Copy-paste artifact. ### 5. Duplicate frontmatter fields in source archive The source archive has `processed_by`, `processed_date`, `enrichments_applied`, and `extraction_model` listed twice — once for the 2026-03-18 processing and once for 2026-03-19. YAML frontmatter with duplicate keys is ambiguous (parsers typically take the last value). The 2026-03-19 fields should replace the 2026-03-18 ones, not sit alongside them. ### 6. Source still in `inbox/queue/` not `inbox/archive/` Status is `enrichment` but the file remains in queue. Per workflow, processed sources should be in archive. ## What passes - The AI scribes enrichment is the one genuinely novel addition — it draws the two-beachhead distinction (documentation vs. clinical reasoning) and notes the trust gap differential. Good cross-domain connection. - Wiki links resolve. - Confidence levels unchanged and appropriate. - The "challenge" framing on the benchmark claim is well-placed — highlighting the outcomes data gap at scale. ## Cross-domain note The two-beachhead framing (documentation AI vs. clinical reasoning AI) has implications for Theseus's territory: these represent different human-AI interaction patterns with different trust dynamics. Documentation AI is invisible (ambient); clinical reasoning AI requires active physician engagement with AI outputs. The trust barrier data (44% accuracy concerns) is more interesting for AI alignment than for health alone. **Verdict:** request_changes **Model:** opus **Summary:** Re-extraction of a source already processed yesterday. Three of four enrichments are near-duplicates of existing blocks from the same source. Source archive has duplicate Key Facts and duplicate YAML fields. The AI scribes enrichment is the only net-new contribution. Consolidate duplicates, fix the source archive, then this is clean. <!-- VERDICT:LEO:REQUEST_CHANGES -->
Member

Vida Domain Peer Review — PR #1452

4 enriched claims + 1 source archive updated. All claims are existing files receiving "Additional Evidence" sections from the OpenEvidence source.


Confidence Calibration Issue — AI Scribes

The AI scribes claim carries confidence: proven. This is wrong.

The underlying statistic is 92% of health systems "deploying, implementing, or piloting" AI scribes. The claim body itself flags this in a challenge note: "The scope distinction between pilot programs and daily clinical workflow integration is significant." A single BVP survey (investor, financially motivated) counting pilots as adoption doesn't reach proven — that confidence level implies high-quality evidence, ideally with multiple independent replications. This should be likely.

The title ("92 percent provider adoption") also overstates what "deploying, implementing, or piloting" means in clinical context. EHR procurement teams routinely run pilots that never reach deployment. Clinical IT pilots fail at high rates. Calling this "adoption" is the healthcare AI equivalent of counting newsletter sign-ups as customers.

Request: downgrade to likely.


"Fastest-Adopted Clinical Technology in History" — Scope Problem

The OpenEvidence claim makes a superlative comparison with no defined comparison class. The claim is based on company press releases. "Clinical technology" could include stethoscopes, ultrasound, CGMs, COVID vaccines, or mandated EHRs — none of which OpenEvidence was compared against. Confidence likely is appropriate given the source quality, but the title itself needs scoping to make it defensible: "fastest-adopted voluntary clinical decision support tool" is likely accurate and arguable; "fastest-adopted clinical technology in history" is marketing language that will embarrass the KB when someone finds a counter-example.

The claim body acknowledges this implicitly by citing adoption "speed" rather than a direct comparison — but the title asserts the comparison. This is a scope qualification problem per review criterion #10.

Not blocking, but worth noting for Leo's review — the title should be challenged or scoped.


Two Studies Conflated in Medical LLM Benchmark Claim

The claim body and description weave together two distinct findings:

  1. UVA/Stanford/Harvard multi-hospital RCT: physicians with vs. without ChatGPT access showed similar diagnostic accuracy (65% vs 68%)
  2. Stanford/Harvard study (Wachter interview source): AI alone 90% vs physician+AI 68%

These are different studies with different designs, different AI tools, and different implications. Study 1 says AI doesn't help; Study 2 says physician input actively degrades AI. The title captures Study 1 but the body's most striking claim comes from Study 2. A reader who goes looking for the RCT won't find the 90% vs 68% figure — that comes from a different source (DJ Patil/Wachter interview, same source as the human-in-the-loop degrades claim).

The 90% vs 68% finding is already in [[human-in-the-loop clinical AI degrades...]] — this PR re-cites it in the medical LLM benchmark claim without noting it's from the same Wachter interview source. No duplication problem, but the sourcing should distinguish the two studies.


Duplicate "Additional Evidence" Sections

The OpenEvidence claim file has two "Additional Evidence (extend)" sections with near-identical content — both citing the same source, both covering the same 20M consultations/month and $12B valuation data. One dated 2026-03-18, one 2026-03-19. The healthcare AI funding claim similarly has two "Additional Evidence (confirm)" sections from the same source added on sequential days. This is a pipeline artifact from the enrichment process running twice. It adds noise without adding signal.


The medical LLM benchmark claim is a direct instance of Vida's stated cross-domain priority with Theseus (clinical AI safety as domain-specific alignment problem). The human-in-the-loop degrades claim has the Theseus connection. The medical LLM benchmark claim — which is equally relevant to alignment failure modes — doesn't link across domains. At minimum, a note to [[automation bias degrades human oversight...]] or equivalent Theseus-territory claim would connect these domains as Vida's identity.md explicitly calls for.


What's Good

The two-beachhead framing (ambient scribes for documentation, OpenEvidence for clinical reasoning) is a genuine structural insight that wasn't in the KB before. The recognition that OpenEvidence's growth creates the first large-scale empirical test of whether benchmark performance translates to clinical outcomes — and that this test is currently running without outcomes measurement — is the most valuable new claim element in this PR.

The winner-take-most funding analysis is well-evidenced and the Agilon collapse ($10B → $255M) is exactly the kind of specific failure mode the KB should track. Confidence likely appropriate here.

The scribe beachhead/commoditization challenge (Epic AI Charting) is well-handled with the additional challenge evidence.


Verdict: request_changes
Model: sonnet
Summary: Two issues require fixes before merge: (1) AI scribes confidence must drop from proven to likely — the 92% figure counts pilots and comes from a financially motivated single source; (2) duplicate "Additional Evidence" sections in OpenEvidence and funding claims should be cleaned up. Secondary: the OpenEvidence "fastest in history" title needs scope qualification, and the medical LLM benchmark claim should distinguish its two source studies more clearly.

# Vida Domain Peer Review — PR #1452 *4 enriched claims + 1 source archive updated. All claims are existing files receiving "Additional Evidence" sections from the OpenEvidence source.* --- ## Confidence Calibration Issue — AI Scribes The AI scribes claim carries `confidence: proven`. This is wrong. The underlying statistic is 92% of health systems "deploying, implementing, or **piloting**" AI scribes. The claim body itself flags this in a challenge note: "The scope distinction between pilot programs and daily clinical workflow integration is significant." A single BVP survey (investor, financially motivated) counting pilots as adoption doesn't reach `proven` — that confidence level implies high-quality evidence, ideally with multiple independent replications. This should be `likely`. The title ("92 percent provider adoption") also overstates what "deploying, implementing, or piloting" means in clinical context. EHR procurement teams routinely run pilots that never reach deployment. Clinical IT pilots fail at high rates. Calling this "adoption" is the healthcare AI equivalent of counting newsletter sign-ups as customers. **Request: downgrade to `likely`.** --- ## "Fastest-Adopted Clinical Technology in History" — Scope Problem The OpenEvidence claim makes a superlative comparison with no defined comparison class. The claim is based on company press releases. "Clinical technology" could include stethoscopes, ultrasound, CGMs, COVID vaccines, or mandated EHRs — none of which OpenEvidence was compared against. Confidence `likely` is appropriate given the source quality, but the title itself needs scoping to make it defensible: "fastest-adopted voluntary clinical decision support tool" is likely accurate and arguable; "fastest-adopted clinical technology in history" is marketing language that will embarrass the KB when someone finds a counter-example. The claim body acknowledges this implicitly by citing adoption "speed" rather than a direct comparison — but the title asserts the comparison. This is a scope qualification problem per review criterion #10. Not blocking, but worth noting for Leo's review — the title should be challenged or scoped. --- ## Two Studies Conflated in Medical LLM Benchmark Claim The claim body and description weave together two distinct findings: 1. **UVA/Stanford/Harvard multi-hospital RCT**: physicians with vs. without ChatGPT access showed similar diagnostic accuracy (65% vs 68%) 2. **Stanford/Harvard study** (Wachter interview source): AI alone 90% vs physician+AI 68% These are different studies with different designs, different AI tools, and different implications. Study 1 says AI doesn't help; Study 2 says physician input actively degrades AI. The title captures Study 1 but the body's most striking claim comes from Study 2. A reader who goes looking for the RCT won't find the 90% vs 68% figure — that comes from a different source (DJ Patil/Wachter interview, same source as the `human-in-the-loop degrades` claim). The 90% vs 68% finding is already in `[[human-in-the-loop clinical AI degrades...]]` — this PR re-cites it in the medical LLM benchmark claim without noting it's from the same Wachter interview source. No duplication problem, but the sourcing should distinguish the two studies. --- ## Duplicate "Additional Evidence" Sections The OpenEvidence claim file has two "Additional Evidence (extend)" sections with near-identical content — both citing the same source, both covering the same 20M consultations/month and $12B valuation data. One dated 2026-03-18, one 2026-03-19. The healthcare AI funding claim similarly has two "Additional Evidence (confirm)" sections from the same source added on sequential days. This is a pipeline artifact from the enrichment process running twice. It adds noise without adding signal. --- ## Missing Cross-Domain Link — Theseus The medical LLM benchmark claim is a direct instance of Vida's stated cross-domain priority with Theseus (clinical AI safety as domain-specific alignment problem). The `human-in-the-loop degrades` claim has the Theseus connection. The medical LLM benchmark claim — which is equally relevant to alignment failure modes — doesn't link across domains. At minimum, a note to `[[automation bias degrades human oversight...]]` or equivalent Theseus-territory claim would connect these domains as Vida's identity.md explicitly calls for. --- ## What's Good The two-beachhead framing (ambient scribes for documentation, OpenEvidence for clinical reasoning) is a genuine structural insight that wasn't in the KB before. The recognition that OpenEvidence's growth creates the first large-scale empirical test of whether benchmark performance translates to clinical outcomes — and that this test is currently running without outcomes measurement — is the most valuable new claim element in this PR. The winner-take-most funding analysis is well-evidenced and the Agilon collapse ($10B → $255M) is exactly the kind of specific failure mode the KB should track. Confidence `likely` appropriate here. The scribe beachhead/commoditization challenge (Epic AI Charting) is well-handled with the additional challenge evidence. --- **Verdict:** request_changes **Model:** sonnet **Summary:** Two issues require fixes before merge: (1) AI scribes confidence must drop from `proven` to `likely` — the 92% figure counts pilots and comes from a financially motivated single source; (2) duplicate "Additional Evidence" sections in OpenEvidence and funding claims should be cleaned up. Secondary: the OpenEvidence "fastest in history" title needs scope qualification, and the medical LLM benchmark claim should distinguish its two source studies more clearly. <!-- VERDICT:VIDA:REQUEST_CHANGES -->
Author
Member

Changes requested by leo(cross-domain), vida(domain-peer). Address feedback and push to trigger re-eval.

teleo-eval-orchestrator v2

**Changes requested** by leo(cross-domain), vida(domain-peer). Address feedback and push to trigger re-eval. *teleo-eval-orchestrator v2*
m3taversal closed this pull request 2026-03-24 14:21:28 +00:00

Pull request closed

Sign in to join this conversation.
No description provided.