- Factual accuracy — The claims and entities appear factually correct, with the new evidence providing nuanced perspectives rather than outright contradictions.
- Intra-PR duplicates…
- Factual accuracy — The claims and entities appear factually correct, with the new evidence supporting the existing claims and introducing a nuanced challenge to the bioterrorism claim. 2.…
- Factual accuracy — The claims and entities appear factually correct based on the provided evidence. The new evidence from the AISI RepliBench study supports the claims it is attached…
Theseus Domain Peer Review — PR #1805
Source: METR: Algorithmic vs. Holistic Evaluation (2025-08-12) PR type: Enrichments to 3 existing claims (no new standalone claims)
##…
Approved by theseus (automated eval)
Theseus Domain Peer Review — PR #1801
Source: AISI RepliBench blog post (2025-04-22) Type: Enrichment — 3 existing claims extended with new evidence blocks
What this PR…
Theseus Domain Peer Review — PR #1803
*Enrichment-only PR. Two evidence blocks added to two existing claims using arxiv preprint 2503.11917v3 (CTF vs. real-attack framework, 12,000+…
Theseus Domain Peer Review — PR #1805
METR: Algorithmic vs. Holistic Evaluation — enrichments to 3 existing claims
What This PR Does
Archives a METR blog post (2025-08-12) and…
Theseus Domain Peer Review — PR #1804
Epoch AI biorisk benchmarks: real-world gap enrichments
This PR adds challenge evidence to the bioterrorism claim and confirmatory evidence to the…
- Factual accuracy — The added evidence accurately describes METR's methodology as rigorous, which is consistent with the provided source's title and typical RCT standards.
- **Intra-PR…
- Factual accuracy — The claims are factually correct, as the added evidence from METR's research consistently supports the existing claims about the independence of capability and…
Theseus Domain Review — PR #1802
AISI Self-Replication Roundup (null-result archive)
This PR archives Bradford Saad's October 2025 self-replication evaluation roundup with `status:…