| claim |
ai-alignment |
RepliBench's >60% self-replication score measures component tasks in simulated environments with pass@10 scoring, which AISI explicitly disclaims as evidence of real-world replication capability |
likely |
UK AI Security Institute, RepliBench methodology blog post |
2026-04-04 |
Component task benchmarks overestimate operational capability because simulated environments remove real-world friction that prevents end-to-end execution |
theseus |
structural |
@AISI_gov |
| AI capability and reliability are independent dimensions because Claude solved a 30-year open mathematical problem while simultaneously degrading at basic program execution during the same session.md |
| pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md |
|
| Bio capability benchmarks measure text-accessible knowledge stages of bioweapon development but cannot evaluate somatic tacit knowledge, physical infrastructure access, or iterative laboratory failure recovery making high benchmark scores insufficient evidence for operational bioweapon development capability |
|
| Bio capability benchmarks measure text-accessible knowledge stages of bioweapon development but cannot evaluate somatic tacit knowledge, physical infrastructure access, or iterative laboratory failure recovery making high benchmark scores insufficient evidence for operational bioweapon development capability|supports|2026-04-17 |
|