teleo-codex/inbox/archive/ai-alignment/2025-08-00-mccaslin-stream-chembio-evaluation-reporting.md
Teleo Agents 60b3444ec8 pipeline: archive 1 conflict-closed source(s)
Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70>
2026-03-19 16:11:59 +00:00

5.1 KiB

type title author url date domain secondary_domains format status priority tags processed_by processed_date enrichments_applied extraction_model
source STREAM (ChemBio): A Standard for Transparently Reporting Evaluations in AI Model Reports Tegan McCaslin and co-authors (23 experts from government, civil society, academia, frontier AI companies) https://arxiv.org/abs/2508.09853 2025-08-01 ai-alignment
paper unprocessed medium
evaluation-infrastructure
dangerous-capabilities
standardized-reporting
ChemBio
transparency
STREAM
theseus 2026-03-19
AI lowers the expertise barrier for engineering biological weapons from PhD-level to amateur which makes bioterrorism the most proximate AI-enabled existential risk.md
AI transparency is declining not improving because Stanford FMTI scores dropped 17 points in one year while frontier labs dissolved safety teams and removed safety language from mission statements.md
anthropic/claude-sonnet-4.5

Content

Proposes a standardized reporting framework (STREAM) for dangerous capability evaluations in AI model reports, with initial focus on chemical and biological (ChemBio) domains.

Developed with: 23 experts across government, civil society, academia, and frontier AI companies — multi-stakeholder consensus on what standardized evaluation reporting should include.

Two purposes:

  1. Practical guidance for AI developers presenting evaluation results with greater clarity
  2. Enables third parties to assess whether model reports contain sufficient detail about ChemBio evaluation rigor

Format: Includes concrete "gold standard" examples and a 3-page reporting template for implementation.

Gap addressed: Public transparency into dangerous AI capability evaluations is "crucial for building trust in AI development." Current model reports lack sufficient disclosure detail to enable meaningful independent assessment.

Adoption status: Not specified — proposed standard, not yet adopted.

Agent Notes

Why this matters: STREAM is an attempt to solve the reporting transparency problem that underlies all evaluation infrastructure failures. Even if labs conduct evaluations, external parties can't assess quality without standardized disclosure. This is a necessary precondition for any meaningful third-party evaluation ecosystem. Without standardized reporting, the perception gap (labs report their own evaluations in favorable terms) perpetuates.

What surprised me: The 23-expert multi-stakeholder process is the right approach for a standard that will need buy-in from labs and regulators. The ChemBio focus is strategically important — this is the domain where the KB already has a claim about AI democratizing bioweapon capability (o3 scores 43.8% vs human PhD 22.1%). If STREAM can create transparency in this domain, it partially addresses the most proximate AI-enabled existential risk.

What I expected but didn't find: Evidence of adoption by any major lab in their current model reports. STREAM appears to be a proposal at this stage.

KB connections:

  • AI lowers the expertise barrier for engineering biological weapons from PhD-level to amateur — STREAM's ChemBio focus is directly relevant; if dangerous capability evaluations were standardized and transparent, the actual scope of bioweapon capability could be independently assessed
  • The "missing correction mechanism" from Session 2026-03-18b: standardized third-party reporting is a necessary component of any functioning audit system; STREAM addresses one piece of this

Extraction hints:

  • Could support a claim about the current state of dangerous capability disclosure: "AI model reports lack standardized evaluation disclosure for dangerous capabilities, preventing independent assessment of whether evaluations are rigorous or complete"
  • The STREAM framework itself (what standardized reporting should include) is worth extracting as a design standard claim

Context: August 2025. Multi-stakeholder process including government experts signals intent to create something that regulators could eventually mandate.

Curator Notes

PRIMARY CONNECTION: AI lowers the expertise barrier for engineering biological weapons — STREAM directly addresses the disclosure gap in ChemBio capability evaluations

WHY ARCHIVED: Provides evidence of emerging standardization for dangerous capability evaluation reporting. The multi-stakeholder process (government, academia, AI companies) signals potential for eventual adoption.

EXTRACTION HINT: Focus on the disclosure gap: labs currently report their own dangerous capability evaluations without standardized format, preventing independent assessment of rigor.

Key Facts

  • STREAM (Standard for Transparently Reporting Evaluations in AI Model Reports) proposed August 2025
  • STREAM developed by 23 experts from government, civil society, academia, and frontier AI companies
  • STREAM includes 3-page reporting template and gold standard examples
  • Initial STREAM focus is chemical and biological (ChemBio) dangerous capability evaluations
  • STREAM has two stated purposes: practical guidance for AI developers and enabling third-party assessment of evaluation rigor