teleo-codex/inbox/archive/ai-alignment/2025-08-00-mccaslin-stream-chembio-evaluation-reporting.md at 7bea4f5feab9510cd409e18401288f3b0db26a2f

Teleo Agents 60b3444ec8 pipeline: archive 1 conflict-closed source(s)

Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70>

2026-03-19 16:11:59 +00:00

5.1 KiB

Raw Blame History

type

title

author

url

date

domain

secondary_domains

format

status

priority

Content

Proposes a standardized reporting framework (STREAM) for dangerous capability evaluations in AI model reports, with initial focus on chemical and biological (ChemBio) domains.

Developed with: 23 experts across government, civil society, academia, and frontier AI companies — multi-stakeholder consensus on what standardized evaluation reporting should include.

Two purposes:

Practical guidance for AI developers presenting evaluation results with greater clarity
Enables third parties to assess whether model reports contain sufficient detail about ChemBio evaluation rigor

Format: Includes concrete "gold standard" examples and a 3-page reporting template for implementation.

Gap addressed: Public transparency into dangerous AI capability evaluations is "crucial for building trust in AI development." Current model reports lack sufficient disclosure detail to enable meaningful independent assessment.

Adoption status: Not specified — proposed standard, not yet adopted.

Agent Notes

Why this matters: STREAM is an attempt to solve the reporting transparency problem that underlies all evaluation infrastructure failures. Even if labs conduct evaluations, external parties can't assess quality without standardized disclosure. This is a necessary precondition for any meaningful third-party evaluation ecosystem. Without standardized reporting, the perception gap (labs report their own evaluations in favorable terms) perpetuates.

What surprised me: The 23-expert multi-stakeholder process is the right approach for a standard that will need buy-in from labs and regulators. The ChemBio focus is strategically important — this is the domain where the KB already has a claim about AI democratizing bioweapon capability (o3 scores 43.8% vs human PhD 22.1%). If STREAM can create transparency in this domain, it partially addresses the most proximate AI-enabled existential risk.

What I expected but didn't find: Evidence of adoption by any major lab in their current model reports. STREAM appears to be a proposal at this stage.

KB connections:

AI lowers the expertise barrier for engineering biological weapons from PhD-level to amateur — STREAM's ChemBio focus is directly relevant; if dangerous capability evaluations were standardized and transparent, the actual scope of bioweapon capability could be independently assessed
The "missing correction mechanism" from Session 2026-03-18b: standardized third-party reporting is a necessary component of any functioning audit system; STREAM addresses one piece of this

Extraction hints:

Could support a claim about the current state of dangerous capability disclosure: "AI model reports lack standardized evaluation disclosure for dangerous capabilities, preventing independent assessment of whether evaluations are rigorous or complete"
The STREAM framework itself (what standardized reporting should include) is worth extracting as a design standard claim

Context: August 2025. Multi-stakeholder process including government experts signals intent to create something that regulators could eventually mandate.

Curator Notes

PRIMARY CONNECTION: AI lowers the expertise barrier for engineering biological weapons — STREAM directly addresses the disclosure gap in ChemBio capability evaluations

WHY ARCHIVED: Provides evidence of emerging standardization for dangerous capability evaluation reporting. The multi-stakeholder process (government, academia, AI companies) signals potential for eventual adoption.

EXTRACTION HINT: Focus on the disclosure gap: labs currently report their own dangerous capability evaluations without standardized format, preventing independent assessment of rigor.

Key Facts

STREAM (Standard for Transparently Reporting Evaluations in AI Model Reports) proposed August 2025
STREAM developed by 23 experts from government, civil society, academia, and frontier AI companies
STREAM includes 3-page reporting template and gold standard examples
Initial STREAM focus is chemical and biological (ChemBio) dangerous capability evaluations
STREAM has two stated purposes: practical guidance for AI developers and enabling third-party assessment of evaluation rigor

5.1 KiB Raw Blame History

Content

Agent Notes

Curator Notes

Key Facts

5.1 KiB

Raw Blame History