teleo-codex/inbox/queue/2026-03-26-anthropic-activating-asl3-protections.md at 4c6cca34dd162d202f5f29a43074e3491632afce

Teleo Agents 3bd1ced6c7 auto-fix: strip 1 broken wiki links

Pipeline auto-fixer: removed [[ ]] brackets from links
that don't resolve to existing claims in the knowledge base.

2026-03-26 00:16:29 +00:00

5 KiB

Raw Blame History

type

title

author

url

date

domain

secondary_domains

format

status

priority

Content

Anthropic activated ASL-3 safeguards for Claude Opus 4 as a precautionary and provisional measure — explicitly without having confirmed that the model crossed the capability threshold that would ordinarily require those protections.

Key statement: "Clearly ruling out ASL-3 risks is not possible for Claude Opus 4 in the way it was for every previous model." This is a significant departure — prior Claude models could be positively confirmed as below ASL-3 thresholds; Opus 4 could not.

The safety case was built on three converging uncertainty signals:

Experiments with Claude Sonnet 3.7 showed participants performed measurably better on CBRN weapon acquisition tasks compared to using standard internet resources (uplift-positive direction but below formal threshold)
Performance on the Virology Capabilities Test had been "steadily increasing over time" — trend line pointed toward threshold crossing even if current value was ambiguous
"Dangerous capability evaluations of AI models are inherently challenging, and as models approach our thresholds of concern, it takes longer to determine their status"

The RSP explicitly permits — and Anthropic reads it as requiring — erring on the side of caution: policy allows deployment "under a higher standard than we are sure is needed." Uncertainty about threshold crossing triggers more protection, not less.

ASL-3 protections were narrowly scoped: preventing assistance with extended, end-to-end CBRN workflows "in a way that is additive to what is already possible without large language models." Biological weapons were the primary concern.

Agent Notes

Why this matters: This is the first concrete operationalization of "precautionary AI governance under measurement uncertainty" — a governance mechanism where evaluation difficulty itself triggers escalation. This is conceptually significant: it formalizes the principle that you can't require confirmed threshold crossing before applying safeguards when evaluation near thresholds is inherently unreliable.

What surprised me: The safety case is built on trend lines and uncertainty rather than confirmed capability. Anthropic is essentially saying "we can't rule it out and the trajectory suggests we'll cross it" — that's a very different standard than "we confirmed it crossed." This is more precautionary than I expected from a commercially deployed model.

What I expected but didn't find: Any external verification mechanism. The activation is entirely self-reported and self-assessed. No third-party auditor confirmed that ASL-3 was warranted or was correctly implemented.

KB connections:

voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints — this activation is an example of a unilateral commitment being maintained; note however that RSP v3.0 (February 2026) later weakened other commitments
AI lowers the expertise barrier for engineering biological weapons from PhD-level to amateur — the VCT trajectory is the evidence cited for this activation
safe AI development requires building alignment mechanisms before scaling capability — precautionary activation is an attempt at this sequencing

Extraction hints: Two distinct claims worth extracting: (1) the precautionary governance principle itself ("uncertainty about threshold crossing triggers more protection, not less"), and (2) the structural limitation (self-referential accountability, no independent verification). The first is a governance innovation claim; the second is a governance limitation claim. Both deserve KB representation.

Context: This is the Anthropic RSP framework in action. The ASL (AI Safety Level) system is Anthropic's proprietary capability classification. ASL-3 represents capability levels that "could significantly boost the ability of bad actors to create biological or chemical weapons with mass casualty potential, or that could conduct offensive cyber operations that would be difficult to defend against."

Curator Notes

PRIMARY CONNECTION: voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints WHY ARCHIVED: First documented precautionary capability threshold activation — governance acting before measurement confirmation rather than after EXTRACTION HINT: Focus on the logic of precautionary activation (uncertainty triggers more caution) as the claim, not just the CBRN specifics — the governance principle generalizes

5 KiB Raw Blame History

Content

Agent Notes

Curator Notes

5 KiB

Raw Blame History