auto-fix: strip 3 broken wiki links
Some checks are pending
Sync Graph Data to teleo-app / sync (push) Waiting to run
Some checks are pending
Sync Graph Data to teleo-app / sync (push) Waiting to run
Pipeline auto-fixer: removed [[ ]] brackets from links that don't resolve to existing claims in the knowledge base.
This commit is contained in:
parent
82159c59da
commit
28e28f0dc7
1 changed files with 3 additions and 3 deletions
|
|
@ -88,7 +88,7 @@ Anthropic's stated rationale for extending evaluation intervals from 3 to 6 mont
|
||||||
*Auto-converted by substantive fixer. Review: revert if this evidence doesn't belong here.*
|
*Auto-converted by substantive fixer. Review: revert if this evidence doesn't belong here.*
|
||||||
|
|
||||||
### Additional Evidence (extend)
|
### Additional Evidence (extend)
|
||||||
*Source: [[2026-03-26-anthropic-activating-asl3-protections]] | Added: 2026-03-26*
|
*Source: 2026-03-26-anthropic-activating-asl3-protections | Added: 2026-03-26*
|
||||||
|
|
||||||
Anthropic's ASL-3 activation demonstrates that evaluation uncertainty compounds near capability thresholds: 'dangerous capability evaluations of AI models are inherently challenging, and as models approach our thresholds of concern, it takes longer to determine their status.' The Virology Capabilities Test showed 'steadily increasing' performance across model generations, but Anthropic could not definitively confirm whether Opus 4 crossed the threshold—they activated protections based on trend trajectory and inability to rule out crossing rather than confirmed measurement.
|
Anthropic's ASL-3 activation demonstrates that evaluation uncertainty compounds near capability thresholds: 'dangerous capability evaluations of AI models are inherently challenging, and as models approach our thresholds of concern, it takes longer to determine their status.' The Virology Capabilities Test showed 'steadily increasing' performance across model generations, but Anthropic could not definitively confirm whether Opus 4 crossed the threshold—they activated protections based on trend trajectory and inability to rule out crossing rather than confirmed measurement.
|
||||||
|
|
||||||
|
|
@ -150,12 +150,12 @@ METR's January 2026 evaluation of GPT-5 placed its autonomous replication and ad
|
||||||
METR's August 2025 research update provides specific quantification of the evaluation reliability problem: algorithmic scoring overstates capability by 2-3x (38% algorithmic success vs 0% holistic success for Claude 3.7 Sonnet on software tasks), and HCAST benchmark version instability of ~50% between annual versions means even the measurement instrument itself is unstable. METR explicitly acknowledges their own evaluations 'may substantially overestimate' real-world capability.
|
METR's August 2025 research update provides specific quantification of the evaluation reliability problem: algorithmic scoring overstates capability by 2-3x (38% algorithmic success vs 0% holistic success for Claude 3.7 Sonnet on software tasks), and HCAST benchmark version instability of ~50% between annual versions means even the measurement instrument itself is unstable. METR explicitly acknowledges their own evaluations 'may substantially overestimate' real-world capability.
|
||||||
|
|
||||||
### Additional Evidence (extend)
|
### Additional Evidence (extend)
|
||||||
*Source: [[2026-03-26-anthropic-activating-asl3-protections]] | Added: 2026-03-26*
|
*Source: 2026-03-26-anthropic-activating-asl3-protections | Added: 2026-03-26*
|
||||||
|
|
||||||
Anthropic explicitly acknowledged that 'dangerous capability evaluations of AI models are inherently challenging, and as models approach our thresholds of concern, it takes longer to determine their status.' This is a frontier lab publicly stating that evaluation reliability degrades precisely when it matters most—near capability thresholds. The ASL-3 activation was triggered by this evaluation uncertainty rather than confirmed capability, suggesting governance frameworks are adapting to evaluation unreliability rather than solving it.
|
Anthropic explicitly acknowledged that 'dangerous capability evaluations of AI models are inherently challenging, and as models approach our thresholds of concern, it takes longer to determine their status.' This is a frontier lab publicly stating that evaluation reliability degrades precisely when it matters most—near capability thresholds. The ASL-3 activation was triggered by this evaluation uncertainty rather than confirmed capability, suggesting governance frameworks are adapting to evaluation unreliability rather than solving it.
|
||||||
|
|
||||||
### Additional Evidence (extend)
|
### Additional Evidence (extend)
|
||||||
*Source: [[2026-03-26-anthropic-activating-asl3-protections]] | Added: 2026-03-26*
|
*Source: 2026-03-26-anthropic-activating-asl3-protections | Added: 2026-03-26*
|
||||||
|
|
||||||
Anthropic's ASL-3 activation explicitly acknowledges that 'dangerous capability evaluations of AI models are inherently challenging, and as models approach our thresholds of concern, it takes longer to determine their status.' This is the first public admission from a frontier lab that evaluation reliability degrades near capability thresholds, creating a zone where governance must operate under irreducible uncertainty. The activation proceeded despite being unable to 'clearly rule out ASL-3 risks' in the way previous models could be confirmed safe, demonstrating that the evaluation limitation is not theoretical but operationally binding.
|
Anthropic's ASL-3 activation explicitly acknowledges that 'dangerous capability evaluations of AI models are inherently challenging, and as models approach our thresholds of concern, it takes longer to determine their status.' This is the first public admission from a frontier lab that evaluation reliability degrades near capability thresholds, creating a zone where governance must operate under irreducible uncertainty. The activation proceeded despite being unable to 'clearly rule out ASL-3 risks' in the way previous models could be confirmed safe, demonstrating that the evaluation limitation is not theoretical but operationally binding.
|
||||||
|
|
||||||
|
|
|
||||||
Loading…
Reference in a new issue