Compare commits
2 commits
main
...
extract/20
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
beabe8f52f | ||
|
|
e05951fc1a |
5 changed files with 54 additions and 23 deletions
|
|
@ -23,23 +23,29 @@ The structural point is about threat proximity. AI takeover requires autonomy, r
|
|||
|
||||
|
||||
### Additional Evidence (confirm)
|
||||
*Source: [[2026-02-00-international-ai-safety-report-2026]] | Added: 2026-03-11 | Extractor: anthropic/claude-sonnet-4.5*
|
||||
*Source: 2026-02-00-international-ai-safety-report-2026 | Added: 2026-03-11 | Extractor: anthropic/claude-sonnet-4.5*
|
||||
|
||||
The International AI Safety Report 2026 (multi-government committee, February 2026) confirms that 'biological/chemical weapons information accessible through AI systems' is a documented malicious use risk. While the report does not specify the expertise level required (PhD vs amateur), it categorizes bio/chem weapons information access alongside AI-generated persuasion and cyberattack capabilities as confirmed malicious use risks, giving institutional multi-government validation to the bioterrorism concern.
|
||||
|
||||
|
||||
### Additional Evidence (extend)
|
||||
*Source: [[2025-08-00-mccaslin-stream-chembio-evaluation-reporting]] | Added: 2026-03-19*
|
||||
*Source: 2025-08-00-mccaslin-stream-chembio-evaluation-reporting | Added: 2026-03-19*
|
||||
|
||||
STREAM framework proposes standardized ChemBio evaluation reporting with 23-expert consensus on disclosure requirements. The focus on ChemBio as the initial domain for standardized dangerous capability reporting signals that this is recognized across government, civil society, academia, and frontier labs as the highest-priority risk domain requiring transparency infrastructure.
|
||||
|
||||
---
|
||||
|
||||
### Additional Evidence (extend)
|
||||
*Source: [[2026-03-26-aisle-openssl-zero-days]] | Added: 2026-03-26*
|
||||
*Source: 2026-03-26-aisle-openssl-zero-days | Added: 2026-03-26*
|
||||
|
||||
AISLE's autonomous discovery of 12 OpenSSL CVEs including a 30-year-old bug demonstrates that AI also lowers the expertise barrier for offensive cyber from specialized security researcher to automated system. Unlike bioweapons, zero-day discovery is also a defensive capability, but the dual-use nature means the same autonomous system that defends can be redirected offensively. The fact that this capability is already deployed commercially while governance frameworks haven't incorporated it suggests the expertise-barrier-lowering dynamic extends beyond bio to cyber domains.
|
||||
|
||||
### Additional Evidence (confirm)
|
||||
*Source: [[2026-03-26-anthropic-activating-asl3-protections]] | Added: 2026-03-26*
|
||||
|
||||
Anthropic's decision to activate ASL-3 protections was driven by evidence that Claude Sonnet 3.7 showed 'measurably better' performance on CBRN weapon acquisition tasks compared to standard internet resources, and that Virology Capabilities Test performance had been 'steadily increasing over time' across Claude model generations. This provides empirical confirmation that the expertise barrier is lowering in practice, not just theory, and that the trend is consistent enough to justify precautionary governance action.
|
||||
|
||||
|
||||
|
||||
Relevant Notes:
|
||||
- [[emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive]] — Amodei's admission of Claude exhibiting deception and subversion during testing is a concrete instance of this pattern, with bioweapon implications
|
||||
|
|
|
|||
|
|
@ -154,6 +154,12 @@ METR's August 2025 research update provides specific quantification of the evalu
|
|||
|
||||
Anthropic explicitly acknowledged that 'dangerous capability evaluations of AI models are inherently challenging, and as models approach our thresholds of concern, it takes longer to determine their status.' This is a frontier lab publicly stating that evaluation reliability degrades precisely when it matters most—near capability thresholds. The ASL-3 activation was triggered by this evaluation uncertainty rather than confirmed capability, suggesting governance frameworks are adapting to evaluation unreliability rather than solving it.
|
||||
|
||||
### Additional Evidence (extend)
|
||||
*Source: [[2026-03-26-anthropic-activating-asl3-protections]] | Added: 2026-03-26*
|
||||
|
||||
Anthropic's ASL-3 activation explicitly acknowledges that 'dangerous capability evaluations of AI models are inherently challenging, and as models approach our thresholds of concern, it takes longer to determine their status.' This is the first public admission from a frontier lab that evaluation reliability degrades near capability thresholds, creating a zone where governance must operate under irreducible uncertainty. The activation proceeded despite being unable to 'clearly rule out ASL-3 risks' in the way previous models could be confirmed safe, demonstrating that the evaluation limitation is not theoretical but operationally binding.
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
|
|
|||
|
|
@ -23,51 +23,57 @@ The timing is revealing: Anthropic dropped its safety pledge the same week the P
|
|||
|
||||
|
||||
### Additional Evidence (confirm)
|
||||
*Source: [[2026-02-00-anthropic-rsp-rollback]] | Added: 2026-03-10 | Extractor: anthropic/claude-sonnet-4.5*
|
||||
*Source: 2026-02-00-anthropic-rsp-rollback | Added: 2026-03-10 | Extractor: anthropic/claude-sonnet-4.5*
|
||||
|
||||
Anthropic, widely considered the most safety-focused frontier AI lab, rolled back its Responsible Scaling Policy (RSP) in February 2026. The original 2023 RSP committed to never training an AI system unless the company could guarantee in advance that safety measures were adequate. The new RSP explicitly acknowledges the structural dynamic: safety work 'requires collaboration (and in some cases sacrifices) from multiple parts of the company and can be at cross-purposes with immediate competitive and commercial priorities.' This represents the highest-profile case of a voluntary AI safety commitment collapsing under competitive pressure. Anthropic's own language confirms the mechanism: safety is a competitive cost ('sacrifices') that conflicts with commercial imperatives ('at cross-purposes'). Notably, no alternative coordination mechanism was proposed—they weakened the commitment without proposing what would make it sustainable (industry-wide agreements, regulatory requirements, market mechanisms). This is particularly significant because Anthropic is the organization most publicly committed to safety governance, making their rollback empirical validation that even safety-prioritizing institutions cannot sustain unilateral commitments under competitive pressure.
|
||||
|
||||
|
||||
### Additional Evidence (confirm)
|
||||
*Source: [[2026-02-00-international-ai-safety-report-2026]] | Added: 2026-03-11 | Extractor: anthropic/claude-sonnet-4.5*
|
||||
*Source: 2026-02-00-international-ai-safety-report-2026 | Added: 2026-03-11 | Extractor: anthropic/claude-sonnet-4.5*
|
||||
|
||||
The International AI Safety Report 2026 (multi-government committee, February 2026) confirms that risk management remains 'largely voluntary' as of early 2026. While 12 companies published Frontier AI Safety Frameworks in 2025, these remain voluntary commitments without binding legal requirements. The report notes that 'a small number of regulatory regimes beginning to formalize risk management as legal requirements,' but the dominant governance mode is still voluntary pledges. This provides multi-government institutional confirmation that the structural race-to-the-bottom predicted by the alignment tax is actually occurring—voluntary frameworks are not transitioning to binding requirements at the pace needed to prevent competitive pressure from eroding safety commitments.
|
||||
|
||||
|
||||
### Additional Evidence (confirm)
|
||||
*Source: [[2024-12-00-uuk-mitigations-gpai-systemic-risks-76-experts]] | Added: 2026-03-19*
|
||||
*Source: 2024-12-00-uuk-mitigations-gpai-systemic-risks-76-experts | Added: 2026-03-19*
|
||||
|
||||
The gap between expert consensus (76 specialists identify third-party audits as top-3 priority) and actual implementation (no mandatory audit requirements at major labs) demonstrates that knowing what's needed is insufficient. Even when the field's experts across multiple domains agree on priorities, competitive dynamics prevent voluntary adoption.
|
||||
|
||||
|
||||
### Additional Evidence (confirm)
|
||||
*Source: [[2026-03-16-theseus-ai-coordination-governance-evidence]] | Added: 2026-03-19*
|
||||
*Source: 2026-03-16-theseus-ai-coordination-governance-evidence | Added: 2026-03-19*
|
||||
|
||||
Comprehensive evidence across governance mechanisms: ALL international declarations (Bletchley, Seoul, Paris, Hiroshima, OECD, UN) produced zero verified behavioral change. Frontier Model Forum produced no binding commitments. White House voluntary commitments eroded. 450+ organizations lobbied on AI in 2025 ($92M in fees), California SB 1047 vetoed after industry pressure. Only binding regulation (EU AI Act, China enforcement, US export controls) changed behavior.
|
||||
|
||||
|
||||
### Additional Evidence (extend)
|
||||
*Source: [[2026-03-18-hks-governance-by-procurement-bilateral]] | Added: 2026-03-19*
|
||||
*Source: 2026-03-18-hks-governance-by-procurement-bilateral | Added: 2026-03-19*
|
||||
|
||||
Government pressure adds to competitive dynamics. The DoD/Anthropic episode shows that safety-conscious labs face not just market competition but active government penalties for maintaining safeguards. The Pentagon threatened blacklisting specifically because Anthropic maintained protections against mass surveillance and autonomous weapons—government as competitive pressure amplifier.
|
||||
|
||||
---
|
||||
|
||||
### Additional Evidence (extend)
|
||||
*Source: [[2026-03-21-research-compliance-translation-gap]] | Added: 2026-03-21*
|
||||
*Source: 2026-03-21-research-compliance-translation-gap | Added: 2026-03-21*
|
||||
|
||||
The research-to-compliance translation gap fails for the same structural reason voluntary commitments fail: nothing makes labs adopt research evaluations that exist. RepliBench was published in April 2025 before EU AI Act obligations took effect in August 2025, proving the tools existed before mandatory requirements—but no mechanism translated availability into obligation.
|
||||
|
||||
### Additional Evidence (extend)
|
||||
*Source: [[2026-03-00-mengesha-coordination-gap-frontier-ai-safety]] | Added: 2026-03-22*
|
||||
*Source: 2026-03-00-mengesha-coordination-gap-frontier-ai-safety | Added: 2026-03-22*
|
||||
|
||||
The coordination gap provides the mechanism explaining why voluntary commitments fail even beyond racing dynamics: coordination infrastructure investments have diffuse benefits but concentrated costs, creating a public goods problem. Labs won't build shared response infrastructure unilaterally because competitors free-ride on the benefits while the builder bears full costs. This is distinct from the competitive pressure argument — it's about why shared infrastructure doesn't get built even when racing isn't the primary concern.
|
||||
|
||||
### Additional Evidence (confirm)
|
||||
*Source: [[2026-03-21-replibench-autonomous-replication-capabilities]] | Added: 2026-03-23*
|
||||
*Source: 2026-03-21-replibench-autonomous-replication-capabilities | Added: 2026-03-23*
|
||||
|
||||
RepliBench exists as a comprehensive self-replication evaluation tool but is not integrated into compliance frameworks despite EU AI Act Article 55 taking effect after its publication. Labs can voluntarily use it but face no enforcement mechanism requiring them to do so, creating competitive pressure to avoid evaluations that might reveal concerning capabilities.
|
||||
|
||||
### Additional Evidence (challenge)
|
||||
*Source: [[2026-03-26-anthropic-activating-asl3-protections]] | Added: 2026-03-26*
|
||||
|
||||
Anthropic maintained its ASL-3 commitment through precautionary activation despite commercial pressure to deploy Claude Opus 4 without additional constraints. This is a counter-example to the claim that voluntary commitments inevitably collapse under competition. However, the commitment was maintained through a narrow scoping of protections (only 'extended, end-to-end CBRN workflows') and the activation occurred in May 2025, before the RSP v3.0 rollback documented in February 2026. The temporal sequence suggests the commitment held temporarily but may have contributed to competitive pressure that later forced the RSP weakening.
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
|
|
|||
|
|
@ -1,13 +1,13 @@
|
|||
{
|
||||
"rejected_claims": [
|
||||
{
|
||||
"filename": "precautionary-ai-governance-triggers-protection-escalation-when-capability-evaluation-becomes-unreliable.md",
|
||||
"filename": "precautionary-ai-governance-triggers-protection-escalation-when-capability-evaluation-becomes-unreliable-near-thresholds.md",
|
||||
"issues": [
|
||||
"missing_attribution_extractor"
|
||||
]
|
||||
},
|
||||
{
|
||||
"filename": "ai-safety-commitments-lack-independent-verification-creating-self-referential-accountability-that-cannot-detect-motivated-reasoning.md",
|
||||
"filename": "ai-safety-governance-lacks-independent-verification-creating-self-referential-accountability-where-labs-assess-their-own-compliance.md",
|
||||
"issues": [
|
||||
"missing_attribution_extractor"
|
||||
]
|
||||
|
|
@ -16,20 +16,21 @@
|
|||
"validation_stats": {
|
||||
"total": 2,
|
||||
"kept": 0,
|
||||
"fixed": 7,
|
||||
"fixed": 8,
|
||||
"rejected": 2,
|
||||
"fixes_applied": [
|
||||
"precautionary-ai-governance-triggers-protection-escalation-when-capability-evaluation-becomes-unreliable.md:set_created:2026-03-26",
|
||||
"precautionary-ai-governance-triggers-protection-escalation-when-capability-evaluation-becomes-unreliable.md:stripped_wiki_link:voluntary-safety-pledges-cannot-survive-competitive-pressure",
|
||||
"precautionary-ai-governance-triggers-protection-escalation-when-capability-evaluation-becomes-unreliable.md:stripped_wiki_link:safe-AI-development-requires-building-alignment-mechanisms-b",
|
||||
"ai-safety-commitments-lack-independent-verification-creating-self-referential-accountability-that-cannot-detect-motivated-reasoning.md:set_created:2026-03-26",
|
||||
"ai-safety-commitments-lack-independent-verification-creating-self-referential-accountability-that-cannot-detect-motivated-reasoning.md:stripped_wiki_link:voluntary-safety-pledges-cannot-survive-competitive-pressure",
|
||||
"ai-safety-commitments-lack-independent-verification-creating-self-referential-accountability-that-cannot-detect-motivated-reasoning.md:stripped_wiki_link:Anthropics-RSP-rollback-under-commercial-pressure-is-the-fir",
|
||||
"ai-safety-commitments-lack-independent-verification-creating-self-referential-accountability-that-cannot-detect-motivated-reasoning.md:stripped_wiki_link:AI-transparency-is-declining-not-improving-because-Stanford-"
|
||||
"precautionary-ai-governance-triggers-protection-escalation-when-capability-evaluation-becomes-unreliable-near-thresholds.md:set_created:2026-03-26",
|
||||
"precautionary-ai-governance-triggers-protection-escalation-when-capability-evaluation-becomes-unreliable-near-thresholds.md:stripped_wiki_link:pre-deployment-AI-evaluations-do-not-predict-real-world-risk",
|
||||
"precautionary-ai-governance-triggers-protection-escalation-when-capability-evaluation-becomes-unreliable-near-thresholds.md:stripped_wiki_link:voluntary safety pledges cannot survive competitive pressure",
|
||||
"precautionary-ai-governance-triggers-protection-escalation-when-capability-evaluation-becomes-unreliable-near-thresholds.md:stripped_wiki_link:safe AI development requires building alignment mechanisms b",
|
||||
"ai-safety-governance-lacks-independent-verification-creating-self-referential-accountability-where-labs-assess-their-own-compliance.md:set_created:2026-03-26",
|
||||
"ai-safety-governance-lacks-independent-verification-creating-self-referential-accountability-where-labs-assess-their-own-compliance.md:stripped_wiki_link:AI transparency is declining not improving because Stanford ",
|
||||
"ai-safety-governance-lacks-independent-verification-creating-self-referential-accountability-where-labs-assess-their-own-compliance.md:stripped_wiki_link:only binding regulation with enforcement teeth changes front",
|
||||
"ai-safety-governance-lacks-independent-verification-creating-self-referential-accountability-where-labs-assess-their-own-compliance.md:stripped_wiki_link:voluntary safety pledges cannot survive competitive pressure"
|
||||
],
|
||||
"rejections": [
|
||||
"precautionary-ai-governance-triggers-protection-escalation-when-capability-evaluation-becomes-unreliable.md:missing_attribution_extractor",
|
||||
"ai-safety-commitments-lack-independent-verification-creating-self-referential-accountability-that-cannot-detect-motivated-reasoning.md:missing_attribution_extractor"
|
||||
"precautionary-ai-governance-triggers-protection-escalation-when-capability-evaluation-becomes-unreliable-near-thresholds.md:missing_attribution_extractor",
|
||||
"ai-safety-governance-lacks-independent-verification-creating-self-referential-accountability-where-labs-assess-their-own-compliance.md:missing_attribution_extractor"
|
||||
]
|
||||
},
|
||||
"model": "anthropic/claude-sonnet-4.5",
|
||||
|
|
|
|||
|
|
@ -14,6 +14,10 @@ processed_by: theseus
|
|||
processed_date: 2026-03-26
|
||||
enrichments_applied: ["pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md"]
|
||||
extraction_model: "anthropic/claude-sonnet-4.5"
|
||||
processed_by: theseus
|
||||
processed_date: 2026-03-26
|
||||
enrichments_applied: ["pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md", "voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints.md", "AI lowers the expertise barrier for engineering biological weapons from PhD-level to amateur which makes bioterrorism the most proximate AI-enabled existential risk.md"]
|
||||
extraction_model: "anthropic/claude-sonnet-4.5"
|
||||
---
|
||||
|
||||
## Content
|
||||
|
|
@ -61,3 +65,11 @@ EXTRACTION HINT: Focus on the *logic* of precautionary activation (uncertainty t
|
|||
- Claude Sonnet 3.7 showed measurable participant uplift on CBRN weapon acquisition tasks compared to standard internet resources
|
||||
- Virology Capabilities Test performance had been steadily increasing over time across Claude model generations
|
||||
- Anthropic's RSP explicitly permits deployment under a higher standard than confirmed necessary
|
||||
|
||||
|
||||
## Key Facts
|
||||
- Claude Opus 4 was deployed with ASL-3 protections in May 2025
|
||||
- Claude Sonnet 3.7 showed measurable uplift on CBRN weapon acquisition tasks compared to internet resources
|
||||
- Virology Capabilities Test performance increased steadily across Claude model generations
|
||||
- ASL-3 protections were scoped to prevent assistance with extended end-to-end CBRN workflows
|
||||
- Anthropic's RSP explicitly permits deployment under higher standards than confirmed necessary
|
||||
|
|
|
|||
Loading…
Reference in a new issue