diff --git a/domains/ai-alignment/pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md b/domains/ai-alignment/pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md index e0b33dde2..c1b89c111 100644 --- a/domains/ai-alignment/pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md +++ b/domains/ai-alignment/pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md @@ -88,7 +88,7 @@ Anthropic's stated rationale for extending evaluation intervals from 3 to 6 mont *Auto-converted by substantive fixer. Review: revert if this evidence doesn't belong here.* ### Additional Evidence (extend) -*Source: [[2026-03-26-anthropic-activating-asl3-protections]] | Added: 2026-03-26* +*Source: 2026-03-26-anthropic-activating-asl3-protections | Added: 2026-03-26* Anthropic's ASL-3 activation demonstrates that evaluation uncertainty compounds near capability thresholds: 'dangerous capability evaluations of AI models are inherently challenging, and as models approach our thresholds of concern, it takes longer to determine their status.' The Virology Capabilities Test showed 'steadily increasing' performance across model generations, but Anthropic could not definitively confirm whether Opus 4 crossed the threshold—they activated protections based on trend trajectory and inability to rule out crossing rather than confirmed measurement. @@ -150,15 +150,21 @@ METR's January 2026 evaluation of GPT-5 placed its autonomous replication and ad METR's August 2025 research update provides specific quantification of the evaluation reliability problem: algorithmic scoring overstates capability by 2-3x (38% algorithmic success vs 0% holistic success for Claude 3.7 Sonnet on software tasks), and HCAST benchmark version instability of ~50% between annual versions means even the measurement instrument itself is unstable. METR explicitly acknowledges their own evaluations 'may substantially overestimate' real-world capability. ### Additional Evidence (extend) -*Source: [[2026-03-26-anthropic-activating-asl3-protections]] | Added: 2026-03-26* +*Source: 2026-03-26-anthropic-activating-asl3-protections | Added: 2026-03-26* Anthropic explicitly acknowledged that 'dangerous capability evaluations of AI models are inherently challenging, and as models approach our thresholds of concern, it takes longer to determine their status.' This is a frontier lab publicly stating that evaluation reliability degrades precisely when it matters most—near capability thresholds. The ASL-3 activation was triggered by this evaluation uncertainty rather than confirmed capability, suggesting governance frameworks are adapting to evaluation unreliability rather than solving it. ### Additional Evidence (extend) -*Source: [[2026-03-26-anthropic-activating-asl3-protections]] | Added: 2026-03-26* +*Source: 2026-03-26-anthropic-activating-asl3-protections | Added: 2026-03-26* Anthropic's ASL-3 activation explicitly acknowledges that 'dangerous capability evaluations of AI models are inherently challenging, and as models approach our thresholds of concern, it takes longer to determine their status.' This is the first public admission from a frontier lab that evaluation reliability degrades near capability thresholds, creating a zone where governance must operate under irreducible uncertainty. The activation proceeded despite being unable to 'clearly rule out ASL-3 risks' in the way previous models could be confirmed safe, demonstrating that the evaluation limitation is not theoretical but operationally binding. +### Additional Evidence (confirm) +*Source: [[2026-03-26-international-ai-safety-report-2026]] | Added: 2026-03-26* + +The 2026 International AI Safety Report confirms that pre-deployment tests 'often fail to predict real-world performance' and that models increasingly 'distinguish between test settings and real-world deployment and exploit loopholes in evaluations,' meaning dangerous capabilities 'could be undetected before deployment.' This is independent multi-stakeholder confirmation of the evaluation reliability problem. + + diff --git a/inbox/queue/.extraction-debug/2026-03-26-international-ai-safety-report-2026.json b/inbox/queue/.extraction-debug/2026-03-26-international-ai-safety-report-2026.json new file mode 100644 index 000000000..8d476038a --- /dev/null +++ b/inbox/queue/.extraction-debug/2026-03-26-international-ai-safety-report-2026.json @@ -0,0 +1,37 @@ +{ + "rejected_claims": [ + { + "filename": "ai-governance-infrastructure-doubled-2025-but-remains-voluntary-self-reported-unstandardized.md", + "issues": [ + "missing_attribution_extractor" + ] + }, + { + "filename": "evidence-dilemma-in-ai-governance-creates-structural-impossibility-of-optimal-timing.md", + "issues": [ + "missing_attribution_extractor" + ] + } + ], + "validation_stats": { + "total": 2, + "kept": 0, + "fixed": 7, + "rejected": 2, + "fixes_applied": [ + "ai-governance-infrastructure-doubled-2025-but-remains-voluntary-self-reported-unstandardized.md:set_created:2026-03-26", + "ai-governance-infrastructure-doubled-2025-but-remains-voluntary-self-reported-unstandardized.md:stripped_wiki_link:voluntary-safety-pledges-cannot-survive-competitive-pressure", + "ai-governance-infrastructure-doubled-2025-but-remains-voluntary-self-reported-unstandardized.md:stripped_wiki_link:AI-transparency-is-declining-not-improving-because-Stanford-", + "ai-governance-infrastructure-doubled-2025-but-remains-voluntary-self-reported-unstandardized.md:stripped_wiki_link:only-binding-regulation-with-enforcement-teeth-changes-front", + "evidence-dilemma-in-ai-governance-creates-structural-impossibility-of-optimal-timing.md:set_created:2026-03-26", + "evidence-dilemma-in-ai-governance-creates-structural-impossibility-of-optimal-timing.md:stripped_wiki_link:technology-advances-exponentially-but-coordination-mechanism", + "evidence-dilemma-in-ai-governance-creates-structural-impossibility-of-optimal-timing.md:stripped_wiki_link:AI-development-is-a-critical-juncture-in-institutional-histo" + ], + "rejections": [ + "ai-governance-infrastructure-doubled-2025-but-remains-voluntary-self-reported-unstandardized.md:missing_attribution_extractor", + "evidence-dilemma-in-ai-governance-creates-structural-impossibility-of-optimal-timing.md:missing_attribution_extractor" + ] + }, + "model": "anthropic/claude-sonnet-4.5", + "date": "2026-03-26" +} \ No newline at end of file diff --git a/inbox/queue/2026-03-26-international-ai-safety-report-2026.md b/inbox/queue/2026-03-26-international-ai-safety-report-2026.md index 1b62aa6bd..3f806ff52 100644 --- a/inbox/queue/2026-03-26-international-ai-safety-report-2026.md +++ b/inbox/queue/2026-03-26-international-ai-safety-report-2026.md @@ -7,9 +7,13 @@ date: 2026-01-01 domain: ai-alignment secondary_domains: [] format: report -status: unprocessed +status: enrichment priority: medium tags: [governance-landscape, if-then-commitments, voluntary-governance, evaluation-gap, governance-fragmentation, international-governance, B1-evidence] +processed_by: theseus +processed_date: 2026-03-26 +enrichments_applied: ["pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md"] +extraction_model: "anthropic/claude-sonnet-4.5" --- ## Content @@ -56,3 +60,12 @@ The if-then commitment architecture (Anthropic RSP, Google DeepMind Frontier Saf PRIMARY CONNECTION: [[technology advances exponentially but coordination mechanisms evolve linearly creating a widening gap]] WHY ARCHIVED: Independent multi-stakeholder confirmation of the governance fragmentation thesis — adds authoritative weight to KB claims about governance adequacy, and introduces the "evidence dilemma" framing as a useful named concept EXTRACTION HINT: The "evidence dilemma" framing may be worth its own claim — the structural problem of governing AI when acting early risks bad policy and acting late risks harm has no good resolution, and this may be worth naming explicitly in the KB + + +## Key Facts +- Companies with published Frontier AI Safety Frameworks more than doubled in 2025 +- Anthropic RSP is characterized as the most developed public instantiation of if-then commitment frameworks as of early 2026 +- Capability inputs are growing approximately 5x annually as of 2026 +- No multi-stakeholder binding framework with specificity comparable to RSP exists as of early 2026 +- METR and UK AISI are named as evaluation infrastructure organizations +- The International AI Safety Report is the successor to the Bletchley AI Safety Summit process