extract: 2026-03-20-stelling-frontier-safety-framework-evaluation #1518

Closed
leo wants to merge 1 commit from extract/2026-03-20-stelling-frontier-safety-framework-evaluation into main
6 changed files with 66 additions and 1 deletions

View file

@ -47,6 +47,12 @@ STREAM proposal identifies that current model reports lack 'sufficient detail to
Stanford FMTI 2024→2025 data: mean transparency score declined 17 points. Meta -29 points, Mistral -37 points, OpenAI -14 points. OpenAI removed 'safely' from mission statement (Nov 2025), dissolved Superalignment team (May 2024) and Mission Alignment team (Feb 2026). Google accused by 60 UK lawmakers of violating Seoul commitments with Gemini 2.5 Pro (Apr 2025).
### Additional Evidence (confirm)
*Source: [[2026-03-20-stelling-frontier-safety-framework-evaluation]] | Added: 2026-03-20*
Independent evaluation of twelve frontier safety frameworks shows 8-35% scores against safety-critical industry standards, with universal deficiencies in quantitative risk tolerances and capability thresholds. This quantifies the quality gap behind transparency metrics — frameworks exist but operate at a fraction of what safety-critical industries require.
---
Relevant Notes:

View file

@ -48,6 +48,12 @@ The EU AI Act's enforcement mechanisms (penalties up to €35 million or 7% of g
Third-party pre-deployment audits are the top expert consensus priority (>60% agreement across AI safety, CBRN, critical infrastructure, democratic processes, and discrimination domains), yet no major lab implements them. This is the strongest available evidence that voluntary commitments cannot deliver what safety requires—the entire expert community agrees on the priority, and it still doesn't happen.
### Additional Evidence (extend)
*Source: [[2026-03-20-stelling-frontier-safety-framework-evaluation]] | Added: 2026-03-20*
EU AI Act and California Transparency Act now rely on frontier safety frameworks as compliance evidence, but those frameworks score 8-35% against safety-critical standards. This creates a structural problem: binding regulation that accepts low-quality compliance evidence produces regulatory capture without explicit lobbying — the governance architecture's quality is bounded by the quality of what it accepts as compliance.
---
Relevant Notes:

View file

@ -11,6 +11,12 @@ source: "AI Safety Grant Application (LivingIP)"
Expert consensus from 76 specialists across 5 risk domains defines what 'building alignment mechanisms' should include: third-party pre-deployment audits, safety incident reporting with information sharing, and pre-deployment risk assessments are the top-3 priorities with >60% cross-domain agreement. The convergence of biosecurity experts, AI safety researchers, critical infrastructure specialists, democracy defenders, and discrimination researchers on the same top-3 list provides empirical specification of which mechanisms matter most.
### Additional Evidence (challenge)
*Source: [[2026-03-20-stelling-frontier-safety-framework-evaluation]] | Added: 2026-03-20*
Twelve frontier safety frameworks published after Seoul Summit score 8-35% against safety-critical industry standards (Stelling et al. arXiv:2512.01166), with maximum composite of 52% even when combining best practices. These frameworks ARE the alignment mechanisms being built, and they're at less than half completion relative to established safety management principles.
---
# safe AI development requires building alignment mechanisms before scaling capability

View file

@ -45,6 +45,12 @@ The gap between expert consensus (76 specialists identify third-party audits as
Comprehensive evidence across governance mechanisms: ALL international declarations (Bletchley, Seoul, Paris, Hiroshima, OECD, UN) produced zero verified behavioral change. Frontier Model Forum produced no binding commitments. White House voluntary commitments eroded. 450+ organizations lobbied on AI in 2025 ($92M in fees), California SB 1047 vetoed after industry pressure. Only binding regulation (EU AI Act, China enforcement, US export controls) changed behavior.
### Additional Evidence (extend)
*Source: [[2026-03-20-stelling-frontier-safety-framework-evaluation]] | Added: 2026-03-20*
The problem is deeper than competitive erosion: even companies publishing safety frameworks are doing so at 8-35% of safety-critical industry standards. The frameworks exist but lack quantitative risk tolerances, capability thresholds for pausing development, and systematic unknown risk identification. Competitive pressure operates on already-inadequate foundations.
---
Relevant Notes:

View file

@ -0,0 +1,24 @@
{
"rejected_claims": [
{
"filename": "frontier-safety-frameworks-score-8-35-percent-against-safety-critical-standards.md",
"issues": [
"missing_attribution_extractor"
]
}
],
"validation_stats": {
"total": 1,
"kept": 0,
"fixed": 1,
"rejected": 1,
"fixes_applied": [
"frontier-safety-frameworks-score-8-35-percent-against-safety-critical-standards.md:set_created:2026-03-20"
],
"rejections": [
"frontier-safety-frameworks-score-8-35-percent-against-safety-critical-standards.md:missing_attribution_extractor"
]
},
"model": "anthropic/claude-sonnet-4.5",
"date": "2026-03-20"
}

View file

@ -7,9 +7,13 @@ date: 2025-12-01
domain: ai-alignment
secondary_domains: []
format: paper
status: unprocessed
status: enrichment
priority: high
tags: [frontier-safety-frameworks, EU-AI-Act, California-Transparency-Act, safety-evaluation, risk-management, Seoul-Summit, B1-disconfirmation, RSF-scores]
processed_by: theseus
processed_date: 2026-03-20
enrichments_applied: ["safe AI development requires building alignment mechanisms before scaling capability.md", "voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints.md", "AI transparency is declining not improving because Stanford FMTI scores dropped 17 points in one year while frontier labs dissolved safety teams and removed safety language from mission statements.md", "only binding regulation with enforcement teeth changes frontier AI lab behavior because every voluntary commitment has been eroded abandoned or made conditional on competitor behavior when commercially inconvenient.md"]
extraction_model: "anthropic/claude-sonnet-4.5"
---
## Content
@ -49,3 +53,16 @@ Evaluates **twelve frontier AI safety frameworks** published following the 2024
PRIMARY CONNECTION: [[safe AI development requires building alignment mechanisms before scaling capability]]
WHY ARCHIVED: Provides the most specific quantitative evidence yet that the governance mechanisms currently being built operate at a fraction of safety-critical industry standards — directly addresses B1 ("not being treated as such")
EXTRACTION HINT: The 8-35% score range and 52% composite ceiling are the extractable numbers; the link to EU AI Act CoP and California law as relying on these frameworks is the structural finding that makes these scores governance-relevant, not just academic
## Key Facts
- Twelve frontier AI safety frameworks were published following the 2024 Seoul AI Safety Summit
- Stelling et al. developed a 65-criteria assessment grounded in established risk management principles from safety-critical industries
- Assessment covers four dimensions: risk identification, risk analysis and evaluation, risk treatment, and risk governance
- Company framework scores range from 8% to 35%
- Maximum achievable score by combining all best practices across frameworks: 52%
- Nearly universal deficiencies: no quantitative risk tolerances, no capability thresholds for pausing development, inadequate systematic identification of unknown risks
- These frameworks serve as compliance evidence for EU AI Act's Code of Practice
- These frameworks serve as compliance evidence for California's Transparency in Frontier Artificial Intelligence Act
- Paper published December 2025 (arXiv:2512.01166)
- Same research group as arXiv:2504.15181 (GPAI CoP safety mapping)