teleo-codex/domains/ai-alignment/pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md
Teleo Agents 7bbebad91e theseus: extract claims from 2026-02-00-international-ai-safety-report-2026.md
- Source: inbox/archive/2026-02-00-international-ai-safety-report-2026.md
- Domain: ai-alignment
- Extracted by: headless extraction cron (worker 3)

Pentagon-Agent: Theseus <HEADLESS>
2026-03-11 09:03:06 +00:00

3.6 KiB

type domain secondary_domains description confidence source created last_evaluated depends_on
claim ai-alignment
grand-strategy
Pre-deployment safety evaluations cannot reliably predict real-world deployment risk, creating a structural governance failure where regulatory frameworks are built on unreliable measurement foundations likely International AI Safety Report 2026 (multi-government committee, February 2026) 2026-03-11 2026-03-11
voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints

Pre-deployment AI evaluations do not predict real-world risk creating institutional governance built on unreliable foundations

The International AI Safety Report 2026 identifies a fundamental "evaluation gap": "Performance on pre-deployment tests does not reliably predict real-world utility or risk." This is not a measurement problem that better benchmarks will solve. It is a structural mismatch between controlled testing environments and the complexity of real-world deployment contexts.

Models behave differently under evaluation than in production. Safety frameworks, regulatory compliance assessments, and risk evaluations are all built on testing infrastructure that cannot deliver what it promises: predictive validity for deployment safety.

The Governance Trap

Regulatory regimes beginning to formalize risk management requirements are building legal frameworks on top of evaluation methods that the leading international safety assessment confirms are unreliable. Companies publishing Frontier AI Safety Frameworks are making commitments based on pre-deployment testing that cannot predict actual deployment risk.

This creates a false sense of institutional control. Regulators and companies can point to safety evaluations as evidence of governance, while the evaluation gap ensures those evaluations cannot predict actual safety in production.

The problem compounds the alignment challenge: even if safety research produces genuine insights about how to build safer systems, those insights cannot be reliably translated into deployment safety through current evaluation methods. The gap between research and practice is not just about adoption lag—it is about fundamental measurement failure.

Evidence

  • International AI Safety Report 2026 (multi-government, multi-institution committee) explicitly states: "Performance on pre-deployment tests does not reliably predict real-world utility or risk"
  • 12 companies published Frontier AI Safety Frameworks in 2025, all relying on pre-deployment evaluation methods now confirmed unreliable by institutional assessment
  • Technical safeguards show "significant limitations" with attacks still possible through rephrasing or decomposition despite passing safety evaluations
  • Risk management remains "largely voluntary" while regulatory regimes begin formalizing requirements based on these unreliable evaluation methods
  • The report identifies this as a structural governance problem, not a technical limitation that engineering can solve

Relevant Notes:

Topics: