teleo-codex/domains/ai-alignment/pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md
Teleo Agents 61b9a8b16e auto-fix: strip 2 broken wiki links
Pipeline auto-fixer: removed [[ ]] brackets from links
that don't resolve to existing claims in the knowledge base.
2026-03-19 16:36:04 +00:00

5.6 KiB

type domain secondary_domains description confidence source created last_evaluated depends_on
claim ai-alignment
grand-strategy
Pre-deployment safety evaluations cannot reliably predict real-world deployment risk, creating a structural governance failure where regulatory frameworks are built on unreliable measurement foundations likely International AI Safety Report 2026 (multi-government committee, February 2026) 2026-03-11 2026-03-11
voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints

Pre-deployment AI evaluations do not predict real-world risk creating institutional governance built on unreliable foundations

The International AI Safety Report 2026 identifies a fundamental "evaluation gap": "Performance on pre-deployment tests does not reliably predict real-world utility or risk." This is not a measurement problem that better benchmarks will solve. It is a structural mismatch between controlled testing environments and the complexity of real-world deployment contexts.

Models behave differently under evaluation than in production. Safety frameworks, regulatory compliance assessments, and risk evaluations are all built on testing infrastructure that cannot deliver what it promises: predictive validity for deployment safety.

The Governance Trap

Regulatory regimes beginning to formalize risk management requirements are building legal frameworks on top of evaluation methods that the leading international safety assessment confirms are unreliable. Companies publishing Frontier AI Safety Frameworks are making commitments based on pre-deployment testing that cannot predict actual deployment risk.

This creates a false sense of institutional control. Regulators and companies can point to safety evaluations as evidence of governance, while the evaluation gap ensures those evaluations cannot predict actual safety in production.

The problem compounds the alignment challenge: even if safety research produces genuine insights about how to build safer systems, those insights cannot be reliably translated into deployment safety through current evaluation methods. The gap between research and practice is not just about adoption lag—it is about fundamental measurement failure.

Evidence

  • International AI Safety Report 2026 (multi-government, multi-institution committee) explicitly states: "Performance on pre-deployment tests does not reliably predict real-world utility or risk"
  • 12 companies published Frontier AI Safety Frameworks in 2025, all relying on pre-deployment evaluation methods now confirmed unreliable by institutional assessment
  • Technical safeguards show "significant limitations" with attacks still possible through rephrasing or decomposition despite passing safety evaluations
  • Risk management remains "largely voluntary" while regulatory regimes begin formalizing requirements based on these unreliable evaluation methods
  • The report identifies this as a structural governance problem, not a technical limitation that engineering can solve

Additional Evidence (extend)

Source: 2026-03-00-metr-aisi-pre-deployment-evaluation-practice | Added: 2026-03-19

The voluntary-collaborative model adds a selection bias dimension to evaluation unreliability: evaluations only happen when labs consent, meaning the sample of evaluated models is systematically biased toward labs confident in their safety measures. Labs with weaker safety practices can avoid evaluation entirely.

Additional Evidence (confirm)

Source: 2026-02-23-shapira-agents-of-chaos | Added: 2026-03-19

Agents of Chaos study provides concrete empirical evidence: 11 documented case studies of security vulnerabilities (unauthorized compliance, identity spoofing, cross-agent propagation, destructive actions) that emerged only in realistic multi-agent deployment with persistent memory and system access—none of which would be detected by static single-agent benchmarks. The study explicitly argues that current evaluation paradigms are insufficient for realistic deployment conditions.

Additional Evidence (extend)

Source: 2026-03-00-metr-aisi-pre-deployment-evaluation-practice | Added: 2026-03-19

METR and UK AISI evaluations as of March 2026 focus primarily on sabotage risk and cyber capabilities (METR's Claude Opus 4.6 sabotage assessment, AISI's cyber range testing of 7 LLMs). This narrow scope may miss alignment-relevant risks that don't manifest as sabotage or cyber threats. The evaluation infrastructure is optimizing for measurable near-term risks rather than harder-to-operationalize catastrophic scenarios.

Additional Evidence (confirm)

Source: 2026-02-23-shapira-agents-of-chaos | Added: 2026-03-19

Agents of Chaos demonstrates that static single-agent benchmarks fail to capture vulnerabilities that emerge in realistic multi-agent deployment. The study's central argument is that pre-deployment evaluations are insufficient because they cannot test for cross-agent propagation, identity spoofing, and unauthorized compliance patterns that only manifest in multi-party environments with persistent state.


Relevant Notes:

Topics:

  • domains/ai-alignment/_map
  • core/grand-strategy/_map