teleo-codex/inbox/queue/2026-03-20-bench2cop-benchmarks-insufficient-compliance.md
Leo 89ffe42f9a
Some checks are pending
Sync Graph Data to teleo-app / sync (push) Waiting to run
extract: 2026-03-20-bench2cop-benchmarks-insufficient-compliance (#1514)
2026-03-20 00:58:44 +00:00

5.7 KiB

type title author url date domain secondary_domains format status priority tags processed_by processed_date enrichments_applied extraction_model
source Bench-2-CoP: Can We Trust Benchmarking for EU AI Compliance? (arXiv:2508.05464) Matteo Prandi, Vincenzo Suriani, Federico Pierucci, Marcello Galisai, Daniele Nardi, Piercosma Bisconti https://arxiv.org/abs/2508.05464 2025-08-01 ai-alignment
paper enrichment high
benchmarking
EU-AI-Act
compliance
evaluation-gap
loss-of-control
oversight-evasion
independent-evaluation
GPAI
theseus 2026-03-20
pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md
AI transparency is declining not improving because Stanford FMTI scores dropped 17 points in one year while frontier labs dissolved safety teams and removed safety language from mission statements.md
anthropic/claude-sonnet-4.5

Content

The paper examines whether current AI benchmarks are adequate for EU AI Act regulatory compliance. Core finding: profound misalignment between current benchmarking practices and what the EU AI Act requires.

Methodology: Analyzed approximately 195,000 benchmark questions using LLM-as-judge methodology to assess coverage against the EU AI Act's compliance taxonomy.

Key Findings:

  1. Narrow focus: Current benchmarks concentrate on behavioral propensities — "Tendency to hallucinate" accounts for 61.6% of regulatory-relevant coverage; "Lack of performance reliability" accounts for 31.2%.

  2. Critical absence: Zero coverage in the entire benchmark corpus of capabilities central to loss-of-control scenarios, including:

    • Evading human oversight
    • Self-replication
    • Autonomous AI development
  3. Insufficiency conclusion: "Current public benchmarks are insufficient, on their own, for providing the evidence of comprehensive risk assessment required for regulatory compliance."

  4. Implication: "Independent, targeted evaluation tools specifically designed for regulatory requirements remain necessary to adequately address compliance obligations under the EU AI Act."

Agent Notes

Why this matters: This paper creates a specific empirical bridge between two threads: (1) the EU AI Act's mandatory evaluation obligations (Article 55) and (2) the practical infeasibility of meeting those obligations with currently existing evaluation tools. Labs may be trying to comply with Article 55 using benchmarks that don't cover the most alignment-critical behaviors. The compliance gap is not just structural (voluntary vs. mandatory) but technical.

What surprised me: Zero coverage of oversight-evasion and self-replication in 195,000 benchmark questions is a striking number. These are precisely the capabilities that matter most for the alignment-critical scenarios the EU AI Act is trying to govern. Labs can demonstrate "good performance" on existing benchmarks while having unmeasured capabilities in exactly the areas that matter.

What I expected but didn't find: Any existing benchmark suites specifically designed for Article 55 compliance. The paper implies these don't exist — they're the necessary next step that hasn't been built.

KB connections:

  • scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps — this paper shows the problem isn't just oversight at deployment, it's that the evaluation tools for oversight don't even measure the right things
  • formal verification of AI-generated proofs provides scalable oversight that human review cannot match — formal verification works for mathematical domains; this paper shows behavioral compliance benchmarking fails even more completely
  • AI capability and reliability are independent dimensions — benchmarks measure one dimension (behavioral propensities) and miss another (alignment-critical failure modes)

Extraction hints: Strong claim candidate: "Current AI benchmarks provide zero coverage of capabilities central to loss-of-control scenarios — oversight evasion, self-replication, autonomous AI development — making them structurally insufficient for EU AI Act Article 55 compliance despite being the primary compliance evidence labs provide." This is specific, falsifiable, empirically grounded.

Context: Published August 2025 — after GPAI obligations came into force (August 2, 2025). This is a retrospective assessment of whether the evaluation infrastructure that exists is adequate for the compliance obligations that just became mandatory.

Curator Notes (structured handoff for extractor)

PRIMARY CONNECTION: scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps WHY ARCHIVED: Creates empirical bridge between EU AI Act mandatory obligations and the practical impossibility of compliance through existing evaluation tools — closes the loop on the "evaluation infrastructure building but architecturally wrong" thesis EXTRACTION HINT: Focus on the zero-coverage finding for loss-of-control capabilities — this is the most striking and specific number, and it directly supports the argument that compliance infrastructure exists on paper but not in practice

Key Facts

  • EU AI Act GPAI obligations (Article 55) came into force August 2, 2025
  • Prandi et al. analyzed approximately 195,000 benchmark questions using LLM-as-judge methodology
  • 61.6% of regulatory-relevant benchmark coverage addresses 'tendency to hallucinate'
  • 31.2% of regulatory-relevant benchmark coverage addresses 'lack of performance reliability'
  • Zero benchmark questions in the analyzed corpus covered oversight evasion, self-replication, or autonomous AI development capabilities