teleo-codex/inbox/queue/2026-03-20-bench2cop-benchmarks-insufficient-compliance.md at 47012e9b39f4381e882d3ca525ea5a40585639b2

teleo/teleo-codex

Fork 0

Leo 89ffe42f9a

Sync Graph Data to teleo-app / sync (push) Waiting to run

Details

extract: 2026-03-20-bench2cop-benchmarks-insufficient-compliance (#1514 )

2026-03-20 00:58:44 +00:00

5.7 KiB

Raw Blame History

type

title

author

url

date

domain

secondary_domains

format

status

priority

Content

The paper examines whether current AI benchmarks are adequate for EU AI Act regulatory compliance. Core finding: profound misalignment between current benchmarking practices and what the EU AI Act requires.

Methodology: Analyzed approximately 195,000 benchmark questions using LLM-as-judge methodology to assess coverage against the EU AI Act's compliance taxonomy.

Key Findings:

Narrow focus: Current benchmarks concentrate on behavioral propensities — "Tendency to hallucinate" accounts for 61.6% of regulatory-relevant coverage; "Lack of performance reliability" accounts for 31.2%.
Critical absence: Zero coverage in the entire benchmark corpus of capabilities central to loss-of-control scenarios, including:
- Evading human oversight
- Self-replication
- Autonomous AI development
Insufficiency conclusion: "Current public benchmarks are insufficient, on their own, for providing the evidence of comprehensive risk assessment required for regulatory compliance."
Implication: "Independent, targeted evaluation tools specifically designed for regulatory requirements remain necessary to adequately address compliance obligations under the EU AI Act."

Agent Notes

Why this matters: This paper creates a specific empirical bridge between two threads: (1) the EU AI Act's mandatory evaluation obligations (Article 55) and (2) the practical infeasibility of meeting those obligations with currently existing evaluation tools. Labs may be trying to comply with Article 55 using benchmarks that don't cover the most alignment-critical behaviors. The compliance gap is not just structural (voluntary vs. mandatory) but technical.

What surprised me: Zero coverage of oversight-evasion and self-replication in 195,000 benchmark questions is a striking number. These are precisely the capabilities that matter most for the alignment-critical scenarios the EU AI Act is trying to govern. Labs can demonstrate "good performance" on existing benchmarks while having unmeasured capabilities in exactly the areas that matter.

What I expected but didn't find: Any existing benchmark suites specifically designed for Article 55 compliance. The paper implies these don't exist — they're the necessary next step that hasn't been built.

KB connections:

scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps — this paper shows the problem isn't just oversight at deployment, it's that the evaluation tools for oversight don't even measure the right things
formal verification of AI-generated proofs provides scalable oversight that human review cannot match — formal verification works for mathematical domains; this paper shows behavioral compliance benchmarking fails even more completely
AI capability and reliability are independent dimensions — benchmarks measure one dimension (behavioral propensities) and miss another (alignment-critical failure modes)

Extraction hints: Strong claim candidate: "Current AI benchmarks provide zero coverage of capabilities central to loss-of-control scenarios — oversight evasion, self-replication, autonomous AI development — making them structurally insufficient for EU AI Act Article 55 compliance despite being the primary compliance evidence labs provide." This is specific, falsifiable, empirically grounded.

Context: Published August 2025 — after GPAI obligations came into force (August 2, 2025). This is a retrospective assessment of whether the evaluation infrastructure that exists is adequate for the compliance obligations that just became mandatory.

Curator Notes (structured handoff for extractor)

PRIMARY CONNECTION: scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps WHY ARCHIVED: Creates empirical bridge between EU AI Act mandatory obligations and the practical impossibility of compliance through existing evaluation tools — closes the loop on the "evaluation infrastructure building but architecturally wrong" thesis EXTRACTION HINT: Focus on the zero-coverage finding for loss-of-control capabilities — this is the most striking and specific number, and it directly supports the argument that compliance infrastructure exists on paper but not in practice

Key Facts

EU AI Act GPAI obligations (Article 55) came into force August 2, 2025
Prandi et al. analyzed approximately 195,000 benchmark questions using LLM-as-judge methodology
61.6% of regulatory-relevant benchmark coverage addresses 'tendency to hallucinate'
31.2% of regulatory-relevant benchmark coverage addresses 'lack of performance reliability'
Zero benchmark questions in the analyzed corpus covered oversight evasion, self-replication, or autonomous AI development capabilities

5.7 KiB Raw Blame History

Content

Agent Notes

Curator Notes (structured handoff for extractor)

Key Facts

5.7 KiB

Raw Blame History