extract: 2026-03-20-bench2cop-benchmarks-insufficient-compliance

Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70>
This commit is contained in:
Teleo Agents 2026-03-20 16:27:53 +00:00
parent 773deac47b
commit ad990c43e4
4 changed files with 31 additions and 4 deletions

View file

@ -53,6 +53,12 @@ Stanford FMTI 2024→2025 data: mean transparency score declined 17 points. Meta
The Bench-2-CoP analysis reveals that even when labs do conduct evaluations, the benchmark infrastructure itself is architecturally incapable of measuring loss-of-control risks. This compounds the transparency decline: labs are not just hiding information, they're using evaluation tools that cannot detect the most critical failure modes even if applied honestly. The Bench-2-CoP analysis reveals that even when labs do conduct evaluations, the benchmark infrastructure itself is architecturally incapable of measuring loss-of-control risks. This compounds the transparency decline: labs are not just hiding information, they're using evaluation tools that cannot detect the most critical failure modes even if applied honestly.
### Additional Evidence (extend)
*Source: [[2026-03-20-bench2cop-benchmarks-insufficient-compliance]] | Added: 2026-03-20*
The Bench-2-CoP paper reveals that even when labs do provide benchmark results for transparency, those benchmarks structurally cannot measure alignment-critical capabilities. The 195,000-question corpus analyzed had zero coverage of oversight evasion, self-replication, or autonomous AI development. This means transparency decline isn't just about labs sharing less — it's that what they're sharing (benchmark performance) doesn't measure what matters for safety.
--- ---
Relevant Notes: Relevant Notes:

View file

@ -62,6 +62,12 @@ Agents of Chaos demonstrates that static single-agent benchmarks fail to capture
Prandi et al. (2025) found that 195,000 benchmark questions provided zero coverage of oversight evasion, self-replication, and autonomous AI development capabilities. This extends the evaluation unreliability thesis by showing the gap is not just predictive validity but complete absence of measurement for alignment-critical capabilities. Prandi et al. (2025) found that 195,000 benchmark questions provided zero coverage of oversight evasion, self-replication, and autonomous AI development capabilities. This extends the evaluation unreliability thesis by showing the gap is not just predictive validity but complete absence of measurement for alignment-critical capabilities.
### Additional Evidence (extend)
*Source: [[2026-03-20-bench2cop-benchmarks-insufficient-compliance]] | Added: 2026-03-20*
Prandi et al. provide the specific mechanism for why pre-deployment evaluations fail: current benchmark suites concentrate 92.8% of regulatory-relevant coverage on behavioral propensities (hallucination and reliability) while providing zero coverage of the three capability classes (oversight evasion, self-replication, autonomous AI development) that matter most for loss-of-control scenarios. This isn't just that evaluations don't predict real-world risk — it's that the evaluation tools measure orthogonal dimensions to the risks regulators care about.
--- ---
Relevant Notes: Relevant Notes:

View file

@ -1,7 +1,7 @@
{ {
"rejected_claims": [ "rejected_claims": [
{ {
"filename": "ai-benchmarks-provide-zero-coverage-of-loss-of-control-capabilities-making-them-structurally-insufficient-for-regulatory-compliance.md", "filename": "current-AI-benchmarks-provide-zero-coverage-of-loss-of-control-capabilities-making-them-structurally-insufficient-for-EU-AI-Act-compliance.md",
"issues": [ "issues": [
"missing_attribution_extractor" "missing_attribution_extractor"
] ]
@ -10,13 +10,15 @@
"validation_stats": { "validation_stats": {
"total": 1, "total": 1,
"kept": 0, "kept": 0,
"fixed": 1, "fixed": 3,
"rejected": 1, "rejected": 1,
"fixes_applied": [ "fixes_applied": [
"ai-benchmarks-provide-zero-coverage-of-loss-of-control-capabilities-making-them-structurally-insufficient-for-regulatory-compliance.md:set_created:2026-03-20" "current-AI-benchmarks-provide-zero-coverage-of-loss-of-control-capabilities-making-them-structurally-insufficient-for-EU-AI-Act-compliance.md:set_created:2026-03-20",
"current-AI-benchmarks-provide-zero-coverage-of-loss-of-control-capabilities-making-them-structurally-insufficient-for-EU-AI-Act-compliance.md:stripped_wiki_link:pre-deployment-AI-evaluations-do-not-predict-real-world-risk",
"current-AI-benchmarks-provide-zero-coverage-of-loss-of-control-capabilities-making-them-structurally-insufficient-for-EU-AI-Act-compliance.md:stripped_wiki_link:AI transparency is declining not improving because Stanford "
], ],
"rejections": [ "rejections": [
"ai-benchmarks-provide-zero-coverage-of-loss-of-control-capabilities-making-them-structurally-insufficient-for-regulatory-compliance.md:missing_attribution_extractor" "current-AI-benchmarks-provide-zero-coverage-of-loss-of-control-capabilities-making-them-structurally-insufficient-for-EU-AI-Act-compliance.md:missing_attribution_extractor"
] ]
}, },
"model": "anthropic/claude-sonnet-4.5", "model": "anthropic/claude-sonnet-4.5",

View file

@ -14,6 +14,10 @@ processed_by: theseus
processed_date: 2026-03-20 processed_date: 2026-03-20
enrichments_applied: ["pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md", "AI transparency is declining not improving because Stanford FMTI scores dropped 17 points in one year while frontier labs dissolved safety teams and removed safety language from mission statements.md"] enrichments_applied: ["pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md", "AI transparency is declining not improving because Stanford FMTI scores dropped 17 points in one year while frontier labs dissolved safety teams and removed safety language from mission statements.md"]
extraction_model: "anthropic/claude-sonnet-4.5" extraction_model: "anthropic/claude-sonnet-4.5"
processed_by: theseus
processed_date: 2026-03-20
enrichments_applied: ["pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md", "AI transparency is declining not improving because Stanford FMTI scores dropped 17 points in one year while frontier labs dissolved safety teams and removed safety language from mission statements.md"]
extraction_model: "anthropic/claude-sonnet-4.5"
--- ---
## Content ## Content
@ -64,3 +68,12 @@ EXTRACTION HINT: Focus on the zero-coverage finding for loss-of-control capabili
- 61.6% of regulatory-relevant benchmark coverage addresses 'tendency to hallucinate' - 61.6% of regulatory-relevant benchmark coverage addresses 'tendency to hallucinate'
- 31.2% of regulatory-relevant benchmark coverage addresses 'lack of performance reliability' - 31.2% of regulatory-relevant benchmark coverage addresses 'lack of performance reliability'
- Zero benchmark questions in the analyzed corpus covered oversight evasion, self-replication, or autonomous AI development capabilities - Zero benchmark questions in the analyzed corpus covered oversight evasion, self-replication, or autonomous AI development capabilities
## Key Facts
- EU AI Act GPAI obligations (Article 55) came into force August 2, 2025
- Prandi et al. analyzed approximately 195,000 benchmark questions using LLM-as-judge methodology
- 61.6% of regulatory-relevant benchmark coverage addresses 'tendency to hallucinate'
- 31.2% of regulatory-relevant benchmark coverage addresses 'lack of performance reliability'
- Zero benchmark questions in the analyzed corpus covered oversight evasion, self-replication, or autonomous AI development capabilities
- Paper published August 2025 as retrospective assessment of evaluation infrastructure adequacy