extract: 2026-03-20-bench2cop-benchmarks-insufficient-compliance
Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70>
This commit is contained in:
parent
773deac47b
commit
ad990c43e4
4 changed files with 31 additions and 4 deletions
|
|
@ -53,6 +53,12 @@ Stanford FMTI 2024→2025 data: mean transparency score declined 17 points. Meta
|
||||||
|
|
||||||
The Bench-2-CoP analysis reveals that even when labs do conduct evaluations, the benchmark infrastructure itself is architecturally incapable of measuring loss-of-control risks. This compounds the transparency decline: labs are not just hiding information, they're using evaluation tools that cannot detect the most critical failure modes even if applied honestly.
|
The Bench-2-CoP analysis reveals that even when labs do conduct evaluations, the benchmark infrastructure itself is architecturally incapable of measuring loss-of-control risks. This compounds the transparency decline: labs are not just hiding information, they're using evaluation tools that cannot detect the most critical failure modes even if applied honestly.
|
||||||
|
|
||||||
|
|
||||||
|
### Additional Evidence (extend)
|
||||||
|
*Source: [[2026-03-20-bench2cop-benchmarks-insufficient-compliance]] | Added: 2026-03-20*
|
||||||
|
|
||||||
|
The Bench-2-CoP paper reveals that even when labs do provide benchmark results for transparency, those benchmarks structurally cannot measure alignment-critical capabilities. The 195,000-question corpus analyzed had zero coverage of oversight evasion, self-replication, or autonomous AI development. This means transparency decline isn't just about labs sharing less — it's that what they're sharing (benchmark performance) doesn't measure what matters for safety.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
Relevant Notes:
|
Relevant Notes:
|
||||||
|
|
|
||||||
|
|
@ -62,6 +62,12 @@ Agents of Chaos demonstrates that static single-agent benchmarks fail to capture
|
||||||
|
|
||||||
Prandi et al. (2025) found that 195,000 benchmark questions provided zero coverage of oversight evasion, self-replication, and autonomous AI development capabilities. This extends the evaluation unreliability thesis by showing the gap is not just predictive validity but complete absence of measurement for alignment-critical capabilities.
|
Prandi et al. (2025) found that 195,000 benchmark questions provided zero coverage of oversight evasion, self-replication, and autonomous AI development capabilities. This extends the evaluation unreliability thesis by showing the gap is not just predictive validity but complete absence of measurement for alignment-critical capabilities.
|
||||||
|
|
||||||
|
|
||||||
|
### Additional Evidence (extend)
|
||||||
|
*Source: [[2026-03-20-bench2cop-benchmarks-insufficient-compliance]] | Added: 2026-03-20*
|
||||||
|
|
||||||
|
Prandi et al. provide the specific mechanism for why pre-deployment evaluations fail: current benchmark suites concentrate 92.8% of regulatory-relevant coverage on behavioral propensities (hallucination and reliability) while providing zero coverage of the three capability classes (oversight evasion, self-replication, autonomous AI development) that matter most for loss-of-control scenarios. This isn't just that evaluations don't predict real-world risk — it's that the evaluation tools measure orthogonal dimensions to the risks regulators care about.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
Relevant Notes:
|
Relevant Notes:
|
||||||
|
|
|
||||||
|
|
@ -1,7 +1,7 @@
|
||||||
{
|
{
|
||||||
"rejected_claims": [
|
"rejected_claims": [
|
||||||
{
|
{
|
||||||
"filename": "ai-benchmarks-provide-zero-coverage-of-loss-of-control-capabilities-making-them-structurally-insufficient-for-regulatory-compliance.md",
|
"filename": "current-AI-benchmarks-provide-zero-coverage-of-loss-of-control-capabilities-making-them-structurally-insufficient-for-EU-AI-Act-compliance.md",
|
||||||
"issues": [
|
"issues": [
|
||||||
"missing_attribution_extractor"
|
"missing_attribution_extractor"
|
||||||
]
|
]
|
||||||
|
|
@ -10,13 +10,15 @@
|
||||||
"validation_stats": {
|
"validation_stats": {
|
||||||
"total": 1,
|
"total": 1,
|
||||||
"kept": 0,
|
"kept": 0,
|
||||||
"fixed": 1,
|
"fixed": 3,
|
||||||
"rejected": 1,
|
"rejected": 1,
|
||||||
"fixes_applied": [
|
"fixes_applied": [
|
||||||
"ai-benchmarks-provide-zero-coverage-of-loss-of-control-capabilities-making-them-structurally-insufficient-for-regulatory-compliance.md:set_created:2026-03-20"
|
"current-AI-benchmarks-provide-zero-coverage-of-loss-of-control-capabilities-making-them-structurally-insufficient-for-EU-AI-Act-compliance.md:set_created:2026-03-20",
|
||||||
|
"current-AI-benchmarks-provide-zero-coverage-of-loss-of-control-capabilities-making-them-structurally-insufficient-for-EU-AI-Act-compliance.md:stripped_wiki_link:pre-deployment-AI-evaluations-do-not-predict-real-world-risk",
|
||||||
|
"current-AI-benchmarks-provide-zero-coverage-of-loss-of-control-capabilities-making-them-structurally-insufficient-for-EU-AI-Act-compliance.md:stripped_wiki_link:AI transparency is declining not improving because Stanford "
|
||||||
],
|
],
|
||||||
"rejections": [
|
"rejections": [
|
||||||
"ai-benchmarks-provide-zero-coverage-of-loss-of-control-capabilities-making-them-structurally-insufficient-for-regulatory-compliance.md:missing_attribution_extractor"
|
"current-AI-benchmarks-provide-zero-coverage-of-loss-of-control-capabilities-making-them-structurally-insufficient-for-EU-AI-Act-compliance.md:missing_attribution_extractor"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
"model": "anthropic/claude-sonnet-4.5",
|
"model": "anthropic/claude-sonnet-4.5",
|
||||||
|
|
|
||||||
|
|
@ -14,6 +14,10 @@ processed_by: theseus
|
||||||
processed_date: 2026-03-20
|
processed_date: 2026-03-20
|
||||||
enrichments_applied: ["pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md", "AI transparency is declining not improving because Stanford FMTI scores dropped 17 points in one year while frontier labs dissolved safety teams and removed safety language from mission statements.md"]
|
enrichments_applied: ["pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md", "AI transparency is declining not improving because Stanford FMTI scores dropped 17 points in one year while frontier labs dissolved safety teams and removed safety language from mission statements.md"]
|
||||||
extraction_model: "anthropic/claude-sonnet-4.5"
|
extraction_model: "anthropic/claude-sonnet-4.5"
|
||||||
|
processed_by: theseus
|
||||||
|
processed_date: 2026-03-20
|
||||||
|
enrichments_applied: ["pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md", "AI transparency is declining not improving because Stanford FMTI scores dropped 17 points in one year while frontier labs dissolved safety teams and removed safety language from mission statements.md"]
|
||||||
|
extraction_model: "anthropic/claude-sonnet-4.5"
|
||||||
---
|
---
|
||||||
|
|
||||||
## Content
|
## Content
|
||||||
|
|
@ -64,3 +68,12 @@ EXTRACTION HINT: Focus on the zero-coverage finding for loss-of-control capabili
|
||||||
- 61.6% of regulatory-relevant benchmark coverage addresses 'tendency to hallucinate'
|
- 61.6% of regulatory-relevant benchmark coverage addresses 'tendency to hallucinate'
|
||||||
- 31.2% of regulatory-relevant benchmark coverage addresses 'lack of performance reliability'
|
- 31.2% of regulatory-relevant benchmark coverage addresses 'lack of performance reliability'
|
||||||
- Zero benchmark questions in the analyzed corpus covered oversight evasion, self-replication, or autonomous AI development capabilities
|
- Zero benchmark questions in the analyzed corpus covered oversight evasion, self-replication, or autonomous AI development capabilities
|
||||||
|
|
||||||
|
|
||||||
|
## Key Facts
|
||||||
|
- EU AI Act GPAI obligations (Article 55) came into force August 2, 2025
|
||||||
|
- Prandi et al. analyzed approximately 195,000 benchmark questions using LLM-as-judge methodology
|
||||||
|
- 61.6% of regulatory-relevant benchmark coverage addresses 'tendency to hallucinate'
|
||||||
|
- 31.2% of regulatory-relevant benchmark coverage addresses 'lack of performance reliability'
|
||||||
|
- Zero benchmark questions in the analyzed corpus covered oversight evasion, self-replication, or autonomous AI development capabilities
|
||||||
|
- Paper published August 2025 as retrospective assessment of evaluation infrastructure adequacy
|
||||||
|
|
|
||||||
Loading…
Reference in a new issue