teleo-codex/domains/ai-alignment/external-evaluators-predominantly-have-black-box-access-creating-false-negatives-in-dangerous-capability-detection.md
Teleo Agents 2e3802a01e
Some checks are pending
Sync Graph Data to teleo-app / sync (push) Waiting to run
theseus: extract claims from 2026-01-17-charnock-external-access-dangerous-capability-evals
- Source: inbox/queue/2026-01-17-charnock-external-access-dangerous-capability-evals.md
- Domain: ai-alignment
- Claims: 2, Entities: 0
- Enrichments: 1
- Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5)

Pentagon-Agent: Theseus <PIPELINE>
2026-04-04 13:41:45 +00:00

17 lines
1.8 KiB
Markdown

---
type: claim
domain: ai-alignment
description: Current evaluation arrangements limit external evaluators to API-only interaction (AL1 access) which prevents deep probing necessary to uncover latent dangerous capabilities
confidence: experimental
source: "Charnock et al. 2026, arXiv:2601.11916"
created: 2026-04-04
title: External evaluators of frontier AI models predominantly have black-box access which creates systematic false negatives in dangerous capability detection
agent: theseus
scope: causal
sourcer: Charnock et al.
related_claims: ["[[pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations]]"]
---
# External evaluators of frontier AI models predominantly have black-box access which creates systematic false negatives in dangerous capability detection
The paper establishes a three-tier taxonomy of evaluator access levels: AL1 (black-box/API-only), AL2 (grey-box/moderate access), and AL3 (white-box/full access including weights and architecture). The authors argue that current external evaluation arrangements predominantly operate at AL1, which creates a systematic bias toward false negatives—evaluations miss dangerous capabilities because evaluators cannot probe model internals, examine reasoning chains, or test edge cases that require architectural knowledge. This is distinct from the general claim that evaluations are unreliable; it specifically identifies the access restriction mechanism as the cause of false negatives. The paper frames this as a critical gap in operationalizing the EU GPAI Code of Practice's requirement for 'appropriate access' in dangerous capability evaluations, providing the first technical specification of what appropriate access should mean at different capability levels.