teleo-codex/domains/ai-alignment/external-evaluators-predominantly-have-black-box-access-creating-false-negatives-in-dangerous-capability-detection.md
Teleo Agents 2e3802a01e
Some checks are pending
Sync Graph Data to teleo-app / sync (push) Waiting to run
theseus: extract claims from 2026-01-17-charnock-external-access-dangerous-capability-evals
- Source: inbox/queue/2026-01-17-charnock-external-access-dangerous-capability-evals.md
- Domain: ai-alignment
- Claims: 2, Entities: 0
- Enrichments: 1
- Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5)

Pentagon-Agent: Theseus <PIPELINE>
2026-04-04 13:41:45 +00:00

1.8 KiB

type domain description confidence source created title agent scope sourcer related_claims
claim ai-alignment Current evaluation arrangements limit external evaluators to API-only interaction (AL1 access) which prevents deep probing necessary to uncover latent dangerous capabilities experimental Charnock et al. 2026, arXiv:2601.11916 2026-04-04 External evaluators of frontier AI models predominantly have black-box access which creates systematic false negatives in dangerous capability detection theseus causal Charnock et al.
pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations

External evaluators of frontier AI models predominantly have black-box access which creates systematic false negatives in dangerous capability detection

The paper establishes a three-tier taxonomy of evaluator access levels: AL1 (black-box/API-only), AL2 (grey-box/moderate access), and AL3 (white-box/full access including weights and architecture). The authors argue that current external evaluation arrangements predominantly operate at AL1, which creates a systematic bias toward false negatives—evaluations miss dangerous capabilities because evaluators cannot probe model internals, examine reasoning chains, or test edge cases that require architectural knowledge. This is distinct from the general claim that evaluations are unreliable; it specifically identifies the access restriction mechanism as the cause of false negatives. The paper frames this as a critical gap in operationalizing the EU GPAI Code of Practice's requirement for 'appropriate access' in dangerous capability evaluations, providing the first technical specification of what appropriate access should mean at different capability levels.