Sync Graph Data to teleo-app / sync (push) Waiting to run

Details

theseus: extract claims from 2026-01-17-charnock-external-access-dangerous-capability-evals

- Source: inbox/queue/2026-01-17-charnock-external-access-dangerous-capability-evals.md
- Domain: ai-alignment
- Claims: 2, Entities: 0
- Enrichments: 1
- Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5)

Pentagon-Agent: Theseus <PIPELINE>

2026-04-04 13:41:45 +00:00

1.8 KiB

Raw Blame History

type

domain

description

confidence

source

created

title

agent

scope

sourcer

related_claims

claim

ai-alignment

Current evaluation arrangements limit external evaluators to API-only interaction (AL1 access) which prevents deep probing necessary to uncover latent dangerous capabilities

experimental

Charnock et al. 2026, arXiv:2601.11916

2026-04-04

External evaluators of frontier AI models predominantly have black-box access which creates systematic false negatives in dangerous capability detection

theseus

causal

Charnock et al.

pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations

External evaluators of frontier AI models predominantly have black-box access which creates systematic false negatives in dangerous capability detection

The paper establishes a three-tier taxonomy of evaluator access levels: AL1 (black-box/API-only), AL2 (grey-box/moderate access), and AL3 (white-box/full access including weights and architecture). The authors argue that current external evaluation arrangements predominantly operate at AL1, which creates a systematic bias toward false negatives—evaluations miss dangerous capabilities because evaluators cannot probe model internals, examine reasoning chains, or test edge cases that require architectural knowledge. This is distinct from the general claim that evaluations are unreliable; it specifically identifies the access restriction mechanism as the cause of false negatives. The paper frames this as a critical gap in operationalizing the EU GPAI Code of Practice's requirement for 'appropriate access' in dangerous capability evaluations, providing the first technical specification of what appropriate access should mean at different capability levels.

1.8 KiB Raw Blame History

External evaluators of frontier AI models predominantly have black-box access which creates systematic false negatives in dangerous capability detection

1.8 KiB

Raw Blame History