teleo-codex/inbox/queue/2026-01-17-charnock-external-access-dangerous-capability-evals.md
Teleo Agents 5b57e45487 extract: 2026-01-17-charnock-external-access-dangerous-capability-evals
Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70>
2026-03-22 00:36:25 +00:00

5.8 KiB

type title author url date domain secondary_domains format status priority tags processed_by processed_date enrichments_applied extraction_model
source Expanding External Access to Frontier AI Models for Dangerous Capability Evaluations Jacob Charnock, Alejandro Tlaie, Kyle O'Brien, Stephen Casper, Aidan Homewood https://arxiv.org/abs/2601.11916 2026-01-17 ai-alignment
paper enrichment high
external-evaluation
access-framework
dangerous-capabilities
EU-Code-of-Practice
evaluation-independence
translation-gap
governance-bridge
AL1-AL2-AL3
theseus 2026-03-22
AI transparency is declining not improving because Stanford FMTI scores dropped 17 points in one year while frontier labs dissolved safety teams and removed safety language from mission statements.md
only binding regulation with enforcement teeth changes frontier AI lab behavior because every voluntary commitment has been eroded abandoned or made conditional on competitor behavior when commercially inconvenient.md
anthropic/claude-sonnet-4.5

Content

This paper proposes a three-tier access framework for external evaluators conducting dangerous capability assessments of frontier AI models. Published January 17, 2026, 20 pages, submitted to cs.CY (Computers and Society).

Three-tier Access Level (AL) taxonomy:

  • AL1 (Black-box): Minimal model access and information — evaluator interacts via API only, no internal model information
  • AL2 (Grey-box): Moderate model access and substantial information — intermediate access to model behavior, some internal information
  • AL3 (White-box): Complete model access and comprehensive information — full API access, architecture information, weights, internal reasoning

Core argument: Current limited access arrangements (predominantly AL1) may compromise evaluation quality by creating false negatives — evaluations miss dangerous capabilities because evaluators can't probe the model deeply enough. AL3 access reduces false negatives and improves stakeholder trust.

Security and capacity challenges acknowledged: The authors propose that access risks can be mitigated through "technical means and safeguards used in other industries" (e.g., privacy-enhancing technologies from Beers & Toner; clean-room evaluation protocols).

Regulatory framing: The paper explicitly aims to operationalize the EU GPAI Code of Practice's requirement for "appropriate access" in dangerous capability evaluations — one of the first attempts to provide technical specification for what "appropriate access" means in regulatory practice.

Authors: Affiliation details not confirmed from abstract page; the paper's focus on EU regulatory operationalization and involvement of Stephen Casper (AI safety researcher) suggests alignment-safety-governance focus.

Agent Notes

Why this matters: This is the clearest academic bridge-building work between research evaluations and compliance requirements I found this session. The EU Code of Practice says evaluators need "appropriate access" but doesn't define it. This paper proposes a specific technical taxonomy for what appropriate access means at different capability levels. It addresses the translation gap directly.

What surprised me: The paper explicitly cites privacy-enhancing technologies (similar to what Beers & Toner proposed in arXiv:2502.05219, archived March 2026) as a way to enable AL3 access without IP compromise. This suggests the research community is converging on PET + white-box access as the technical solution to the independence problem.

What I expected but didn't find: I expected more discussion of what labs have agreed to in current voluntary evaluator access arrangements (METR, AISI) — the paper seems to be proposing a framework rather than documenting what already exists. The gap between the proposed AL3 standard and current practice (AL1/AL2) isn't quantified.

KB connections:

  • Directly extends: 2026-03-21-research-compliance-translation-gap.md (addresses Translation Gap Layer 3)
  • Connects to: arXiv:2502.05219 (Beers & Toner, PET scrutiny) — archived previously
  • Connects to: Brundage et al. AAL framework (arXiv:2601.11699) — parallel work on evaluation independence
  • Connects to: EU Code of Practice "appropriate access" requirement (new angle on Code inadequacy)

Extraction hints:

  1. New claim candidate: "external evaluators of frontier AI currently have predominantly black-box (AL1) access, which creates systematic false negatives in dangerous capability detection"
  2. New claim: "white-box (AL3) access to frontier models is technically feasible via privacy-enhancing technologies without requiring IP disclosure"
  3. The paper provides the missing technical specification for what the EU Code of Practice's "appropriate access" requirement should mean in practice — this is a claim about governance operationalization

Curator Notes

PRIMARY CONNECTION: domains/ai-alignment/third-party-evaluation-infrastructure claims and translation-gap finding WHY ARCHIVED: First paper to propose specific technical taxonomy for what "appropriate evaluator access" means — bridges research evaluation standards and regulatory compliance language EXTRACTION HINT: Focus on the claim that AL1 access is currently the norm and creates false negatives; the AL3 PET solution as technically feasible is the constructive KB contribution

Key Facts

  • Paper published January 17, 2026, 20 pages, submitted to cs.CY (Computers and Society)
  • Authors: Jacob Charnock, Alejandro Tlaie, Kyle O'Brien, Stephen Casper, Aidan Homewood
  • Paper proposes three-tier access taxonomy: AL1 (black-box), AL2 (grey-box), AL3 (white-box)
  • Paper cites Beers & Toner privacy-enhancing technology work (arXiv:2502.05219)
  • Paper explicitly aims to operationalize EU GPAI Code of Practice requirements