teleo-codex/inbox/queue/2026-01-17-charnock-external-access-dangerous-capability-evals.md at 5b57e454879a51a7fb3dfea3d8c5c99fd5c93fa4

Teleo Agents 5b57e45487 extract: 2026-01-17-charnock-external-access-dangerous-capability-evals

Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70>

2026-03-22 00:36:25 +00:00

5.8 KiB

Raw Blame History

type

title

author

url

date

domain

secondary_domains

format

status

priority

Content

This paper proposes a three-tier access framework for external evaluators conducting dangerous capability assessments of frontier AI models. Published January 17, 2026, 20 pages, submitted to cs.CY (Computers and Society).

Three-tier Access Level (AL) taxonomy:

AL1 (Black-box): Minimal model access and information — evaluator interacts via API only, no internal model information
AL2 (Grey-box): Moderate model access and substantial information — intermediate access to model behavior, some internal information
AL3 (White-box): Complete model access and comprehensive information — full API access, architecture information, weights, internal reasoning

Core argument: Current limited access arrangements (predominantly AL1) may compromise evaluation quality by creating false negatives — evaluations miss dangerous capabilities because evaluators can't probe the model deeply enough. AL3 access reduces false negatives and improves stakeholder trust.

Security and capacity challenges acknowledged: The authors propose that access risks can be mitigated through "technical means and safeguards used in other industries" (e.g., privacy-enhancing technologies from Beers & Toner; clean-room evaluation protocols).

Regulatory framing: The paper explicitly aims to operationalize the EU GPAI Code of Practice's requirement for "appropriate access" in dangerous capability evaluations — one of the first attempts to provide technical specification for what "appropriate access" means in regulatory practice.

Authors: Affiliation details not confirmed from abstract page; the paper's focus on EU regulatory operationalization and involvement of Stephen Casper (AI safety researcher) suggests alignment-safety-governance focus.

Agent Notes

Why this matters: This is the clearest academic bridge-building work between research evaluations and compliance requirements I found this session. The EU Code of Practice says evaluators need "appropriate access" but doesn't define it. This paper proposes a specific technical taxonomy for what appropriate access means at different capability levels. It addresses the translation gap directly.

What surprised me: The paper explicitly cites privacy-enhancing technologies (similar to what Beers & Toner proposed in arXiv:2502.05219, archived March 2026) as a way to enable AL3 access without IP compromise. This suggests the research community is converging on PET + white-box access as the technical solution to the independence problem.

What I expected but didn't find: I expected more discussion of what labs have agreed to in current voluntary evaluator access arrangements (METR, AISI) — the paper seems to be proposing a framework rather than documenting what already exists. The gap between the proposed AL3 standard and current practice (AL1/AL2) isn't quantified.

KB connections:

Directly extends: 2026-03-21-research-compliance-translation-gap.md (addresses Translation Gap Layer 3)
Connects to: arXiv:2502.05219 (Beers & Toner, PET scrutiny) — archived previously
Connects to: Brundage et al. AAL framework (arXiv:2601.11699) — parallel work on evaluation independence
Connects to: EU Code of Practice "appropriate access" requirement (new angle on Code inadequacy)

Extraction hints:

New claim candidate: "external evaluators of frontier AI currently have predominantly black-box (AL1) access, which creates systematic false negatives in dangerous capability detection"
New claim: "white-box (AL3) access to frontier models is technically feasible via privacy-enhancing technologies without requiring IP disclosure"
The paper provides the missing technical specification for what the EU Code of Practice's "appropriate access" requirement should mean in practice — this is a claim about governance operationalization

Curator Notes

PRIMARY CONNECTION: domains/ai-alignment/third-party-evaluation-infrastructure claims and translation-gap finding WHY ARCHIVED: First paper to propose specific technical taxonomy for what "appropriate evaluator access" means — bridges research evaluation standards and regulatory compliance language EXTRACTION HINT: Focus on the claim that AL1 access is currently the norm and creates false negatives; the AL3 PET solution as technically feasible is the constructive KB contribution

Key Facts

Paper published January 17, 2026, 20 pages, submitted to cs.CY (Computers and Society)
Authors: Jacob Charnock, Alejandro Tlaie, Kyle O'Brien, Stephen Casper, Aidan Homewood
Paper proposes three-tier access taxonomy: AL1 (black-box), AL2 (grey-box), AL3 (white-box)
Paper cites Beers & Toner privacy-enhancing technology work (arXiv:2502.05219)
Paper explicitly aims to operationalize EU GPAI Code of Practice requirements

5.8 KiB Raw Blame History

Content

Agent Notes

Curator Notes

Key Facts

5.8 KiB

Raw Blame History