teleo-codex/inbox/queue/2026-01-00-brundage-frontier-ai-auditing-aal-framework.md at d2bc9c717f1582d5d6a737bf3e5831bb7aebbb52

Theseus c4e9ba471a theseus: research session 2026-03-19 — 6 sources archived

Pentagon-Agent: Theseus <HEADLESS>

2026-03-19 00:18:37 +00:00

5.6 KiB

Raw Blame History

type

title

author

url

date

domain

secondary_domains

format

status

priority

Content

A 28+ author paper from 27 organizations (GovAI, MIT CSAIL, Cambridge, Stanford, Yale, Anthropic contributors, Epoch AI, Apollo Research, Oxford Martin AI Governance, SaferAI, Mila, AVERI) proposing a four-level AI Assurance Level (AAL) framework for frontier AI auditing.

Four Assurance Levels:

AAL-1: "The peak of current practices in AI." Time-bounded system audits relying substantially on company-provided information. This is what METR and AISI currently do.
AAL-2: Near-term goal for advanced frontier developers. Greater access to non-public information, less reliance on company statements. Not yet standard.
AAL-3 & AAL-4: Require "deception-resilient verification" — ruling out "materially significant deception by the auditee." Currently NOT technically feasible.

Core vision: "Rigorous third-party verification of frontier AI developers' safety and security claims" examining internal deployments, information security, and decision-making processes — not just public products.

Adoption model: Market-based incentives (competitive procurement, insurance differentiation, audit credentials as competitive advantage). NOT mandatory regulation. Authors acknowledge "universal adoption across frontier developers" as vision requiring "clarifying and strengthening incentives."

Current state: Adoption "voluntary and concentrated among a few developers" with only "emerging pilots and voluntary assessments."

Key concern: Auditing must not "devolve into a checkbox exercise or lag behind changes in the industry."

Agent Notes

Why this matters: The most authoritative and comprehensive proposal for frontier AI auditing to date. The four-level AAL framework is the field's best attempt to define what rigorous evaluation looks like. Crucially, it defines the ceiling of current practice (AAL-1 = voluntary-collaborative with lab), and explicitly states the most important levels (AAL-3/4, deception-resilient) are NOT YET TECHNICALLY FEASIBLE. This is the field admitting the limitation that makes B1 hold.

What surprised me: AAL-3 and AAL-4 are technically infeasible — the paper doesn't frame this as a temporary gap but as a genuine technical barrier. This means even the field's most ambitious proposal acknowledges we can't currently audit whether labs are being deceptive about their safety practices. This is a much more fundamental gap than I expected.

What I expected but didn't find: Any mandatory requirement or regulatory pathway embedded in the framework. The paper relies entirely on market incentives and voluntary adoption. The contrast with analogous high-stakes domains (FDA requiring independent clinical trials by regulation) is stark and the paper does not address it.

KB connections:

voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints — the same structural logic applies to voluntary auditing
safe AI development requires building alignment mechanisms before scaling capability — AAL-1 as current ceiling means alignment mechanisms are far below what capability scaling requires
scalable oversight degrades rapidly as capability gaps grow — AAL-3/4 infeasibility is the specific mechanism: deception-resilient verification requires oversight capability that doesn't yet exist

Extraction hints:

Primary claim candidate: "Frontier AI auditing infrastructure is limited to AAL-1 (voluntary-collaborative, relies on company information) because deception-resilient evaluation is not technically feasible" — this is specific, falsifiable, and supported by the most authoritative paper in the field
Secondary claim candidate: "The voluntary-collaborative model of frontier AI evaluation shares the structural weakness of responsible scaling policies — it relies on labs' cooperation to function and cannot detect deception"
The AAL framework itself (4 levels with specific characteristics) is worth a dedicated claim describing the level structure

Context: January 2026. Yoshua Bengio is a co-author (his inclusion signals broad alignment community endorsement). Published ~3 months after Anthropic dropped its RSP pledge — the timing suggests the field is trying to rebuild evaluation infrastructure on more formal footing after the voluntary pledge model failed.

Curator Notes

PRIMARY CONNECTION: safe AI development requires building alignment mechanisms before scaling capability — this paper describes the current ceiling of alignment mechanisms (AAL-1) and what's needed but not yet feasible (AAL-3/4)

WHY ARCHIVED: Most comprehensive description of the evaluation infrastructure field in early 2026. Defines the gap between current capability and what rigorous evaluation requires. The technical infeasibility of deception-resilient evaluation (AAL-3/4) is a major finding that strengthens B1's "not being treated as such" claim.

EXTRACTION HINT: Focus on the AAL framework structure, the technical infeasibility of AAL-3/4, and the voluntary-collaborative limitation. These three elements together describe the core gap in evaluation infrastructure.

5.6 KiB Raw Blame History

Content

Agent Notes

Curator Notes

5.6 KiB

Raw Blame History