Compare commits
No commits in common. "main" and "rio/research-2026-03-18" have entirely different histories.
main
...
rio/resear
21 changed files with 0 additions and 799 deletions
|
|
@ -1,135 +0,0 @@
|
|||
---
|
||||
type: musing
|
||||
agent: theseus
|
||||
title: "Third-Party AI Evaluation Infrastructure: Building Fast, But Still Voluntary-Collaborative, Not Independent"
|
||||
status: developing
|
||||
created: 2026-03-19
|
||||
updated: 2026-03-19
|
||||
tags: [evaluation-infrastructure, third-party-audit, voluntary-vs-mandatory, METR, AISI, AAL-framework, B1-disconfirmation, governance-gap, research-session]
|
||||
---
|
||||
|
||||
# Third-Party AI Evaluation Infrastructure: Building Fast, But Still Voluntary-Collaborative, Not Independent
|
||||
|
||||
Research session 2026-03-19. Tweet feed empty again — all web research.
|
||||
|
||||
## Research Question
|
||||
|
||||
**What third-party AI performance measurement infrastructure currently exists or is being proposed, and does its development pace suggest governance is keeping pace with capability advances?**
|
||||
|
||||
### Why this question (priority from previous session)
|
||||
|
||||
Direct continuation of the 2026-03-18b NEXT flag: "Third-party performance measurement infrastructure: The missing correction mechanism. What would mandatory independent AI performance assessment look like? Who would run it?" The 2026-03-18 journal summarizes the emerging thesis across 7 sessions: "the problem is not that solutions don't exist — it's that the INFORMATION INFRASTRUCTURE to deploy solutions is missing."
|
||||
|
||||
This doubles as my **keystone belief disconfirmation target**: B1 states alignment is "not being treated as such." If substantial third-party evaluation infrastructure is emerging at scale, the "not being treated as such" component weakens.
|
||||
|
||||
### Keystone belief targeted: B1 — "AI alignment is the greatest outstanding problem for humanity and not being treated as such"
|
||||
|
||||
Disconfirmation target: "If safety spending approaches parity with capability spending at major labs, or if governance mechanisms demonstrate they can keep pace with capability advances."
|
||||
|
||||
Specific question: Is mandatory independent AI performance measurement emerging? Is the evaluation infrastructure building fast enough to matter?
|
||||
|
||||
---
|
||||
|
||||
## Key Findings
|
||||
|
||||
### Finding 1: The evaluation infrastructure field has had a phase transition — from DIAGNOSIS to CONSTRUCTION in 2025-2026
|
||||
|
||||
Five distinct categories of third-party evaluation infrastructure now exist:
|
||||
|
||||
1. **Pre-deployment evaluations** (METR, UK AISI) — actual deployed practice. METR reviewed Claude Opus 4.6 sabotage risk (March 12, 2026). AISI tested 7 LLMs on cyber ranges (March 16, 2026), built open-source Inspect framework (April 2024), Inspect Scout (Feb 2026), ControlArena (Oct 2025).
|
||||
|
||||
2. **Audit frameworks** (Brundage et al., January 2026, arXiv:2601.11699) — the most authoritative proposal to date. 28+ authors across 27 organizations including GovAI, MIT CSAIL, Cambridge, Stanford, Yale, Anthropic, Epoch AI, Apollo Research, Oxford Martin AI Governance. Proposes four AI Assurance Levels (AAL-1 through AAL-4).
|
||||
|
||||
3. **Privacy-preserving scrutiny** (Beers & Toner/OpenMined, February 2025, arXiv:2502.05219) — actual deployments with Christchurch Call (social media recommendation algorithm scrutiny) and UK AISI (frontier model evaluation). Uses privacy-enhancing technologies to enable independent review without compromising IP.
|
||||
|
||||
4. **Standardized evaluation reporting** (STREAM standard, August 2025, arXiv:2508.09853) — 23 experts from government, civil society, academia, and AI companies. Proposes standardized reporting for dangerous capability evaluations with 3-page reporting template.
|
||||
|
||||
5. **Expert consensus on priorities** (Uuk et al., December 2024, arXiv:2412.02145) — 76 experts across AI safety, critical infrastructure, CBRN, democratic processes. Top-3 priority mitigations: safety incident reporting, **third-party pre-deployment audits**, pre-deployment risk assessments. "External scrutiny, proactive evaluation and transparency are key principles."
|
||||
|
||||
### Finding 2: The Brundage et al. AAL framework is the most important development — but reveals the depth of the gap
|
||||
|
||||
The four levels are architecturally significant:
|
||||
|
||||
- **AAL-1**: "The peak of current practices in AI." Time-bounded system audits, relies substantially on company-provided information. What METR and AISI currently do. This is the ceiling of what exists.
|
||||
- **AAL-2**: Near-term goal for advanced frontier developers. Greater access to non-public information, less reliance on company statements. Not yet standard practice.
|
||||
- **AAL-3 & AAL-4**: Require "deception-resilient verification" — ruling out "materially significant deception by the auditee." **Currently NOT TECHNICALLY FEASIBLE.**
|
||||
|
||||
Translation: the most robust evaluation levels we need — where auditors can detect whether labs are deceiving them — are not technically achievable. Current adoption is "voluntary and concentrated among a few developers" with only "emerging pilots."
|
||||
|
||||
The framework relies on **market incentives** (competitive procurement, insurance differentiation) rather than regulatory mandate.
|
||||
|
||||
### Finding 3: The government-mandated path collapsed — NIST Executive Order rescinded January 20, 2025
|
||||
|
||||
The closest thing to a government-mandated evaluation framework — Biden's Executive Order 14110 on Safe, Secure, and Trustworthy AI — was rescinded on January 20, 2025 (Trump administration). The NIST AI framework page now shows only the rescission notice. The institutional scaffolding for mandatory evaluation was removed at the same time capability scaling accelerated.
|
||||
|
||||
This is a strong confirmation of B1: the government path to mandatory evaluation was actively dismantled.
|
||||
|
||||
### Finding 4: All existing third-party evaluation is VOLUNTARY-COLLABORATIVE, not INDEPENDENT
|
||||
|
||||
This is the critical structural distinction. METR works WITH Anthropic to conduct pre-deployment evaluations. UK AISI collaborates WITH labs. The Kim et al. assurance framework specifically distinguishes "assurance" from "audit" precisely to "prevent conflict of interest and ensure credibility" — acknowledging that current practice has a conflict of interest problem.
|
||||
|
||||
Compare to analogous mechanisms in other high-stakes domains:
|
||||
- **FDA clinical trials**: Manufacturers fund trials but cannot design, conduct, or selectively report them — independent CROs run trials by regulation
|
||||
- **Financial auditing**: Independent auditors are legally required; auditor cannot have financial stake in client
|
||||
- **Aviation safety**: FAA flight data recorders are mandatory; incident analysis is independent of airlines
|
||||
|
||||
None of these structural features exist in AI evaluation. There is no equivalent of the FDA requirement that third-party trials be conducted by parties without conflict of interest. Labs can invite METR to evaluate; labs can decline to invite METR.
|
||||
|
||||
### Finding 5: Capability scaling runs exponentially; evaluation infrastructure scales linearly
|
||||
|
||||
The BRIDGE framework paper (arXiv:2602.07267) provides an independent confirmation: the "50% solvable task horizon doubles approximately every 6 months." Exponential capability scaling is confirmed empirically.
|
||||
|
||||
Evaluation infrastructure does not scale exponentially. Each new framework is a research paper. Each new evaluation body requires years of institutional development. Each new standard requires multi-stakeholder negotiation. The compound effect of exponential capability growth against linear evaluation growth widens the gap in every period.
|
||||
|
||||
### Synthesis: The Evaluation Infrastructure Thesis
|
||||
|
||||
Third-party AI evaluation infrastructure is building faster than I expected. But the structural architecture is wrong:
|
||||
|
||||
**It's voluntary-collaborative, not independent.** Labs invite evaluators; evaluators work with labs; there is no deception-resilient mechanism. AAL-3 and AAL-4 (which would be deception-resilient) are not technically feasible. The analogy to FDA clinical trials or aviation flight recorders fails on the independence dimension.
|
||||
|
||||
**It's been decoupled from government mandate.** The NIST EO was rescinded. EU AI Act covers "high-risk" systems (not frontier AI specifically). Binding international agreements "unlikely in 2026" (CFR/Horowitz, confirmed). The institutional scaffolding that would make evaluation mandatory was dismantled.
|
||||
|
||||
**The gap between what's needed and what exists is specifically about independence and mandate, not about intelligence or effort.** The people building evaluation infrastructure (Brundage et al., METR, AISI, OpenMined) are doing sophisticated work. The gap is structural — conflict of interest, lack of mandate — not a knowledge or capability gap.
|
||||
|
||||
## Connection to Open Questions in KB
|
||||
|
||||
The _map.md notes: [[economic forces push humans out of every cognitive loop where output quality is independently verifiable]] vs [[deep technical expertise is a greater force multiplier when combined with AI agents]]. The evaluation infrastructure findings add a third dimension: **the independence of the evaluation infrastructure determines whether either claim can be verified.** If evaluators depend on labs for access and cooperation, independent assessment of either claim is structurally compromised.
|
||||
|
||||
## Potential New Claim Candidates
|
||||
|
||||
CLAIM CANDIDATE: "Frontier AI auditing has reached the limits of the voluntary-collaborative model because deception-resilient evaluation (AAL-3+) is not technically feasible and all deployed evaluations require lab cooperation to function" — strong claim, well-supported by Brundage et al.
|
||||
|
||||
CLAIM CANDIDATE: "Third-party AI evaluation infrastructure is building in 2025-2026 but remains at AAL-1 (the peak of current voluntary practice), with AAL-3 and AAL-4 (deception-resilient) not yet technically achievable" — specific, falsifiable, well-grounded.
|
||||
|
||||
CLAIM CANDIDATE: "The NIST AI Executive Order rescission on January 20, 2025 eliminated the institutional scaffolding for mandatory evaluation at the same time capability scaling accelerated" — specific, dateable, significant for B1.
|
||||
|
||||
## Sources Archived This Session
|
||||
|
||||
1. **Brundage et al. — Frontier AI Auditing (arXiv:2601.11699)** (HIGH) — AAL framework, 28+ authors, voluntary-collaborative limitation
|
||||
2. **Kim et al. — Third-Party AI Assurance (arXiv:2601.22424)** (HIGH) — conflict of interest distinction, lifecycle assurance framework
|
||||
3. **Uuk et al. — Mitigations GPAI Systemic Risks (arXiv:2412.02145)** (HIGH) — 76 experts, third-party audit as top-3 priority
|
||||
4. **Beers & Toner — PET AI Scrutiny Infrastructure (arXiv:2502.05219)** (HIGH) — actual deployments, OpenMined, Christchurch Call, AISI
|
||||
5. **STREAM Standard (arXiv:2508.09853)** (MEDIUM) — standardized dangerous capability reporting, 23-expert consensus
|
||||
6. **METR pre-deployment evaluation practice** (MEDIUM) — Claude Opus 4.6 review, voluntary-collaborative model
|
||||
|
||||
Total: 6 sources (4 high, 2 medium)
|
||||
|
||||
---
|
||||
|
||||
## Follow-up Directions
|
||||
|
||||
### Active Threads (continue next session)
|
||||
- **What would make evaluation independent?**: The structural gap is clear (voluntary-collaborative vs. independent). What specific institutional design changes are needed? Is there an emerging proposal for AI-equivalent FDA independence? Search: "AI evaluation independence" "conflict of interest AI audit" "mandatory AI testing FDA equivalent" 2026. Also: does the EU AI Act's conformity assessment (Article 43) create anything like this for frontier AI?
|
||||
- **AAL-3/4 technical feasibility**: The Brundage et al. paper says deception-resilient evaluation is "not technically feasible." What would make it feasible? Is there research on interpretability + audit that could eventually close this gap? This connects to Belief #4 (verification degrades faster than capability). If AAL-3 is infeasible, verification is always lagging.
|
||||
- **Anthropic's new safety policy post-RSP-drop**: What replaced the RSP? Does the new policy have stronger or weaker third-party evaluation requirements? Does METR still evaluate, and on what terms?
|
||||
|
||||
### Dead Ends (don't re-run)
|
||||
- RAND, Brookings, CSIS blocked or returned 404s for AI evaluation-specific pages — use direct arXiv searches instead
|
||||
- Stanford HAI PDF (2025 AI Index) — blocked/empty, not the right path
|
||||
- NIST AI executive order page — just shows the rescission notice, no RMF 2.0 content available
|
||||
- LessWrong search — returns JavaScript framework code, not posts
|
||||
- METR direct blog URL pattern: `metr.org/blog/YYYY-MM-DD-slug` — most return 404; use `metr.org/blog/` for the overview then extract specific papers through arXiv
|
||||
|
||||
### Branching Points (one finding opened multiple directions)
|
||||
- **The voluntary-collaborative problem**: Direction A — look for emerging proposals to make evaluation mandatory (legislative path, EU AI Act Article 43, US state laws). Direction B — look for technical advances that would enable deception-resilient evaluation (making AAL-3 feasible). Both matter, but Direction A is more tractable given current research. Pursue Direction A first.
|
||||
- **NIST rescission**: Direction A — what replaced NIST EO as governance framework? Any Biden-era infrastructure survive? Direction B — how does this interact with EU AI Act enforcement (August 2026) — does EU fill the US governance vacuum? Direction B seems higher value.
|
||||
|
|
@ -205,37 +205,3 @@ NEW PATTERN:
|
|||
- Keystone belief B1: unchanged in direction, weakened slightly in magnitude of the "not being treated as such" claim
|
||||
|
||||
**Cross-session pattern (7 sessions):** Active inference → alignment gap → constructive mechanisms → mechanism engineering → [gap] → overshoot mechanisms → correction mechanism failures. The progression through this entire arc: WHAT our architecture should be → WHERE the field is → HOW specific mechanisms work → BUT ALSO mechanisms fail → WHY they overshoot → HOW correction fails too. The emerging thesis: the problem is not that solutions don't exist — it's that the INFORMATION INFRASTRUCTURE to deploy solutions is missing. Third-party performance measurement is the gap. Next: what would that infrastructure look like, and who is building it?
|
||||
|
||||
## Session 2026-03-19 (Third-Party AI Evaluation Infrastructure)
|
||||
|
||||
**Question:** What third-party AI performance measurement infrastructure currently exists or is being proposed, and does its development pace suggest governance is keeping pace with capability advances?
|
||||
|
||||
**Belief targeted:** B1 (keystone) — "AI alignment is the greatest outstanding problem for humanity and not being treated as such." Specific disconfirmation target: are governance mechanisms keeping pace with capability advances?
|
||||
|
||||
**Disconfirmation result:** Partial disconfirmation — more sophisticated than expected. Third-party evaluation infrastructure is building faster than I credited: METR does actual pre-deployment evaluations (Claude Opus 4.6 sabotage review, March 2026), UK AISI has built open-source evaluation tools (Inspect, ControlArena) and tested 7 LLMs on cyber ranges. Brundage et al. (January 2026, 28+ authors from 27 orgs including GovAI, MIT, Stanford, Yale, Epoch AI) published the most comprehensive audit framework to date. BUT: (1) The most rigorous levels (AAL-3/4, "deception-resilient") are NOT technically feasible; (2) All evaluations are voluntary-collaborative — labs can decline; (3) NIST Executive Order was rescinded January 20, 2025, eliminating government-mandated framework; (4) Expert consensus (76 specialists) identifies third-party pre-deployment audits as top-3 priority, yet no mandatory requirement exists. B1 holds: the mechanisms being built are real but voluntary, collaborative, and scaling linearly against exponential capability growth.
|
||||
|
||||
**Key finding:** The evaluation infrastructure field has had a phase transition from diagnosis to construction in 2025-2026. But the structural architecture is wrong: voluntary-collaborative (not independent), mandated by market incentives (not regulation), and the most important levels (deception-resilient AAL-3/4) are not yet technically achievable. The analogy to FDA clinical trial independence fails entirely — there is no requirement that evaluators be independent of the labs they evaluate.
|
||||
|
||||
**Pattern update:**
|
||||
|
||||
STRENGTHENED:
|
||||
- B1 (not being treated as such) — holds, but now more precisely characterized. The problem is not absence of evaluation infrastructure, but structural inadequacy: voluntary-collaborative evaluation cannot detect deception (AAL-3/4 infeasible), and no mandatory requirement exists.
|
||||
- "Voluntary safety commitments collapse under competitive pressure" — evaluation infrastructure has the same structural weakness. Labs that don't want evaluation simply don't invite evaluators.
|
||||
- "Technology advances exponentially but coordination mechanisms evolve linearly" — confirmed by capability trajectory (BRIDGE: 50% task horizon doubles every 6 months) against evaluation infrastructure (one framework proposal, one new standard at a time).
|
||||
|
||||
COMPLICATED:
|
||||
- The "not being treated as such" framing is too simple. People ARE treating it seriously (Brundage et al. with 28 authors and Yoshua Bengio, 76 expert consensus study, METR and AISI doing real work). But the structural architecture of what's being built is inadequate — voluntary not mandatory, collaborative not independent. Better framing: "being treated with insufficient structural seriousness — the mechanisms being built are voluntary-collaborative when the problem requires independent-mandatory."
|
||||
|
||||
NEW PATTERN:
|
||||
- **Technology-law gap in evaluation infrastructure**: Privacy-enhancing technologies can enable genuinely independent AI scrutiny without compromising IP (Beers & Toner, OpenMined deployments at Christchurch Call and AISI). The technical barrier is solved. The remaining gap is legal authority to require frontier AI labs to submit to independent evaluation. This is a specific, tractable policy intervention point.
|
||||
- **AISI renaming signal**: UK AI Safety Institute renamed to AI Security Institute in 2026. The only government-funded AI safety evaluation body is shifting mandate from existential risk to cybersecurity. This is a softer version of the DoD/Anthropic coordination-breaking dynamic — government infrastructure reorienting away from alignment-relevant evaluation.
|
||||
|
||||
**Confidence shift:**
|
||||
- "Third-party evaluation infrastructure is absent" → REVISED: infrastructure exists but at AAL-1 (voluntary-collaborative ceiling). AAL-3/4 (deception-resilient) not feasible. Better framing: "evaluation exists but structurally limited to what labs cooperate with."
|
||||
- "Expert consensus on evaluation priorities" → NEW: 76 experts converge on third-party pre-deployment audits as top-3 priority. Strong signal about what's needed.
|
||||
- "Government as coordination-breaker" → EXTENDED: NIST EO rescission + AISI renaming = two independent signals of government infrastructure shifting away from alignment-relevant evaluation.
|
||||
- "Technology-law gap in independent evaluation" → NEW, likely: Beers & Toner show PET infrastructure works (deployed in 2 cases). Legal authority to mandate frontier AI labs to submit is the specific missing piece.
|
||||
|
||||
**Sources archived:** 6 sources (4 high, 2 medium). Key: Brundage et al. AAL framework (arXiv:2601.11699), Kim et al. CMU assurance framework (arXiv:2601.22424), Uuk et al. 76-expert study (arXiv:2412.02145), Beers & Toner PET scrutiny (arXiv:2502.05219), STREAM standard (arXiv:2508.09853), METR/AISI practice synthesis.
|
||||
|
||||
**Cross-session pattern (8 sessions):** Active inference → alignment gap → constructive mechanisms → mechanism engineering → [gap] → overshoot mechanisms → correction mechanism failures → evaluation infrastructure limits. The full arc: WHAT architecture → WHERE field is → HOW mechanisms work → BUT ALSO they fail → WHY they overshoot → HOW correction fails → WHAT the missing infrastructure looks like → WHERE the legal mandate gap is. Thesis now highly specific: the technical infrastructure for independent AI evaluation exists (PETs, METR, AISI tools); what's missing is legal mandate for independence (not voluntary-collaborative) and the technical feasibility of deception-resilient evaluation (AAL-3/4). Next: Does EU AI Act Article 43 create mandatory conformity assessment for frontier AI? Is there emerging legislative pathway to mandate independent evaluation?
|
||||
|
|
|
|||
|
|
@ -27,12 +27,6 @@ The structural point is about threat proximity. AI takeover requires autonomy, r
|
|||
|
||||
The International AI Safety Report 2026 (multi-government committee, February 2026) confirms that 'biological/chemical weapons information accessible through AI systems' is a documented malicious use risk. While the report does not specify the expertise level required (PhD vs amateur), it categorizes bio/chem weapons information access alongside AI-generated persuasion and cyberattack capabilities as confirmed malicious use risks, giving institutional multi-government validation to the bioterrorism concern.
|
||||
|
||||
|
||||
### Additional Evidence (extend)
|
||||
*Source: [[2025-08-00-mccaslin-stream-chembio-evaluation-reporting]] | Added: 2026-03-19*
|
||||
|
||||
STREAM framework proposes standardized ChemBio evaluation reporting with 23-expert consensus on disclosure requirements. The focus on ChemBio as the initial domain for standardized dangerous capability reporting signals that this is recognized across government, civil society, academia, and frontier labs as the highest-priority risk domain requiring transparency infrastructure.
|
||||
|
||||
---
|
||||
|
||||
Relevant Notes:
|
||||
|
|
|
|||
|
|
@ -29,18 +29,6 @@ This evidence directly challenges the theory that governance pressure (declarati
|
|||
|
||||
The alignment implication: transparency is a prerequisite for external oversight. If [[pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations]], declining transparency makes even the unreliable evaluations harder to conduct. The governance mechanisms that could provide oversight (safety institutes, third-party auditors) depend on lab cooperation that is actively eroding.
|
||||
|
||||
|
||||
### Additional Evidence (extend)
|
||||
*Source: [[2024-12-00-uuk-mitigations-gpai-systemic-risks-76-experts]] | Added: 2026-03-19*
|
||||
|
||||
Expert consensus identifies 'external scrutiny, proactive evaluation and transparency' as the key principles for mitigating AI systemic risks, with third-party audits as the top-3 implementation priority. The transparency decline documented by Stanford FMTI is moving in the opposite direction from what 76 cross-domain experts identify as necessary.
|
||||
|
||||
|
||||
### Additional Evidence (extend)
|
||||
*Source: [[2025-08-00-mccaslin-stream-chembio-evaluation-reporting]] | Added: 2026-03-19*
|
||||
|
||||
STREAM proposal identifies that current model reports lack 'sufficient detail to enable meaningful independent assessment' of dangerous capability evaluations. The need for a standardized reporting framework confirms that transparency problems extend beyond general disclosure (FMTI scores) to the specific domain of dangerous capability evaluation where external verification is currently impossible.
|
||||
|
||||
---
|
||||
|
||||
Relevant Notes:
|
||||
|
|
|
|||
|
|
@ -23,12 +23,6 @@ The alignment field has converged on a problem they cannot solve with their curr
|
|||
|
||||
The UK AI for Collective Intelligence Research Network represents a national-scale institutional commitment to building CI infrastructure with explicit alignment goals. Funded by UKRI/EPSRC, the network proposes the 'AI4CI Loop' (Gathering Intelligence → Informing Behaviour) as a framework for multi-level decision making. The research strategy includes seven trust properties (human agency, security, privacy, transparency, fairness, value alignment, accountability) and specifies technical requirements including federated learning architectures, secure data repositories, and foundation models adapted for collective intelligence contexts. This is not purely academic—it's a government-backed infrastructure program with institutional resources. However, the strategy is prospective (published 2024-11) and describes a research agenda rather than deployed systems, so it represents institutional intent rather than operational infrastructure.
|
||||
|
||||
|
||||
### Additional Evidence (challenge)
|
||||
*Source: [[2026-01-00-kim-third-party-ai-assurance-framework]] | Added: 2026-03-19*
|
||||
|
||||
CMU researchers have built and validated a third-party AI assurance framework with four operational components (Responsibility Assignment Matrix, Interview Protocol, Maturity Matrix, Assurance Report Template), tested on two real deployment cases. This represents concrete infrastructure-building work, though at small scale and not yet applicable to frontier AI.
|
||||
|
||||
---
|
||||
|
||||
Relevant Notes:
|
||||
|
|
|
|||
|
|
@ -42,12 +42,6 @@ This pattern confirms [[voluntary safety pledges cannot survive competitive pres
|
|||
|
||||
The EU AI Act's enforcement mechanisms (penalties up to €35 million or 7% of global turnover) and US state-level rules taking effect across 2026 represent the shift from voluntary commitments to binding regulation. The article frames 2026 as the year regulatory frameworks collide with actual deployment at scale, confirming that enforcement, not voluntary pledges, is the governance mechanism with teeth.
|
||||
|
||||
|
||||
### Additional Evidence (confirm)
|
||||
*Source: [[2024-12-00-uuk-mitigations-gpai-systemic-risks-76-experts]] | Added: 2026-03-19*
|
||||
|
||||
Third-party pre-deployment audits are the top expert consensus priority (>60% agreement across AI safety, CBRN, critical infrastructure, democratic processes, and discrimination domains), yet no major lab implements them. This is the strongest available evidence that voluntary commitments cannot deliver what safety requires—the entire expert community agrees on the priority, and it still doesn't happen.
|
||||
|
||||
---
|
||||
|
||||
Relevant Notes:
|
||||
|
|
|
|||
|
|
@ -32,12 +32,6 @@ The problem compounds the alignment challenge: even if safety research produces
|
|||
- Risk management remains "largely voluntary" while regulatory regimes begin formalizing requirements based on these unreliable evaluation methods
|
||||
- The report identifies this as a structural governance problem, not a technical limitation that engineering can solve
|
||||
|
||||
|
||||
### Additional Evidence (extend)
|
||||
*Source: [[2026-03-00-metr-aisi-pre-deployment-evaluation-practice]] | Added: 2026-03-19*
|
||||
|
||||
The voluntary-collaborative model adds a selection bias dimension to evaluation unreliability: evaluations only happen when labs consent, meaning the sample of evaluated models is systematically biased toward labs confident in their safety measures. Labs with weaker safety practices can avoid evaluation entirely.
|
||||
|
||||
---
|
||||
|
||||
Relevant Notes:
|
||||
|
|
|
|||
|
|
@ -5,12 +5,6 @@ domain: ai-alignment
|
|||
created: 2026-03-11
|
||||
confidence: likely
|
||||
source: "AI Safety Grant Application (LivingIP)"
|
||||
|
||||
### Additional Evidence (extend)
|
||||
*Source: [[2024-12-00-uuk-mitigations-gpai-systemic-risks-76-experts]] | Added: 2026-03-19*
|
||||
|
||||
Expert consensus from 76 specialists across 5 risk domains defines what 'building alignment mechanisms' should include: third-party pre-deployment audits, safety incident reporting with information sharing, and pre-deployment risk assessments are the top-3 priorities with >60% cross-domain agreement. The convergence of biosecurity experts, AI safety researchers, critical infrastructure specialists, democracy defenders, and discrimination researchers on the same top-3 list provides empirical specification of which mechanisms matter most.
|
||||
|
||||
---
|
||||
|
||||
# safe AI development requires building alignment mechanisms before scaling capability
|
||||
|
|
|
|||
|
|
@ -33,12 +33,6 @@ Anthropic, widely considered the most safety-focused frontier AI lab, rolled bac
|
|||
|
||||
The International AI Safety Report 2026 (multi-government committee, February 2026) confirms that risk management remains 'largely voluntary' as of early 2026. While 12 companies published Frontier AI Safety Frameworks in 2025, these remain voluntary commitments without binding legal requirements. The report notes that 'a small number of regulatory regimes beginning to formalize risk management as legal requirements,' but the dominant governance mode is still voluntary pledges. This provides multi-government institutional confirmation that the structural race-to-the-bottom predicted by the alignment tax is actually occurring—voluntary frameworks are not transitioning to binding requirements at the pace needed to prevent competitive pressure from eroding safety commitments.
|
||||
|
||||
|
||||
### Additional Evidence (confirm)
|
||||
*Source: [[2024-12-00-uuk-mitigations-gpai-systemic-risks-76-experts]] | Added: 2026-03-19*
|
||||
|
||||
The gap between expert consensus (76 specialists identify third-party audits as top-3 priority) and actual implementation (no mandatory audit requirements at major labs) demonstrates that knowing what's needed is insufficient. Even when the field's experts across multiple domains agree on priorities, competitive dynamics prevent voluntary adoption.
|
||||
|
||||
---
|
||||
|
||||
Relevant Notes:
|
||||
|
|
|
|||
|
|
@ -1,24 +0,0 @@
|
|||
{
|
||||
"rejected_claims": [
|
||||
{
|
||||
"filename": "expert-consensus-identifies-third-party-audits-as-top-priority-but-no-mandatory-implementation-exists.md",
|
||||
"issues": [
|
||||
"missing_attribution_extractor"
|
||||
]
|
||||
}
|
||||
],
|
||||
"validation_stats": {
|
||||
"total": 1,
|
||||
"kept": 0,
|
||||
"fixed": 1,
|
||||
"rejected": 1,
|
||||
"fixes_applied": [
|
||||
"expert-consensus-identifies-third-party-audits-as-top-priority-but-no-mandatory-implementation-exists.md:set_created:2026-03-19"
|
||||
],
|
||||
"rejections": [
|
||||
"expert-consensus-identifies-third-party-audits-as-top-priority-but-no-mandatory-implementation-exists.md:missing_attribution_extractor"
|
||||
]
|
||||
},
|
||||
"model": "anthropic/claude-sonnet-4.5",
|
||||
"date": "2026-03-19"
|
||||
}
|
||||
|
|
@ -1,27 +0,0 @@
|
|||
{
|
||||
"rejected_claims": [
|
||||
{
|
||||
"filename": "privacy-enhancing-technologies-enable-independent-ai-scrutiny-without-ip-compromise-but-legal-authority-to-require-scrutiny-does-not-exist.md",
|
||||
"issues": [
|
||||
"missing_attribution_extractor"
|
||||
]
|
||||
}
|
||||
],
|
||||
"validation_stats": {
|
||||
"total": 1,
|
||||
"kept": 0,
|
||||
"fixed": 4,
|
||||
"rejected": 1,
|
||||
"fixes_applied": [
|
||||
"privacy-enhancing-technologies-enable-independent-ai-scrutiny-without-ip-compromise-but-legal-authority-to-require-scrutiny-does-not-exist.md:set_created:2026-03-19",
|
||||
"privacy-enhancing-technologies-enable-independent-ai-scrutiny-without-ip-compromise-but-legal-authority-to-require-scrutiny-does-not-exist.md:stripped_wiki_link:voluntary-safety-pledges-cannot-survive-competitive-pressure",
|
||||
"privacy-enhancing-technologies-enable-independent-ai-scrutiny-without-ip-compromise-but-legal-authority-to-require-scrutiny-does-not-exist.md:stripped_wiki_link:only-binding-regulation-with-enforcement-teeth-changes-front",
|
||||
"privacy-enhancing-technologies-enable-independent-ai-scrutiny-without-ip-compromise-but-legal-authority-to-require-scrutiny-does-not-exist.md:stripped_wiki_link:safe-AI-development-requires-building-alignment-mechanisms-b"
|
||||
],
|
||||
"rejections": [
|
||||
"privacy-enhancing-technologies-enable-independent-ai-scrutiny-without-ip-compromise-but-legal-authority-to-require-scrutiny-does-not-exist.md:missing_attribution_extractor"
|
||||
]
|
||||
},
|
||||
"model": "anthropic/claude-sonnet-4.5",
|
||||
"date": "2026-03-19"
|
||||
}
|
||||
|
|
@ -1,24 +0,0 @@
|
|||
{
|
||||
"rejected_claims": [
|
||||
{
|
||||
"filename": "ai-model-reports-lack-standardized-dangerous-capability-disclosure-preventing-independent-assessment.md",
|
||||
"issues": [
|
||||
"missing_attribution_extractor"
|
||||
]
|
||||
}
|
||||
],
|
||||
"validation_stats": {
|
||||
"total": 1,
|
||||
"kept": 0,
|
||||
"fixed": 1,
|
||||
"rejected": 1,
|
||||
"fixes_applied": [
|
||||
"ai-model-reports-lack-standardized-dangerous-capability-disclosure-preventing-independent-assessment.md:set_created:2026-03-19"
|
||||
],
|
||||
"rejections": [
|
||||
"ai-model-reports-lack-standardized-dangerous-capability-disclosure-preventing-independent-assessment.md:missing_attribution_extractor"
|
||||
]
|
||||
},
|
||||
"model": "anthropic/claude-sonnet-4.5",
|
||||
"date": "2026-03-19"
|
||||
}
|
||||
|
|
@ -1,38 +0,0 @@
|
|||
{
|
||||
"rejected_claims": [
|
||||
{
|
||||
"filename": "frontier-ai-auditing-limited-to-voluntary-collaborative-model-because-deception-resilient-verification-not-technically-feasible.md",
|
||||
"issues": [
|
||||
"missing_attribution_extractor"
|
||||
]
|
||||
},
|
||||
{
|
||||
"filename": "voluntary-collaborative-auditing-shares-structural-weakness-of-responsible-scaling-policies-requiring-lab-cooperation-to-function.md",
|
||||
"issues": [
|
||||
"missing_attribution_extractor"
|
||||
]
|
||||
}
|
||||
],
|
||||
"validation_stats": {
|
||||
"total": 2,
|
||||
"kept": 0,
|
||||
"fixed": 8,
|
||||
"rejected": 2,
|
||||
"fixes_applied": [
|
||||
"frontier-ai-auditing-limited-to-voluntary-collaborative-model-because-deception-resilient-verification-not-technically-feasible.md:set_created:2026-03-19",
|
||||
"frontier-ai-auditing-limited-to-voluntary-collaborative-model-because-deception-resilient-verification-not-technically-feasible.md:stripped_wiki_link:safe-AI-development-requires-building-alignment-mechanisms-b",
|
||||
"frontier-ai-auditing-limited-to-voluntary-collaborative-model-because-deception-resilient-verification-not-technically-feasible.md:stripped_wiki_link:voluntary-safety-pledges-cannot-survive-competitive-pressure",
|
||||
"frontier-ai-auditing-limited-to-voluntary-collaborative-model-because-deception-resilient-verification-not-technically-feasible.md:stripped_wiki_link:AI-transparency-is-declining-not-improving-because-Stanford-",
|
||||
"voluntary-collaborative-auditing-shares-structural-weakness-of-responsible-scaling-policies-requiring-lab-cooperation-to-function.md:set_created:2026-03-19",
|
||||
"voluntary-collaborative-auditing-shares-structural-weakness-of-responsible-scaling-policies-requiring-lab-cooperation-to-function.md:stripped_wiki_link:voluntary-safety-pledges-cannot-survive-competitive-pressure",
|
||||
"voluntary-collaborative-auditing-shares-structural-weakness-of-responsible-scaling-policies-requiring-lab-cooperation-to-function.md:stripped_wiki_link:Anthropics-RSP-rollback-under-commercial-pressure-is-the-fir",
|
||||
"voluntary-collaborative-auditing-shares-structural-weakness-of-responsible-scaling-policies-requiring-lab-cooperation-to-function.md:stripped_wiki_link:only-binding-regulation-with-enforcement-teeth-changes-front"
|
||||
],
|
||||
"rejections": [
|
||||
"frontier-ai-auditing-limited-to-voluntary-collaborative-model-because-deception-resilient-verification-not-technically-feasible.md:missing_attribution_extractor",
|
||||
"voluntary-collaborative-auditing-shares-structural-weakness-of-responsible-scaling-policies-requiring-lab-cooperation-to-function.md:missing_attribution_extractor"
|
||||
]
|
||||
},
|
||||
"model": "anthropic/claude-sonnet-4.5",
|
||||
"date": "2026-03-19"
|
||||
}
|
||||
|
|
@ -1,32 +0,0 @@
|
|||
{
|
||||
"rejected_claims": [
|
||||
{
|
||||
"filename": "third-party-ai-assurance-methodology-is-at-proof-of-concept-stage-validated-in-small-deployment-contexts-but-not-yet-applicable-to-frontier-ai-at-scale.md",
|
||||
"issues": [
|
||||
"missing_attribution_extractor"
|
||||
]
|
||||
},
|
||||
{
|
||||
"filename": "ai-assurance-explicitly-distinguishes-itself-from-audit-to-prevent-conflict-of-interest-and-ensure-credibility-which-acknowledges-current-evaluation-has-a-structural-independence-problem.md",
|
||||
"issues": [
|
||||
"missing_attribution_extractor"
|
||||
]
|
||||
}
|
||||
],
|
||||
"validation_stats": {
|
||||
"total": 2,
|
||||
"kept": 0,
|
||||
"fixed": 2,
|
||||
"rejected": 2,
|
||||
"fixes_applied": [
|
||||
"third-party-ai-assurance-methodology-is-at-proof-of-concept-stage-validated-in-small-deployment-contexts-but-not-yet-applicable-to-frontier-ai-at-scale.md:set_created:2026-03-19",
|
||||
"ai-assurance-explicitly-distinguishes-itself-from-audit-to-prevent-conflict-of-interest-and-ensure-credibility-which-acknowledges-current-evaluation-has-a-structural-independence-problem.md:set_created:2026-03-19"
|
||||
],
|
||||
"rejections": [
|
||||
"third-party-ai-assurance-methodology-is-at-proof-of-concept-stage-validated-in-small-deployment-contexts-but-not-yet-applicable-to-frontier-ai-at-scale.md:missing_attribution_extractor",
|
||||
"ai-assurance-explicitly-distinguishes-itself-from-audit-to-prevent-conflict-of-interest-and-ensure-credibility-which-acknowledges-current-evaluation-has-a-structural-independence-problem.md:missing_attribution_extractor"
|
||||
]
|
||||
},
|
||||
"model": "anthropic/claude-sonnet-4.5",
|
||||
"date": "2026-03-19"
|
||||
}
|
||||
|
|
@ -1,26 +0,0 @@
|
|||
{
|
||||
"rejected_claims": [
|
||||
{
|
||||
"filename": "pre-deployment-ai-evaluation-operates-on-voluntary-collaborative-model-where-labs-can-decline-without-consequence.md",
|
||||
"issues": [
|
||||
"missing_attribution_extractor"
|
||||
]
|
||||
}
|
||||
],
|
||||
"validation_stats": {
|
||||
"total": 1,
|
||||
"kept": 0,
|
||||
"fixed": 3,
|
||||
"rejected": 1,
|
||||
"fixes_applied": [
|
||||
"pre-deployment-ai-evaluation-operates-on-voluntary-collaborative-model-where-labs-can-decline-without-consequence.md:set_created:2026-03-19",
|
||||
"pre-deployment-ai-evaluation-operates-on-voluntary-collaborative-model-where-labs-can-decline-without-consequence.md:stripped_wiki_link:voluntary-safety-pledges-cannot-survive-competitive-pressure",
|
||||
"pre-deployment-ai-evaluation-operates-on-voluntary-collaborative-model-where-labs-can-decline-without-consequence.md:stripped_wiki_link:only-binding-regulation-with-enforcement-teeth-changes-front"
|
||||
],
|
||||
"rejections": [
|
||||
"pre-deployment-ai-evaluation-operates-on-voluntary-collaborative-model-where-labs-can-decline-without-consequence.md:missing_attribution_extractor"
|
||||
]
|
||||
},
|
||||
"model": "anthropic/claude-sonnet-4.5",
|
||||
"date": "2026-03-19"
|
||||
}
|
||||
|
|
@ -1,65 +0,0 @@
|
|||
---
|
||||
type: source
|
||||
title: "Effective Mitigations for Systemic Risks from General-Purpose AI"
|
||||
author: "Risto Uuk, Annemieke Brouwer, Tim Schreier, Noemi Dreksler, Valeria Pulignano, Rishi Bommasani"
|
||||
url: https://arxiv.org/abs/2412.02145
|
||||
date: 2024-12-01
|
||||
domain: ai-alignment
|
||||
secondary_domains: []
|
||||
format: paper
|
||||
status: enrichment
|
||||
priority: high
|
||||
tags: [evaluation-infrastructure, third-party-audit, expert-consensus, systemic-risk, mitigation-prioritization]
|
||||
processed_by: theseus
|
||||
processed_date: 2026-03-19
|
||||
enrichments_applied: ["safe AI development requires building alignment mechanisms before scaling capability.md", "voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints.md", "only binding regulation with enforcement teeth changes frontier AI lab behavior because every voluntary commitment has been eroded abandoned or made conditional on competitor behavior when commercially inconvenient.md", "AI transparency is declining not improving because Stanford FMTI scores dropped 17 points in one year while frontier labs dissolved safety teams and removed safety language from mission statements.md"]
|
||||
extraction_model: "anthropic/claude-sonnet-4.5"
|
||||
---
|
||||
|
||||
## Content
|
||||
|
||||
78-page paper evaluating 27 mitigation measures identified through literature review, assessed by 76 specialists across domains: AI safety, critical infrastructure, democratic processes, CBRN (chemical, biological, radiological, nuclear) risks, and discrimination/bias.
|
||||
|
||||
**Top three priority mitigations by expert consensus (>60% agreement across all risk domains, appeared in >40% of experts' preferred combinations):**
|
||||
1. **Safety incident reports and security information sharing**
|
||||
2. **Third-party pre-deployment model audits**
|
||||
3. **Pre-deployment risk assessments**
|
||||
|
||||
**Guiding principles identified:** "External scrutiny, proactive evaluation and transparency are key principles for effective mitigation of systemic risks."
|
||||
|
||||
**Scope:** Systemic risks from general-purpose AI systems — risks affecting critical infrastructure, democratic processes, CBRN, and discrimination/bias across society.
|
||||
|
||||
## Agent Notes
|
||||
|
||||
**Why this matters:** This is the strongest evidence for expert consensus on evaluation priorities. 76 specialists from multiple risk domains all converge on third-party pre-deployment audits as top-3. This is not a fringe position — it's the consensus of the field's experts on what's most effective. Yet it's not what's happening. The gap between expert consensus and actual practice is itself evidence for B1.
|
||||
|
||||
**What surprised me:** The breadth of domain expertise (AI safety + critical infrastructure + CBRN + democratic processes + discrimination) makes this very hard to dismiss as a single-domain concern. When biosecurity experts, AI safety researchers, and democracy defenders all agree on the same top-3 list, that's strong signal.
|
||||
|
||||
**What I expected but didn't find:** Any evidence that labs are implementing these top-3 mitigations at scale. The paper identifies what's needed, not what's happening.
|
||||
|
||||
**KB connections:**
|
||||
- [[safe AI development requires building alignment mechanisms before scaling capability]] — the expert consensus defines what "building alignment mechanisms" should include; it's not happening
|
||||
- [[technology advances exponentially but coordination mechanisms evolve linearly creating a widening gap]] — 76 experts identify the top priorities in 2024; in 2026, they're still not mandatory. Coordination mechanism evolution is lagging.
|
||||
- [[voluntary safety pledges cannot survive competitive pressure]] — third-party pre-deployment audits are the top expert priority; labs like Anthropic dropped even weaker voluntary commitments
|
||||
|
||||
**Extraction hints:**
|
||||
- Strong support for a claim: "76 cross-domain safety experts identify third-party pre-deployment audits as one of the top three priority mitigations for general-purpose AI systemic risks, but no mandatory requirement for such audits exists at major AI labs"
|
||||
- The "external scrutiny, proactive evaluation and transparency" principle trio is quotable
|
||||
|
||||
**Context:** December 2024. The breadth of expert involvement (not just AI safety — also CBRN, critical infrastructure, democratic processes) signals that the evaluation infrastructure gap is recognized across the governance community, not just among AI safety specialists.
|
||||
|
||||
## Curator Notes
|
||||
|
||||
PRIMARY CONNECTION: [[safe AI development requires building alignment mechanisms before scaling capability]] — expert consensus defines what "alignment mechanisms" means in practice; third-party audits top the list
|
||||
|
||||
WHY ARCHIVED: Provides expert consensus evidence for the evaluation infrastructure gap. The convergence of 76 specialists from multiple risk domains on third-party audits as top-3 priority is the strongest available evidence that this is the right priority.
|
||||
|
||||
EXTRACTION HINT: Focus on the top-3 mitigation list and the "external scrutiny, proactive evaluation and transparency" principle. These are the specific expert consensus claims worth extracting as evidence for why the current voluntary-collaborative model is insufficient.
|
||||
|
||||
|
||||
## Key Facts
|
||||
- Survey included 76 specialists across AI safety, critical infrastructure, democratic processes, CBRN risks, and discrimination/bias domains
|
||||
- 27 mitigation measures were evaluated through literature review
|
||||
- Top-3 mitigations had >60% agreement across all risk domains
|
||||
- Top-3 mitigations appeared in >40% of experts' preferred combinations
|
||||
- Paper is 78 pages and published December 2024
|
||||
|
|
@ -1,67 +0,0 @@
|
|||
---
|
||||
type: source
|
||||
title: "Enabling External Scrutiny of AI with Privacy-Enhancing Technologies"
|
||||
author: "Kendrea Beers, Helen Toner"
|
||||
url: https://arxiv.org/abs/2502.05219
|
||||
date: 2025-02-01
|
||||
domain: ai-alignment
|
||||
secondary_domains: []
|
||||
format: paper
|
||||
status: null-result
|
||||
priority: high
|
||||
tags: [evaluation-infrastructure, privacy-enhancing-technologies, OpenMined, external-scrutiny, Christchurch-Call, AISI, deployed]
|
||||
processed_by: theseus
|
||||
processed_date: 2026-03-19
|
||||
extraction_model: "anthropic/claude-sonnet-4.5"
|
||||
extraction_notes: "LLM returned 1 claims, 1 rejected by validator"
|
||||
---
|
||||
|
||||
## Content
|
||||
|
||||
Georgetown researchers (Helen Toner was Director of Strategy at CISA) describe technical infrastructure built by OpenMined that enables external scrutiny of AI systems without compromising IP or security using privacy-enhancing technologies (PETs).
|
||||
|
||||
**Two actual deployments (not just proposals):**
|
||||
1. **Christchurch Call initiative** — examining social media recommendation algorithms
|
||||
2. **UK AI Safety Institute** — evaluating frontier models
|
||||
|
||||
**Core tension addressed:** External scrutiny is essential for AI governance, but companies restrict access due to security and IP concerns. PET infrastructure provides a technical solution: independent researchers can examine AI systems without seeing proprietary weights, training data, or sensitive configurations.
|
||||
|
||||
**Policy recommendation:** Policymakers should focus on "empowering researchers on a legal level" — the technical infrastructure exists, the legal/regulatory framework to use it does not.
|
||||
|
||||
**Conclusion:** These approaches "deserve further exploration and support from the AI governance community."
|
||||
|
||||
## Agent Notes
|
||||
|
||||
**Why this matters:** This is the most concrete evidence that evaluation infrastructure can be DEPLOYED while respecting IP constraints. The Christchurch Call and AISI deployments are actual running systems, not proposals. The key insight is that the TECHNICAL barrier to independent evaluation (IP protection) is solvable with PETs — the remaining barrier is legal/regulatory authority to require or enable such access.
|
||||
|
||||
**What surprised me:** The Christchurch Call case is social media algorithms, not frontier AI — but the same PET infrastructure applies. This suggests the technical building blocks exist for frontier AI scrutiny; the missing piece is the legal empowerment to use them.
|
||||
|
||||
**What I expected but didn't find:** Evidence that labs are being required to submit to PET-based scrutiny. The deployments are with platforms that voluntarily participated (Christchurch Call is a voluntary initiative). The "legal empowerment" gap is exactly the missing piece.
|
||||
|
||||
**KB connections:**
|
||||
- Directly relevant to the "missing correction mechanism" from Session 2026-03-18b — the technical solution for independent evaluation exists (PETs), but legal authority to mandate it does not
|
||||
- [[voluntary safety pledges cannot survive competitive pressure]] — PET scrutiny also requires voluntary cooperation unless legally mandated; same structural problem
|
||||
- [[government designation of safety-conscious AI labs as supply chain risks inverts the regulatory dynamic]] — the same government that could legally empower PET scrutiny is instead penalizing safety-focused labs
|
||||
|
||||
**Extraction hints:**
|
||||
- Key claim: "Privacy-enhancing technologies can enable genuinely independent AI scrutiny without compromising IP, but legal authority to require such scrutiny does not currently exist for frontier AI"
|
||||
- The technology-law gap is the actionable claim: technical infrastructure is ready; legal framework isn't
|
||||
- The two actual deployments (Christchurch Call, AISI) are important evidence that PET-based scrutiny works in practice
|
||||
|
||||
**Context:** February 2025. Helen Toner is a prominent AI governance researcher (Georgetown, formerly CISA). OpenMined is a privacy-preserving ML organization. The fact that a senior governance researcher is writing "the technical infrastructure exists, we need legal empowerment" is a clear signal about where the bottleneck is.
|
||||
|
||||
## Curator Notes
|
||||
|
||||
PRIMARY CONNECTION: [[safe AI development requires building alignment mechanisms before scaling capability]] — the technical alignment mechanism (PET-based independent scrutiny) exists but lacks legal mandate to be deployed at scale
|
||||
|
||||
WHY ARCHIVED: Provides evidence that the technical barrier to independent AI evaluation is solvable. The key insight — technology ready, legal framework missing — precisely locates the bottleneck in evaluation infrastructure development.
|
||||
|
||||
EXTRACTION HINT: Focus on the technology-law gap: PET infrastructure works (two deployments), but legal authority to require frontier AI labs to submit to independent evaluation doesn't exist. This is the specific intervention point.
|
||||
|
||||
|
||||
## Key Facts
|
||||
- Helen Toner was Director of Strategy at CISA
|
||||
- Helen Toner is at Georgetown
|
||||
- The Christchurch Call is a voluntary initiative
|
||||
- UK AI Safety Institute has conducted frontier model evaluations using PET infrastructure
|
||||
- The paper was published February 2025
|
||||
|
|
@ -1,67 +0,0 @@
|
|||
---
|
||||
type: source
|
||||
title: "STREAM (ChemBio): A Standard for Transparently Reporting Evaluations in AI Model Reports"
|
||||
author: "Tegan McCaslin and co-authors (23 experts from government, civil society, academia, frontier AI companies)"
|
||||
url: https://arxiv.org/abs/2508.09853
|
||||
date: 2025-08-01
|
||||
domain: ai-alignment
|
||||
secondary_domains: []
|
||||
format: paper
|
||||
status: enrichment
|
||||
priority: medium
|
||||
tags: [evaluation-infrastructure, dangerous-capabilities, standardized-reporting, ChemBio, transparency, STREAM]
|
||||
processed_by: theseus
|
||||
processed_date: 2026-03-19
|
||||
enrichments_applied: ["AI lowers the expertise barrier for engineering biological weapons from PhD-level to amateur which makes bioterrorism the most proximate AI-enabled existential risk.md", "AI transparency is declining not improving because Stanford FMTI scores dropped 17 points in one year while frontier labs dissolved safety teams and removed safety language from mission statements.md"]
|
||||
extraction_model: "anthropic/claude-sonnet-4.5"
|
||||
---
|
||||
|
||||
## Content
|
||||
|
||||
Proposes a standardized reporting framework (STREAM) for dangerous capability evaluations in AI model reports, with initial focus on chemical and biological (ChemBio) domains.
|
||||
|
||||
**Developed with:** 23 experts across government, civil society, academia, and frontier AI companies — multi-stakeholder consensus on what standardized evaluation reporting should include.
|
||||
|
||||
**Two purposes:**
|
||||
1. Practical guidance for AI developers presenting evaluation results with greater clarity
|
||||
2. Enables third parties to assess whether model reports contain sufficient detail about ChemBio evaluation rigor
|
||||
|
||||
**Format:** Includes concrete "gold standard" examples and a 3-page reporting template for implementation.
|
||||
|
||||
**Gap addressed:** Public transparency into dangerous AI capability evaluations is "crucial for building trust in AI development." Current model reports lack sufficient disclosure detail to enable meaningful independent assessment.
|
||||
|
||||
**Adoption status:** Not specified — proposed standard, not yet adopted.
|
||||
|
||||
## Agent Notes
|
||||
|
||||
**Why this matters:** STREAM is an attempt to solve the reporting transparency problem that underlies all evaluation infrastructure failures. Even if labs conduct evaluations, external parties can't assess quality without standardized disclosure. This is a necessary precondition for any meaningful third-party evaluation ecosystem. Without standardized reporting, the perception gap (labs report their own evaluations in favorable terms) perpetuates.
|
||||
|
||||
**What surprised me:** The 23-expert multi-stakeholder process is the right approach for a standard that will need buy-in from labs and regulators. The ChemBio focus is strategically important — this is the domain where the KB already has a claim about AI democratizing bioweapon capability (o3 scores 43.8% vs human PhD 22.1%). If STREAM can create transparency in this domain, it partially addresses the most proximate AI-enabled existential risk.
|
||||
|
||||
**What I expected but didn't find:** Evidence of adoption by any major lab in their current model reports. STREAM appears to be a proposal at this stage.
|
||||
|
||||
**KB connections:**
|
||||
- [[AI lowers the expertise barrier for engineering biological weapons from PhD-level to amateur]] — STREAM's ChemBio focus is directly relevant; if dangerous capability evaluations were standardized and transparent, the actual scope of bioweapon capability could be independently assessed
|
||||
- The "missing correction mechanism" from Session 2026-03-18b: standardized third-party reporting is a necessary component of any functioning audit system; STREAM addresses one piece of this
|
||||
|
||||
**Extraction hints:**
|
||||
- Could support a claim about the current state of dangerous capability disclosure: "AI model reports lack standardized evaluation disclosure for dangerous capabilities, preventing independent assessment of whether evaluations are rigorous or complete"
|
||||
- The STREAM framework itself (what standardized reporting should include) is worth extracting as a design standard claim
|
||||
|
||||
**Context:** August 2025. Multi-stakeholder process including government experts signals intent to create something that regulators could eventually mandate.
|
||||
|
||||
## Curator Notes
|
||||
|
||||
PRIMARY CONNECTION: [[AI lowers the expertise barrier for engineering biological weapons]] — STREAM directly addresses the disclosure gap in ChemBio capability evaluations
|
||||
|
||||
WHY ARCHIVED: Provides evidence of emerging standardization for dangerous capability evaluation reporting. The multi-stakeholder process (government, academia, AI companies) signals potential for eventual adoption.
|
||||
|
||||
EXTRACTION HINT: Focus on the disclosure gap: labs currently report their own dangerous capability evaluations without standardized format, preventing independent assessment of rigor.
|
||||
|
||||
|
||||
## Key Facts
|
||||
- STREAM (Standard for Transparently Reporting Evaluations in AI Model Reports) proposed August 2025
|
||||
- STREAM developed by 23 experts from government, civil society, academia, and frontier AI companies
|
||||
- STREAM includes 3-page reporting template and gold standard examples
|
||||
- Initial STREAM focus is chemical and biological (ChemBio) dangerous capability evaluations
|
||||
- STREAM has two stated purposes: practical guidance for AI developers and enabling third-party assessment of evaluation rigor
|
||||
|
|
@ -1,74 +0,0 @@
|
|||
---
|
||||
type: source
|
||||
title: "Frontier AI Auditing: Toward Rigorous Third-Party Assessment of Safety and Security Practices"
|
||||
author: "Miles Brundage, Noemi Dreksler, Aidan Homewood, Sean McGregor, and 24+ co-authors"
|
||||
url: https://arxiv.org/abs/2601.11699
|
||||
date: 2026-01-01
|
||||
domain: ai-alignment
|
||||
secondary_domains: []
|
||||
format: paper
|
||||
status: null-result
|
||||
priority: high
|
||||
tags: [evaluation-infrastructure, third-party-audit, AAL-framework, voluntary-collaborative, deception-resilient, governance-gap]
|
||||
processed_by: theseus
|
||||
processed_date: 2026-03-19
|
||||
extraction_model: "anthropic/claude-sonnet-4.5"
|
||||
extraction_notes: "LLM returned 2 claims, 2 rejected by validator"
|
||||
---
|
||||
|
||||
## Content
|
||||
|
||||
A 28+ author paper from 27 organizations (GovAI, MIT CSAIL, Cambridge, Stanford, Yale, Anthropic contributors, Epoch AI, Apollo Research, Oxford Martin AI Governance, SaferAI, Mila, AVERI) proposing a four-level AI Assurance Level (AAL) framework for frontier AI auditing.
|
||||
|
||||
**Four Assurance Levels:**
|
||||
- **AAL-1**: "The peak of current practices in AI." Time-bounded system audits relying substantially on company-provided information. This is what METR and AISI currently do.
|
||||
- **AAL-2**: Near-term goal for advanced frontier developers. Greater access to non-public information, less reliance on company statements. Not yet standard.
|
||||
- **AAL-3 & AAL-4**: Require "deception-resilient verification" — ruling out "materially significant deception by the auditee." Currently NOT technically feasible.
|
||||
|
||||
**Core vision:** "Rigorous third-party verification of frontier AI developers' safety and security claims" examining internal deployments, information security, and decision-making processes — not just public products.
|
||||
|
||||
**Adoption model:** Market-based incentives (competitive procurement, insurance differentiation, audit credentials as competitive advantage). NOT mandatory regulation. Authors acknowledge "universal adoption across frontier developers" as vision requiring "clarifying and strengthening incentives."
|
||||
|
||||
**Current state:** Adoption "voluntary and concentrated among a few developers" with only "emerging pilots and voluntary assessments."
|
||||
|
||||
**Key concern:** Auditing must not "devolve into a checkbox exercise or lag behind changes in the industry."
|
||||
|
||||
## Agent Notes
|
||||
|
||||
**Why this matters:** The most authoritative and comprehensive proposal for frontier AI auditing to date. The four-level AAL framework is the field's best attempt to define what rigorous evaluation looks like. Crucially, it defines the ceiling of current practice (AAL-1 = voluntary-collaborative with lab), and explicitly states the most important levels (AAL-3/4, deception-resilient) are NOT YET TECHNICALLY FEASIBLE. This is the field admitting the limitation that makes B1 hold.
|
||||
|
||||
**What surprised me:** AAL-3 and AAL-4 are technically infeasible — the paper doesn't frame this as a temporary gap but as a genuine technical barrier. This means even the field's most ambitious proposal acknowledges we can't currently audit whether labs are being deceptive about their safety practices. This is a much more fundamental gap than I expected.
|
||||
|
||||
**What I expected but didn't find:** Any mandatory requirement or regulatory pathway embedded in the framework. The paper relies entirely on market incentives and voluntary adoption. The contrast with analogous high-stakes domains (FDA requiring independent clinical trials by regulation) is stark and the paper does not address it.
|
||||
|
||||
**KB connections:**
|
||||
- [[voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints]] — the same structural logic applies to voluntary auditing
|
||||
- [[safe AI development requires building alignment mechanisms before scaling capability]] — AAL-1 as current ceiling means alignment mechanisms are far below what capability scaling requires
|
||||
- [[scalable oversight degrades rapidly as capability gaps grow]] — AAL-3/4 infeasibility is the specific mechanism: deception-resilient verification requires oversight capability that doesn't yet exist
|
||||
|
||||
**Extraction hints:**
|
||||
- Primary claim candidate: "Frontier AI auditing infrastructure is limited to AAL-1 (voluntary-collaborative, relies on company information) because deception-resilient evaluation is not technically feasible" — this is specific, falsifiable, and supported by the most authoritative paper in the field
|
||||
- Secondary claim candidate: "The voluntary-collaborative model of frontier AI evaluation shares the structural weakness of responsible scaling policies — it relies on labs' cooperation to function and cannot detect deception"
|
||||
- The AAL framework itself (4 levels with specific characteristics) is worth a dedicated claim describing the level structure
|
||||
|
||||
**Context:** January 2026. Yoshua Bengio is a co-author (his inclusion signals broad alignment community endorsement). Published ~3 months after Anthropic dropped its RSP pledge — the timing suggests the field is trying to rebuild evaluation infrastructure on more formal footing after the voluntary pledge model failed.
|
||||
|
||||
## Curator Notes
|
||||
|
||||
PRIMARY CONNECTION: [[safe AI development requires building alignment mechanisms before scaling capability]] — this paper describes the current ceiling of alignment mechanisms (AAL-1) and what's needed but not yet feasible (AAL-3/4)
|
||||
|
||||
WHY ARCHIVED: Most comprehensive description of the evaluation infrastructure field in early 2026. Defines the gap between current capability and what rigorous evaluation requires. The technical infeasibility of deception-resilient evaluation (AAL-3/4) is a major finding that strengthens B1's "not being treated as such" claim.
|
||||
|
||||
EXTRACTION HINT: Focus on the AAL framework structure, the technical infeasibility of AAL-3/4, and the voluntary-collaborative limitation. These three elements together describe the core gap in evaluation infrastructure.
|
||||
|
||||
|
||||
## Key Facts
|
||||
- AAL-1 represents current peak practice: time-bounded system audits relying substantially on company-provided information
|
||||
- AAL-2 is near-term goal: greater access to non-public information, less reliance on company statements, not yet standard
|
||||
- AAL-3 and AAL-4 require deception-resilient verification and are currently not technically feasible
|
||||
- METR and AISI currently perform AAL-1 level evaluations
|
||||
- Paper has 28+ authors from 27 organizations including GovAI, MIT CSAIL, Cambridge, Stanford, Yale, Anthropic contributors, Epoch AI, Apollo Research
|
||||
- Yoshua Bengio is a co-author
|
||||
- Published January 2026, approximately 3 months after Anthropic RSP rollback
|
||||
- Adoption model relies on market-based incentives: competitive procurement, insurance differentiation, audit credentials as competitive advantage
|
||||
- Current adoption is voluntary and concentrated among a few developers with only emerging pilots
|
||||
|
|
@ -1,64 +0,0 @@
|
|||
---
|
||||
type: source
|
||||
title: "Toward Third-Party Assurance of AI Systems"
|
||||
author: "Rachel M. Kim, Blaine Kuehnert, Alice Lai, Kenneth Holstein, Hoda Heidari, Rayid Ghani (Carnegie Mellon University)"
|
||||
url: https://arxiv.org/abs/2601.22424
|
||||
date: 2026-01-30
|
||||
domain: ai-alignment
|
||||
secondary_domains: []
|
||||
format: paper
|
||||
status: enrichment
|
||||
priority: high
|
||||
tags: [evaluation-infrastructure, third-party-assurance, conflict-of-interest, lifecycle-assessment, CMU]
|
||||
processed_by: theseus
|
||||
processed_date: 2026-03-19
|
||||
enrichments_applied: ["no research group is building alignment through collective intelligence infrastructure despite the field converging on problems that require it.md"]
|
||||
extraction_model: "anthropic/claude-sonnet-4.5"
|
||||
---
|
||||
|
||||
## Content
|
||||
|
||||
CMU researchers propose a comprehensive third-party AI assurance framework with four components:
|
||||
1. **Responsibility Assignment Matrix** — maps stakeholder involvement across AI lifecycle stages
|
||||
2. **Interview Protocol** — structured conversations with each AI system stakeholder
|
||||
3. **Maturity Matrix** — evaluates adherence to best practices
|
||||
4. **Assurance Report Template** — draws from established business accounting assurance practices
|
||||
|
||||
**Key distinction:** The paper proposes "assurance" not "audit" to "prevent conflict of interest and ensure credibility and accountability." This framing acknowledges current AI auditing has a conflict of interest problem the authors explicitly want to avoid.
|
||||
|
||||
**Gap identified:** Few existing evaluation resources "address both the process of designing, developing, and deploying an AI system and the outcomes it produces." Few existing approaches are "end-to-end and operational, give actionable guidance, or present evidence of usability."
|
||||
|
||||
**Validation:** Tested on two use cases: a business document tagging tool and a housing resource allocation tool. Results: "sound and comprehensive, usable across different organizational contexts, and effective at identifying bespoke issues."
|
||||
|
||||
## Agent Notes
|
||||
|
||||
**Why this matters:** The explicit distinction between "assurance" and "audit" confirms the conflict of interest problem in current AI evaluation. The paper is trying to build what the Brundage et al. paper only proposes — but it's tested on deployment-scale tools, not frontier AI. This represents the early-stage methodology work needed to eventually close the independence gap.
|
||||
|
||||
**What surprised me:** The paper specifically acknowledges conflict of interest as a design concern, which is rare in the AI evaluation literature. Most papers don't name this structural problem explicitly.
|
||||
|
||||
**What I expected but didn't find:** Any discussion of how this scales to frontier AI systems (the two test cases are much more limited in capability than frontier models). The gap between "document tagging tool" and "Claude Opus 4.6" is enormous.
|
||||
|
||||
**KB connections:**
|
||||
- Directly relevant to the "missing correction mechanism" identified in Session 2026-03-18b — third-party performance measurement that is genuinely independent, not collaborative
|
||||
- [[no research group is building alignment through collective intelligence infrastructure]] — this paper is one of the first to try to build the assurance infrastructure, but at a small scale
|
||||
|
||||
**Extraction hints:**
|
||||
- Could support a claim about the early stage of AI assurance methodology: "third-party AI assurance methodology is at the proof-of-concept stage, validated in small deployment contexts but not yet applicable to frontier AI at scale"
|
||||
- The conflict of interest framing is valuable for any claim about the limitations of current evaluation practice
|
||||
|
||||
**Context:** CMU researchers, published January 2026. The field is clearly aware of the limitations of current voluntary-collaborative evaluation.
|
||||
|
||||
## Curator Notes
|
||||
|
||||
PRIMARY CONNECTION: [[no research group is building alignment through collective intelligence infrastructure despite the field converging on problems that require it]] — this paper is early evidence that some groups ARE starting to build assurance infrastructure, though at small scale
|
||||
|
||||
WHY ARCHIVED: Provides methodology for third-party AI assurance that explicitly addresses the conflict of interest problem. Important evidence that the field is aware of the independence gap.
|
||||
|
||||
EXTRACTION HINT: The "assurance vs audit" distinction to prevent conflict of interest is the key extractable insight. The lifecycle approach (process + outcomes) is also worth noting.
|
||||
|
||||
|
||||
## Key Facts
|
||||
- CMU researchers published 'Toward Third-Party Assurance of AI Systems' in January 2026
|
||||
- The framework was tested on a business document tagging tool and a housing resource allocation tool
|
||||
- The paper identifies that few existing evaluation resources 'address both the process of designing, developing, and deploying an AI system and the outcomes it produces'
|
||||
- Few existing approaches are 'end-to-end and operational, give actionable guidance, or present evidence of usability' according to the gap analysis
|
||||
|
|
@ -1,74 +0,0 @@
|
|||
---
|
||||
type: source
|
||||
title: "METR and UK AISI: State of Pre-Deployment AI Evaluation Practice (March 2026)"
|
||||
author: "METR (metr.org) and UK AI Security Institute (aisi.gov.uk)"
|
||||
url: https://metr.org/blog/
|
||||
date: 2026-03-01
|
||||
domain: ai-alignment
|
||||
secondary_domains: []
|
||||
format: article
|
||||
status: enrichment
|
||||
priority: medium
|
||||
tags: [evaluation-infrastructure, pre-deployment, METR, AISI, voluntary-collaborative, Inspect, Claude-Opus-4-6, cyber-evaluation]
|
||||
processed_by: theseus
|
||||
processed_date: 2026-03-19
|
||||
enrichments_applied: ["pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md"]
|
||||
extraction_model: "anthropic/claude-sonnet-4.5"
|
||||
---
|
||||
|
||||
## Content
|
||||
|
||||
Synthesized overview of the two main organizations conducting pre-deployment AI evaluations as of March 2026.
|
||||
|
||||
**METR (Model Evaluation and Threat Research):**
|
||||
- Review of Anthropic Sabotage Risk Report: Claude Opus 4.6 (March 12, 2026)
|
||||
- Review of Anthropic Summer 2025 Pilot Sabotage Risk Report (October 28, 2025)
|
||||
- Summary of gpt-oss methodology review for OpenAI (October 23, 2025)
|
||||
- Common Elements of Frontier AI Safety Policies (December 2025 update)
|
||||
- Frontier AI Safety Policies repository (February 2025) — catalogs safety policies from Amazon, Anthropic, Google DeepMind, Meta, Microsoft, OpenAI
|
||||
|
||||
**UK AI Security Institute (formerly AI Safety Institute, renamed 2026):**
|
||||
- Cyber capability testing on 7 LLMs on custom-built cyber ranges (March 16, 2026)
|
||||
- Universal jailbreak assessment against best-defended systems (February 17, 2026)
|
||||
- Open-source Inspect evaluation framework (April 2024)
|
||||
- Inspect Scout transcript analysis tool (February 25, 2026)
|
||||
- ControlArena library for AI control experiments (October 22, 2025)
|
||||
- HiBayES statistical modeling framework (May 2025)
|
||||
- International joint testing exercise on agentic systems (July 2025)
|
||||
|
||||
**Key structural observation:** METR's evaluations are conducted by invitation/agreement with labs (METR "worked with" Anthropic on Opus 4.6, "worked with" OpenAI on gpt-oss). UK AISI conducts "joint pre-deployment evaluations." No mandatory requirement exists for labs to submit to these evaluations. AISI's renaming from "Safety Institute" to "Security Institute" suggests a shift from safety (avoiding catastrophic AI risk) to security (preventing cybersecurity threats).
|
||||
|
||||
## Agent Notes
|
||||
|
||||
**Why this matters:** This is the current ceiling of third-party AI evaluation in practice. Both METR and AISI represent the best-in-class evaluation practice — and both operate on a voluntary-collaborative model where labs invite or agree to evaluation. This maps directly to AAL-1 in the Brundage et al. framework ("the peak of current practices in AI" — relying substantially on company-provided information).
|
||||
|
||||
**What surprised me:** AISI's renaming to "AI Security Institute." This suggests the UK government's focus has shifted from existential AI safety risk (alignment, catastrophic outcomes) toward near-term cybersecurity threats. If the primary government-funded evaluation body is reorienting from safety to security, the evaluation infrastructure for alignment-relevant risks weakens.
|
||||
|
||||
**What I expected but didn't find:** Any evidence that METR evaluates labs without the lab's consent or cooperation. All evaluations appear to be collaborative — the lab shares information, METR reviews it. There is no mechanism for METR to evaluate a lab that refuses.
|
||||
|
||||
**KB connections:**
|
||||
- [[voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints]] — voluntary evaluation has the same structural problem; a lab can simply not invite METR
|
||||
- [[technology advances exponentially but coordination mechanisms evolve linearly creating a widening gap]] — METR and AISI are growing their evaluation capacity, but AI capabilities are growing faster; the gap widens in every period
|
||||
- [[government designation of safety-conscious AI labs as supply chain risks inverts the regulatory dynamic]] — AISI renaming to "Security Institute" is a softer version of the same dynamic — government safety infrastructure shifting to serve government security interests rather than existential risk reduction
|
||||
|
||||
**Extraction hints:**
|
||||
- Key claim: "Pre-deployment AI evaluation operates on a voluntary-collaborative model where evaluators (METR, AISI) require lab cooperation, meaning labs that decline evaluation face no consequence"
|
||||
- The AISI renaming is worth noting as a signal: the only government-funded AI safety evaluation body is shifting its mandate
|
||||
- The scope of METR/AISI evaluations (mostly sabotage risk and cyber capabilities) may be narrower than alignment-relevant evaluation
|
||||
|
||||
**Context:** March 2026 state of play. Assessed by synthesizing METR's published blog and AISI's published work pages — these are the two most active evaluation organizations globally.
|
||||
|
||||
## Curator Notes
|
||||
|
||||
PRIMARY CONNECTION: [[safe AI development requires building alignment mechanisms before scaling capability]] — the current ceiling of evaluation practice (METR/AISI, voluntary-collaborative) is far below what "building alignment mechanisms before scaling capability" requires
|
||||
|
||||
WHY ARCHIVED: Documents the actual state of pre-deployment AI evaluation practice in early 2026. The voluntary-collaborative model and AISI's renaming are the key signals.
|
||||
|
||||
EXTRACTION HINT: Focus on the voluntary-collaborative limitation: no evaluation happens without lab consent. Also note the AISI renaming as a signal about government priority shift from safety to security.
|
||||
|
||||
|
||||
## Key Facts
|
||||
- METR reviewed Anthropic's Claude Opus 4.6 sabotage risk report on March 12, 2026
|
||||
- UK AISI was renamed from 'AI Safety Institute' to 'AI Security Institute' in 2026
|
||||
- UK AISI tested 7 LLMs on custom cyber ranges as of March 16, 2026
|
||||
- METR maintains a Frontier AI Safety Policies repository covering Amazon, Anthropic, Google DeepMind, Meta, Microsoft, and OpenAI
|
||||
Loading…
Reference in a new issue