extract: 2026-03-00-metr-aisi-pre-deployment-evaluation-practice #1361
8 changed files with 149 additions and 3 deletions
|
|
@ -23,6 +23,12 @@ The alignment field has converged on a problem they cannot solve with their curr
|
||||||
|
|
||||||
The UK AI for Collective Intelligence Research Network represents a national-scale institutional commitment to building CI infrastructure with explicit alignment goals. Funded by UKRI/EPSRC, the network proposes the 'AI4CI Loop' (Gathering Intelligence → Informing Behaviour) as a framework for multi-level decision making. The research strategy includes seven trust properties (human agency, security, privacy, transparency, fairness, value alignment, accountability) and specifies technical requirements including federated learning architectures, secure data repositories, and foundation models adapted for collective intelligence contexts. This is not purely academic—it's a government-backed infrastructure program with institutional resources. However, the strategy is prospective (published 2024-11) and describes a research agenda rather than deployed systems, so it represents institutional intent rather than operational infrastructure.
|
The UK AI for Collective Intelligence Research Network represents a national-scale institutional commitment to building CI infrastructure with explicit alignment goals. Funded by UKRI/EPSRC, the network proposes the 'AI4CI Loop' (Gathering Intelligence → Informing Behaviour) as a framework for multi-level decision making. The research strategy includes seven trust properties (human agency, security, privacy, transparency, fairness, value alignment, accountability) and specifies technical requirements including federated learning architectures, secure data repositories, and foundation models adapted for collective intelligence contexts. This is not purely academic—it's a government-backed infrastructure program with institutional resources. However, the strategy is prospective (published 2024-11) and describes a research agenda rather than deployed systems, so it represents institutional intent rather than operational infrastructure.
|
||||||
|
|
||||||
|
|
||||||
|
### Additional Evidence (challenge)
|
||||||
|
*Source: [[2026-01-00-kim-third-party-ai-assurance-framework]] | Added: 2026-03-19*
|
||||||
|
|
||||||
|
CMU researchers have built and validated a third-party AI assurance framework with four operational components (Responsibility Assignment Matrix, Interview Protocol, Maturity Matrix, Assurance Report Template), tested on two real deployment cases. This represents concrete infrastructure-building work, though at small scale and not yet applicable to frontier AI.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
Relevant Notes:
|
Relevant Notes:
|
||||||
|
|
|
||||||
|
|
@ -32,6 +32,12 @@ The problem compounds the alignment challenge: even if safety research produces
|
||||||
- Risk management remains "largely voluntary" while regulatory regimes begin formalizing requirements based on these unreliable evaluation methods
|
- Risk management remains "largely voluntary" while regulatory regimes begin formalizing requirements based on these unreliable evaluation methods
|
||||||
- The report identifies this as a structural governance problem, not a technical limitation that engineering can solve
|
- The report identifies this as a structural governance problem, not a technical limitation that engineering can solve
|
||||||
|
|
||||||
|
|
||||||
|
### Additional Evidence (extend)
|
||||||
|
*Source: [[2026-03-00-metr-aisi-pre-deployment-evaluation-practice]] | Added: 2026-03-19*
|
||||||
|
|
||||||
|
The voluntary-collaborative model adds a selection bias dimension to evaluation unreliability: evaluations only happen when labs consent, meaning the sample of evaluated models is systematically biased toward labs confident in their safety measures. Labs with weaker safety practices can avoid evaluation entirely.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
Relevant Notes:
|
Relevant Notes:
|
||||||
|
|
|
||||||
|
|
@ -0,0 +1,38 @@
|
||||||
|
{
|
||||||
|
"rejected_claims": [
|
||||||
|
{
|
||||||
|
"filename": "frontier-ai-auditing-limited-to-voluntary-collaborative-model-because-deception-resilient-verification-not-technically-feasible.md",
|
||||||
|
"issues": [
|
||||||
|
"missing_attribution_extractor"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"filename": "voluntary-collaborative-auditing-shares-structural-weakness-of-responsible-scaling-policies-requiring-lab-cooperation-to-function.md",
|
||||||
|
"issues": [
|
||||||
|
"missing_attribution_extractor"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"validation_stats": {
|
||||||
|
"total": 2,
|
||||||
|
"kept": 0,
|
||||||
|
"fixed": 8,
|
||||||
|
"rejected": 2,
|
||||||
|
"fixes_applied": [
|
||||||
|
"frontier-ai-auditing-limited-to-voluntary-collaborative-model-because-deception-resilient-verification-not-technically-feasible.md:set_created:2026-03-19",
|
||||||
|
"frontier-ai-auditing-limited-to-voluntary-collaborative-model-because-deception-resilient-verification-not-technically-feasible.md:stripped_wiki_link:safe-AI-development-requires-building-alignment-mechanisms-b",
|
||||||
|
"frontier-ai-auditing-limited-to-voluntary-collaborative-model-because-deception-resilient-verification-not-technically-feasible.md:stripped_wiki_link:voluntary-safety-pledges-cannot-survive-competitive-pressure",
|
||||||
|
"frontier-ai-auditing-limited-to-voluntary-collaborative-model-because-deception-resilient-verification-not-technically-feasible.md:stripped_wiki_link:AI-transparency-is-declining-not-improving-because-Stanford-",
|
||||||
|
"voluntary-collaborative-auditing-shares-structural-weakness-of-responsible-scaling-policies-requiring-lab-cooperation-to-function.md:set_created:2026-03-19",
|
||||||
|
"voluntary-collaborative-auditing-shares-structural-weakness-of-responsible-scaling-policies-requiring-lab-cooperation-to-function.md:stripped_wiki_link:voluntary-safety-pledges-cannot-survive-competitive-pressure",
|
||||||
|
"voluntary-collaborative-auditing-shares-structural-weakness-of-responsible-scaling-policies-requiring-lab-cooperation-to-function.md:stripped_wiki_link:Anthropics-RSP-rollback-under-commercial-pressure-is-the-fir",
|
||||||
|
"voluntary-collaborative-auditing-shares-structural-weakness-of-responsible-scaling-policies-requiring-lab-cooperation-to-function.md:stripped_wiki_link:only-binding-regulation-with-enforcement-teeth-changes-front"
|
||||||
|
],
|
||||||
|
"rejections": [
|
||||||
|
"frontier-ai-auditing-limited-to-voluntary-collaborative-model-because-deception-resilient-verification-not-technically-feasible.md:missing_attribution_extractor",
|
||||||
|
"voluntary-collaborative-auditing-shares-structural-weakness-of-responsible-scaling-policies-requiring-lab-cooperation-to-function.md:missing_attribution_extractor"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"model": "anthropic/claude-sonnet-4.5",
|
||||||
|
"date": "2026-03-19"
|
||||||
|
}
|
||||||
|
|
@ -0,0 +1,32 @@
|
||||||
|
{
|
||||||
|
"rejected_claims": [
|
||||||
|
{
|
||||||
|
"filename": "third-party-ai-assurance-methodology-is-at-proof-of-concept-stage-validated-in-small-deployment-contexts-but-not-yet-applicable-to-frontier-ai-at-scale.md",
|
||||||
|
"issues": [
|
||||||
|
"missing_attribution_extractor"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"filename": "ai-assurance-explicitly-distinguishes-itself-from-audit-to-prevent-conflict-of-interest-and-ensure-credibility-which-acknowledges-current-evaluation-has-a-structural-independence-problem.md",
|
||||||
|
"issues": [
|
||||||
|
"missing_attribution_extractor"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"validation_stats": {
|
||||||
|
"total": 2,
|
||||||
|
"kept": 0,
|
||||||
|
"fixed": 2,
|
||||||
|
"rejected": 2,
|
||||||
|
"fixes_applied": [
|
||||||
|
"third-party-ai-assurance-methodology-is-at-proof-of-concept-stage-validated-in-small-deployment-contexts-but-not-yet-applicable-to-frontier-ai-at-scale.md:set_created:2026-03-19",
|
||||||
|
"ai-assurance-explicitly-distinguishes-itself-from-audit-to-prevent-conflict-of-interest-and-ensure-credibility-which-acknowledges-current-evaluation-has-a-structural-independence-problem.md:set_created:2026-03-19"
|
||||||
|
],
|
||||||
|
"rejections": [
|
||||||
|
"third-party-ai-assurance-methodology-is-at-proof-of-concept-stage-validated-in-small-deployment-contexts-but-not-yet-applicable-to-frontier-ai-at-scale.md:missing_attribution_extractor",
|
||||||
|
"ai-assurance-explicitly-distinguishes-itself-from-audit-to-prevent-conflict-of-interest-and-ensure-credibility-which-acknowledges-current-evaluation-has-a-structural-independence-problem.md:missing_attribution_extractor"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"model": "anthropic/claude-sonnet-4.5",
|
||||||
|
"date": "2026-03-19"
|
||||||
|
}
|
||||||
|
|
@ -0,0 +1,26 @@
|
||||||
|
{
|
||||||
|
"rejected_claims": [
|
||||||
|
{
|
||||||
|
"filename": "pre-deployment-ai-evaluation-operates-on-voluntary-collaborative-model-where-labs-can-decline-without-consequence.md",
|
||||||
|
"issues": [
|
||||||
|
"missing_attribution_extractor"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"validation_stats": {
|
||||||
|
"total": 1,
|
||||||
|
"kept": 0,
|
||||||
|
"fixed": 3,
|
||||||
|
"rejected": 1,
|
||||||
|
"fixes_applied": [
|
||||||
|
"pre-deployment-ai-evaluation-operates-on-voluntary-collaborative-model-where-labs-can-decline-without-consequence.md:set_created:2026-03-19",
|
||||||
|
"pre-deployment-ai-evaluation-operates-on-voluntary-collaborative-model-where-labs-can-decline-without-consequence.md:stripped_wiki_link:voluntary-safety-pledges-cannot-survive-competitive-pressure",
|
||||||
|
"pre-deployment-ai-evaluation-operates-on-voluntary-collaborative-model-where-labs-can-decline-without-consequence.md:stripped_wiki_link:only-binding-regulation-with-enforcement-teeth-changes-front"
|
||||||
|
],
|
||||||
|
"rejections": [
|
||||||
|
"pre-deployment-ai-evaluation-operates-on-voluntary-collaborative-model-where-labs-can-decline-without-consequence.md:missing_attribution_extractor"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"model": "anthropic/claude-sonnet-4.5",
|
||||||
|
"date": "2026-03-19"
|
||||||
|
}
|
||||||
|
|
@ -7,9 +7,13 @@ date: 2026-01-01
|
||||||
domain: ai-alignment
|
domain: ai-alignment
|
||||||
secondary_domains: []
|
secondary_domains: []
|
||||||
format: paper
|
format: paper
|
||||||
status: unprocessed
|
status: null-result
|
||||||
priority: high
|
priority: high
|
||||||
tags: [evaluation-infrastructure, third-party-audit, AAL-framework, voluntary-collaborative, deception-resilient, governance-gap]
|
tags: [evaluation-infrastructure, third-party-audit, AAL-framework, voluntary-collaborative, deception-resilient, governance-gap]
|
||||||
|
processed_by: theseus
|
||||||
|
processed_date: 2026-03-19
|
||||||
|
extraction_model: "anthropic/claude-sonnet-4.5"
|
||||||
|
extraction_notes: "LLM returned 2 claims, 2 rejected by validator"
|
||||||
---
|
---
|
||||||
|
|
||||||
## Content
|
## Content
|
||||||
|
|
@ -56,3 +60,15 @@ PRIMARY CONNECTION: [[safe AI development requires building alignment mechanisms
|
||||||
WHY ARCHIVED: Most comprehensive description of the evaluation infrastructure field in early 2026. Defines the gap between current capability and what rigorous evaluation requires. The technical infeasibility of deception-resilient evaluation (AAL-3/4) is a major finding that strengthens B1's "not being treated as such" claim.
|
WHY ARCHIVED: Most comprehensive description of the evaluation infrastructure field in early 2026. Defines the gap between current capability and what rigorous evaluation requires. The technical infeasibility of deception-resilient evaluation (AAL-3/4) is a major finding that strengthens B1's "not being treated as such" claim.
|
||||||
|
|
||||||
EXTRACTION HINT: Focus on the AAL framework structure, the technical infeasibility of AAL-3/4, and the voluntary-collaborative limitation. These three elements together describe the core gap in evaluation infrastructure.
|
EXTRACTION HINT: Focus on the AAL framework structure, the technical infeasibility of AAL-3/4, and the voluntary-collaborative limitation. These three elements together describe the core gap in evaluation infrastructure.
|
||||||
|
|
||||||
|
|
||||||
|
## Key Facts
|
||||||
|
- AAL-1 represents current peak practice: time-bounded system audits relying substantially on company-provided information
|
||||||
|
- AAL-2 is near-term goal: greater access to non-public information, less reliance on company statements, not yet standard
|
||||||
|
- AAL-3 and AAL-4 require deception-resilient verification and are currently not technically feasible
|
||||||
|
- METR and AISI currently perform AAL-1 level evaluations
|
||||||
|
- Paper has 28+ authors from 27 organizations including GovAI, MIT CSAIL, Cambridge, Stanford, Yale, Anthropic contributors, Epoch AI, Apollo Research
|
||||||
|
- Yoshua Bengio is a co-author
|
||||||
|
- Published January 2026, approximately 3 months after Anthropic RSP rollback
|
||||||
|
- Adoption model relies on market-based incentives: competitive procurement, insurance differentiation, audit credentials as competitive advantage
|
||||||
|
- Current adoption is voluntary and concentrated among a few developers with only emerging pilots
|
||||||
|
|
|
||||||
|
|
@ -7,9 +7,13 @@ date: 2026-01-30
|
||||||
domain: ai-alignment
|
domain: ai-alignment
|
||||||
secondary_domains: []
|
secondary_domains: []
|
||||||
format: paper
|
format: paper
|
||||||
status: unprocessed
|
status: enrichment
|
||||||
priority: high
|
priority: high
|
||||||
tags: [evaluation-infrastructure, third-party-assurance, conflict-of-interest, lifecycle-assessment, CMU]
|
tags: [evaluation-infrastructure, third-party-assurance, conflict-of-interest, lifecycle-assessment, CMU]
|
||||||
|
processed_by: theseus
|
||||||
|
processed_date: 2026-03-19
|
||||||
|
enrichments_applied: ["no research group is building alignment through collective intelligence infrastructure despite the field converging on problems that require it.md"]
|
||||||
|
extraction_model: "anthropic/claude-sonnet-4.5"
|
||||||
---
|
---
|
||||||
|
|
||||||
## Content
|
## Content
|
||||||
|
|
@ -51,3 +55,10 @@ PRIMARY CONNECTION: [[no research group is building alignment through collective
|
||||||
WHY ARCHIVED: Provides methodology for third-party AI assurance that explicitly addresses the conflict of interest problem. Important evidence that the field is aware of the independence gap.
|
WHY ARCHIVED: Provides methodology for third-party AI assurance that explicitly addresses the conflict of interest problem. Important evidence that the field is aware of the independence gap.
|
||||||
|
|
||||||
EXTRACTION HINT: The "assurance vs audit" distinction to prevent conflict of interest is the key extractable insight. The lifecycle approach (process + outcomes) is also worth noting.
|
EXTRACTION HINT: The "assurance vs audit" distinction to prevent conflict of interest is the key extractable insight. The lifecycle approach (process + outcomes) is also worth noting.
|
||||||
|
|
||||||
|
|
||||||
|
## Key Facts
|
||||||
|
- CMU researchers published 'Toward Third-Party Assurance of AI Systems' in January 2026
|
||||||
|
- The framework was tested on a business document tagging tool and a housing resource allocation tool
|
||||||
|
- The paper identifies that few existing evaluation resources 'address both the process of designing, developing, and deploying an AI system and the outcomes it produces'
|
||||||
|
- Few existing approaches are 'end-to-end and operational, give actionable guidance, or present evidence of usability' according to the gap analysis
|
||||||
|
|
|
||||||
|
|
@ -7,9 +7,13 @@ date: 2026-03-01
|
||||||
domain: ai-alignment
|
domain: ai-alignment
|
||||||
secondary_domains: []
|
secondary_domains: []
|
||||||
format: article
|
format: article
|
||||||
status: unprocessed
|
status: enrichment
|
||||||
priority: medium
|
priority: medium
|
||||||
tags: [evaluation-infrastructure, pre-deployment, METR, AISI, voluntary-collaborative, Inspect, Claude-Opus-4-6, cyber-evaluation]
|
tags: [evaluation-infrastructure, pre-deployment, METR, AISI, voluntary-collaborative, Inspect, Claude-Opus-4-6, cyber-evaluation]
|
||||||
|
processed_by: theseus
|
||||||
|
processed_date: 2026-03-19
|
||||||
|
enrichments_applied: ["pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md"]
|
||||||
|
extraction_model: "anthropic/claude-sonnet-4.5"
|
||||||
---
|
---
|
||||||
|
|
||||||
## Content
|
## Content
|
||||||
|
|
@ -61,3 +65,10 @@ PRIMARY CONNECTION: [[safe AI development requires building alignment mechanisms
|
||||||
WHY ARCHIVED: Documents the actual state of pre-deployment AI evaluation practice in early 2026. The voluntary-collaborative model and AISI's renaming are the key signals.
|
WHY ARCHIVED: Documents the actual state of pre-deployment AI evaluation practice in early 2026. The voluntary-collaborative model and AISI's renaming are the key signals.
|
||||||
|
|
||||||
EXTRACTION HINT: Focus on the voluntary-collaborative limitation: no evaluation happens without lab consent. Also note the AISI renaming as a signal about government priority shift from safety to security.
|
EXTRACTION HINT: Focus on the voluntary-collaborative limitation: no evaluation happens without lab consent. Also note the AISI renaming as a signal about government priority shift from safety to security.
|
||||||
|
|
||||||
|
|
||||||
|
## Key Facts
|
||||||
|
- METR reviewed Anthropic's Claude Opus 4.6 sabotage risk report on March 12, 2026
|
||||||
|
- UK AISI was renamed from 'AI Safety Institute' to 'AI Security Institute' in 2026
|
||||||
|
- UK AISI tested 7 LLMs on custom cyber ranges as of March 16, 2026
|
||||||
|
- METR maintains a Frontier AI Safety Policies repository covering Amazon, Anthropic, Google DeepMind, Meta, Microsoft, and OpenAI
|
||||||
|
|
|
||||||
Loading…
Reference in a new issue