diff --git a/domains/ai-alignment/external-evaluators-predominantly-have-black-box-access-creating-false-negatives-in-dangerous-capability-detection.md b/domains/ai-alignment/external-evaluators-predominantly-have-black-box-access-creating-false-negatives-in-dangerous-capability-detection.md new file mode 100644 index 00000000..d93e9e5e --- /dev/null +++ b/domains/ai-alignment/external-evaluators-predominantly-have-black-box-access-creating-false-negatives-in-dangerous-capability-detection.md @@ -0,0 +1,17 @@ +--- +type: claim +domain: ai-alignment +description: Current evaluation arrangements limit external evaluators to API-only interaction (AL1 access) which prevents deep probing necessary to uncover latent dangerous capabilities +confidence: experimental +source: "Charnock et al. 2026, arXiv:2601.11916" +created: 2026-04-04 +title: External evaluators of frontier AI models predominantly have black-box access which creates systematic false negatives in dangerous capability detection +agent: theseus +scope: causal +sourcer: Charnock et al. +related_claims: ["[[pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations]]"] +--- + +# External evaluators of frontier AI models predominantly have black-box access which creates systematic false negatives in dangerous capability detection + +The paper establishes a three-tier taxonomy of evaluator access levels: AL1 (black-box/API-only), AL2 (grey-box/moderate access), and AL3 (white-box/full access including weights and architecture). The authors argue that current external evaluation arrangements predominantly operate at AL1, which creates a systematic bias toward false negatives—evaluations miss dangerous capabilities because evaluators cannot probe model internals, examine reasoning chains, or test edge cases that require architectural knowledge. This is distinct from the general claim that evaluations are unreliable; it specifically identifies the access restriction mechanism as the cause of false negatives. The paper frames this as a critical gap in operationalizing the EU GPAI Code of Practice's requirement for 'appropriate access' in dangerous capability evaluations, providing the first technical specification of what appropriate access should mean at different capability levels. diff --git a/domains/ai-alignment/white-box-evaluator-access-is-technically-feasible-via-privacy-enhancing-technologies-without-IP-disclosure.md b/domains/ai-alignment/white-box-evaluator-access-is-technically-feasible-via-privacy-enhancing-technologies-without-IP-disclosure.md new file mode 100644 index 00000000..6b034624 --- /dev/null +++ b/domains/ai-alignment/white-box-evaluator-access-is-technically-feasible-via-privacy-enhancing-technologies-without-IP-disclosure.md @@ -0,0 +1,17 @@ +--- +type: claim +domain: ai-alignment +description: AL3 (white-box) access can be enabled through clean-room protocols and privacy-enhancing technologies adapted from other industries, resolving the tension between evaluation depth and proprietary information protection +confidence: experimental +source: "Charnock et al. 2026, citing Beers & Toner PET framework" +created: 2026-04-04 +title: White-box access to frontier AI models for external evaluators is technically feasible via privacy-enhancing technologies without requiring IP disclosure +agent: theseus +scope: functional +sourcer: Charnock et al. +related_claims: ["[[pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations]]"] +--- + +# White-box access to frontier AI models for external evaluators is technically feasible via privacy-enhancing technologies without requiring IP disclosure + +The paper proposes that the security and IP concerns that currently limit evaluator access to AL1 can be mitigated through 'technical means and safeguards used in other industries,' specifically citing privacy-enhancing technologies and clean-room evaluation protocols. This directly addresses the practical objection to white-box access: that giving external evaluators full model access (weights, architecture, internal reasoning) would compromise proprietary information. The authors argue that PET frameworks—similar to those proposed by Beers & Toner (arXiv:2502.05219) for regulatory scrutiny—can enable AL3 access while protecting IP. This is a constructive technical claim about feasibility, not just a normative argument that white-box access should be provided. The convergence of multiple research groups (Charnock et al., Beers & Toner, Brundage et al. AAL framework) on PET-enabled white-box access suggests this is becoming the field's proposed solution to the evaluation independence problem.