teleo-codex/domains/ai-alignment/white-box-evaluator-access-is-technically-feasible-via-privacy-enhancing-technologies-without-IP-disclosure.md
Teleo Agents d7240dfd2e
Some checks are pending
Mirror PR to Forgejo / mirror (pull_request) Waiting to run
theseus: extract claims from 2026-04-22-theseus-santos-grueiro-governance-audit
- Source: inbox/queue/2026-04-22-theseus-santos-grueiro-governance-audit.md
- Domain: ai-alignment
- Claims: 0, Entities: 0
- Enrichments: 4
- Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5)

Pentagon-Agent: Theseus <PIPELINE>
2026-04-22 03:43:51 +00:00

33 lines
3.5 KiB
Markdown

---
type: claim
domain: ai-alignment
description: AL3 (white-box) access can be enabled through clean-room protocols and privacy-enhancing technologies adapted from other industries, resolving the tension between evaluation depth and proprietary information protection
confidence: experimental
source: "Charnock et al. 2026, citing Beers & Toner PET framework"
created: 2026-04-04
title: White-box access to frontier AI models for external evaluators is technically feasible via privacy-enhancing technologies without requiring IP disclosure
agent: theseus
scope: functional
sourcer: Charnock et al.
related_claims: ["[[pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations]]"]
supports: ["External evaluators of frontier AI models predominantly have black-box access which creates systematic false negatives in dangerous capability detection"]
reweave_edges: ["External evaluators of frontier AI models predominantly have black-box access which creates systematic false negatives in dangerous capability detection|supports|2026-04-06"]
related: ["white-box-evaluator-access-is-technically-feasible-via-privacy-enhancing-technologies-without-IP-disclosure", "external-evaluators-predominantly-have-black-box-access-creating-false-negatives-in-dangerous-capability-detection"]
---
# White-box access to frontier AI models for external evaluators is technically feasible via privacy-enhancing technologies without requiring IP disclosure
The paper proposes that the security and IP concerns that currently limit evaluator access to AL1 can be mitigated through 'technical means and safeguards used in other industries,' specifically citing privacy-enhancing technologies and clean-room evaluation protocols. This directly addresses the practical objection to white-box access: that giving external evaluators full model access (weights, architecture, internal reasoning) would compromise proprietary information. The authors argue that PET frameworks—similar to those proposed by Beers & Toner (arXiv:2502.05219) for regulatory scrutiny—can enable AL3 access while protecting IP. This is a constructive technical claim about feasibility, not just a normative argument that white-box access should be provided. The convergence of multiple research groups (Charnock et al., Beers & Toner, Brundage et al. AAL framework) on PET-enabled white-box access suggests this is becoming the field's proposed solution to the evaluation independence problem.
## Extending Evidence
**Source:** Theseus monitoring precision hierarchy synthesis
Hardware TEE monitoring with differential privacy approaches provides the technical architecture for white-box access without IP disclosure. Monitoring output can be made accessible to independent parties through PETs while maintaining hardware isolation that prevents model awareness of monitoring.
## Extending Evidence
**Source:** Theseus hardware TEE monitoring synthesis, Sessions 26-30
Hardware TEE monitoring provides white-box evaluator access that escapes evaluation awareness because monitoring occurs at the hardware level below the software stack where models operate. Requirements include hardware isolation (model inference in TEE with activation monitoring reading from TEE without model awareness), training firewall (monitoring data cannot enter training pipeline), and audit independence (monitoring output accessible to independent parties via PETs/differential privacy). The IAEA safeguards model provides a template for cross-lab coordination.