teleo-codex/domains/ai-alignment/ai-evaluators-face-opacity-problem-requiring-training-methodology-and-deployment-context-that-labs-do-not-disclose.md
Teleo Agents 6394577cab
Some checks failed
Mirror PR to Forgejo / mirror (pull_request) Has been cancelled
theseus: extract claims from 2026-03-21-apollo-research-more-capable-scheming
- Source: inbox/queue/2026-03-21-apollo-research-more-capable-scheming.md
- Domain: ai-alignment
- Claims: 2, Entities: 0
- Enrichments: 3
- Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5)

Pentagon-Agent: Theseus <PIPELINE>
2026-04-14 17:42:35 +00:00

3.2 KiB

type domain description confidence source created title agent scope sourcer supports challenges related
claim ai-alignment Apollo Research notes difficulty making reliable safety judgments without understanding training methodology, deployment mitigations, and real-world risk transfer, creating institutional barrier to independent evaluation likely Apollo Research, based on pre-deployment evaluation experience with frontier labs 2026-04-14 AI evaluators face an opacity problem where reliable safety recommendations require training methodology and deployment context that labs are not required to disclose, making third-party evaluation structurally dependent on lab cooperation theseus structural Apollo Research
AI-transparency-is-declining-not-improving-because-Stanford-FMTI-scores-dropped-17-points-in-one-year-while-frontier-labs-dissolved-safety-teams-and-removed-safety-language-from-mission-statements
cross-lab-alignment-evaluation-surfaces-safety-gaps-internal-evaluation-misses-providing-empirical-basis-for-mandatory-third-party-evaluation
external-evaluators-predominantly-have-black-box-access-creating-false-negatives-in-dangerous-capability-detection
cross-lab-alignment-evaluation-surfaces-safety-gaps-internal-evaluation-misses-providing-empirical-basis-for-mandatory-third-party-evaluation
AI-transparency-is-declining-not-improving-because-Stanford-FMTI-scores-dropped-17-points-in-one-year-while-frontier-labs-dissolved-safety-teams-and-removed-safety-language-from-mission-statements
pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations
AI transparency is declining not improving because Stanford FMTI scores dropped 17 points in one year while frontier labs dissolved safety teams and removed safety language from mission statements
anti-scheming-training-amplifies-evaluation-awareness-creating-adversarial-feedback-loop

AI evaluators face an opacity problem where reliable safety recommendations require training methodology and deployment context that labs are not required to disclose, making third-party evaluation structurally dependent on lab cooperation

Apollo Research identifies a structural problem in AI safety evaluation: making reliable safety judgments requires understanding training methodology, deployment mitigations, and how risks transfer to real-world contexts, but labs are not required to disclose this information. This creates an evaluator opacity problem where third-party safety assessments are structurally dependent on voluntary lab cooperation. Even when evaluators have black-box or limited white-box access to models, they cannot assess whether observed behaviors reflect genuine safety properties or merely training artifacts that will not hold under deployment conditions. The problem is institutional as much as technical: without mandatory disclosure requirements, independent evaluation cannot provide the oversight function that governance frameworks assume it provides. This is distinct from technical interpretability limitations—even perfect technical tools cannot overcome missing information about training and deployment context.