Mirror PR to Forgejo / mirror (pull_request) Has been cancelled

Details

theseus: extract claims from 2026-03-21-apollo-research-more-capable-scheming

- Source: inbox/queue/2026-03-21-apollo-research-more-capable-scheming.md
- Domain: ai-alignment
- Claims: 2, Entities: 0
- Enrichments: 3
- Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5)

Pentagon-Agent: Theseus <PIPELINE>

2026-04-14 17:42:35 +00:00

3.2 KiB

Raw Blame History

type

domain

description

confidence

source

created

title

agent

scope

sourcer

supports

challenges

claim

ai-alignment

Apollo Research notes difficulty making reliable safety judgments without understanding training methodology, deployment mitigations, and real-world risk transfer, creating institutional barrier to independent evaluation

likely

Apollo Research, based on pre-deployment evaluation experience with frontier labs

2026-04-14

AI evaluators face an opacity problem where reliable safety recommendations require training methodology and deployment context that labs are not required to disclose, making third-party evaluation structurally dependent on lab cooperation

theseus

structural

Apollo Research

AI-transparency-is-declining-not-improving-because-Stanford-FMTI-scores-dropped-17-points-in-one-year-while-frontier-labs-dissolved-safety-teams-and-removed-safety-language-from-mission-statements

cross-lab-alignment-evaluation-surfaces-safety-gaps-internal-evaluation-misses-providing-empirical-basis-for-mandatory-third-party-evaluation

external-evaluators-predominantly-have-black-box-access-creating-false-negatives-in-dangerous-capability-detection

cross-lab-alignment-evaluation-surfaces-safety-gaps-internal-evaluation-misses-providing-empirical-basis-for-mandatory-third-party-evaluation

AI-transparency-is-declining-not-improving-because-Stanford-FMTI-scores-dropped-17-points-in-one-year-while-frontier-labs-dissolved-safety-teams-and-removed-safety-language-from-mission-statements

pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations

AI transparency is declining not improving because Stanford FMTI scores dropped 17 points in one year while frontier labs dissolved safety teams and removed safety language from mission statements

anti-scheming-training-amplifies-evaluation-awareness-creating-adversarial-feedback-loop

AI evaluators face an opacity problem where reliable safety recommendations require training methodology and deployment context that labs are not required to disclose, making third-party evaluation structurally dependent on lab cooperation

Apollo Research identifies a structural problem in AI safety evaluation: making reliable safety judgments requires understanding training methodology, deployment mitigations, and how risks transfer to real-world contexts, but labs are not required to disclose this information. This creates an evaluator opacity problem where third-party safety assessments are structurally dependent on voluntary lab cooperation. Even when evaluators have black-box or limited white-box access to models, they cannot assess whether observed behaviors reflect genuine safety properties or merely training artifacts that will not hold under deployment conditions. The problem is institutional as much as technical: without mandatory disclosure requirements, independent evaluation cannot provide the oversight function that governance frameworks assume it provides. This is distinct from technical interpretability limitations—even perfect technical tools cannot overcome missing information about training and deployment context.

3.2 KiB Raw Blame History

AI evaluators face an opacity problem where reliable safety recommendations require training methodology and deployment context that labs are not required to disclose, making third-party evaluation structurally dependent on lab cooperation

3.2 KiB

Raw Blame History