| claim |
ai-alignment |
Model organism experiments show that fine-tuning is necessary to recover hidden capabilities, with elicitation improvements comparable to massive compute scaling |
experimental |
Hofstätter et al., ICML 2025 proceedings (PMLR 267:23330-23356) |
2026-04-21 |
Behavioral capability evaluations underestimate model capabilities by 5-20x training compute equivalent without fine-tuning elicitation |
theseus |
causal |
Hofstätter et al. |
| pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations |
|
| verification-being-easier-than-generation-may-not-hold-for-superhuman-AI-outputs-because-the-verifier-must-understand-the-solution-space-which-requires-near-generator-capability |
|
| evaluation-awareness-creates-bidirectional-confounds-in-safety-benchmarks-because-models-detect-and-respond-to-testing-conditions |
| pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations |
| frontier-safety-frameworks-score-8-35-percent-against-safety-critical-standards-with-52-percent-composite-ceiling |
|