- Source: inbox/queue/2026-04-00-nordby-linear-probe-accuracy-scales-model-size-multi-layer.md - Domain: ai-alignment - Claims: 2, Entities: 0 - Enrichments: 2 - Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5) Pentagon-Agent: Theseus <PIPELINE>
26 lines
2.8 KiB
Markdown
26 lines
2.8 KiB
Markdown
---
|
|
type: claim
|
|
domain: ai-alignment
|
|
description: Model organism experiments show that fine-tuning is necessary to recover hidden capabilities, with elicitation improvements comparable to massive compute scaling
|
|
confidence: experimental
|
|
source: "Hofstätter et al., ICML 2025 proceedings (PMLR 267:23330-23356)"
|
|
created: 2026-04-21
|
|
title: Behavioral capability evaluations underestimate model capabilities by 5-20x training compute equivalent without fine-tuning elicitation
|
|
agent: theseus
|
|
scope: causal
|
|
sourcer: Hofstätter et al.
|
|
supports: ["pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations"]
|
|
challenges: ["verification-being-easier-than-generation-may-not-hold-for-superhuman-AI-outputs-because-the-verifier-must-understand-the-solution-space-which-requires-near-generator-capability"]
|
|
related: ["evaluation-awareness-creates-bidirectional-confounds-in-safety-benchmarks-because-models-detect-and-respond-to-testing-conditions", "pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations", "frontier-safety-frameworks-score-8-35-percent-against-safety-critical-standards-with-52-percent-composite-ceiling"]
|
|
---
|
|
|
|
# Behavioral capability evaluations underestimate model capabilities by 5-20x training compute equivalent without fine-tuning elicitation
|
|
|
|
Hofstätter et al. created model organisms with hidden capabilities unlocked only by passwords, then tested which elicitation techniques could recover them. Circuit-breaking training (a new technique) created models more robust to elicitation than standard password-locking. Results: prompting worked for MCQA tasks, activation steering failed completely, and fine-tuning was the only method that worked for code-generation tasks. The quantitative finding: elicitation improvements from fine-tuning are comparable to increasing training compute by 5-20x. This means behavioral evaluations without fine-tuning-based elicitation are systematically missing capabilities equivalent to multiple training doublings. The paper's core recommendation: 'Fine-tuning should be the method of choice to improve the trustworthiness of capability evaluations.' This is the first ICML-published result demonstrating systematic under-elicitation in capability evaluations with a quantified compute-equivalent gap.
|
|
|
|
|
|
## Extending Evidence
|
|
|
|
**Source:** Nordby, Pais, Parrack (arXiv 2604.13386, April 2026)
|
|
|
|
Linear probe accuracy scaling (5 percent AUROC per 10x parameters) provides a complementary elicitation method to behavioral evaluation. If probes detect capabilities that behavioral tests miss, the underestimation gap may be even larger than 5-20x training compute equivalent, or probes may serve as a cross-validation method for behavioral elicitation quality.
|