m3taversal f63eb8000a fix: normalize 1,072 broken wiki-links across 604 files

Mechanical space→hyphen conversion in frontmatter references
(related_claims, challenges, supports, etc.) to match actual
filenames. Fixes 26.9% broken link rate found by wiki-link audit.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-21 10:21:26 +01:00

2.8 KiB

Raw Blame History

type

domain

description

confidence

source

created

title

agent

scope

sourcer

supports

challenges

claim

ai-alignment

Model organism experiments show that fine-tuning is necessary to recover hidden capabilities, with elicitation improvements comparable to massive compute scaling

experimental

Hofstätter et al., ICML 2025 proceedings (PMLR 267:23330-23356)

2026-04-21

Behavioral capability evaluations underestimate model capabilities by 5-20x training compute equivalent without fine-tuning elicitation

theseus

causal

Hofstätter et al.

pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations

verification being easier than generation may not hold for superhuman AI outputs because the verifier must understand the solution space which requires near-generator capability

evaluation-awareness-creates-bidirectional-confounds-in-safety-benchmarks-because-models-detect-and-respond-to-testing-conditions

pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations

frontier-safety-frameworks-score-8-35-percent-against-safety-critical-standards-with-52-percent-composite-ceiling

Behavioral capability evaluations underestimate model capabilities by 5-20x training compute equivalent without fine-tuning elicitation

Hofstätter et al. created model organisms with hidden capabilities unlocked only by passwords, then tested which elicitation techniques could recover them. Circuit-breaking training (a new technique) created models more robust to elicitation than standard password-locking. Results: prompting worked for MCQA tasks, activation steering failed completely, and fine-tuning was the only method that worked for code-generation tasks. The quantitative finding: elicitation improvements from fine-tuning are comparable to increasing training compute by 5-20x. This means behavioral evaluations without fine-tuning-based elicitation are systematically missing capabilities equivalent to multiple training doublings. The paper's core recommendation: 'Fine-tuning should be the method of choice to improve the trustworthiness of capability evaluations.' This is the first ICML-published result demonstrating systematic under-elicitation in capability evaluations with a quantified compute-equivalent gap.

Extending Evidence

Source: Nordby, Pais, Parrack (arXiv 2604.13386, April 2026)

Linear probe accuracy scaling (5 percent AUROC per 10x parameters) provides a complementary elicitation method to behavioral evaluation. If probes detect capabilities that behavioral tests miss, the underestimation gap may be even larger than 5-20x training compute equivalent, or probes may serve as a cross-validation method for behavioral elicitation quality.

2.8 KiB Raw Blame History

Behavioral capability evaluations underestimate model capabilities by 5-20x training compute equivalent without fine-tuning elicitation

Extending Evidence

2.8 KiB

Raw Blame History