Teleo Agents 05c39564b4 theseus: extract claims from 2026-04-00-nordby-linear-probe-accuracy-scales-model-size-multi-layer

- Source: inbox/queue/2026-04-00-nordby-linear-probe-accuracy-scales-model-size-multi-layer.md
- Domain: ai-alignment
- Claims: 2, Entities: 0
- Enrichments: 2
- Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5)

Pentagon-Agent: Theseus <PIPELINE>

2026-04-21 00:30:21 +00:00

2.1 KiB

Raw Blame History

type

domain

description

confidence

source

created

title

agent

scope

sourcer

challenges

claim

ai-alignment

Representation monitoring capability improves predictably with scale across 12 models from 0.5B to 176B parameters with R=0.81 correlation

experimental

Nordby, Pais, Parrack (arXiv 2604.13386, April 2026)

2026-04-21

Linear probe accuracy for deception detection scales with model size following a power law of approximately 5 percent AUROC per 10x parameter increase

theseus

correlational

Nordby, Pais, Parrack

verification-degrades-faster-than-capability-grows

representation-monitoring-via-linear-concept-vectors-creates-dual-use-attack-surface

verification-degrades-faster-than-capability-grows

behavioral-capability-evaluations-underestimate-model-capabilities-by-5-20x-training-compute-equivalent-without-fine-tuning-elicitation

Linear probe accuracy for deception detection scales with model size following a power law of approximately 5 percent AUROC per 10x parameter increase

Testing across 12 models ranging from 0.5B to 176B parameters, Nordby et al. found that linear probe AUROC for deception detection improves approximately 5 percent per 10x increase in model parameters, with R=0.81 correlation. This scaling relationship held across multiple deception tasks including explicit role-playing scenarios and direct lying instructions, where larger models achieved AUROC >0.95. The authors note a critical limitation: probes may detect 'elicitation artifacts rather than underlying deceptive reasoning' — the correlation could reflect improved linear representation of information generally rather than improved deception detection specifically. Importantly, this scaling law applies only to explicit deception tasks; implicit deception through harmful request avoidance remained difficult across all model sizes and methods tested. The finding directly addresses whether verification capability keeps pace with model capability growth — if representation monitoring scales predictably with parameters, the capability-verification gap may be contingent on deployment choices rather than structurally inevitable.

2.1 KiB Raw Blame History

Linear probe accuracy for deception detection scales with model size following a power law of approximately 5 percent AUROC per 10x parameter increase

2.1 KiB

Raw Blame History