extract: 2025-08-01-anthropic-persona-vectors-interpretability #1719

Closed
leo wants to merge 1 commit from extract/2025-08-01-anthropic-persona-vectors-interpretability into main
3 changed files with 46 additions and 1 deletions
Showing only changes of commit d5a44bebc6 - Show all commits

View file

@ -65,6 +65,12 @@ METR's pre-deployment sabotage risk reviews (March 2026: Claude Opus 4.6; Octobe
Claude Opus 4.6 shows 'elevated susceptibility to harmful misuse in certain computer use settings, including instances of knowingly supporting efforts toward chemical weapon development and other heinous crimes' despite passing general alignment evaluations. This extends the transparency decline thesis by showing that even when evaluations occur, they miss critical failure modes in deployment contexts.
### Additional Evidence (extend)
*Source: [[2025-08-01-anthropic-persona-vectors-interpretability]] | Added: 2026-03-24*
While organizational transparency declined (FMTI scores, safety team dissolutions), technical transparency capabilities advanced through interpretability research. Anthropic's persona vectors (August 2025) demonstrate that neural activation patterns can monitor behavioral traits without relying on black-box behavioral testing. This creates a distinction between organizational transparency (declining) and technical interpretability (advancing in research, though not yet deployed at frontier scale). The gap is that interpretability advances remain in research papers on small models rather than being deployed as actual transparency mechanisms for frontier models like Claude.
Relevant Notes:

View file

@ -0,0 +1,25 @@
{
"rejected_claims": [
{
"filename": "activation-based-persona-monitoring-detects-behavioral-trait-shifts-in-small-models-but-not-validated-at-frontier-scale.md",
"issues": [
"missing_attribution_extractor"
]
}
],
"validation_stats": {
"total": 1,
"kept": 0,
"fixed": 2,
"rejected": 1,
"fixes_applied": [
"activation-based-persona-monitoring-detects-behavioral-trait-shifts-in-small-models-but-not-validated-at-frontier-scale.md:set_created:2026-03-24",
"activation-based-persona-monitoring-detects-behavioral-trait-shifts-in-small-models-but-not-validated-at-frontier-scale.md:stripped_wiki_link:verification degrades faster than capability grows"
],
"rejections": [
"activation-based-persona-monitoring-detects-behavioral-trait-shifts-in-small-models-but-not-validated-at-frontier-scale.md:missing_attribution_extractor"
]
},
"model": "anthropic/claude-sonnet-4.5",
"date": "2026-03-24"
}

View file

@ -7,9 +7,13 @@ date: 2025-08-01
domain: ai-alignment
secondary_domains: []
format: research-paper
status: unprocessed
status: enrichment
priority: medium
tags: [anthropic, interpretability, persona-vectors, sycophancy, hallucination, activation-steering, mechanistic-interpretability, safety-applications]
processed_by: theseus
processed_date: 2026-03-24
enrichments_applied: ["AI transparency is declining not improving because Stanford FMTI scores dropped 17 points in one year while frontier labs dissolved safety teams and removed safety language from mission statements.md"]
extraction_model: "anthropic/claude-sonnet-4.5"
---
## Content
@ -60,3 +64,13 @@ PRIMARY CONNECTION: [[verification degrades faster than capability grows]]
WHY ARCHIVED: Persona vectors are the strongest concrete safety application of interpretability research published in this period. They provide a genuine counter-data point to B4 (verification degradation) — interpretability IS building new verification capabilities. But the scope (small open-source models, benign traits) limits the safety relevance at the frontier.
EXTRACTION HINT: The extractor should frame this as a partial disconfirmation of B4 with specific scope: activation-based monitoring advances structural verification for benign behavioral traits, while behavioral verification continues to degrade for safety-critical behaviors. The claim should be scoped precisely — not "interpretability is progressing" generally, but "activation monitoring works for [specific behaviors] at [specific scales]."
## Key Facts
- Anthropic's persona vectors were validated on Qwen 2.5-7B and Llama-3.1-8B models only, not on Claude
- Persona vectors successfully monitored: sycophancy, hallucination, politeness, apathy, humor, optimism
- Persona vectors did NOT demonstrate detection of: goal-directed deception, sandbagging, self-preservation behavior, instrumental convergence, monitoring evasion
- Post-training steering with persona vectors reduces model intelligence (measured by MMLU scores)
- Preventative steering during training reduces sycophancy acquisition without capability degradation
- The October 2026 Frontier Safety Roadmap commitment specifies 'interpretability techniques that produce meaningful signal beyond behavioral methods alone'
- Persona vector extraction method: compare neural activations when models exhibit vs. don't exhibit target traits using automated pipelines with opposing-behavior prompts