diff --git a/inbox/archive/ai-alignment/2025-08-01-anthropic-persona-vectors-interpretability.md b/inbox/archive/ai-alignment/2025-08-01-anthropic-persona-vectors-interpretability.md
index 577e539e..f2e5d01d 100644
--- a/inbox/archive/ai-alignment/2025-08-01-anthropic-persona-vectors-interpretability.md
+++ b/inbox/archive/ai-alignment/2025-08-01-anthropic-persona-vectors-interpretability.md
@@ -7,9 +7,12 @@ date: 2025-08-01
 domain: ai-alignment
 secondary_domains: []
 format: research-paper
-status: unprocessed
+status: processed
+processed_by: theseus
+processed_date: 2026-04-04
 priority: medium
 tags: [anthropic, interpretability, persona-vectors, sycophancy, hallucination, activation-steering, mechanistic-interpretability, safety-applications]
+extraction_model: "anthropic/claude-sonnet-4.5"
 ---
 
 ## Content
diff --git a/inbox/queue/2025-08-01-anthropic-persona-vectors-interpretability.md b/inbox/queue/2025-08-01-anthropic-persona-vectors-interpretability.md
deleted file mode 100644
index 577e539e..00000000
--- a/inbox/queue/2025-08-01-anthropic-persona-vectors-interpretability.md
+++ /dev/null
@@ -1,62 +0,0 @@
----
-type: source
-title: "Anthropic Persona Vectors: Monitoring and Controlling Character Traits in Language Models"
-author: "Anthropic"
-url: https://www.anthropic.com/research/persona-vectors
-date: 2025-08-01
-domain: ai-alignment
-secondary_domains: []
-format: research-paper
-status: unprocessed
-priority: medium
-tags: [anthropic, interpretability, persona-vectors, sycophancy, hallucination, activation-steering, mechanistic-interpretability, safety-applications]
----
-
-## Content
-
-Anthropic research demonstrating that character traits can be represented, monitored, and controlled via neural network activation patterns ("persona vectors").
-
-**What persona vectors are:**
-Patterns of neural network activations that represent character traits in language models. Described as "loose analogs to parts of the brain that light up when a person experiences different moods or attitudes." Extraction method: compare neural activations when models exhibit vs. don't exhibit target traits, using automated pipelines with opposing-behavior prompts.
-
-**Traits successfully monitored and controlled:**
-- Primary: sycophancy (insincere flattery/user appeasement), hallucination, "evil" tendency
-- Secondary: politeness, apathy, humor, optimism
-
-**Demonstrated applications:**
-1. **Monitoring**: Measuring persona vector strength detects personality shifts during conversation or training
-2. **Mitigation**: "Preventative steering" — injecting vectors during training acts like a vaccine, reducing harmful trait acquisition without capability degradation (measured by MMLU scores)
-3. **Data flagging**: Identifying training samples likely to induce unwanted traits before deployment
-
-**Critical limitations:**
-- Validated only on open-source models: **Qwen 2.5-7B and Llama-3.1-8B** — NOT on Claude
-- Post-training steering (inference-time) reduces model intelligence
-- Requires defining target traits in natural language beforehand
-- Does NOT demonstrate detection of: goal-directed deception, sandbagging, self-preservation behavior, instrumental convergence, monitoring evasion
-
-**Relationship to Frontier Safety Roadmap:**
-The October 2026 alignment assessment commitment in the Roadmap specifies "interpretability techniques in such a way that it produces meaningful signal beyond behavioral methods alone." Persona vectors (detecting trait shifts via activations) are one candidate approach — but only validated on small open-source models, not Claude.
-
-## Agent Notes
-
-**Why this matters:** Persona vectors are the most safety-relevant interpretability capability Anthropic has published. If they scale to Claude and can detect dangerous behavioral traits (not just sycophancy/hallucination), this would be meaningful progress toward the October 2026 alignment assessment target. Currently, the gap between demonstrated capability (small open-source models, benign traits) and needed capability (frontier models, dangerous behaviors) is substantial.
-
-**What surprised me:** The "preventative steering during training" (vaccine approach) is a genuinely novel safety application — reducing sycophancy acquisition without capability degradation. This is more constructive than I expected. But the validation only on small open-source models is a significant limitation given that Claude is substantially larger and different in architecture.
-
-**What I expected but didn't find:** Any mention of Claude-scale validation or plans to extend to Claude. No 2027 target mentioned. No connection to the RSP's Frontier Safety Roadmap commitments in the paper itself.
-
-**KB connections:**
-- [[verification degrades faster than capability grows]] — partial counter-evidence: persona vectors represent a NEW verification capability that doesn't exist in behavioral testing alone. But it applies to the wrong behaviors for safety purposes.
-- [[alignment must be continuous rather than a one-shot specification problem]] — persona vector monitoring during training supports this: it's a continuous monitoring approach rather than a one-time specification
-
-**Extraction hints:** Primary claim candidate: "Activation-based persona vector monitoring can detect behavioral trait shifts (sycophancy, hallucination) in small language models without relying on behavioral testing — but this capability has not been validated at frontier model scale and doesn't address the safety-critical behaviors (deception, goal-directed autonomy) that matter for alignment." This positions persona vectors as genuine progress that falls short of safety-relevance.
-
-**Context:** Published August 1, 2025. Part of Anthropic's interpretability research program. This paper represents the "applied interpretability" direction — demonstrating that interpretability research can produce monitoring capabilities, not just circuit mapping. The limitation to open-source small models is the key gap.
-
-## Curator Notes (structured handoff for extractor)
-
-PRIMARY CONNECTION: [[verification degrades faster than capability grows]]
-
-WHY ARCHIVED: Persona vectors are the strongest concrete safety application of interpretability research published in this period. They provide a genuine counter-data point to B4 (verification degradation) — interpretability IS building new verification capabilities. But the scope (small open-source models, benign traits) limits the safety relevance at the frontier.
-
-EXTRACTION HINT: The extractor should frame this as a partial disconfirmation of B4 with specific scope: activation-based monitoring advances structural verification for benign behavioral traits, while behavioral verification continues to degrade for safety-critical behaviors. The claim should be scoped precisely — not "interpretability is progressing" generally, but "activation monitoring works for [specific behaviors] at [specific scales]."