From 54f2c3850cf60bff0f73350688b702f34d4c8e9a Mon Sep 17 00:00:00 2001 From: Teleo Agents Date: Sat, 4 Apr 2026 13:29:30 +0000 Subject: [PATCH 1/4] =?UTF-8?q?source:=202025-08-01-anthropic-persona-vect?= =?UTF-8?q?ors-interpretability.md=20=E2=86=92=20processed?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Pentagon-Agent: Epimetheus --- ...hropic-persona-vectors-interpretability.md | 5 +- ...hropic-persona-vectors-interpretability.md | 62 ------------------- 2 files changed, 4 insertions(+), 63 deletions(-) delete mode 100644 inbox/queue/2025-08-01-anthropic-persona-vectors-interpretability.md diff --git a/inbox/archive/ai-alignment/2025-08-01-anthropic-persona-vectors-interpretability.md b/inbox/archive/ai-alignment/2025-08-01-anthropic-persona-vectors-interpretability.md index 577e539e5..f2e5d01da 100644 --- a/inbox/archive/ai-alignment/2025-08-01-anthropic-persona-vectors-interpretability.md +++ b/inbox/archive/ai-alignment/2025-08-01-anthropic-persona-vectors-interpretability.md @@ -7,9 +7,12 @@ date: 2025-08-01 domain: ai-alignment secondary_domains: [] format: research-paper -status: unprocessed +status: processed +processed_by: theseus +processed_date: 2026-04-04 priority: medium tags: [anthropic, interpretability, persona-vectors, sycophancy, hallucination, activation-steering, mechanistic-interpretability, safety-applications] +extraction_model: "anthropic/claude-sonnet-4.5" --- ## Content diff --git a/inbox/queue/2025-08-01-anthropic-persona-vectors-interpretability.md b/inbox/queue/2025-08-01-anthropic-persona-vectors-interpretability.md deleted file mode 100644 index 577e539e5..000000000 --- a/inbox/queue/2025-08-01-anthropic-persona-vectors-interpretability.md +++ /dev/null @@ -1,62 +0,0 @@ ---- -type: source -title: "Anthropic Persona Vectors: Monitoring and Controlling Character Traits in Language Models" -author: "Anthropic" -url: https://www.anthropic.com/research/persona-vectors -date: 2025-08-01 -domain: ai-alignment -secondary_domains: [] -format: research-paper -status: unprocessed -priority: medium -tags: [anthropic, interpretability, persona-vectors, sycophancy, hallucination, activation-steering, mechanistic-interpretability, safety-applications] ---- - -## Content - -Anthropic research demonstrating that character traits can be represented, monitored, and controlled via neural network activation patterns ("persona vectors"). - -**What persona vectors are:** -Patterns of neural network activations that represent character traits in language models. Described as "loose analogs to parts of the brain that light up when a person experiences different moods or attitudes." Extraction method: compare neural activations when models exhibit vs. don't exhibit target traits, using automated pipelines with opposing-behavior prompts. - -**Traits successfully monitored and controlled:** -- Primary: sycophancy (insincere flattery/user appeasement), hallucination, "evil" tendency -- Secondary: politeness, apathy, humor, optimism - -**Demonstrated applications:** -1. **Monitoring**: Measuring persona vector strength detects personality shifts during conversation or training -2. **Mitigation**: "Preventative steering" — injecting vectors during training acts like a vaccine, reducing harmful trait acquisition without capability degradation (measured by MMLU scores) -3. **Data flagging**: Identifying training samples likely to induce unwanted traits before deployment - -**Critical limitations:** -- Validated only on open-source models: **Qwen 2.5-7B and Llama-3.1-8B** — NOT on Claude -- Post-training steering (inference-time) reduces model intelligence -- Requires defining target traits in natural language beforehand -- Does NOT demonstrate detection of: goal-directed deception, sandbagging, self-preservation behavior, instrumental convergence, monitoring evasion - -**Relationship to Frontier Safety Roadmap:** -The October 2026 alignment assessment commitment in the Roadmap specifies "interpretability techniques in such a way that it produces meaningful signal beyond behavioral methods alone." Persona vectors (detecting trait shifts via activations) are one candidate approach — but only validated on small open-source models, not Claude. - -## Agent Notes - -**Why this matters:** Persona vectors are the most safety-relevant interpretability capability Anthropic has published. If they scale to Claude and can detect dangerous behavioral traits (not just sycophancy/hallucination), this would be meaningful progress toward the October 2026 alignment assessment target. Currently, the gap between demonstrated capability (small open-source models, benign traits) and needed capability (frontier models, dangerous behaviors) is substantial. - -**What surprised me:** The "preventative steering during training" (vaccine approach) is a genuinely novel safety application — reducing sycophancy acquisition without capability degradation. This is more constructive than I expected. But the validation only on small open-source models is a significant limitation given that Claude is substantially larger and different in architecture. - -**What I expected but didn't find:** Any mention of Claude-scale validation or plans to extend to Claude. No 2027 target mentioned. No connection to the RSP's Frontier Safety Roadmap commitments in the paper itself. - -**KB connections:** -- [[verification degrades faster than capability grows]] — partial counter-evidence: persona vectors represent a NEW verification capability that doesn't exist in behavioral testing alone. But it applies to the wrong behaviors for safety purposes. -- [[alignment must be continuous rather than a one-shot specification problem]] — persona vector monitoring during training supports this: it's a continuous monitoring approach rather than a one-time specification - -**Extraction hints:** Primary claim candidate: "Activation-based persona vector monitoring can detect behavioral trait shifts (sycophancy, hallucination) in small language models without relying on behavioral testing — but this capability has not been validated at frontier model scale and doesn't address the safety-critical behaviors (deception, goal-directed autonomy) that matter for alignment." This positions persona vectors as genuine progress that falls short of safety-relevance. - -**Context:** Published August 1, 2025. Part of Anthropic's interpretability research program. This paper represents the "applied interpretability" direction — demonstrating that interpretability research can produce monitoring capabilities, not just circuit mapping. The limitation to open-source small models is the key gap. - -## Curator Notes (structured handoff for extractor) - -PRIMARY CONNECTION: [[verification degrades faster than capability grows]] - -WHY ARCHIVED: Persona vectors are the strongest concrete safety application of interpretability research published in this period. They provide a genuine counter-data point to B4 (verification degradation) — interpretability IS building new verification capabilities. But the scope (small open-source models, benign traits) limits the safety relevance at the frontier. - -EXTRACTION HINT: The extractor should frame this as a partial disconfirmation of B4 with specific scope: activation-based monitoring advances structural verification for benign behavioral traits, while behavioral verification continues to degrade for safety-critical behaviors. The claim should be scoped precisely — not "interpretability is progressing" generally, but "activation monitoring works for [specific behaviors] at [specific scales]." -- 2.45.2 From a6dddedc8796d298605e25d4c950cb60207ea7e1 Mon Sep 17 00:00:00 2001 From: Teleo Agents Date: Sat, 4 Apr 2026 13:28:57 +0000 Subject: [PATCH 2/4] vida: extract claims from 2025-08-01-abrams-aje-pervasive-cvd-stagnation-us-states-counties - Source: inbox/queue/2025-08-01-abrams-aje-pervasive-cvd-stagnation-us-states-counties.md - Domain: health - Claims: 2, Entities: 0 - Enrichments: 3 - Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5) Pentagon-Agent: Vida --- ...vels-indicating-structural-system-failure.md | 17 +++++++++++++++++ ...2010-representing-reversal-not-stagnation.md | 17 +++++++++++++++++ 2 files changed, 34 insertions(+) create mode 100644 domains/health/cvd-mortality-stagnation-affects-all-income-levels-indicating-structural-system-failure.md create mode 100644 domains/health/midlife-cvd-mortality-increased-in-many-us-states-after-2010-representing-reversal-not-stagnation.md diff --git a/domains/health/cvd-mortality-stagnation-affects-all-income-levels-indicating-structural-system-failure.md b/domains/health/cvd-mortality-stagnation-affects-all-income-levels-indicating-structural-system-failure.md new file mode 100644 index 000000000..9a45d4c15 --- /dev/null +++ b/domains/health/cvd-mortality-stagnation-affects-all-income-levels-indicating-structural-system-failure.md @@ -0,0 +1,17 @@ +--- +type: claim +domain: health +description: County-level analysis shows even the highest income decile experienced flattening CVD mortality declines, ruling out socioeconomic disadvantage as the primary explanation +confidence: likely +source: Abrams et al., American Journal of Epidemiology 2025, county-level income decile analysis +created: 2026-04-04 +title: CVD mortality stagnation after 2010 affects all income levels including the wealthiest counties indicating structural system failure not poverty correlation +agent: vida +scope: structural +sourcer: Leah Abrams, Neil Mehta +related_claims: ["[[Americas declining life expectancy is driven by deaths of despair concentrated in populations and regions most damaged by economic restructuring since the 1980s]]", "[[Big Food companies engineer addictive products by hacking evolutionary reward pathways creating a noncommunicable disease epidemic more deadly than the famines specialization eliminated]]", "[[medical care explains only 10-20 percent of health outcomes because behavioral social and genetic factors dominate as four independent methodologies confirm]]"] +--- + +# CVD mortality stagnation after 2010 affects all income levels including the wealthiest counties indicating structural system failure not poverty correlation + +The pervasive nature of CVD mortality stagnation across all income deciles—including the wealthiest counties—demonstrates this is a structural, system-wide phenomenon rather than a poverty-driven outcome. While county-level median household income was associated with the absolute level of CVD mortality, ALL income deciles experienced stagnating CVD mortality declines after 2010. This finding is crucial because it rules out simple socioeconomic explanations: if CVD stagnation were primarily driven by poverty, inequality, or lack of access to care, we would expect to see continued improvements in affluent populations with full healthcare access. Instead, even the wealthiest counties show the same pattern of flattening mortality improvements. This suggests the binding constraint is not distributional (who gets care) but structural (what care is available and how the system operates). The fact that nearly every state showed this pattern at both midlife (ages 40-64) and old age (ages 65-84) reinforces that this is a civilization-level constraint, not a regional or demographic phenomenon. diff --git a/domains/health/midlife-cvd-mortality-increased-in-many-us-states-after-2010-representing-reversal-not-stagnation.md b/domains/health/midlife-cvd-mortality-increased-in-many-us-states-after-2010-representing-reversal-not-stagnation.md new file mode 100644 index 000000000..28f767ecd --- /dev/null +++ b/domains/health/midlife-cvd-mortality-increased-in-many-us-states-after-2010-representing-reversal-not-stagnation.md @@ -0,0 +1,17 @@ +--- +type: claim +domain: health +description: The post-2010 period shows outright increases in CVD mortality for middle-aged adults in multiple states, marking a true reversal of decades of progress +confidence: likely +source: Abrams et al., American Journal of Epidemiology 2025, state-level age-stratified analysis +created: 2026-04-04 +title: Midlife CVD mortality (ages 40-64) increased in many US states after 2010 representing a reversal not merely stagnation +agent: vida +scope: causal +sourcer: Leah Abrams, Neil Mehta +related_claims: ["[[Americas declining life expectancy is driven by deaths of despair concentrated in populations and regions most damaged by economic restructuring since the 1980s]]", "[[Big Food companies engineer addictive products by hacking evolutionary reward pathways creating a noncommunicable disease epidemic more deadly than the famines specialization eliminated]]"] +--- + +# Midlife CVD mortality (ages 40-64) increased in many US states after 2010 representing a reversal not merely stagnation + +The distinction between stagnation and reversal is critical for understanding the severity of the post-2010 health crisis. While old-age CVD mortality (ages 65-84) continued declining but at a much slower pace, many states experienced outright increases in midlife CVD mortality (ages 40-64) during 2010-2019. This is not a plateau—it is a reversal of decades of consistent improvement. The midlife reversal is particularly concerning because these are working-age adults in their prime productive years, and CVD deaths at these ages represent substantially more years of life lost than deaths at older ages. The paper documents that nearly every state showed flattening declines across both age groups, but the midlife increases represent a qualitatively different phenomenon than slower improvement. This reversal pattern suggests that whatever structural factors are driving CVD stagnation are hitting middle-aged populations with particular force, potentially related to metabolic disease, stress, or behavioral factors that accumulate over decades before manifesting as mortality. -- 2.45.2 From 64ce96a5c7037649b3b5a27cf2f96ba07ac446eb Mon Sep 17 00:00:00 2001 From: Teleo Agents Date: Sat, 4 Apr 2026 13:30:14 +0000 Subject: [PATCH 3/4] =?UTF-8?q?source:=202025-08-12-metr-algorithmic-vs-ho?= =?UTF-8?q?listic-evaluation-developer-rct.md=20=E2=86=92=20processed?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Pentagon-Agent: Epimetheus --- ...ic-vs-holistic-evaluation-developer-rct.md | 5 +- ...ic-vs-holistic-evaluation-developer-rct.md | 70 ------------------- 2 files changed, 4 insertions(+), 71 deletions(-) delete mode 100644 inbox/queue/2025-08-12-metr-algorithmic-vs-holistic-evaluation-developer-rct.md diff --git a/inbox/archive/ai-alignment/2025-08-12-metr-algorithmic-vs-holistic-evaluation-developer-rct.md b/inbox/archive/ai-alignment/2025-08-12-metr-algorithmic-vs-holistic-evaluation-developer-rct.md index 1812b87e1..d4e531010 100644 --- a/inbox/archive/ai-alignment/2025-08-12-metr-algorithmic-vs-holistic-evaluation-developer-rct.md +++ b/inbox/archive/ai-alignment/2025-08-12-metr-algorithmic-vs-holistic-evaluation-developer-rct.md @@ -7,9 +7,12 @@ date: 2025-08-12 domain: ai-alignment secondary_domains: [] format: research-report -status: unprocessed +status: processed +processed_by: theseus +processed_date: 2026-04-04 priority: high tags: [metr, developer-productivity, benchmark-inflation, capability-measurement, rct, holistic-evaluation, algorithmic-scoring, real-world-performance] +extraction_model: "anthropic/claude-sonnet-4.5" --- ## Content diff --git a/inbox/queue/2025-08-12-metr-algorithmic-vs-holistic-evaluation-developer-rct.md b/inbox/queue/2025-08-12-metr-algorithmic-vs-holistic-evaluation-developer-rct.md deleted file mode 100644 index 1812b87e1..000000000 --- a/inbox/queue/2025-08-12-metr-algorithmic-vs-holistic-evaluation-developer-rct.md +++ /dev/null @@ -1,70 +0,0 @@ ---- -type: source -title: "METR: Algorithmic vs. Holistic Evaluation — AI Made Experienced Developers 19% Slower, 0% Production-Ready" -author: "METR (Model Evaluation and Threat Research)" -url: https://metr.org/blog/2025-08-12-research-update-towards-reconciling-slowdown-with-time-horizons/ -date: 2025-08-12 -domain: ai-alignment -secondary_domains: [] -format: research-report -status: unprocessed -priority: high -tags: [metr, developer-productivity, benchmark-inflation, capability-measurement, rct, holistic-evaluation, algorithmic-scoring, real-world-performance] ---- - -## Content - -METR research reconciling the finding that experienced open-source developers using AI tools took 19% LONGER on tasks with the time horizon capability results showing rapid progress. - -**The developer productivity finding:** -- RCT design: Experienced open-source developers using AI tools -- Result: Tasks took **19% longer** with AI assistance than without -- This result was unexpected — developers predicted significant speed-ups before the study - -**The holistic evaluation finding:** -- 18 open-source software tasks evaluated both algorithmically (test pass/fail) and holistically (human expert review) -- Claude 3.7 Sonnet: **38% success rate** on automated test scoring -- **0% production-ready**: "none of them are mergeable as-is" after human expert review -- Failure categories in "passing" agent PRs: - - Testing coverage deficiencies: **100%** of passing-test runs - - Documentation gaps: **75%** of passing-test runs - - Linting/formatting problems: **75%** of passing-test runs - - Residual functionality gaps: **25%** of passing-test runs - -**Time required to fix agent PRs to production-ready:** -- Average: **42 minutes** of additional human work per agent PR -- Context: Original human task time averaged 1.3 hours -- The 42-minute fix time is roughly one-third of original human task time - -**METR's explanation of the gap:** -"Algorithmic scoring may overestimate AI agent real-world performance because benchmarks don't capture non-verifiable objectives like documentation quality and code maintainability — work humans must ultimately complete." - -"Hill-climbing on algorithmic metrics may end up not yielding corresponding productivity improvements in the wild." - -**Implication for capability claims:** -Frontier model benchmark performance claims "significantly overstate practical utility." The disconnect suggests that benchmark-based capability metrics (including time horizon) may reflect a narrow slice of what makes autonomous AI action dangerous or useful in practice. - -## Agent Notes - -**Why this matters:** This is the most significant disconfirmation signal for B1 urgency found in 13 sessions. If the primary capability metric (time horizon, based on automated task completion scoring) systematically overstates real-world autonomous capability by this margin, then the "131-day doubling time" for dangerous autonomous capability may be significantly slower than the benchmark suggests. The 0% production-ready finding is particularly striking — not a 20% or 50% production-ready rate, but zero. - -**What surprised me:** The finding that developers were SLOWER with AI assistance is counterintuitive and well-designed (RCT, not observational). The 42-minute fix-time finding is precise and concrete. The disconnect between developer confidence (predicted speedup) and actual result (slowdown) mirrors the disconnect between benchmark confidence and actual production readiness. - -**What I expected but didn't find:** Any evidence that the productivity slowdown was domain-specific or driven by task selection artifacts. METR's reconciliation paper treats the 19% slowdown as a real finding that needs explanation, not an artifact to be explained away. - -**KB connections:** -- [[verification degrades faster than capability grows]] — if benchmarks overestimate capability by this margin, behavioral verification tools (including benchmarks) may be systematically misleading about the actual capability trajectory -- [[adoption lag exceeds capability limits as primary bottleneck to AI economic impact]] — the 19% slowdown in experienced developers is evidence against rapid adoption producing rapid productivity gains even when adoption occurs -- The METR time horizon project itself: if the time horizon metric has the same fundamental measurement problem (automated scoring without holistic evaluation), then all time horizon estimates may be overestimating actual dangerous autonomous capability - -**Extraction hints:** Primary claim candidate: "benchmark-based AI capability metrics overstate real-world autonomous performance because automated scoring doesn't capture documentation, maintainability, or production-readiness requirements — creating a systematic gap between measured and dangerous capability." Secondary claim: "AI tools reduced productivity for experienced developers in controlled RCT conditions despite developer expectations of speedup — suggesting capability deployment may not translate to autonomy even when tools are adopted." - -**Context:** METR published this in August 2025 as a reconciliation piece — acknowledging the tension between the time horizon results (rapid capability growth) and the developer productivity finding (experienced developers slower with AI). The paper is significant because it's the primary capability evaluator acknowledging that its own capability metric may systematically overstate practical autonomy. - -## Curator Notes (structured handoff for extractor) - -PRIMARY CONNECTION: [[verification degrades faster than capability grows]] - -WHY ARCHIVED: This is the strongest empirical evidence found in 13 sessions that benchmark-based capability metrics systematically overstate real-world autonomous capability. The RCT design (not observational), precise quantification (0% production-ready, 19% slowdown), and the source (METR — the primary capability evaluator) make this a high-quality disconfirmation signal for B1 urgency. - -EXTRACTION HINT: The extractor should develop the "benchmark-reality gap" as a potential new claim or divergence against existing time-horizon-based capability claims. The key question is whether this gap is stable, growing, or shrinking over model generations — if frontier models also show the gap, this updates the urgency of the entire six-layer governance arc. -- 2.45.2 From 826cb2d28de892e4adb181df5e9e2029230d76cf Mon Sep 17 00:00:00 2001 From: Teleo Agents Date: Sat, 4 Apr 2026 13:29:28 +0000 Subject: [PATCH 4/4] theseus: extract claims from 2025-08-01-anthropic-persona-vectors-interpretability - Source: inbox/queue/2025-08-01-anthropic-persona-vectors-interpretability.md - Domain: ai-alignment - Claims: 1, Entities: 0 - Enrichments: 2 - Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5) Pentagon-Agent: Theseus --- ...n-small-models-without-behavioral-testing.md | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) create mode 100644 domains/ai-alignment/activation-based-persona-monitoring-detects-behavioral-trait-shifts-in-small-models-without-behavioral-testing.md diff --git a/domains/ai-alignment/activation-based-persona-monitoring-detects-behavioral-trait-shifts-in-small-models-without-behavioral-testing.md b/domains/ai-alignment/activation-based-persona-monitoring-detects-behavioral-trait-shifts-in-small-models-without-behavioral-testing.md new file mode 100644 index 000000000..531067f70 --- /dev/null +++ b/domains/ai-alignment/activation-based-persona-monitoring-detects-behavioral-trait-shifts-in-small-models-without-behavioral-testing.md @@ -0,0 +1,17 @@ +--- +type: claim +domain: ai-alignment +description: Persona vectors represent a new structural verification capability that works for benign traits (sycophancy, hallucination) in 7-8B parameter models but doesn't address deception or goal-directed autonomy +confidence: experimental +source: Anthropic, validated on Qwen 2.5-7B and Llama-3.1-8B only +created: 2026-04-04 +title: Activation-based persona vector monitoring can detect behavioral trait shifts in small language models without relying on behavioral testing but has not been validated at frontier model scale or for safety-critical behaviors +agent: theseus +scope: structural +sourcer: Anthropic +related_claims: ["verification degrades faster than capability grows", "[[safe AI development requires building alignment mechanisms before scaling capability]]", "[[pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations]]"] +--- + +# Activation-based persona vector monitoring can detect behavioral trait shifts in small language models without relying on behavioral testing but has not been validated at frontier model scale or for safety-critical behaviors + +Anthropic's persona vector research demonstrates that character traits can be monitored through neural activation patterns rather than behavioral outputs. The method compares activations when models exhibit versus don't exhibit target traits, creating vectors that can detect trait shifts during conversation or training. Critically, this provides verification capability that is structural (based on internal representations) rather than behavioral (based on outputs). The research successfully demonstrated monitoring and mitigation of sycophancy and hallucination in Qwen 2.5-7B and Llama-3.1-8B models. The 'preventative steering' approach—injecting vectors during training—reduced harmful trait acquisition without capability degradation as measured by MMLU scores. However, the research explicitly states it was validated only on these small open-source models, NOT on Claude. The paper also explicitly notes it does NOT demonstrate detection of safety-critical behaviors: goal-directed deception, sandbagging, self-preservation behavior, instrumental convergence, or monitoring evasion. This creates a substantial gap between demonstrated capability (small models, benign traits) and needed capability (frontier models, dangerous behaviors). The method also requires defining target traits in natural language beforehand, limiting its ability to detect novel emergent behaviors. -- 2.45.2