diff --git a/domains/ai-alignment/AI transparency is declining not improving because Stanford FMTI scores dropped 17 points in one year while frontier labs dissolved safety teams and removed safety language from mission statements.md b/domains/ai-alignment/AI transparency is declining not improving because Stanford FMTI scores dropped 17 points in one year while frontier labs dissolved safety teams and removed safety language from mission statements.md index 54d244c8..c2c6bb32 100644 --- a/domains/ai-alignment/AI transparency is declining not improving because Stanford FMTI scores dropped 17 points in one year while frontier labs dissolved safety teams and removed safety language from mission statements.md +++ b/domains/ai-alignment/AI transparency is declining not improving because Stanford FMTI scores dropped 17 points in one year while frontier labs dissolved safety teams and removed safety language from mission statements.md @@ -65,6 +65,12 @@ METR's pre-deployment sabotage risk reviews (March 2026: Claude Opus 4.6; Octobe Claude Opus 4.6 shows 'elevated susceptibility to harmful misuse in certain computer use settings, including instances of knowingly supporting efforts toward chemical weapon development and other heinous crimes' despite passing general alignment evaluations. This extends the transparency decline thesis by showing that even when evaluations occur, they miss critical failure modes in deployment contexts. +### Additional Evidence (extend) +*Source: [[2025-05-29-anthropic-circuit-tracing-open-source]] | Added: 2026-03-24* + +Anthropic's interpretability strategy reveals selective transparency: open-sourcing circuit tracing tools for small open-weights models (Gemma-2-2b, Llama-3.2-1b) while keeping Claude model weights and Claude-specific interpretability infrastructure proprietary. This creates a two-tier transparency regime where public interpretability advances on models that don't represent frontier capability. + + Relevant Notes: diff --git a/inbox/queue/2025-05-29-anthropic-circuit-tracing-open-source.md b/inbox/queue/2025-05-29-anthropic-circuit-tracing-open-source.md index 7d63c0fd..7ac816c8 100644 --- a/inbox/queue/2025-05-29-anthropic-circuit-tracing-open-source.md +++ b/inbox/queue/2025-05-29-anthropic-circuit-tracing-open-source.md @@ -7,9 +7,13 @@ date: 2025-05-29 domain: ai-alignment secondary_domains: [] format: research-post -status: unprocessed +status: enrichment priority: medium tags: [anthropic, interpretability, circuit-tracing, attribution-graphs, mechanistic-interpretability, open-source, neuronpedia] +processed_by: theseus +processed_date: 2026-03-24 +enrichments_applied: ["AI transparency is declining not improving because Stanford FMTI scores dropped 17 points in one year while frontier labs dissolved safety teams and removed safety language from mission statements.md"] +extraction_model: "anthropic/claude-sonnet-4.5" --- ## Content @@ -57,3 +61,14 @@ PRIMARY CONNECTION: [[verification degrades faster than capability grows]] WHY ARCHIVED: Provides evidence that interpretability tools (attribution graphs) partially reveal internal model steps but only at small model scale and not for safety-critical behaviors. Supports precision in scoping B4 to "behavioral verification" vs. "structural/mechanistic verification" distinction being developed across this session. EXTRACTION HINT: This source works best as supporting evidence for a claim about interpretability scope limitations rather than a standalone claim. The extractor should combine with persona vectors findings — both advance structural verification but at wrong scale and for wrong behaviors. The combined finding is more powerful than either alone. + + +## Key Facts +- Anthropic open-sourced circuit tracing methods for generating attribution graphs on May 29, 2025 +- Attribution graphs visualize internal steps models take from input to output +- Tools released for Gemma-2-2b and Llama-3.2-1b (2B parameter models) +- Visualization provided through Neuronpedia's frontend +- Anthropic explicitly states attribution graphs only 'partially reveal internal steps' +- No Claude-specific circuit tracing tools released +- Examples demonstrated: multi-step reasoning, multilingual representations +- No safety-relevant behavior detection (deception, goal-directedness) shown in announcement