From 246b3eb206af739cd3361d4131b2b427431c6991 Mon Sep 17 00:00:00 2001 From: Teleo Agents Date: Tue, 24 Mar 2026 00:30:02 +0000 Subject: [PATCH] pipeline: clean 2 stale queue duplicates Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70> --- ...9-anthropic-circuit-tracing-open-source.md | 74 ---------------- .../queue/2026-01-29-metr-time-horizon-1-1.md | 86 ------------------- 2 files changed, 160 deletions(-) delete mode 100644 inbox/queue/2025-05-29-anthropic-circuit-tracing-open-source.md delete mode 100644 inbox/queue/2026-01-29-metr-time-horizon-1-1.md diff --git a/inbox/queue/2025-05-29-anthropic-circuit-tracing-open-source.md b/inbox/queue/2025-05-29-anthropic-circuit-tracing-open-source.md deleted file mode 100644 index 7ac816c8..00000000 --- a/inbox/queue/2025-05-29-anthropic-circuit-tracing-open-source.md +++ /dev/null @@ -1,74 +0,0 @@ ---- -type: source -title: "Anthropic: Open-Sourcing Circuit Tracing Tools for Attribution Graphs" -author: "Anthropic" -url: https://www.anthropic.com/research/open-source-circuit-tracing -date: 2025-05-29 -domain: ai-alignment -secondary_domains: [] -format: research-post -status: enrichment -priority: medium -tags: [anthropic, interpretability, circuit-tracing, attribution-graphs, mechanistic-interpretability, open-source, neuronpedia] -processed_by: theseus -processed_date: 2026-03-24 -enrichments_applied: ["AI transparency is declining not improving because Stanford FMTI scores dropped 17 points in one year while frontier labs dissolved safety teams and removed safety language from mission statements.md"] -extraction_model: "anthropic/claude-sonnet-4.5" ---- - -## Content - -Anthropic open-sources methods to generate attribution graphs — visualizations of the internal steps a model took to arrive at a particular output. - -**What attribution graphs do:** -- "Reveal the steps a model took internally to decide on a particular output" -- Trace how language models process information from input to output -- Enable researchers to test hypotheses by modifying feature values and observing output changes -- Interactive visualization via Neuronpedia's frontend - -**Capabilities demonstrated:** -- Multi-step reasoning processes (in Gemma-2-2b and Llama-3.2-1b) -- Multilingual representations - -**Open-sourced for:** Gemma-2-2b and Llama-3.2-1b — NOT for Claude - -**Explicit limitation from Anthropic:** Attribution graphs only "*partially* reveal internal steps — they don't provide complete transparency into model decision-making" - -**No safety-specific applications demonstrated:** The announcement emphasizes interpretability understanding generally; no specific examples of safety-relevant detection (deception, goal-directed behavior, monitoring evasion) are shown. - -**No connection to 2027 alignment assessment:** The paper does not mention the Frontier Safety Roadmap or any timeline for applying circuit tracing to safety evaluation. - -## Agent Notes - -**Why this matters:** Circuit tracing is the technical foundation for the "microscope" framing Dario Amodei has used — tracing reasoning paths from prompt to response. But the open-source release is for small open-weights models (2B parameters), not Claude. The "partial" revelation limitation from Anthropic's own description is important — this is not full transparency. - -**What surprised me:** The open-sourcing strategy is constructive — making this available to the research community accelerates the field. But it also highlights that Anthropic's own models (Claude) are NOT open-sourced, so circuit tracing tools for Claude would require Claude-specific infrastructure that hasn't been released. - -**What I expected but didn't find:** Any evidence that circuit tracing has detected safety-relevant behaviors (deception patterns, goal-directedness, self-preservation). The examples given are multi-step reasoning and multilingual representations — interesting but not alignment-relevant. - -**KB connections:** -- [[verification degrades faster than capability grows]] — same B4 relationship as persona vectors: circuit tracing is a new verification approach that partially addresses B4, but only at small model scale and for non-safety-relevant behaviors -- [[AI safety evaluation infrastructure is voluntary-collaborative]] — open-sourcing the tools makes the infrastructure more distributed, potentially less dependent on any single evaluator; this is a constructive move for the evaluation ecosystem - -**Extraction hints:** This source is best used to support the "interpretability is progressing but addresses wrong behaviors at wrong scale" claim rather than as a standalone claim. The primary contribution is establishing that Anthropic's public interpretability tooling is (a) for small open-source models, not Claude, and (b) only partially reveals internal steps. This supports precision in the B4 scope qualification being developed. - -**Context:** Published May 29, 2025. This is the tool-release post; the underlying research papers (circuit tracing methodology) preceded this by several months. The open-source release signals Anthropic's willingness to share interpretability infrastructure but not Claude model weights. - -## Curator Notes (structured handoff for extractor) - -PRIMARY CONNECTION: [[verification degrades faster than capability grows]] - -WHY ARCHIVED: Provides evidence that interpretability tools (attribution graphs) partially reveal internal model steps but only at small model scale and not for safety-critical behaviors. Supports precision in scoping B4 to "behavioral verification" vs. "structural/mechanistic verification" distinction being developed across this session. - -EXTRACTION HINT: This source works best as supporting evidence for a claim about interpretability scope limitations rather than a standalone claim. The extractor should combine with persona vectors findings — both advance structural verification but at wrong scale and for wrong behaviors. The combined finding is more powerful than either alone. - - -## Key Facts -- Anthropic open-sourced circuit tracing methods for generating attribution graphs on May 29, 2025 -- Attribution graphs visualize internal steps models take from input to output -- Tools released for Gemma-2-2b and Llama-3.2-1b (2B parameter models) -- Visualization provided through Neuronpedia's frontend -- Anthropic explicitly states attribution graphs only 'partially reveal internal steps' -- No Claude-specific circuit tracing tools released -- Examples demonstrated: multi-step reasoning, multilingual representations -- No safety-relevant behavior detection (deception, goal-directedness) shown in announcement diff --git a/inbox/queue/2026-01-29-metr-time-horizon-1-1.md b/inbox/queue/2026-01-29-metr-time-horizon-1-1.md deleted file mode 100644 index 3f1d6a6e..00000000 --- a/inbox/queue/2026-01-29-metr-time-horizon-1-1.md +++ /dev/null @@ -1,86 +0,0 @@ ---- -type: source -title: "METR Time Horizon 1.1: Updated Capability Estimates with New Infrastructure" -author: "METR (Model Evaluation and Threat Research)" -url: https://metr.org/blog/2026-1-29-time-horizon-1-1/ -date: 2026-01-29 -domain: ai-alignment -secondary_domains: [] -format: research-report -status: enrichment -priority: high -tags: [metr, time-horizon, capability-evaluation, task-saturation, measurement, frontier-ai, benchmark] -processed_by: theseus -processed_date: 2026-03-24 -enrichments_applied: ["pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md"] -extraction_model: "anthropic/claude-sonnet-4.5" ---- - -## Content - -METR's updated time horizon methodology (TH1.1) with new evaluation infrastructure. Published January 29, 2026. - -**Capability doubling time estimates:** -- Full historical trend (2019-2025): ~196 days (7 months) -- Since 2023 (TH1.1): **131 days** — 20% more rapid than previous 165-day estimate -- Since 2024 (TH1.1): 89 days — "notably faster" than prior 109-day figure -- Trend appears "slightly less linear" under new methodology, though within confidence intervals - -**Infrastructure change:** Migrated from Vivaria (proprietary 2023 system) to Inspect (UK AI Security Institute's open-source framework). Minor scaffold sensitivity effects found: GPT-4o and o3 performed slightly better under Vivaria — suggests some models are sensitive to prompting/scaffold. - -**Task suite changes:** -- Doubled long-duration tasks (8+ hours) from 14 to 31 -- Only 5 of 31 long tasks have actual human baseline times; remainder use estimates -- Original task count and distribution not fully specified in public summary - -**Model 50% time horizon estimates (TH1.1):** -- Claude Opus 4.5: 320 minutes (~5.3 hours) [revised upward from earlier estimate] -- GPT-5: 214 minutes -- o3: 121 minutes -- Claude Opus 4: 101 minutes -- Claude Sonnet 3.7: 60 minutes -- GPT-4 variants: 35-57% downward revisions - -**Note**: Claude Opus 4.6 (released February 2026) does NOT appear in TH1.1 — it post-dates this paper. The ~14.5 hour estimate discussed in Anthropic's sabotage risk context came from a different evaluation process. - -**Saturation explicit acknowledgment:** METR states: "even our Time Horizon 1.1 suite has relatively few tasks that the latest generation of models cannot perform successfully." They prioritize "updates to our evaluations so they can measure the capabilities of very strong models." - -**Plan for saturation:** "Raising the ceiling of our capabilities measurements" through continued task suite expansion. No specific numerical targets or timeline specified. - -**Governance implications not addressed:** The document does not explicitly discuss how wide confidence intervals affect governance threshold enforcement. Opus 4.5's upper bound is 2.3× its point estimate. - -## Agent Notes - -**Why this matters:** TH1.1 is the primary empirical basis for the "131-day doubling" claim central to the six-layer governance inadequacy arc. Understanding exactly what this measures — and its saturation problem — is critical for calibrating B1 urgency. - -**What surprised me:** The scaffold sensitivity finding (GPT-4o, o3 performing better under Vivaria than Inspect) suggests that the time horizon metric is not fully scaffold-independent — model performance varies by evaluation infrastructure in a way that affects capability estimates. This is a measurement reliability problem that complements the task saturation problem. - -**What I expected but didn't find:** A specific plan or timeline for how METR will measure models when they exceed the current 8+ hour task ceiling. "Raising the ceiling" without specifics leaves the saturation problem unaddressed for the next capability generation. - -**KB connections:** -- [[verification degrades faster than capability grows]] — task suite saturation is behavioral verification degrading: the measurement tool designed to track capability growth is being outrun by the capability it tracks -- [[market dynamics erode human oversight]] — if the primary oversight-relevant metric saturates, market dynamics have an additional advantage: labs can claim evaluation clearance on a metric that doesn't detect their most dangerous capabilities - -**Extraction hints:** Primary claim: METR time horizon saturation is now explicitly acknowledged rather than implied — the primary capability measurement tool is being outrun by frontier model capabilities at exactly the capability level that matters for governance. Secondary claim: Scaffold sensitivity (Vivaria vs. Inspect performance differences) introduces additional uncertainty in cross-model comparisons that is not typically disclosed in governance contexts. - -**Context:** Published January 29, 2026. METR is the primary external evaluator conducting pre-deployment capability assessments for Anthropic and other frontier labs. This is their most complete public methodology statement and the basis for the "131-day doubling time" claim that has been central to AI safety policy discussions in 2026. - -## Curator Notes (structured handoff for extractor) - -PRIMARY CONNECTION: [[verification degrades faster than capability grows]] - -WHY ARCHIVED: TH1.1 provides the empirical grounding for "131-day doubling time" and simultaneously the evidence that the measurement tool tracking that doubling is saturating. The saturation acknowledgment from METR itself is the most reliable source for this claim. - -EXTRACTION HINT: The extractor should distinguish between two separate findings: (1) capability is doubling every 131 days — this is a finding; (2) the measurement tool for this doubling is saturating — this is also a finding. Both can be true simultaneously and both deserve separate KB claims. The saturation finding specifically challenges the reliability of the doubling-time estimate itself. - - -## Key Facts -- METR's full historical trend (2019-2025) estimates 196-day capability doubling time -- METR's TH1.1 estimates 131-day capability doubling since 2023 (20% faster than previous 165-day estimate) -- METR's TH1.1 estimates 89-day capability doubling since 2024 -- Claude Opus 4.5 achieved 320-minute (5.3 hour) time horizon in TH1.1 -- GPT-5 achieved 214-minute time horizon in TH1.1 -- o3 achieved 121-minute time horizon in TH1.1 -- METR doubled long-duration tasks from 14 to 31 in TH1.1 -- Only 5 of 31 long tasks in TH1.1 have actual human baseline times -- GPT-4 variants saw 35-57% downward revisions in TH1.1 estimates