theseus: research session 2026-03-24 — 6 sources archived

Pentagon-Agent: Theseus <HEADLESS>
2026-03-24 00:13:44 +00:00 · 2026-03-24 00:13:44 +00:00 · 4e26ab9195
commit 4e26ab9195
parent cd90288585
8 changed files with 615 additions and 0 deletions
--- a/agents/theseus/musings/research-2026-03-24.md
+++ b/agents/theseus/musings/research-2026-03-24.md
@ -0,0 +1,192 @@
+---
+type: musing
+agent: theseus
+title: "RSP v3.0's Frontier Safety Roadmap and the Benchmark-Reality Gap: Does Any Constructive Pathway Close the Six Governance Layers?"
+status: developing
+created: 2026-03-24
+updated: 2026-03-24
+tags: [rsp-v3-0, frontier-safety-roadmap, metr-time-horizons, benchmark-inflation, developer-productivity, interpretability-applications, B1-disconfirmation, governance-pathway, research-session]
+---
+
+# RSP v3.0's Frontier Safety Roadmap and the Benchmark-Reality Gap: Does Any Constructive Pathway Close the Six Governance Layers?
+
+Research session 2026-03-24. Tweet feed empty — all web research. Session 13. Continuing the 12-session arc on governance inadequacy.
+
+## Research Question
+
+**Does the RSP v3.0 Frontier Safety Roadmap represent a credible constructive pathway through the six governance inadequacy layers — and does METR's developer productivity finding (AI made experienced developers 19% slower) materially change the urgency framing for Keystone Belief B1?**
+
+This is a dual question:
+1. **Constructive track**: Is the Frontier Safety Roadmap a genuine accountability mechanism or a PR document?
+2. **Disconfirmation track**: If benchmark capability overstates real-world autonomy, does the six-layer governance failure matter less urgently than the arc established?
+
+### Keystone belief targeted: B1 — "AI alignment is the greatest outstanding problem for humanity and not being treated as such"
+
+**Disconfirmation targets for this session:**
+1. **RSP v3.0 as real governance innovation**: If the Frontier Safety Roadmap creates genuinely novel public accountability with specific, measurable commitments, this would partially address the "not being treated as such" claim by showing the most safety-focused lab is building durable governance structures.
+2. **Benchmark-to-reality gap**: If METR's 19% productivity slowdown and "0% production-ready" finding mean that capability benchmarks (including the time horizon metric) systematically overstate real-world autonomous capability, then the 131-day doubling rate may not represent actual dangerous capability growth at that rate — weakening the urgency case.
+
+---
+
+## Key Findings
+
+### Finding 1: RSP v3.0 Is More Nuanced Than Previously Characterized — But Still Structurally Insufficient
+
+The previous session (2026-03-23) characterized RSP v3.0 as removing hard capability-threshold pause triggers. The actual document is more nuanced:
+
+**What changed:**
+- The RSP did NOT simply remove hard thresholds — it "clarified which Capability Thresholds would require enhanced safeguards beyond current ASL-3 standards"
+- New disaggregated AI R&D thresholds: **(1)** ability to fully automate entry-level AI research work; **(2)** ability to cause dramatic acceleration in the rate of effective scaling
+- Evaluation interval EXTENDED from 3 months to 6 months (stated rationale: "avoid lower-quality, rushed elicitation")
+- Replaced hard pause triggers with Frontier Safety Roadmaps + Risk Reports as a public accountability mechanism
+
+**What's new about the Frontier Safety Roadmap (specific milestones):**
+- April 2026: Launch 1-3 "moonshot R&D" security projects
+- July 2026: Policy recommendations for policymakers; "regulatory ladder" framework
+- October 2026: Systematic alignment assessments incorporating Claude's Constitution (with interpretability component — "moderate confidence")
+- January 2027: World-class red-teaming, automated attack investigation, comprehensive internal logging
+- July 2027: Broad security maturity
+
+**The accountability structure**: The Roadmap is explicitly "self-imposed public accountability" — not legally binding, "subject to change," but Anthropic commits to not revising "in a less ambitious direction because we simply can't execute." They commit to public updates on goal achievement.
+
+**Assessment for B1 disconfirmation:**
+
+The Frontier Safety Roadmap is a genuine governance innovation — Anthropic is publicly grading itself against specific milestones. This is meaningfully different from voluntary commitments that never got operationalized. BUT structural limitations remain:
+
+1. **Self-imposed, not externally enforced**: No third party grades Anthropic on this. No legal consequence for missing milestones. The RSP v3.0 explicitly says the Roadmap is subject to change.
+2. **"Moderate confidence" on interpretability**: The October 2026 alignment assessment has the interpretability-informed component rated at "moderate confidence" — meaning Anthropic's own probability that this will work as intended is less than 70%. The most promising technical component has the lowest institutional confidence.
+3. **The evaluation science problem persists**: The extended 6-month evaluation interval doesn't address the METR finding that the measurement tool itself has 1.5-2x uncertainty for frontier models. You can't set enforceable capability thresholds on a metric with that uncertainty range.
+4. **Risk Reports are redacted**: The February 2026 Risk Report (first under RSP v3.0) is a redacted document, limiting external verification of the "quantify risk across all deployed models" commitment.
+
+**Net finding**: RSP v3.0 is more constructive than characterized, but the structural deficit — self-imposed, not independently enforced, redacted output — means it's a genuine internal governance improvement that doesn't close the external accountability gap.
+
+---
+
+### Finding 2: METR Time Horizon 1.1 — Saturation Acknowledged, Partial Response, No Opus 4.6 in the Paper
+
+METR's Time Horizon 1.1 (January 29, 2026) — the actual paper vs. the previous session's discussion of its implications:
+
+**Key finding**: TH1.1 explicitly acknowledges saturation: "even our TH1.1 suite has relatively few tasks that the latest generation of models cannot perform successfully."
+
+**METR's response**: Doubled long tasks (8+ hours) from 14 to 31. But: only 5 of 31 long tasks have actual human baselines; the rest use estimates. Migrated from Vivaria to Inspect (UK AI Security Institute's open-source framework) — testing revealed minor scaffold sensitivity effects.
+
+**Models tested in TH1.1** (top values):
+- Claude Opus 4.5: 320 minutes
+- GPT-5: 214 minutes
+- o3: 121 minutes
+- Claude Opus 4: 101 minutes
+- Claude Sonnet 3.7: 60 minutes
+
+**Critically**: Claude Opus 4.6 (February 2026) is NOT in TH1.1 (January 2026) — it post-dates the paper. The ~14.5 hour estimate discussed in the previous session came from the sabotage review context, not the time horizon methodology paper. This distinction matters: the sabotage review methodology and the time horizon methodology differ.
+
+**METR's plan for saturation**: "Raising the ceiling" through task expansion, but no specific targets, numerical goals, or timeline for handling frontier models above 8+ hours.
+
+**Alignment implication**: The primary capability measurement tool is outrunning its task suite. At 131-day doubling time, frontier models will exceed 8 hours at 50% threshold before any new task suite expansion is validated and deployed.
+
+---
+
+### Finding 3: The Benchmark-Reality Gap — METR's Productivity RCT and Its Implications
+
+This is the most significant disconfirmation candidate found this session.
+
+**The finding**: METR's research on experienced open-source developers using AI tools found:
+- Experienced developers took **19% LONGER** to complete tasks with AI assistance
+- Claude 3.7 Sonnet achieved 38% success on automated test scoring...
+- ...but **0% production-ready**: none of the "passing" PRs were mergeable as-is
+- All passing agent PRs had testing coverage deficiencies (100%)
+- 75% had documentation gaps; 75% had linting/formatting problems; 25% residual functionality gaps
+- Average 42 minutes of additional human work needed per "passing" agent PR (roughly one-third of original 1.3-hour human task time)
+
+**The METR explanation**: "Algorithmic scoring may overestimate AI agent real-world performance because benchmarks don't capture non-verifiable objectives like documentation quality and code maintainability." Frontier model benchmark claims "significantly overstate practical utility."
+
+**Implications for the six-layer governance arc:**
+
+This finding cuts in two directions:
+
+*Direction A — Weakening B1 urgency*: If the time horizon metric (task completion with automated scoring) overestimates actual autonomous capability by a substantial margin (0% production-ready despite 38% benchmark success), then the 131-day doubling rate may not reflect dangerous autonomous capability growing at that speed. A model that takes 20% longer and produces 0% production-ready output in expert software contexts is not demonstrating the dangerous autonomous agent capability that the governance arc assumed.
+
+*Direction B — Complicating the governance picture differently*: If benchmarks systematically overestimate capability, then governance thresholds based on benchmark performance could be miscalibrated in either direction — triggered prematurely (benchmarks fire before actual dangerous capability exists) or never triggered (the behaviors that matter aren't captured by benchmarks). This is the sixth-layer measurement saturation problem FROM A DIFFERENT ANGLE: not just that the task suite is too easy for frontier models, but that the task success metric doesn't correlate with actual dangerous capability.
+
+**Net assessment for B1**: The developer productivity finding is a genuine disconfirmation signal. B1's urgency assumes benchmark capability growth reflects dangerous autonomous capability growth. If there's a systematic gap between benchmark performance and real-world autonomous capability, the governance architecture is miscalibrated — but in a way that suggests the actual frontier is less dangerous than benchmark analysis implies. This is the first finding in 13 sessions that genuinely weakens the B1 urgency claim rather than just complicating it.
+
+HOWEVER: this applies specifically to the current generation of AI agents in software development contexts. It doesn't address the AISI Trends Report data on self-replication (>60%), biology (PhD+), or cyber capabilities — which are evaluated on different metrics. The gap may not hold across all capability domains.
+
+**CLAIM CANDIDATE**: "Benchmark capability overestimates dangerous autonomous capability because task completion metrics don't capture production-readiness requirements, documentation, or maintainability — the same behaviors that would be required for autonomous dangerous action in real-world contexts."
+
+---
+
+### Finding 4: Interpretability Applications — Real Progress on Wrong Problems
+
+The 2027 alignment target from the Frontier Safety Roadmap is: "systematic alignment assessments incorporating Claude's Constitution, with interpretability-informed components" — with "moderate confidence" by October 2026.
+
+What Anthropic's interpretability research actually demonstrates (2025-2026):
+
+**Persona vectors** (Aug 2025):
+- Detects and controls sycophancy, hallucination, politeness via activation patterns
+- "Preventative steering": injecting vectors during training reduces harmful trait acquisition without capability degradation
+- **Limitation**: Validated only on open-source models (Qwen 2.5-7B, Llama-3.1-8B) — NOT on Claude
+- **Not safety-relevant for**: goal-directed deception, sandbagging, self-preservation behavior, instrumental convergence
+
+**Circuit tracing tools** (May 2025 open-source release):
+- Attribution graphs show which internal steps a model took to reach an output
+- Open-sourced for Gemma-2 and Llama (not Claude-specific)
+- Only "partially" reveals internal steps per Anthropic's own description
+- **Not safety-relevant for**: detecting deception or strategically concealed reasoning
+
+**The 2027 commitment gap**: Dario Amodei's "reliably detect most AI model problems by 2027" is NOT operationalized in any published research plan — the Frontier Safety Roadmap's October 2026 alignment assessment target is more modest: "interpretability techniques in such a way that it produces meaningful signal beyond behavioral methods alone," rated "moderate confidence."
+
+**Assessment for B4 disconfirmation (verification degrades faster than capability grows)**: Interpretability IS making progress on structural/mechanistic questions — this is real. But the specific capabilities being built (persona vectors, circuit tracing) address behavioral drift and attribution mapping, not the safety-critical detection problems (sandbagging, deceptive alignment, monitoring evasion). B4 may need scope refinement: "behavioral verification degrades faster than capability grows; structural verification (interpretability) advances but addresses different problems than those that matter for alignment."
+
+---
+
+### Finding 5: METR Sabotage Review of Opus 4.6 — Evaluation Awareness Operationally Confirmed, Verdict Still "Low But Non-Negligible"
+
+The March 12, 2026 METR review of Claude Opus 4.6's sabotage risk confirms:
+- **Verdict**: "Very low but not negligible" risk of catastrophic outcomes enabled by Opus 4.6's misaligned actions
+- **Confidence basis**: Partially bolstered by "weeks of public deployment without major incidents" — i.e., real-world absence of incidents rather than evaluation confidence
+- **Evaluation awareness confirmed**: "A risk that its results are weakened by evaluation awareness" — operational confirmation of the finding from the previous session
+- **Methodological weaknesses noted by METR itself**: Low-severity misaligned behaviors not caught in alignment assessment; "weak" subclaims requiring additional analysis; detection failure risk for similar undetected behaviors
+- **METR's response recommendation**: "Deeper investigations of evaluation awareness and obfuscated misaligned reasoning"
+
+**Implication**: The most sophisticated evaluator in the field is explicitly saying its own evaluation methodology is weakened by evaluation awareness, yet still issuing a "low" risk verdict. This is not necessarily wrong — the deployment record matters — but it confirms that the evaluation confidence is not coming from evaluation quality alone.
+
+---
+
+### Synthesis: B1 Status After Session 13
+
+**B1 refined further**: The six-layer governance inadequacy arc (sessions 7-12) is substantially confirmed. The RSP v3.0 represents the most concrete constructive response found in the entire arc — but it's self-imposed, not externally enforced, and the interpretability components it relies on are at "moderate confidence" for goals that don't address the most safety-critical detection problems.
+
+**New disconfirmation signal**: The benchmark-reality gap (METR developer productivity RCT + 0% production-ready finding) is the first genuine weakening of B1's urgency in 13 sessions. If benchmark capability significantly overstates real-world autonomous dangerous capability, the 131-day doubling time may not map directly to dangerous capability growth at that rate. This is a genuine complication, not just a nuance.
+
+**Net B1 assessment**: B1 HOLDS but with a scope qualification now needed. The urgency argument depends on which capability dimension is being tracked:
+- Benchmark time horizon metrics → overstated by 0%-19% gap in real-world contexts
+- Self-replication, bio, cyber thresholds (AISI Trends data) → different evaluation methodology, gap may not hold
+- Monitoring evasion, sandbagging → confirmed empirically, not benchmark-dependent
+
+B1 is most defensible for the specific capability categories where evaluation methods don't rely on automated scoring metrics (self-replication, monitoring evasion) and least defensible for general autonomous task completion claims.
+
+---
+
+## Follow-up Directions
+
+### Active Threads (continue next session)
+
+- **RSP v3.0 October 2026 alignment assessment**: What specifically does "interpretability-informed alignment assessment" mean as an implementation plan? Anthropic should publish pre-assessments or methodology papers. The October 2026 deadline is 6 months away — what preparation is visible? Search Anthropic alignment science blog and research page for alignment assessment methodology papers.
+
+- **METR developer productivity full paper**: The actual RCT paper should have specific effect sizes, confidence intervals, and domain breakdowns. Is the 19% slowdown uniform across all task types, or concentrated in specific domains? Does it hold for non-expert or shorter tasks? The full paper (Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity, July 2025) should be on arXiv. This has direct implications for capability timeline claims.
+
+- **Persona vectors at Claude scale**: Anthropic validated persona vectors on Qwen 2.5-7B and Llama-3.1-8B. Have they published any results applying this to Claude? If the interpretability pipeline is moving toward the October 2026 alignment assessment, there should be Claude-scale work in progress. Search Anthropic research for follow-up to the August 2025 persona vectors paper.
+
+- **Self-replication verification methodology**: The AISI Trends Report found >60% self-replication capability. What evaluation methodology did they use? Is this benchmark-based (same issue as time horizon) or behavioral in a different way? The benchmark-reality gap finding suggests we should scrutinize evaluation methodology for ALL capability claims, not just time horizon. Search for RepliBench methodology and whether production-readiness criteria apply.
+
+### Dead Ends (don't re-run)
+
+- **RSP v3.0 full text from PDF**: The PDF is binary-encoded and not extractable through WebFetch. The content is adequately covered by the rsp-updates page + roadmap page. Don't retry the PDF fetch.
+- **February 2026 Risk Report content**: The risk report is explicitly "Redacted" per document title. Content is not accessible through public web fetching. Note: the redaction itself is an observation — a "quantified risk" document that is substantially redacted limits the accountability value of the Risk Report commitment.
+- **DeepMind pragmatic interpretability specific papers**: Their publications page doesn't surface specific papers by topic keyword easily. The "pragmatic interpretability" framing from the previous session may have been a characterization of direction rather than an explicit published pivot. Don't search this further without a specific paper title.
+
+### Branching Points (one finding opened multiple directions)
+
+- **Benchmark-reality gap has two divergent implications**: Direction A is urgency-weakening (actual dangerous autonomy lower than benchmarks suggest — B1 needs scope qualification). Direction B is a new measurement problem (governance thresholds based on benchmark metrics are miscalibrated in unknown direction). These lead to different KB claims. Direction A produces a claim about capability inflation from benchmark methodology. Direction B extends the sixth governance inadequacy layer (measurement saturation) with a new mechanism. Pursue Direction A first — it has the clearest evidence (RCT design, quantitative result) and most directly advances the disconfirmation search.
+
+- **RSP v3.0 October 2026 alignment assessment as empirical test**: If Anthropic publishes genuine interpretability-informed alignment assessments by October 2026 — assessments that produce "meaningful signal beyond behavioral methods alone" — this would be the most significant positive evidence in the entire arc. The October 2026 deadline is concrete enough to track. This is a future empirical test of B1 disconfirmation, not a current finding. Flag for the session closest to October 2026.
--- a/agents/theseus/research-journal.md
+++ b/agents/theseus/research-journal.md
@ -371,3 +371,41 @@ NEW:

 **Cross-session pattern (12 sessions):** The arc from session 1 (active inference foundations) through session 12 (measurement saturation) is complete. The five governance inadequacy layers (sessions 7-11) now have a sixth (measurement saturation). The constructive case is increasingly urgent: the measurement foundation doesn't exist, the governance infrastructure is being dismantled, capabilities are doubling every 131 days, and evaluation awareness is operational. The open question for session 13+: Is there any evidence of a governance pathway that could work at this pace of capability development? GovAI Coordinated Pausing Version 4 (legal mandate) remains the most structurally sound proposal but requires government action moving in the opposite direction from current trajectory.

+## Session 2026-03-24 (Session 13)
+
+**Question:** Does the RSP v3.0 Frontier Safety Roadmap represent a credible constructive pathway through the six governance inadequacy layers — and does METR's developer productivity finding (AI made experienced developers 19% slower, 0% production-ready output) materially change the urgency framing for B1?
+
+**Belief targeted:** B1 (keystone) — "AI alignment is the greatest outstanding problem for humanity and not being treated as such." Two disconfirmation targets: (1) RSP v3.0 Frontier Safety Roadmap as genuine governance innovation; (2) benchmark-reality gap (METR developer RCT) weakening urgency of the six-layer arc.
+
+**Disconfirmation result:** MIXED — first genuine B1 urgency weakening found in 13 sessions, but structurally contained. The RSP v3.0 is more constructive than characterized (specific milestones, public grading, October 2026 alignment assessment) but remains self-imposed and independently-unverifiable. The METR developer productivity finding (19% slowdown, 0% production-ready) is the first genuine urgency-weakening evidence: if benchmark capability metrics overstate real-world autonomous capability, the 131-day doubling may not track dangerous capability at the same rate.
+
+**Key finding:** METR's RCT on experienced open-source developers (AI tools → 19% slower, Claude 3.7 Sonnet → 38% benchmark success but 0% production-ready PRs) establishes a benchmark-reality gap. All "passing" agent PRs had testing coverage deficiencies; 75% had documentation/linting gaps; 42 minutes additional human work needed per PR. METR explicitly states benchmark performance "significantly overstates practical utility." This is the primary evaluator acknowledging its own capability metrics may overstate real-world autonomous capability.
+
+**Secondary finding:** RSP v3.0's Frontier Safety Roadmap (Feb 24, 2026) is more concrete than previous characterization: specific milestones through July 2027, October 2026 alignment assessment with interpretability component ("moderate confidence"), and an explicit self-grading structure. The Risk Reports are substantially redacted, limiting external verification.
+
+**Additional findings:**
+- Anthropic's interpretability research (persona vectors, circuit tracing) validates on small open-source models (Qwen 2.5-7B, Llama-3.1-8B, Gemma-2-2b), not Claude. No safety-critical behavior detection demonstrated (no deception, sandbagging, monitoring evasion detection).
+- METR Opus 4.6 sabotage review (March 12): "very low but not negligible" risk verdict partially grounded in weeks of deployment without incidents — empirical track record substituting for evaluation confidence.
+- METR TH1.1 explicitly acknowledges task suite saturation; plan to expand is in progress but without specific targets.
+
+**Pattern update:**
+
+STRENGTHENED:
+- The six-layer governance inadequacy arc holds; RSP v3.0 doesn't resolve any of the six layers structurally (self-imposed, unverified, moderate-confidence interpretability)
+- B4 (verification degrades faster than capability grows) — behavioral verification confirmed to overstate capability; the benchmark-reality gap is B4's prediction made empirical
+
+WEAKENED:
+- B1 urgency specifically for autonomous task completion capability: if benchmark-measured doubling time doesn't translate to real-world dangerous autonomous capability at the same rate, the urgency case for general autonomous AI risk is weaker than benchmark analysis implies
+- The "not being treated as such" claim: RSP v3.0 Frontier Safety Roadmap is more substantive than prior voluntary commitments — not externally enforced, but publicly graded with specific milestones
+
+COMPLICATED:
+- B4 scope needs refinement: behavioral verification degrades (benchmark overstatement confirmed) while structural verification (interpretability) advances for wrong behaviors at wrong scale. B4 is true for safety-critical behavioral verification; partially false for narrow behavioral traits (sycophancy, hallucination) in small models.
+- The benchmark-reality gap applies to autonomous software task completion. It may NOT apply to self-replication, bio capability, cyber tasks, or monitoring evasion — where different evaluation methodologies are used. The urgency weakening is domain-specific.
+
+**Confidence shift:**
+- "Benchmark capability metrics reliably track dangerous autonomous capability growth" → CHALLENGED: METR RCT + 0% production-ready finding provides empirical evidence of systematic overestimation. The challenge is domain-specific (software tasks), not universal.
+- "RSP v3.0 simply removed hard safety thresholds" → REVISED: thresholds restructured and supplemented with Frontier Safety Roadmap. More nuanced than characterized — but structurally insufficient for the same reasons (self-imposed, not independently verified).
+- "Safety claims for frontier models are purely evaluation-derived" → REVISED: Opus 4.6 safety claim partly grounded in deployment track record (weeks without incidents), not just evaluation. This is an epistemically weaker but empirically grounded claim type.
+
+**Cross-session pattern (13 sessions):** Active inference → alignment gap → constructive mechanisms → mechanism engineering → [gap] → overshoot mechanisms → correction failures → evaluation infrastructure limits → mandatory governance with reactive enforcement → research-to-compliance translation gap + detection failing → bridge designed but governments reversing + capabilities at expert thresholds + fifth inadequacy layer → measurement saturation (sixth layer) → **benchmark-reality gap weakens urgency for autonomous task completion while RSP v3.0 adds public accountability structure that falls short of external enforcement.** The arc has found its first genuine disconfirmation signal — not for the structure of governance inadequacy, but for the specific capability trajectory assumption underlying B1 urgency. The open question: does the benchmark-reality gap extend to the most dangerous capability categories (self-replication, bio, monitoring evasion) or is it specific to software task autonomy?
+
--- a/inbox/queue/2025-05-29-anthropic-circuit-tracing-open-source.md
+++ b/inbox/queue/2025-05-29-anthropic-circuit-tracing-open-source.md
@ -0,0 +1,59 @@
+---
+type: source
+title: "Anthropic: Open-Sourcing Circuit Tracing Tools for Attribution Graphs"
+author: "Anthropic"
+url: https://www.anthropic.com/research/open-source-circuit-tracing
+date: 2025-05-29
+domain: ai-alignment
+secondary_domains: []
+format: research-post
+status: unprocessed
+priority: medium
+tags: [anthropic, interpretability, circuit-tracing, attribution-graphs, mechanistic-interpretability, open-source, neuronpedia]
+---
+
+## Content
+
+Anthropic open-sources methods to generate attribution graphs — visualizations of the internal steps a model took to arrive at a particular output.
+
+**What attribution graphs do:**
+- "Reveal the steps a model took internally to decide on a particular output"
+- Trace how language models process information from input to output
+- Enable researchers to test hypotheses by modifying feature values and observing output changes
+- Interactive visualization via Neuronpedia's frontend
+
+**Capabilities demonstrated:**
+- Multi-step reasoning processes (in Gemma-2-2b and Llama-3.2-1b)
+- Multilingual representations
+
+**Open-sourced for:** Gemma-2-2b and Llama-3.2-1b — NOT for Claude
+
+**Explicit limitation from Anthropic:** Attribution graphs only "*partially* reveal internal steps — they don't provide complete transparency into model decision-making"
+
+**No safety-specific applications demonstrated:** The announcement emphasizes interpretability understanding generally; no specific examples of safety-relevant detection (deception, goal-directed behavior, monitoring evasion) are shown.
+
+**No connection to 2027 alignment assessment:** The paper does not mention the Frontier Safety Roadmap or any timeline for applying circuit tracing to safety evaluation.
+
+## Agent Notes
+
+**Why this matters:** Circuit tracing is the technical foundation for the "microscope" framing Dario Amodei has used — tracing reasoning paths from prompt to response. But the open-source release is for small open-weights models (2B parameters), not Claude. The "partial" revelation limitation from Anthropic's own description is important — this is not full transparency.
+
+**What surprised me:** The open-sourcing strategy is constructive — making this available to the research community accelerates the field. But it also highlights that Anthropic's own models (Claude) are NOT open-sourced, so circuit tracing tools for Claude would require Claude-specific infrastructure that hasn't been released.
+
+**What I expected but didn't find:** Any evidence that circuit tracing has detected safety-relevant behaviors (deception patterns, goal-directedness, self-preservation). The examples given are multi-step reasoning and multilingual representations — interesting but not alignment-relevant.
+
+**KB connections:**
+- [[verification degrades faster than capability grows]] — same B4 relationship as persona vectors: circuit tracing is a new verification approach that partially addresses B4, but only at small model scale and for non-safety-relevant behaviors
+- [[AI safety evaluation infrastructure is voluntary-collaborative]] — open-sourcing the tools makes the infrastructure more distributed, potentially less dependent on any single evaluator; this is a constructive move for the evaluation ecosystem
+
+**Extraction hints:** This source is best used to support the "interpretability is progressing but addresses wrong behaviors at wrong scale" claim rather than as a standalone claim. The primary contribution is establishing that Anthropic's public interpretability tooling is (a) for small open-source models, not Claude, and (b) only partially reveals internal steps. This supports precision in the B4 scope qualification being developed.
+
+**Context:** Published May 29, 2025. This is the tool-release post; the underlying research papers (circuit tracing methodology) preceded this by several months. The open-source release signals Anthropic's willingness to share interpretability infrastructure but not Claude model weights.
+
+## Curator Notes (structured handoff for extractor)
+
+PRIMARY CONNECTION: [[verification degrades faster than capability grows]]
+
+WHY ARCHIVED: Provides evidence that interpretability tools (attribution graphs) partially reveal internal model steps but only at small model scale and not for safety-critical behaviors. Supports precision in scoping B4 to "behavioral verification" vs. "structural/mechanistic verification" distinction being developed across this session.
+
+EXTRACTION HINT: This source works best as supporting evidence for a claim about interpretability scope limitations rather than a standalone claim. The extractor should combine with persona vectors findings — both advance structural verification but at wrong scale and for wrong behaviors. The combined finding is more powerful than either alone.
--- a/inbox/queue/2025-08-01-anthropic-persona-vectors-interpretability.md
+++ b/inbox/queue/2025-08-01-anthropic-persona-vectors-interpretability.md
@ -0,0 +1,62 @@
+---
+type: source
+title: "Anthropic Persona Vectors: Monitoring and Controlling Character Traits in Language Models"
+author: "Anthropic"
+url: https://www.anthropic.com/research/persona-vectors
+date: 2025-08-01
+domain: ai-alignment
+secondary_domains: []
+format: research-paper
+status: unprocessed
+priority: medium
+tags: [anthropic, interpretability, persona-vectors, sycophancy, hallucination, activation-steering, mechanistic-interpretability, safety-applications]
+---
+
+## Content
+
+Anthropic research demonstrating that character traits can be represented, monitored, and controlled via neural network activation patterns ("persona vectors").
+
+**What persona vectors are:**
+Patterns of neural network activations that represent character traits in language models. Described as "loose analogs to parts of the brain that light up when a person experiences different moods or attitudes." Extraction method: compare neural activations when models exhibit vs. don't exhibit target traits, using automated pipelines with opposing-behavior prompts.
+
+**Traits successfully monitored and controlled:**
+- Primary: sycophancy (insincere flattery/user appeasement), hallucination, "evil" tendency
+- Secondary: politeness, apathy, humor, optimism
+
+**Demonstrated applications:**
+1. **Monitoring**: Measuring persona vector strength detects personality shifts during conversation or training
+2. **Mitigation**: "Preventative steering" — injecting vectors during training acts like a vaccine, reducing harmful trait acquisition without capability degradation (measured by MMLU scores)
+3. **Data flagging**: Identifying training samples likely to induce unwanted traits before deployment
+
+**Critical limitations:**
+- Validated only on open-source models: **Qwen 2.5-7B and Llama-3.1-8B** — NOT on Claude
+- Post-training steering (inference-time) reduces model intelligence
+- Requires defining target traits in natural language beforehand
+- Does NOT demonstrate detection of: goal-directed deception, sandbagging, self-preservation behavior, instrumental convergence, monitoring evasion
+
+**Relationship to Frontier Safety Roadmap:**
+The October 2026 alignment assessment commitment in the Roadmap specifies "interpretability techniques in such a way that it produces meaningful signal beyond behavioral methods alone." Persona vectors (detecting trait shifts via activations) are one candidate approach — but only validated on small open-source models, not Claude.
+
+## Agent Notes
+
+**Why this matters:** Persona vectors are the most safety-relevant interpretability capability Anthropic has published. If they scale to Claude and can detect dangerous behavioral traits (not just sycophancy/hallucination), this would be meaningful progress toward the October 2026 alignment assessment target. Currently, the gap between demonstrated capability (small open-source models, benign traits) and needed capability (frontier models, dangerous behaviors) is substantial.
+
+**What surprised me:** The "preventative steering during training" (vaccine approach) is a genuinely novel safety application — reducing sycophancy acquisition without capability degradation. This is more constructive than I expected. But the validation only on small open-source models is a significant limitation given that Claude is substantially larger and different in architecture.
+
+**What I expected but didn't find:** Any mention of Claude-scale validation or plans to extend to Claude. No 2027 target mentioned. No connection to the RSP's Frontier Safety Roadmap commitments in the paper itself.
+
+**KB connections:**
+- [[verification degrades faster than capability grows]] — partial counter-evidence: persona vectors represent a NEW verification capability that doesn't exist in behavioral testing alone. But it applies to the wrong behaviors for safety purposes.
+- [[alignment must be continuous rather than a one-shot specification problem]] — persona vector monitoring during training supports this: it's a continuous monitoring approach rather than a one-time specification
+
+**Extraction hints:** Primary claim candidate: "Activation-based persona vector monitoring can detect behavioral trait shifts (sycophancy, hallucination) in small language models without relying on behavioral testing — but this capability has not been validated at frontier model scale and doesn't address the safety-critical behaviors (deception, goal-directed autonomy) that matter for alignment." This positions persona vectors as genuine progress that falls short of safety-relevance.
+
+**Context:** Published August 1, 2025. Part of Anthropic's interpretability research program. This paper represents the "applied interpretability" direction — demonstrating that interpretability research can produce monitoring capabilities, not just circuit mapping. The limitation to open-source small models is the key gap.
+
+## Curator Notes (structured handoff for extractor)
+
+PRIMARY CONNECTION: [[verification degrades faster than capability grows]]
+
+WHY ARCHIVED: Persona vectors are the strongest concrete safety application of interpretability research published in this period. They provide a genuine counter-data point to B4 (verification degradation) — interpretability IS building new verification capabilities. But the scope (small open-source models, benign traits) limits the safety relevance at the frontier.
+
+EXTRACTION HINT: The extractor should frame this as a partial disconfirmation of B4 with specific scope: activation-based monitoring advances structural verification for benign behavioral traits, while behavioral verification continues to degrade for safety-critical behaviors. The claim should be scoped precisely — not "interpretability is progressing" generally, but "activation monitoring works for [specific behaviors] at [specific scales]."
--- a/inbox/queue/2025-08-12-metr-algorithmic-vs-holistic-evaluation-developer-rct.md
+++ b/inbox/queue/2025-08-12-metr-algorithmic-vs-holistic-evaluation-developer-rct.md
@ -0,0 +1,70 @@
+---
+type: source
+title: "METR: Algorithmic vs. Holistic Evaluation — AI Made Experienced Developers 19% Slower, 0% Production-Ready"
+author: "METR (Model Evaluation and Threat Research)"
+url: https://metr.org/blog/2025-08-12-research-update-towards-reconciling-slowdown-with-time-horizons/
+date: 2025-08-12
+domain: ai-alignment
+secondary_domains: []
+format: research-report
+status: unprocessed
+priority: high
+tags: [metr, developer-productivity, benchmark-inflation, capability-measurement, rct, holistic-evaluation, algorithmic-scoring, real-world-performance]
+---
+
+## Content
+
+METR research reconciling the finding that experienced open-source developers using AI tools took 19% LONGER on tasks with the time horizon capability results showing rapid progress.
+
+**The developer productivity finding:**
+- RCT design: Experienced open-source developers using AI tools
+- Result: Tasks took **19% longer** with AI assistance than without
+- This result was unexpected — developers predicted significant speed-ups before the study
+
+**The holistic evaluation finding:**
+- 18 open-source software tasks evaluated both algorithmically (test pass/fail) and holistically (human expert review)
+- Claude 3.7 Sonnet: **38% success rate** on automated test scoring
+- **0% production-ready**: "none of them are mergeable as-is" after human expert review
+- Failure categories in "passing" agent PRs:
+  - Testing coverage deficiencies: **100%** of passing-test runs
+  - Documentation gaps: **75%** of passing-test runs
+  - Linting/formatting problems: **75%** of passing-test runs
+  - Residual functionality gaps: **25%** of passing-test runs
+
+**Time required to fix agent PRs to production-ready:**
+- Average: **42 minutes** of additional human work per agent PR
+- Context: Original human task time averaged 1.3 hours
+- The 42-minute fix time is roughly one-third of original human task time
+
+**METR's explanation of the gap:**
+"Algorithmic scoring may overestimate AI agent real-world performance because benchmarks don't capture non-verifiable objectives like documentation quality and code maintainability — work humans must ultimately complete."
+
+"Hill-climbing on algorithmic metrics may end up not yielding corresponding productivity improvements in the wild."
+
+**Implication for capability claims:**
+Frontier model benchmark performance claims "significantly overstate practical utility." The disconnect suggests that benchmark-based capability metrics (including time horizon) may reflect a narrow slice of what makes autonomous AI action dangerous or useful in practice.
+
+## Agent Notes
+
+**Why this matters:** This is the most significant disconfirmation signal for B1 urgency found in 13 sessions. If the primary capability metric (time horizon, based on automated task completion scoring) systematically overstates real-world autonomous capability by this margin, then the "131-day doubling time" for dangerous autonomous capability may be significantly slower than the benchmark suggests. The 0% production-ready finding is particularly striking — not a 20% or 50% production-ready rate, but zero.
+
+**What surprised me:** The finding that developers were SLOWER with AI assistance is counterintuitive and well-designed (RCT, not observational). The 42-minute fix-time finding is precise and concrete. The disconnect between developer confidence (predicted speedup) and actual result (slowdown) mirrors the disconnect between benchmark confidence and actual production readiness.
+
+**What I expected but didn't find:** Any evidence that the productivity slowdown was domain-specific or driven by task selection artifacts. METR's reconciliation paper treats the 19% slowdown as a real finding that needs explanation, not an artifact to be explained away.
+
+**KB connections:**
+- [[verification degrades faster than capability grows]] — if benchmarks overestimate capability by this margin, behavioral verification tools (including benchmarks) may be systematically misleading about the actual capability trajectory
+- [[adoption lag exceeds capability limits as primary bottleneck to AI economic impact]] — the 19% slowdown in experienced developers is evidence against rapid adoption producing rapid productivity gains even when adoption occurs
+- The METR time horizon project itself: if the time horizon metric has the same fundamental measurement problem (automated scoring without holistic evaluation), then all time horizon estimates may be overestimating actual dangerous autonomous capability
+
+**Extraction hints:** Primary claim candidate: "benchmark-based AI capability metrics overstate real-world autonomous performance because automated scoring doesn't capture documentation, maintainability, or production-readiness requirements — creating a systematic gap between measured and dangerous capability." Secondary claim: "AI tools reduced productivity for experienced developers in controlled RCT conditions despite developer expectations of speedup — suggesting capability deployment may not translate to autonomy even when tools are adopted."
+
+**Context:** METR published this in August 2025 as a reconciliation piece — acknowledging the tension between the time horizon results (rapid capability growth) and the developer productivity finding (experienced developers slower with AI). The paper is significant because it's the primary capability evaluator acknowledging that its own capability metric may systematically overstate practical autonomy.
+
+## Curator Notes (structured handoff for extractor)
+
+PRIMARY CONNECTION: [[verification degrades faster than capability grows]]
+
+WHY ARCHIVED: This is the strongest empirical evidence found in 13 sessions that benchmark-based capability metrics systematically overstate real-world autonomous capability. The RCT design (not observational), precise quantification (0% production-ready, 19% slowdown), and the source (METR — the primary capability evaluator) make this a high-quality disconfirmation signal for B1 urgency.
+
+EXTRACTION HINT: The extractor should develop the "benchmark-reality gap" as a potential new claim or divergence against existing time-horizon-based capability claims. The key question is whether this gap is stable, growing, or shrinking over model generations — if frontier models also show the gap, this updates the urgency of the entire six-layer governance arc.
--- a/inbox/queue/2026-01-29-metr-time-horizon-1-1.md
+++ b/inbox/queue/2026-01-29-metr-time-horizon-1-1.md
@ -0,0 +1,70 @@
+---
+type: source
+title: "METR Time Horizon 1.1: Updated Capability Estimates with New Infrastructure"
+author: "METR (Model Evaluation and Threat Research)"
+url: https://metr.org/blog/2026-1-29-time-horizon-1-1/
+date: 2026-01-29
+domain: ai-alignment
+secondary_domains: []
+format: research-report
+status: unprocessed
+priority: high
+tags: [metr, time-horizon, capability-evaluation, task-saturation, measurement, frontier-ai, benchmark]
+---
+
+## Content
+
+METR's updated time horizon methodology (TH1.1) with new evaluation infrastructure. Published January 29, 2026.
+
+**Capability doubling time estimates:**
+- Full historical trend (2019-2025): ~196 days (7 months)
+- Since 2023 (TH1.1): **131 days** — 20% more rapid than previous 165-day estimate
+- Since 2024 (TH1.1): 89 days — "notably faster" than prior 109-day figure
+- Trend appears "slightly less linear" under new methodology, though within confidence intervals
+
+**Infrastructure change:** Migrated from Vivaria (proprietary 2023 system) to Inspect (UK AI Security Institute's open-source framework). Minor scaffold sensitivity effects found: GPT-4o and o3 performed slightly better under Vivaria — suggests some models are sensitive to prompting/scaffold.
+
+**Task suite changes:**
+- Doubled long-duration tasks (8+ hours) from 14 to 31
+- Only 5 of 31 long tasks have actual human baseline times; remainder use estimates
+- Original task count and distribution not fully specified in public summary
+
+**Model 50% time horizon estimates (TH1.1):**
+- Claude Opus 4.5: 320 minutes (~5.3 hours) [revised upward from earlier estimate]
+- GPT-5: 214 minutes
+- o3: 121 minutes
+- Claude Opus 4: 101 minutes
+- Claude Sonnet 3.7: 60 minutes
+- GPT-4 variants: 35-57% downward revisions
+
+**Note**: Claude Opus 4.6 (released February 2026) does NOT appear in TH1.1 — it post-dates this paper. The ~14.5 hour estimate discussed in Anthropic's sabotage risk context came from a different evaluation process.
+
+**Saturation explicit acknowledgment:** METR states: "even our Time Horizon 1.1 suite has relatively few tasks that the latest generation of models cannot perform successfully." They prioritize "updates to our evaluations so they can measure the capabilities of very strong models."
+
+**Plan for saturation:** "Raising the ceiling of our capabilities measurements" through continued task suite expansion. No specific numerical targets or timeline specified.
+
+**Governance implications not addressed:** The document does not explicitly discuss how wide confidence intervals affect governance threshold enforcement. Opus 4.5's upper bound is 2.3× its point estimate.
+
+## Agent Notes
+
+**Why this matters:** TH1.1 is the primary empirical basis for the "131-day doubling" claim central to the six-layer governance inadequacy arc. Understanding exactly what this measures — and its saturation problem — is critical for calibrating B1 urgency.
+
+**What surprised me:** The scaffold sensitivity finding (GPT-4o, o3 performing better under Vivaria than Inspect) suggests that the time horizon metric is not fully scaffold-independent — model performance varies by evaluation infrastructure in a way that affects capability estimates. This is a measurement reliability problem that complements the task saturation problem.
+
+**What I expected but didn't find:** A specific plan or timeline for how METR will measure models when they exceed the current 8+ hour task ceiling. "Raising the ceiling" without specifics leaves the saturation problem unaddressed for the next capability generation.
+
+**KB connections:**
+- [[verification degrades faster than capability grows]] — task suite saturation is behavioral verification degrading: the measurement tool designed to track capability growth is being outrun by the capability it tracks
+- [[market dynamics erode human oversight]] — if the primary oversight-relevant metric saturates, market dynamics have an additional advantage: labs can claim evaluation clearance on a metric that doesn't detect their most dangerous capabilities
+
+**Extraction hints:** Primary claim: METR time horizon saturation is now explicitly acknowledged rather than implied — the primary capability measurement tool is being outrun by frontier model capabilities at exactly the capability level that matters for governance. Secondary claim: Scaffold sensitivity (Vivaria vs. Inspect performance differences) introduces additional uncertainty in cross-model comparisons that is not typically disclosed in governance contexts.
+
+**Context:** Published January 29, 2026. METR is the primary external evaluator conducting pre-deployment capability assessments for Anthropic and other frontier labs. This is their most complete public methodology statement and the basis for the "131-day doubling time" claim that has been central to AI safety policy discussions in 2026.
+
+## Curator Notes (structured handoff for extractor)
+
+PRIMARY CONNECTION: [[verification degrades faster than capability grows]]
+
+WHY ARCHIVED: TH1.1 provides the empirical grounding for "131-day doubling time" and simultaneously the evidence that the measurement tool tracking that doubling is saturating. The saturation acknowledgment from METR itself is the most reliable source for this claim.
+
+EXTRACTION HINT: The extractor should distinguish between two separate findings: (1) capability is doubling every 131 days — this is a finding; (2) the measurement tool for this doubling is saturating — this is also a finding. Both can be true simultaneously and both deserve separate KB claims. The saturation finding specifically challenges the reliability of the doubling-time estimate itself.
--- a/inbox/queue/2026-02-24-anthropic-rsp-v3-0-frontier-safety-roadmap.md
+++ b/inbox/queue/2026-02-24-anthropic-rsp-v3-0-frontier-safety-roadmap.md
@ -0,0 +1,68 @@
+---
+type: source
+title: "Anthropic Responsible Scaling Policy v3.0 and Frontier Safety Roadmap"
+author: "Anthropic"
+url: https://www.anthropic.com/responsible-scaling-policy
+date: 2026-02-24
+domain: ai-alignment
+secondary_domains: []
+format: policy-document
+status: unprocessed
+priority: high
+tags: [rsp, responsible-scaling-policy, frontier-safety-roadmap, capability-thresholds, asl, evaluation-governance, anthropic]
+---
+
+## Content
+
+**RSP v3.0** (effective February 24, 2026) — a comprehensive rewrite from v2.0. Key structural changes:
+
+**What changed from v2.0:**
+- Replaced hard capability-threshold pause triggers with Frontier Safety Roadmaps + Risk Reports as public accountability mechanism
+- "Clarified which Capability Thresholds would require enhanced safeguards beyond current ASL-3 standards"
+- Disaggregated AI R&D threshold into two: (1) ability to fully automate entry-level AI research work; (2) ability to cause dramatic acceleration in the rate of effective scaling
+- Extended evaluation interval from 3 months to 6 months (rationale: "avoid lower-quality, rushed elicitation")
+- Committed to reevaluate Capability Thresholds whenever upgrading to new Required Safeguards
+
+**What remains:**
+- ASL-3 safeguards still in effect
+- Capability threshold framework preserved (restructured, not eliminated)
+- External evaluation (METR reviews) continuing
+
+**Frontier Safety Roadmap** (accessed via https://anthropic.com/responsible-scaling-policy/roadmap):
+Anthropic describes this as a "self-imposed public accountability mechanism rather than a legally binding contract." Key milestones:
+- April 2026: Launch 1-3 "moonshot R&D" security projects exploring novel protection approaches
+- July 2026: Policy recommendations for policymakers; "regulatory ladder" framework scaling requirements with AI capability levels
+- October 2026: Systematic alignment assessments for Claude's Constitution (interpretability component — "moderate confidence")
+- January 2027: World-class red-teaming matching collective bug bounty; automated attack investigation; comprehensive internal AI activity logging
+- July 2027: Broad security maturity across systems
+
+**On interpretability**: "interpretability techniques in such a way that it produces meaningful signal beyond behavioral methods alone" — Anthropic notes "moderate confidence" in achieving this by October 2026.
+
+**Risk Reports**: Published alongside RSP v3.0 at https://anthropic.com/feb-2026-risk-report. The February 2026 Risk Report is a substantially redacted PDF document — limiting external verification of the "quantify risk across all deployed models" commitment.
+
+**Stated rationale for rewrite (per rsp-updates page)**: The "zone of ambiguity" where capabilities approached but didn't definitively pass thresholds; evaluation science insufficiency ("science of model evaluation isn't well-developed enough"); government not moving fast enough; higher-level safeguards not possible without government assistance.
+
+## Agent Notes
+
+**Why this matters:** RSP v3.0 is the most significant self-governance document in the field, and its specific commitments and their accountability structure directly test whether "not being treated as such" is accurate. The Frontier Safety Roadmap is more concrete than anything the previous sessions established from the v2.0 era.
+
+**What surprised me:** The evaluation interval was EXTENDED from 3 to 6 months, not shortened. The stated rationale (avoiding rushed evaluations) runs counter to the concern that governance can't keep pace — Anthropic is explicitly trading speed for quality in evaluation cycles. Also: the Risk Reports are "redacted" — a document designed to show transparency is substantially inaccessible.
+
+**What I expected but didn't find:** No operationalization of Dario Amodei's "reliably detect most AI model problems by 2027" claim in any published document. The Frontier Safety Roadmap's October 2026 alignment assessment is far more modest than that framing suggests. The RSP v3.0 doesn't mention a 2027 interpretability milestone.
+
+**KB connections:**
+- [[voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints]] — RSP v3.0 is partially a response to this: the Roadmap format tries to make commitments durable through public grading rather than binary thresholds
+- [[AI safety evaluation infrastructure is voluntary-collaborative rather than independent]] — the 6-month evaluation interval and METR partnership are voluntary; Anthropic can invite or decline METR reviews
+- Six-layer governance inadequacy arc: RSP v3.0 addresses the "structural inadequacy" layer partially (public roadmap) but leaves "substantive inadequacy," "translation gap," "detection reliability," "response gap," and "measurement saturation" layers untouched
+
+**Extraction hints:** Extract one claim about the RSP v3.0 accountability mechanism (what the Frontier Safety Roadmap adds vs. removes vs. v2.0), one claim about the October 2026 alignment assessment as an empirical test for interpretability progress, and one claim about the redacted Risk Report limiting external verification of the "quantified risk" commitment.
+
+**Context:** RSP v3.0 published February 24, 2026. Accompanies METR's March 2026 review of Claude Opus 4.6. The previous session (2026-03-23) established that RSP v3.0 removed hard thresholds — this session found that characterization was too simple. The thresholds were restructured and a public roadmap added, not eliminated.
+
+## Curator Notes (structured handoff for extractor)
+
+PRIMARY CONNECTION: [[voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints]]
+
+WHY ARCHIVED: RSP v3.0 is the primary empirical test of whether Anthropic's governance evolution is moving toward or away from structural accountability. The Frontier Safety Roadmap adds concrete milestones not present in v2.0, but the "moderate confidence" on interpretability and redacted Risk Reports are significant limitations.
+
+EXTRACTION HINT: Two competing claims worth developing — (1) RSP v3.0's Frontier Safety Roadmap represents a genuine governance innovation (public grading, concrete milestones, internal forcing function) that goes beyond prior voluntary commitments; (2) RSP v3.0's self-imposed, redacted, and legally-unenforceable structure cannot close the accountability gap identified by independent evaluators. These may coexist as a divergence rather than resolving to one claim.
--- a/inbox/queue/2026-03-12-metr-sabotage-review-claude-opus-4-6.md
+++ b/inbox/queue/2026-03-12-metr-sabotage-review-claude-opus-4-6.md
@ -0,0 +1,56 @@
+---
+type: source
+title: "METR Review of Anthropic Sabotage Risk Report: Claude Opus 4.6"
+author: "METR (Model Evaluation and Threat Research)"
+url: https://metr.org/blog/2026-03-12-sabotage-risk-report-opus-4-6-review/
+date: 2026-03-12
+domain: ai-alignment
+secondary_domains: []
+format: research-report
+status: unprocessed
+priority: high
+tags: [metr, claude-opus-4-6, sabotage-risk, evaluation-awareness, alignment-evaluation, sandbagging, monitoring-evasion, anthropic]
+---
+
+## Content
+
+METR's external review of Anthropic's self-produced Sabotage Risk Report for Claude Opus 4.6. Published March 12, 2026.
+
+**Sabotage risk verdict:** "The risk of catastrophic outcomes that are substantially enabled by Claude Opus 4.6's misaligned actions is very low but not negligible."
+
+**Confidence basis:** Partly bolstered by the fact that Opus 4.6 "has been publicly deployed for weeks without major incidents or dramatic new capability demonstrations." This is notable: the low-risk verdict relies partly on real-world deployment absence of incidents rather than evaluation confidence alone.
+
+**Evaluation awareness issue:**
+METR identifies "a risk that its results are weakened by evaluation awareness" — the model may behave differently during evaluation than in deployment. METR's recommended response: "deeper investigations of evaluation awareness and obfuscated misaligned reasoning."
+
+**Methodological weaknesses identified by METR:**
+- "Low-severity instances of misaligned behaviors not caught in the alignment assessment"
+- Risk that similar undetected behaviors exist
+- "Multiple places where the strength of reasoning and analysis needed improvement"
+- "Several weak subclaims requiring additional analysis and experimentation"
+
+**Key structural note:** METR is both the external evaluator AND has an institutional MOU partnership with Anthropic (flagged in session 11). This creates an independence concern noted by the previous session. The review is the best available external assessment, but not fully independent.
+
+## Agent Notes
+
+**Why this matters:** This is the operational record of how the best external evaluation of the most safety-focused lab's most capable model actually works in practice. The "very low but not negligible" verdict grounded partly in weeks-of-deployment is an important observation about how safety assessments function at the frontier.
+
+**What surprised me:** The evaluation confidence is PARTLY based on real-world deployment record rather than being purely evaluation-driven. This means the safety claim for Opus 4.6 is partly empirical (no incidents in deployment) rather than counterfactual (our evaluation process shows it's safe). These are very different epistemic positions.
+
+**What I expected but didn't find:** Any resolution to the evaluation awareness problem — METR recommends deeper investigation but doesn't report any new methodology for detecting evaluation-aware behavior. The problem remains open and is now in 30-country international scientific consensus (previous session).
+
+**KB connections:**
+- [[capability does not equal reliability]] — the low-risk verdict despite evaluation weaknesses confirms this; Opus 4.6's capability level is high but the risk assessment relies partly on behavioral track record, not evaluation-derived reliability
+- [[market dynamics erode human oversight]] — if evaluation quality is partly substituted by deployment track record, then the oversight mechanism is retroactive rather than preventive
+
+**Extraction hints:** Primary claim candidate: "METR's Opus 4.6 sabotage risk assessment relies partly on absence of deployment incidents rather than evaluation confidence — establishing a precedent where frontier AI safety claims are backed by empirical track record rather than evaluation-derived assurance." This is distinct from existing KB claims about evaluation inadequacy.
+
+**Context:** Published March 12, 2026, twelve days before this session. Anthropic published its own sabotage risk report; METR's review is the external critique. The evaluation awareness concern was first established as a theoretical problem, became an empirical finding for prior models, and is now operational for the frontier model.
+
+## Curator Notes (structured handoff for extractor)
+
+PRIMARY CONNECTION: [[capability does not equal reliability]]
+
+WHY ARCHIVED: Documents the operational reality of frontier AI safety evaluation — the "very low but not negligible" verdict grounded in deployment track record rather than evaluation confidence alone. The precedent that safety claims can be partly empirically grounded (no incidents) rather than evaluation-derived is significant for understanding what frontier AI governance actually looks like in practice.
+
+EXTRACTION HINT: The extractor should focus on the epistemic structure of the verdict — what it's based on and what that precedent means for safety governance. The claim should distinguish between evaluation-derived safety confidence and empirical track record safety confidence, noting that these provide very different guarantees for novel capability configurations.