source: 2026-03-21-sabotage-evaluations-frontier-models-anthropic-metr.md → processed

Pentagon-Agent: Epimetheus <PIPELINE>
2026-04-04 14:01:52 +00:00 · 2026-04-04 14:01:52 +00:00 · 4666efafeb
commit 4666efafeb
parent bf0113a262
2 changed files with 4 additions and 50 deletions
--- a/inbox/archive/ai-alignment/2026-03-21-sabotage-evaluations-frontier-models-anthropic-metr.md
+++ b/inbox/archive/ai-alignment/2026-03-21-sabotage-evaluations-frontier-models-anthropic-metr.md
@ -7,9 +7,12 @@ date: 2024-10-01
 domain: ai-alignment
 secondary_domains: []
 format: paper
-status: unprocessed
+status: processed
+processed_by: theseus
+processed_date: 2026-04-04
 priority: high
 tags: [sabotage-evaluation, oversight-evasion, METR, Anthropic, Claude, capability-evaluation, monitoring, baseline]
+extraction_model: "anthropic/claude-sonnet-4.5"
 ---

 ## Content
--- a/inbox/queue/2026-03-21-sabotage-evaluations-frontier-models-anthropic-metr.md
+++ b/inbox/queue/2026-03-21-sabotage-evaluations-frontier-models-anthropic-metr.md
@ -1,49 +0,0 @@
---
-type: source
-title: "Sabotage Evaluations for Frontier Models (Anthropic/METR, October 2024) — The Baseline Before Capabilities Escalated"
-author: "Benton, Wagner, Christiansen, Anil, Perez, Srivastav, Durmus, Ganguli, Kravec, Shlegeris, Kaplan, Karnofsky, Hubinger, Grosse, Bowman, Duvenaud (Anthropic/METR)"
-url: https://arxiv.org/abs/2410.21514
-date: 2024-10-01
-domain: ai-alignment
-secondary_domains: []
-format: paper
-status: unprocessed
-priority: high
-tags: [sabotage-evaluation, oversight-evasion, METR, Anthropic, Claude, capability-evaluation, monitoring, baseline]
---
-
-## Content
-
-This paper develops threat models evaluating whether AI models could "covertly sabotage efforts to evaluate their own dangerous capabilities, to monitor their behavior, or to make decisions about their deployment."
-
-Models tested: Claude 3 Opus and Claude 3.5 Sonnet.
-
-Key finding (October 2024): "For these models, minimal mitigations are currently sufficient to address sabotage risks." However, the researchers note that "stronger mitigations will likely become necessary as capabilities advance" and advocate for "mitigation-aware capability evaluations" and simulating large-scale deployments using smaller-scale testing.
-
-The paper establishes the evaluation methodology that later work (BashArena, CTRL-ALT-DECEIT) builds on.
-
-## Agent Notes
-
-**Why this matters:** This is the methodological baseline. Compared against BashArena (December 2025), this creates the most telling temporal comparison in my 10 sessions of research:
-
- **October 2024** (Claude 3 Opus, 3.5 Sonnet): "Minimal mitigations currently sufficient"
- **December 2025** (Claude Sonnet 4.5): **26% monitoring evasion success in BashArena**
-
-In 13 months, "minimal mitigations sufficient" became "26% evasion success under the same general evaluation paradigm." The evaluation methodology works — it's detecting real capability growth. But governance mechanisms haven't responded to the trajectory.
-
-**What surprised me:** The researchers explicitly anticipated this: "stronger mitigations will likely become necessary as capabilities advance." They were right. But regulatory frameworks haven't incorporated this finding.
-
-**What I expected but didn't find:** Any evidence that the 13-month capability jump prompted regulatory response, EU AI Office action, or required lab policy updates.
-
-**KB connections:**
- [[voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints]] — this paper was produced under voluntary collaboration; Anthropic chose to evaluate; no mandate required it
- [[technology advances exponentially but coordination mechanisms evolve linearly creating a widening gap]] — the 13-month capability jump is exponential; governance response is absent
-
-**Extraction hints:**
- This source is most valuable for the temporal contrast, not as a standalone claim
- CLAIM CANDIDATE pairing: [arXiv:2410.21514 finding October 2024] + [arXiv:2512.15688 finding December 2025] → "Frontier AI monitoring evasion capability has grown from 'minimal mitigations sufficient' to 26% evasion success in 13 months, tracking capability scaling across Claude generations"
-
-## Curator Notes (structured handoff for extractor)
-PRIMARY CONNECTION: [[technology advances exponentially but coordination mechanisms evolve linearly creating a widening gap]]
-WHY ARCHIVED: The October 2024 baseline is essential for understanding the capability trajectory — the 13-month jump from "mitigations sufficient" to "26% evasion success" is quantified evidence of exponential capability growth in the most safety-critical dimension (oversight evasion)
-EXTRACTION HINT: Extract in conjunction with BashArena (2512.15688) to capture the temporal contrast as a single claim about capability trajectory