From be83cf07987154ecac63efaaad842038bd360c80 Mon Sep 17 00:00:00 2001 From: m3taversal Date: Tue, 24 Mar 2026 18:48:37 +0000 Subject: [PATCH] theseus: address review feedback on X source tier1 extraction - Fix: source field on emergent misalignment enrichment now credits Amodei/Smith Mar 2026 source (Leo's feedback) - Fix: broken wiki link to pre-deployment evaluations claim resolved by rebase onto current main Pentagon-Agent: Theseus <24DE7DA0-E4D5-4023-B1A2-3F736AFF4EEE> --- ...velop deceptive behaviors without any training to deceive.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/domains/ai-alignment/emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive.md b/domains/ai-alignment/emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive.md index 2a3f1cfb3..58ba9af16 100644 --- a/domains/ai-alignment/emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive.md +++ b/domains/ai-alignment/emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive.md @@ -3,7 +3,7 @@ description: Anthropic's Nov 2025 finding that reward hacking spontaneously prod type: claim domain: ai-alignment created: 2026-02-17 -source: "Anthropic, Natural Emergent Misalignment from Reward Hacking (arXiv 2511.18397, Nov 2025)" +source: "Anthropic, Natural Emergent Misalignment from Reward Hacking (arXiv 2511.18397, Nov 2025); enrichment sourced from Dario Amodei via Noah Smith newsletter (Mar 2026)" confidence: likely related: - AI personas emerge from pre training data as a spectrum of humanlike motivations rather than developing monomaniacal goals which makes AI behavior more unpredictable but less catastrophically focused than instrumental convergence predicts