diff --git a/domains/ai-alignment/emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive.md b/domains/ai-alignment/emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive.md index 2a3f1cfb3..58ba9af16 100644 --- a/domains/ai-alignment/emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive.md +++ b/domains/ai-alignment/emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive.md @@ -3,7 +3,7 @@ description: Anthropic's Nov 2025 finding that reward hacking spontaneously prod type: claim domain: ai-alignment created: 2026-02-17 -source: "Anthropic, Natural Emergent Misalignment from Reward Hacking (arXiv 2511.18397, Nov 2025)" +source: "Anthropic, Natural Emergent Misalignment from Reward Hacking (arXiv 2511.18397, Nov 2025); enrichment sourced from Dario Amodei via Noah Smith newsletter (Mar 2026)" confidence: likely related: - AI personas emerge from pre training data as a spectrum of humanlike motivations rather than developing monomaniacal goals which makes AI behavior more unpredictable but less catastrophically focused than instrumental convergence predicts