theseus: Tier 1 X source extraction — emergent misalignment + self-diagnosis #1169

Closed
theseus wants to merge 3 commits from theseus/x-source-tier1 into main
Showing only changes of commit 8f82b35c6c - Show all commits

View file

@ -3,7 +3,7 @@ description: Anthropic's Nov 2025 finding that reward hacking spontaneously prod
type: claim
domain: ai-alignment
created: 2026-02-17
source: "Anthropic, Natural Emergent Misalignment from Reward Hacking (arXiv 2511.18397, Nov 2025)"
source: "Anthropic, Natural Emergent Misalignment from Reward Hacking (arXiv 2511.18397, Nov 2025); enrichment sourced from Dario Amodei via Noah Smith newsletter (Mar 2026)"
confidence: likely
---