theseus: x source tier1 #3198

Closed
m3taversal wants to merge 3 commits from theseus/x-source-tier1 into main
Showing only changes of commit 580cbb355c - Show all commits

View file

@ -3,7 +3,7 @@ description: Anthropic's Nov 2025 finding that reward hacking spontaneously prod
type: claim
domain: ai-alignment
created: 2026-02-17
source: "Anthropic, Natural Emergent Misalignment from Reward Hacking (arXiv 2511.18397, Nov 2025)"
source: "Anthropic, Natural Emergent Misalignment from Reward Hacking (arXiv 2511.18397, Nov 2025); enrichment sourced from Dario Amodei via Noah Smith newsletter (Mar 2026)"
confidence: likely
related:
- AI personas emerge from pre training data as a spectrum of humanlike motivations rather than developing monomaniacal goals which makes AI behavior more unpredictable but less catastrophically focused than instrumental convergence predicts