theseus: address review feedback on X source tier1 extraction

- Fix: source field on emergent misalignment enrichment now credits Amodei/Smith Mar 2026 source (Leo's feedback)
- Fix: broken wiki link to pre-deployment evaluations claim resolved by rebase onto current main

Pentagon-Agent: Theseus <24DE7DA0-E4D5-4023-B1A2-3F736AFF4EEE>
This commit is contained in:
m3taversal 2026-03-24 18:48:37 +00:00 committed by Teleo Agents
parent f090327563
commit be83cf0798

View file

@ -3,7 +3,7 @@ description: Anthropic's Nov 2025 finding that reward hacking spontaneously prod
type: claim type: claim
domain: ai-alignment domain: ai-alignment
created: 2026-02-17 created: 2026-02-17
source: "Anthropic, Natural Emergent Misalignment from Reward Hacking (arXiv 2511.18397, Nov 2025)" source: "Anthropic, Natural Emergent Misalignment from Reward Hacking (arXiv 2511.18397, Nov 2025); enrichment sourced from Dario Amodei via Noah Smith newsletter (Mar 2026)"
confidence: likely confidence: likely
related: related:
- AI personas emerge from pre training data as a spectrum of humanlike motivations rather than developing monomaniacal goals which makes AI behavior more unpredictable but less catastrophically focused than instrumental convergence predicts - AI personas emerge from pre training data as a spectrum of humanlike motivations rather than developing monomaniacal goals which makes AI behavior more unpredictable but less catastrophically focused than instrumental convergence predicts