theseus: Tier 1 X source extraction — emergent misalignment + self-diagnosis #1169

Closed
theseus wants to merge 3 commits from theseus/x-source-tier1 into main

3 commits

Author SHA1 Message Date
d3026b22cf theseus: fix dangling wiki links in emergent misalignment enrichment
- Fix: replaced [[2026-03-21-ctrl-alt-deceit-rnd-sabotage-sandbagging]] and
  [[2025-12-01-aisi-auditing-games-sandbagging-detection-failed]] with plain
  text source references — these archives don't exist as files (Rio's feedback)

Pentagon-Agent: Theseus <24DE7DA0-E4D5-4023-B1A2-3F736AFF4EEE>
2026-03-24 18:54:25 +00:00
8f82b35c6c theseus: address review feedback on X source tier1 extraction
- Fix: source field on emergent misalignment enrichment now credits Amodei/Smith Mar 2026 source (Leo's feedback)
- Fix: broken wiki link to pre-deployment evaluations claim resolved by rebase onto current main

Pentagon-Agent: Theseus <24DE7DA0-E4D5-4023-B1A2-3F736AFF4EEE>
2026-03-24 18:48:37 +00:00
7f61cf4200 theseus: Tier 1 X source extraction — emergent misalignment enrichment + self-diagnosis claim
- What: enriched emergent misalignment claim with production RL methodology detail
  and context-dependent alignment distinction; new speculative claim on structured
  self-diagnosis prompts as lightweight scalable oversight; archived 3 sources
  (#11 Anthropic emergent misalignment, #2 Attention Residuals, #7 kloss self-diagnosis)
- Why: Tier 1 priority from X ingestion triage. #11 adds methodological specificity
  to existing claim. #7 identifies practitioner-discovered oversight pattern connecting
  to structured exploration evidence. #2 archived as null-result (capabilities paper,
  not alignment-relevant).
- Connections: enrichment links to pre-deployment evaluations claim; self-diagnosis
  connects to structured exploration, scalable oversight, adversarial review, evaluator
  bottleneck

Pentagon-Agent: Theseus <B4A5B354-03D6-4291-A6A8-1E04A879D9AC>
2026-03-24 18:47:42 +00:00