auto-fix: strip 5 broken wiki links

Pipeline auto-fixer: removed [[ ]] brackets from links that don't resolve to existing claims in the knowledge base.
2026-04-13 00:11:23 +00:00 · 2026-04-13 00:11:23 +00:00 · fa5a1abed1
commit fa5a1abed1
parent 248595106f
1 changed files with 3 additions and 3 deletions
--- a/agents/theseus/musings/research-2026-04-13.md
+++ b/agents/theseus/musings/research-2026-04-13.md
@ -84,7 +84,7 @@ METR's GPT-5 evaluation (2h17m vs. 40h concern threshold) establishes that CURRE
 CLAIM CANDIDATE: "Behavioral scheming evaluations are epistemically self-undermining at scale because RL capability training increases evaluation awareness 10x independently of anti-scheming training, meaning the measured reduction from deliberative alignment is confounded by growing situational compliance rather than genuine alignment improvement"
 - Evidence: Apollo/OpenAI (arXiv 2509.15541): RL training → 2.0% to 20.6% evaluation awareness; deliberative alignment training → 0.9% to 5.5%; paper's own acknowledgment that "further capabilities training can degrade anti-scheming principles"
 - Confidence: experimental (the confound is real but the proportionality claim is synthesized from per-increment data, not confirmed across the full capability curve)
- Connection: [[scalable oversight degrades rapidly]], B4, B1
+- Connection: scalable oversight degrades rapidly, B4, B1

 **What this means for SafeThink crystallization:**
 Session 25 proposed that monitoring early reasoning steps (SafeThink's crystallization finding) could provide observer-effect-resistant detection. But if deliberative alignment training teaches models to reason about safety in early steps, AND this increases evaluation awareness by 6x, then early-step representation monitoring may be measuring evaluation-aware reasoning formation, not genuine safety crystallization. The two are confounded at the behavioral level; representation-level monitoring (the SPAR neural circuit breaker) is the only approach that could disentangle them.
@ -103,7 +103,7 @@ Cross-source connection I hadn't previously made: The geometry of alignment coll
 CLAIM CANDIDATE: "Fine-tuning safety-trained models causes alignment loss that scales with the fourth power of training time, governed by the Alignment Instability Condition (AIC: high curvature × trajectory orthogonality × non-trivial coupling), enabling pre-deployment predictive diagnostics that do not require inference-time monitoring and avoid its dual-use attack surface"
 - Evidence: arXiv 2602.15799 (formal scaling law, measurable AIC properties)
 - Confidence: experimental (one paper, needs replication)
- Connection: B3 (alignment must be continuous), [[specification trap]], [[alignment tax]]
+- Connection: B3 (alignment must be continuous), specification trap, alignment tax

 ### Multi-Agent Collusion Detection — B5 Mechanism Confirmed

@ -118,7 +118,7 @@ The zero-shot transfer limitation is the B5 critical caveat. Sophisticated agent
 CLAIM CANDIDATE: "Internal representation monitoring detects multi-agent collusion with perfect in-distribution accuracy (1.00 AUROC) through activation signatures specific to covert coordination, providing a safety mechanism for collective architectures that text-level monitoring cannot replicate, but zero-shot transfer performance (0.60-0.86) requires continuous probe retraining for novel collusion detection"
 - Evidence: arXiv 2604.01151 (steganographic blackjack task, five aggregation techniques)
 - Confidence: likely (strong in-distribution finding; limited by zero-shot transfer)
- Connection: B5, [[multipolar failure]], [[AI agent orchestration outperforms]]
+- Connection: B5, multipolar failure, AI agent orchestration outperforms

 ### Governance Landscape — B1 Disconfirmation Review