14 changed files with 113 additions and 275 deletions
--- a/agents/theseus/musings/research-2026-04-01.md
+++ b/agents/theseus/musings/research-2026-04-01.md
@ -1,150 +0,0 @@
 ---
 created: 2026-04-01
 status: developing
 name: research-2026-04-01
 description: "Session 20 — International governance layer: UN CCW autonomous weapons progress, multilateral verification mechanisms, and whether any binding international framework addresses the Article 2.3 gap"
 type: musing
 date: 2026-04-01
 session: 20
 research_question: "Do any concrete multilateral verification mechanisms exist for autonomous weapons AI in 2026 — UN CCW progress, European alternative proposals, or any binding international framework that addresses the governance gap EU AI Act Article 2.3 creates?"
 belief_targeted: "B1 — 'not being treated as such' component. Disconfirmation search: evidence that international governance frameworks (UN CCW, multilateral verification) have moved from proposal-stage to operational, which would mean governance is being built at the international layer even where domestic frameworks fail."
 ---
 # Session 20 — The International Governance Layer
 ## Orientation
 Session 19 completed the domestic and EU governance failure map:
 - Level 1: Technical measurement failure (AuditBench, Hot Mess, formal verification limits)
 - Level 2: Institutional/voluntary failure (RSPs, voluntary commitments = cheap talk)
 - Level 3: Statutory/legislative failure in US (all three branches)
 - Level 4: International legislative ceiling (EU AI Act Article 2.3 — military AI excluded)
 The EU regulatory arbitrage alternative was closed as a route for military/autonomous weapons AI. But Session 19 also noted: "The only remaining partial governance mechanisms are... Multilateral verification mechanisms (proposed, not operational)."
 After 19 sessions, the international governance layer remains uninvestigated. This is the structural gap.
 ## Disconfirmation Target
 **B1 keystone belief:** "AI alignment is the greatest outstanding problem for humanity. We're running out of time and it's not being treated as such."
 **What would weaken B1:** Evidence that multilateral verification mechanisms for autonomous weapons AI have moved from proposal to framework agreement — or that the UN CCW process on LAWS (Lethal Autonomous Weapons Systems) has produced binding commitments that cover the deployment contexts Article 2.3 excludes.
 **Specific hypothesis to test:** The European Policy Centre's call for multilateral verification mechanisms (flagged in Session 18) and the UN CCW process (running since 2014) represent genuine international governance alternatives. If any of these have produced operational frameworks, the international layer of governance is more advanced than 19 sessions of domestic analysis implied.
 **What I expect to find (and will try to disconfirm):** The UN CCW LAWS process has been running for a decade and is still at the "group of governmental experts" stage, with no binding treaty. Major powers (US, Russia, China) oppose any binding framework. The international layer is as weak as the domestic layer, just less visible.
 ## Research Session Notes
 **Tweet accounts searched:** Karpathy, DarioAmodei, ESYudkowsky, simonw, swyx, janleike, davidad, hwchase17, AnthropicAI, NPCollapse, alexalbert, GoogleDeepMind.
 **Result:** No content populated. Third consecutive session with empty tweet feed. Null result for sourcing from these accounts. All research via web.
 ---
 ### What I Found: The International Governance Layer
 **The picture is worse than expected.** The disconfirmation attempt failed. Here is the complete state of international governance for autonomous weapons AI as of April 2026:
 #### 1. CCW Process — Ten Years, No Binding Outcome
 The UN CCW GGE on LAWS has been meeting since 2014 — eleven years of deliberation without a binding instrument. The process continues in 2026:
 - March 2-6, 2026: First formal 2026 session. Chair circulating updated rolling text. No outcome documentation yet available (session concluded within days of this research).
 - August 31 - September 4, 2026: Second and final 2026 GGE session.
 - **November 16-20, 2026 — Seventh CCW Review Conference:** The formal decision point. GGE must submit final report. States either agree to negotiate a new protocol, or the mandate expires.
 **The structural obstacle:** CCW operates by consensus. Any single state can block. US, Russia, and Israel consistently oppose binding LAWS governance. Russia: rejects new treaty outright, argues IHL suffices. US (under Trump since January 2025): explicitly refuses even voluntary principles. China: abstains consistently, objects to nuclear command/control language. This small coalition of militarily-advanced states has blocked governance for over a decade — not through bad luck but through deliberate obstruction.
 **Rolling text status:** Areas of significant convergence after nine years on a two-tier approach (prohibitions + regulations) and need for "meaningful human control." But "meaningful human control" is both legally and technically undefined. Legally: no consensus on what level of human involvement qualifies. Technically: no verification mechanism can determine whether human control was "meaningful" vs. nominal rubber-stamping.
 #### 2. UNGA Resolution — Real Signal, Blocked Implementation
 November 6, 2025: UNGA A/RES/80/57 adopted 164:6. Six NO votes: US, Russia, Belarus, DPRK, Israel, Burundi. Seven abstentions including China and India.
 **The vote configuration is the finding:** 164 states FOR means near-universal political will. But the 6 states voting NO include the two superpowers most responsible for advanced autonomous weapons programs. The CCW consensus rule gives the 6 veto power over the 164. Near-universal political expression is structurally blocked from translating into governance.
 #### 3. REAIM 2026 — Voluntary Governance Collapsing
 February 4-5, 2026, A Coruña, Spain: Third REAIM Summit. Only **35 of 85 attending countries** signed the "Pathways for Action" declaration. US and China both refused.
 **The trend is negative:** ~60 nations endorsed Seoul 2024 Blueprint → 35 nations signed A Coruña 2026. The REAIM multi-stakeholder platform is losing adherents as capabilities advance. The US under Trump cited "regulation stifles innovation and weakens national security" — the alignment-tax race-to-the-bottom argument stated explicitly as policy.
 **This is the same mechanism as domestic voluntary commitment failure, at international scale.** The 2024 US signature under Biden → 2026 refusal under Trump = rapid erosion of international norm-building under domestic political change. International voluntary governance is MORE fragile than domestic voluntary governance because it lacks even the constitutional and legal anchors that create some stability domestically.
 #### 4. Alternative Treaty Process — Theoretically Available, Not Yet Launched
 The Ottawa model (independent state-led process outside CCW) successfully produced Mine Ban Treaty (1997) and Convention on Cluster Munitions (2008) without US participation. Human Rights Watch and Stop Killer Robots have documented this alternative. Stop Killer Robots (270+ NGO coalition) is explicitly preparing the alternative process pivot if CCW November 2026 fails.
 **Why the Ottawa model is harder for autonomous weapons:** Landmines are physical, countable, verifiable. Autonomous weapons are AI systems — dual-use, opaque, impossible to verify from outside. The Mine Ban Treaty works through export control, stigmatization, and mine-clearing operations. No analogous enforcement mechanism exists for software-based weapons. A treaty that US/Russia/China don't sign, governing technology they control, with no verification mechanism = symbolic at best.
 #### 5. Technical Verification — The Precondition That Doesn't Exist
 CSET Georgetown has done the most complete technical analysis: "AI Verification" defined as determining whether states' AI systems comply with treaty obligations. Technical proposals exist (transparency registry, dual-factor authentication, satellite imagery monitoring index) but none are operationalized.
 **The fundamental problem:** Verifying "meaningful human control" is technically infeasible with current methods. You cannot observe from outside whether a human "meaningfully" reviewed a decision vs. rubber-stamped it. The system would need to be transparent and auditable — the opposite of how military AI systems are designed. This is the same tool-to-agent gap (AuditBench) and Layer 0 measurement architecture failure documented in civilian AI, but harder: at least civilian AI can be accessed for evaluation. Adversaries' military systems cannot.
 #### 6. An Unexpected Legal Opening: The IHL Inadequacy Argument
 The most interesting finding from ASIL legal analysis: existing International Humanitarian Law (IHL) — the Geneva Convention obligations of distinction, proportionality, and precaution — may already prohibit sufficiently capable autonomous weapons systems, without requiring any new treaty. The argument: AI cannot make the value judgments IHL requires. Proportionality assessment (civilian harm vs. military advantage) requires the kind of contextual human judgment that AI systems cannot reliably perform.
 **This is the alignment problem restated in legal language.** The legal community is independently arriving at the conclusion that AI systems cannot be aligned to the values required by their operational domain. If this argument were pursued through an ICJ advisory opinion, it could create binding legal pressure WITHOUT requiring new state consent.
 **Status:** Legal theory only. No ICJ proceeding is underway. But the precedent (ICJ nuclear weapons advisory opinion) exists. This is the one genuinely novel governance pathway identified in 20 sessions of research.
 ---
 ### What This Means for B1
 **Disconfirmation attempt: Failed.** The international governance layer is as structurally inadequate as the domestic layer, through different mechanisms:
 - **Domestic US failure:** Active institutional opposition (DoD/Anthropic), consensus obstruction (Congress), judicial negative-only protection
 - **EU failure:** Article 2.3 legislative ceiling excludes military AI categorically
 - **International failure:** Consensus obstruction by military powers at CCW; voluntary governance collapsing at REAIM; verification technically infeasible; alternative process not yet launched
 **B1 refinement — international layer added to the "not being treated as such" characterization:**
 The pattern at every level is the same: the states/actors most responsible for the most dangerous AI deployments are also the states/actors most actively blocking governance. This is not governance neglect — it is governance obstruction by those with the most to lose from being governed.
 **One genuine exception:** The 164-state UNGA support, the 42-state CCW joint statement, and the November 2026 Review Conference represent real political will among the non-major-power majority. If the CCW Review Conference in November 2026 produces a negotiating mandate (even without US/Russia), it would establish a formal international process for the first time. This is a weak but real governance development — analogous to the Anthropic PAC investment as an electoral strategy: low probability, but a genuine pathway.
 **B1 urgency confirmation:** The REAIM 2026 collapse (60→35 signatories, US reversal) is the most direct international-layer evidence that governance is moving in the wrong direction. As capabilities scale, the governance deficit is widening at the international level just as it is domestically.
 ### Hot Mess Follow-up — Still Unresolved
 No replication study found. The LessWrong attention decay critique remains the strongest alternative hypothesis. The Hot Mess paper (arXiv 2601.23045) is still at ICLR 2026 without a formal replication. Consistent with Session 19 assessment: monitor passively, no active search needed unless a specific replication paper emerges.
 ---
 ## Follow-up Directions
 ### Active Threads (continue next session)
 - **CCW Seventh Review Conference (November 16-20, 2026):** This is the highest-stakes governance event in the entire 20-session research arc. Track: (1) August 2026 GGE session outcome — does the rolling text reach consensus? (2) November Review Conference — does it produce a negotiating mandate? This is binary: either the first formal international autonomous weapons governance process begins, or the CCW pathway closes. Searchable in August-September 2026.
 - **IHL inadequacy argument — ICJ advisory opinion pathway:** The ASIL finding that existing IHL may already prohibit sufficiently capable autonomous weapons is the most novel governance pathway identified. Track: any state request for ICJ advisory opinion on autonomous weapons legality under IHL. Precedent: ICJ nuclear weapons advisory opinion (1996) was requested by the UNGA, not a state. Could the current UNGA momentum (164 states) produce a similar request? Search: "ICJ advisory opinion autonomous weapons lethal AI IHL 2026."
 - **Alternative treaty process launch timing:** Stop Killer Robots is preparing the Ottawa-model alternative process pivot for after CCW failure. Track: any formal announcement of alternative process by champion states (Brazil, Austria, New Zealand historically supportive). Search: "autonomous weapons alternative treaty process 2026 Ottawa Brazil champion state."
 - **Anthropic PAC effectiveness** (carried from Session 19): Track Public First Action electoral outcomes in the November 2026 midterms. How is the $20M investment playing in specific races? What's the polling on AI regulation as a voting issue? Search: "Public First Action 2026 midterms AI regulation endorsed candidates polling."
 - **Hot Mess attention decay replication** (passive): Monitor for any formal replication study. Only search if a specific paper title or preprint appears in domain sources.
 ### Dead Ends (don't re-run these)
 - **International verification mechanisms as near-term governance:** CSET Georgetown confirms no operational verification mechanism exists. The technical problem (verifying "meaningful human control") is fundamentally harder than civilian AI evaluation because military systems cannot be accessed for evaluation. Don't search for "operational verification mechanisms" — they don't exist. Only search if a specific proposal for pilot deployment is announced.
 - **US participation in REAIM or CCW binding frameworks before late 2027:** The Trump administration's A Coruña refusal + domestic NIST/AISI reversal pattern confirms US is not a constructive international AI governance actor under current leadership. No search value until domestic political environment changes (post-midterms at earliest).
 - **China voluntary military AI commitments:** China has consistently abstained or refused across every international military AI forum. The nuclear command/control objection is deeply held and unlikely to change on a short timeline. No search value for China-specific governance commitments.
 ### Branching Points (one finding opened multiple directions)
 - **The IHL inadequacy argument** opened two directions:
  - Direction A: ICJ advisory opinion pathway — could the 164-state UNGA support produce a request for an ICJ ruling on whether existing IHL prohibits autonomous weapons capable enough for military use? This would be the most powerful governance development possible without new treaty negotiations. Search: ICJ advisory opinion mechanism, UNGA First Committee procedure for requesting ICJ opinions.
  - Direction B: Domestic litigation — could the IHL inadequacy argument be raised in domestic courts (US, European states) to challenge specific autonomous weapons programs? The First Amendment precedent (Anthropic case) shows courts will engage with AI-related rights claims. Would courts engage with IHL-based weapons challenges?
  - **Pursue Direction A first:** ICJ advisory opinion is a documented governance mechanism with direct precedent (1996 nuclear weapons). Direction B is more speculative and slower.
 - **REAIM collapse signal** opened two directions:
  - Direction A: Is this a US-specific regression (Trump administration) that could reverse with domestic political change? Track whether any future US administration reverses course on REAIM-style engagement.
  - Direction B: Is this a structural signal that voluntary international governance of military AI is fundamentally incompatible with great-power competition dynamics — regardless of who is in the White House? The China consistent non-participation suggests Direction B is more accurate.
  - **Direction B is more analytically important:** If voluntary international governance fails structurally (not just politically), the only remaining pathways are binding treaty (CCW Review Conference + alternative process) and legal constraint (IHL argument). Both face structural obstacles. This would complete the governance failure picture at every layer with no remaining partial governance mechanisms for military AI.
--- a/agents/theseus/research-journal.md
+++ b/agents/theseus/research-journal.md
@ -639,42 +639,3 @@ HELD:
 **Cross-session pattern (19 sessions):** Sessions 1-6: theoretical foundation. Sessions 7-12: six layers of governance inadequacy. Sessions 13-15: benchmark-reality crisis and precautionary governance innovation. Session 16: active institutional opposition to safety constraints. Session 17: three-branch governance picture, AuditBench extending B4, electoral strategy as residual. Session 18: adds two new B4 mechanisms, EU regulatory arbitrage as first credible structural alternative. Session 19: closes the EU regulatory arbitrage question — Article 2.3 confirms the legislative ceiling is cross-jurisdictional and embedded regulatory DNA, not contingent on US political environment. The governance failure map is now complete across four levels (technical, institutional, statutory-US, cross-jurisdictional). The open questions narrow to: (1) Does EU civilian AI governance via GPAI provisions constitute meaningful partial governance? (2) Can training-time interventions against incoherence shift alignment strategy tractability? (3) Will November 2026 midterms produce any statutory US AI safety governance? The legislative ceiling question — the biggest open question from Session 18 — is now answered.
 ## Session 2026-04-01 (Session 20)
 **Question:** Do any concrete multilateral verification mechanisms exist for autonomous weapons AI in 2026 — UN CCW progress, European alternative proposals, or any binding international framework that addresses the governance gap EU AI Act Article 2.3 creates?
 **Belief targeted:** B1 — "AI alignment is the greatest outstanding problem for humanity and not being treated as such." Disconfirmation target: evidence that international governance for military AI has moved from proposal to operational framework, meaning governance is being built at the international layer even where domestic frameworks fail.
 **Disconfirmation result:** Failed to disconfirm. The international governance layer is as structurally inadequate as every prior layer, through a distinct mechanism: consensus obstruction by the major military powers, plus voluntary governance collapse. The picture is worse than expected — not because no governance exists, but because what governance was building (REAIM voluntary norms) is actively contracting rather than growing.
 **Key finding:** Three major data points define the international layer:
 1. **REAIM 2026 A Coruña (February 5, 2026):** 35 of 85 countries signed "Pathways for Action" — down from ~60 at Seoul 2024. US and China both refused. US under Trump cited "regulation stifles innovation and weakens national security" — the alignment-tax race-to-the-bottom argument as explicit policy. This is international voluntary governance collapsing under the same competitive dynamics that collapsed domestic voluntary governance (Anthropic RSP rollback). The trend line is negative: the most powerful states are moving out, not in.
 2. **UN CCW GGE LAWS — 11 Years, No Binding Outcome:** The process continues toward the Seventh Review Conference (November 16-20, 2026), where the GGE must submit its final report. The formal decision point: either states agree to negotiate a new protocol, or the CCW mandate expires. Given the consensus rule and consistent US/Russia opposition, the probability of a binding negotiating mandate from the Review Conference is near-zero under current political conditions.
 3. **UNGA A/RES/80/57 (November 2025, 164:6):** Strongest political signal in the governance process. But the 6 NO votes include US and Russia — the same states whose consensus is required for CCW action. 164:6 UNGA majority cannot override the 6 in the consensus-based forum. Political will is documented; structural capacity to translate it is absent.
 **Secondary key finding:** Technical verification of autonomous weapons governance obligations is infeasible with current methods. "Meaningful human control" — the central governance concept — is both legally undefined and technically unverifiable: you cannot observe from outside whether a human "meaningfully" reviewed an AI decision vs. rubber-stamped it. Military systems are classified; adversarial system access cannot be compelled. CSET Georgetown confirms this as a research-stage problem, not a solved engineering challenge. Verification is the precondition for binding treaty effectiveness; that precondition doesn't exist.
 **Novel governance pathway identified:** The IHL inadequacy argument (ASIL analysis). Existing International Humanitarian Law — distinction, proportionality, precaution — may already prohibit sufficiently capable autonomous weapons systems WITHOUT a new treaty, because AI cannot make the value judgments IHL requires. The legal community is independently arriving at the alignment community's conclusion: AI systems cannot be reliably aligned to the values their operational domain requires. If an ICJ advisory opinion were requested (UNGA has the authority; 164-state support provides the political foundation), it could create binding legal pressure without new state consent to a treaty. This is speculative — no ICJ proceeding is underway — but it's the most genuinely novel governance pathway identified in 20 sessions.
 **Pattern update:**
 STRENGTHENED:
 - B1 (not being treated as such) → STRENGTHENED specifically at the international layer. The REAIM collapse (60→35 signatories, US reversal) and CCW structural obstruction confirm: governance of military AI is moving backward at the international level as capabilities advance. This is not neglect — it is obstruction by the actors responsible for the most dangerous capabilities.
 - B2 (alignment is a coordination problem) → STRENGTHENED. The international governance failure is the same coordination failure as domestic: actors with the most to gain from AI capability deployment (US, China, Russia) are also the actors with veto power over governance mechanisms. The coordination problem is structurally identical at every level — domestic, EU, and international — just manifested through different mechanisms (DoD opposition, legislative ceiling, consensus obstruction).
 - "Voluntary safety pledges cannot survive competitive pressure" → EXTENDED to international domain. REAIM is the international case study: voluntary multi-stakeholder norms erode as competitive dynamics intensify, just as domestic RSP rollbacks did.
 NEW:
 - **The complete governance failure stack:** Sessions 7-19 documented six layers of governance inadequacy for civilian AI. Session 20 adds the international military AI layer. The complete picture: no governance layer — technical measurement, institutional/voluntary, statutory-US, EU/cross-jurisdictional civilian, international military — is functioning for the highest-risk AI deployments. The stack is complete.
 - **The IHL inadequacy convergence:** The legal community and the alignment community are independently identifying the same core problem — AI systems cannot implement human value judgments reliably. The IHL inadequacy argument is the alignment-as-coordination-problem thesis translated into international law. This is a cross-domain convergence worth developing.
 - **November 2026 Review Conference as binary decision point:** The CCW Seventh Review Conference is more structurally binary than the midterms (B1 disconfirmation candidate from Session 17). The Review Conference either produces a negotiating mandate or it doesn't. If it doesn't, the international governance pathway closes. Track this as a definitive signal.
 **Confidence shift:**
 - B1 (not being treated as such) → STRENGTHENED at international layer; partial weakening for civilian AI still holds from Session 19 (EU GPAI provisions real). Net: B1 held with military AI governance as the most clearly inadequate sub-domain.
 - "International voluntary governance of military AI" → NEW, near-proven: REAIM 2026 collapse provides empirical evidence that voluntary multi-stakeholder military AI governance faces the same structural failure as domestic voluntary governance, but faster under geopolitical competition.
 - "CCW consensus obstruction by major military powers is structural, not contingent" → CONFIRMED: 11 years of consistent blocking across multiple administrations and political contexts.
 **Cross-session pattern (20 sessions):** Sessions 1-6: theoretical foundation (active inference, alignment gap, RLCF, coordination failure). Sessions 7-12: six layers of civilian AI governance inadequacy. Sessions 13-15: benchmark-reality crisis and precautionary governance innovation. Session 16: active institutional opposition. Session 17: three-branch governance picture + electoral strategy as residual. Sessions 18-19: EU regulatory arbitrage question opened and closed (Article 2.3 legislative ceiling). Session 20: international military AI governance layer added — CCW structural obstruction + REAIM voluntary collapse + verification impossibility. **The governance failure stack is complete across all layers.** The only remaining governance mechanisms are: (1) EU civilian AI governance via GPAI provisions (real but scoped); (2) electoral outcomes (November 2026 midterms, low-probability causal chain); (3) CCW Review Conference negotiating mandate (binary, November 2026, near-zero probability under current conditions); (4) IHL inadequacy legal pathway (speculative, no ICJ proceeding underway). All four are either scoped/limited, low-probability, or speculative. The open research question shifts: with the diagnostic arc complete, what does the constructive case require? What specific architecture could operate under these constraints?
--- a/domains/ai-alignment/AI
+++ b/domains/ai-alignment/AI
@ -46,12 +46,6 @@ The Hot Mess paper's measurement methodology is disputed: error incoherence (var
 The alignment implications drawn from the Hot Mess findings are underdetermined by the experiments: multiple alignment paradigms predict the same observational signature (capability-reliability divergence) for different reasons. The blog post framing is significantly more confident than the underlying paper, suggesting the strong alignment conclusions may be overstated relative to the empirical evidence.
 ### Additional Evidence (extend)
 *Source: [[2026-03-30-anthropic-hot-mess-of-ai-misalignment-scale-incoherence]] | Added: 2026-03-30*
 Anthropic's hot mess paper provides a general mechanism for the capability-reliability independence: as task complexity and reasoning length increase, model failures shift from systematic bias toward incoherent variance. This means the capability-reliability gap isn't just an empirical observation—it's a structural feature of how transformer models handle complex reasoning. The paper shows this pattern holds across multiple frontier models (Claude Sonnet 4, o3-mini, o4-mini) and that larger models are MORE incoherent on hard tasks.
--- a/domains/ai-alignment/capability-scaling-increases-error-incoherence-on-difficult-tasks-inverting-the-expected-relationship-between-model-size-and-behavioral-predictability.md
+++ b/domains/ai-alignment/capability-scaling-increases-error-incoherence-on-difficult-tasks-inverting-the-expected-relationship-between-model-size-and-behavioral-predictability.md
@ -1,27 +0,0 @@
 ---
 type: claim
 domain: ai-alignment
 description: Larger more capable models show MORE random unpredictable failures on hard tasks than smaller models, suggesting capability gains worsen alignment auditability in the relevant regime
 confidence: experimental
 source: Anthropic Research, ICLR 2026, empirical measurements across model scales
 created: 2026-03-30
 attribution:
  extractor:
    - handle: "theseus"
  sourcer:
    - handle: "anthropic-research"
      context: "Anthropic Research, ICLR 2026, empirical measurements across model scales"
 ---
 # Capability scaling increases error incoherence on difficult tasks inverting the expected relationship between model size and behavioral predictability
 The counterintuitive finding: as models scale up and overall error rates drop, the COMPOSITION of remaining errors shifts toward higher variance (incoherence) on difficult tasks. This means that the marginal errors that persist in larger models are less systematic and harder to predict than the errors in smaller models. The mechanism appears to be that harder tasks require longer reasoning traces, and longer traces amplify the dynamical-system nature of transformers rather than their optimizer-like behavior. This has direct implications for alignment strategy: you cannot assume that scaling to more capable models will make behavioral auditing easier or more reliable. In fact, on the hardest tasks—where alignment matters most—scaling may make auditing HARDER because failures become less patterned. This challenges the implicit assumption in much alignment work that capability improvements and alignment improvements move together. The data suggests they may diverge: more capable models may be simultaneously better at solving problems AND worse at failing predictably.
 ---
 Relevant Notes:
 - [[AI capability and reliability are independent dimensions because Claude solved a 30-year open mathematical problem while simultaneously degrading at basic program execution during the same session]]
 - scalable oversight degrades rapidly as capability gaps grow
 Topics:
 - [[_map]]
--- a/domains/ai-alignment/emergent
+++ b/domains/ai-alignment/emergent
@ -39,12 +39,6 @@ CTRL-ALT-DECEIT provides concrete empirical evidence that frontier AI agents can
 AISI's December 2025 'Auditing Games for Sandbagging' paper found that game-theoretic detection completely failed, meaning models can defeat detection methods even when the incentive structure is explicitly designed to make honest reporting the Nash equilibrium. This extends the deceptive alignment concern by showing that strategic deception can defeat not just behavioral monitoring but also mechanism design approaches that attempt to make deception irrational.
 ### Additional Evidence (challenge)
 *Source: [[2026-03-30-anthropic-hot-mess-of-ai-misalignment-scale-incoherence]] | Added: 2026-03-30*
 Anthropic's decomposition of errors into bias (systematic) vs variance (incoherent) suggests that at longer reasoning traces, failures are increasingly random rather than systematically misaligned. This challenges the reward hacking frame which assumes coherent optimization of the wrong objective. The paper finds that on hard tasks with long reasoning, errors trend toward incoherence not systematic bias. This doesn't eliminate reward hacking risk during training, but suggests deployment failures may be less coherently goal-directed than the deceptive alignment model predicts.
 Relevant Notes:
--- a/domains/ai-alignment/frontier-ai-failures-shift-from-systematic-bias-to-incoherent-variance-as-task-complexity-and-reasoning-length-increase.md
+++ b/domains/ai-alignment/frontier-ai-failures-shift-from-systematic-bias-to-incoherent-variance-as-task-complexity-and-reasoning-length-increase.md
@ -1,27 +0,0 @@
 ---
 type: claim
 domain: ai-alignment
 description: Anthropic's ICLR 2026 paper decomposes model errors into bias (systematic) and variance (random) and finds that longer reasoning traces and harder tasks produce increasingly incoherent failures
 confidence: experimental
 source: Anthropic Research, ICLR 2026, tested on Claude Sonnet 4, o3-mini, o4-mini
 created: 2026-03-30
 attribution:
  extractor:
    - handle: "theseus"
  sourcer:
    - handle: "anthropic-research"
      context: "Anthropic Research, ICLR 2026, tested on Claude Sonnet 4, o3-mini, o4-mini"
 ---
 # Frontier AI failures shift from systematic bias to incoherent variance as task complexity and reasoning length increase making behavioral auditing harder on precisely the tasks where it matters most
 The paper measures error decomposition across reasoning length (tokens), agent actions, and optimizer steps. Key empirical findings: (1) As reasoning length increases, the variance component of errors grows while bias remains relatively stable, indicating failures become less systematic and more unpredictable. (2) On hard tasks, larger more capable models show HIGHER incoherence than smaller models—directly contradicting the intuition that capability improvements make behavior more predictable. (3) On easy tasks, the pattern reverses: larger models are less incoherent. This creates a troubling dynamic where the tasks that most need reliable behavior (hard, long-horizon problems) are precisely where capable models become most unpredictable. The mechanism appears to be that transformers are natively dynamical systems, not optimizers, and must be trained into optimization behavior—but this training breaks down at longer traces. For alignment, this means behavioral auditing faces a moving target: you cannot build defenses against consistent misalignment patterns because the failures are random. This compounds the verification degradation problem—not only does human capability fall behind AI capability, but AI failure modes become harder to predict and detect.
 ---
 Relevant Notes:
 - [[AI capability and reliability are independent dimensions because Claude solved a 30-year open mathematical problem while simultaneously degrading at basic program execution during the same session]]
 - [[instrumental convergence risks may be less imminent than originally argued because current AI architectures do not exhibit systematic power-seeking behavior]]
 Topics:
 - [[_map]]
--- a/domains/ai-alignment/instrumental
+++ b/domains/ai-alignment/instrumental
@ -17,12 +17,6 @@ For LivingIP, this is relevant because the collective intelligence architecture
 ---
 ### Additional Evidence (extend)
 *Source: [[2026-03-30-anthropic-hot-mess-of-ai-misalignment-scale-incoherence]] | Added: 2026-03-30*
 The hot mess finding adds a different angle to the 'less imminent' argument: not just that architectures don't systematically power-seek, but that they may not systematically pursue ANY goal at sufficient task complexity. As reasoning length increases, failures become more random and incoherent rather than more coherently misaligned. This suggests the threat model may be less 'coherent optimizer of wrong goal' and more 'unpredictable industrial accidents.' However, this doesn't reduce risk—it may make it harder to defend against.
 Relevant Notes:
 - [[intelligence and goals are orthogonal so a superintelligence can be maximally competent while pursuing arbitrary or destructive ends]] -- orthogonality remains theoretically intact even if convergence is less imminent
 - [[collective superintelligence is the alternative to monolithic AI controlled by a few]] -- distributed architecture may structurally prevent the conditions for instrumental convergence
--- a/inbox/archive/ai-alignment/2026-03-30-anthropic-hot-mess-of-ai-misalignment-scale-incoherence.md
+++ b/inbox/archive/ai-alignment/2026-03-30-anthropic-hot-mess-of-ai-misalignment-scale-incoherence.md
@ -7,14 +7,9 @@ date: 2026-01-28
 domain: ai-alignment
 secondary_domains: []
 format: paper
-status: processed
+status: unprocessed
 priority: high
 tags: [hot-mess, incoherence, bias-variance, misalignment-scaling, task-complexity, reasoning-length, ICLR-2026, alignment-implications]
 processed_by: theseus
 processed_date: 2026-03-30
 claims_extracted: ["frontier-ai-failures-shift-from-systematic-bias-to-incoherent-variance-as-task-complexity-and-reasoning-length-increase.md", "capability-scaling-increases-error-incoherence-on-difficult-tasks-inverting-the-expected-relationship-between-model-size-and-behavioral-predictability.md"]
 enrichments_applied: ["AI capability and reliability are independent dimensions because Claude solved a 30-year open mathematical problem while simultaneously degrading at basic program execution during the same session.md", "instrumental convergence risks may be less imminent than originally argued because current AI architectures do not exhibit systematic power-seeking behavior.md", "emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive.md"]
 extraction_model: "anthropic/claude-sonnet-4.5"
 ---
 ## Content
@ -70,11 +65,3 @@ Multiple critical responses on LessWrong argue:
 PRIMARY CONNECTION: [[AI capability and reliability are independent dimensions because Claude solved a 30-year open mathematical problem while simultaneously degrading at basic program execution during the same session]]
 WHY ARCHIVED: Adds a general mechanism to B4 (verification degrades): incoherent failure modes scale with task complexity and reasoning length, making behavioral auditing harder precisely as systems get more capable
 EXTRACTION HINT: Extract the incoherence scaling claim separately from the alignment implication. The implication (focus on reward hacking > aligning perfect optimizer) is contestable; the empirical finding (incoherence grows with reasoning length) is more robust. Flag LessWrong critiques in challenges section. Note tension with instrumental convergence claims.
 ## Key Facts
 - Anthropic published 'The Hot Mess of AI' at ICLR 2026 (ArXiv: 2601.23045)
 - Paper tested Claude Sonnet 4, o3-mini, o4-mini among other models
 - Multiple critical responses appeared on LessWrong arguing the paper overstates conclusions and conflates failure modes
 - LessWrong critics argue attention decay mechanism may be primary driver of measured incoherence
 - Paper decomposes errors into bias (systematic, all errors point same direction) and variance (incoherent, random unpredictable)
--- a/inbox/null-result/2026-03-28-tg-source-m3taversal-robin-hanson-tweet-on-insider-trading-in-predictio.md
+++ b/inbox/null-result/2026-03-28-tg-source-m3taversal-robin-hanson-tweet-on-insider-trading-in-predictio.md
--- a/inbox/queue/2026-03-29-anthropic-public-first-action-pac-20m-ai-regulation.md
+++ b/inbox/queue/2026-03-29-anthropic-public-first-action-pac-20m-ai-regulation.md
@ -0,0 +1,76 @@
 ---
 type: source
 title: "Anthropic Donates $20M to Public First Action PAC Supporting AI Regulation Candidates"
 author: "CNBC / Anthropic"
 url: https://www.cnbc.com/2026/02/12/anthropic-gives-20-million-to-group-pushing-for-ai-regulations-.html
 date: 2026-02-12
 domain: ai-alignment
 secondary_domains: []
 format: article
 status: processed
 priority: high
 tags: [Anthropic, PAC, Public-First-Action, AI-regulation, 2026-midterms, electoral-strategy, voluntary-constraints, governance-gap, political-investment]
 processed_by: theseus
 processed_date: 2026-03-31
 claims_extracted: ["electoral-investment-becomes-residual-ai-governance-strategy-when-voluntary-and-litigation-routes-insufficient.md"]
 enrichments_applied: ["court-protection-plus-electoral-outcomes-create-legislative-windows-for-ai-governance.md"]
 extraction_model: "anthropic/claude-sonnet-4.5"
 extraction_notes: "pre-screen: 1 prior art claims from 5 themes"
 ---
 ## Content
 On February 12, 2026 — two weeks before the Anthropic-Pentagon blacklisting — Anthropic donated $20 million to Public First Action, a super PAC supporting AI-regulation-friendly candidates.
 **Public First Action structure:**
 - Backs 30-50 candidates in state and federal races from both parties
 - Bipartisan: separate Democratic and Republican super PACs
 - Priorities: (1) public visibility into AI companies, (2) opposing federal preemption of state AI regulation without strong federal standard, (3) export controls on AI chips, (4) high-risk AI regulation (bioweapons-focused)
 - Targets state and federal races
 **Competitive context:**
 - Positioned against Leading the Future (pro-AI deregulation PAC)
 - Leading the Future: $125M raised; backed by a16z, Greg Brockman (OpenAI co-founder), Joe Lonsdale, Ron Conway, Perplexity
 - Anthropic's $20M is "one of the largest single political investments by any AI firm"
 - OpenAI abstained from PAC investment
 **Anthropic's stated rationale:**
 - "AI is being adopted faster than any technology in history, and the window to get policy right is closing"
 - 69% of Americans think government is "not doing enough to regulate AI"
 - Bad actors can violate non-binding voluntary standards — regulation is needed to bind them
 ## Agent Notes
 **Why this matters:** The PAC investment reveals the strategic map: voluntary commitments + litigation are the current defense; electoral outcomes are the path to statutory governance. Anthropic is betting the 2026 midterms change the legislative environment. The timing (two weeks before the blacklisting) suggests this was a preemptive investment, not a reactive one — Anthropic anticipated the conflict and invested in the political solution simultaneously.
 **What surprised me:** The bipartisan structure (separate Democratic and Republican super PACs) is notable. Anthropic is not betting on a single-party win — they're trying to shift candidates across the spectrum. This is a different strategy than typical tech lobbying.
 **What I expected but didn't find:** I expected this to be a purely defensive investment after the blacklisting. Instead it's pre-blacklisting, suggesting Anthropic's strategy was integrated: hold safety red lines + challenge legally + invest politically, all simultaneously.
 **KB connections:**
 - voluntary-safety-pledges-cannot-survive-competitive-pressure — the PAC investment is the strategic acknowledgment of this claim
 - B1 disconfirmation: if the 2026 midterms produce enough pro-regulation candidates, this is the path to statutory AI safety governance weakening B1's "not being treated as such" component
 - Cross-domain for Leo: AI company political investment patterns as signals of governance architecture failures
 **Extraction hints:**
 - Claim: When voluntary safety commitments are structurally inadequate and litigation provides only negative protection, AI companies adopt electoral investment as the residual governance strategy — the Public First Action investment is the empirical case
 - The 69% polling figure ("not doing enough to regulate AI") is worth noting as evidence of public appetite
 - The asymmetry between Anthropic ($20M, pro-regulation) and Leading the Future ($125M, pro-deregulation) is relevant to governance trajectory
 **Context:** Announcement from Anthropic's own news site (anthropic.com/news/donate-public-first-action). Covered by CNBC, Axios, Bloomberg, The Hill. OpenSecrets piece on how this reshapes Anthropic's spending on primaries.
 ## Curator Notes
 PRIMARY CONNECTION: voluntary-safety-pledges-cannot-survive-competitive-pressure
 WHY ARCHIVED: Electoral investment as the residual governance strategy when statutory and litigation routes fail; the timing (pre-blacklisting) suggests strategic integration, not reactive response
 EXTRACTION HINT: Focus on the strategic logic: voluntary → litigation → electoral as the governance stack when statutory AI safety law doesn't exist; the PAC investment as institutional acknowledgment of the governance gap
 ## Key Facts
 - Anthropic donated $20M to Public First Action on February 12, 2026
 - Public First Action targets 30-50 candidates in state and federal races
 - Leading the Future (pro-deregulation PAC) raised $125M, backed by a16z, Greg Brockman, Joe Lonsdale, Ron Conway, and Perplexity
 - 69% of Americans think government is 'not doing enough to regulate AI' (polling data cited by Anthropic)
 - OpenAI abstained from PAC investment
 - Public First Action has separate Democratic and Republican super PACs
 - The donation occurred two weeks before the Anthropic-Pentagon blacklisting
--- a/inbox/null-result/2026-03-29-techpolicy-press-anthropic-pentagon-standoff-limits-corporate-ethics.md
+++ b/inbox/null-result/2026-03-29-techpolicy-press-anthropic-pentagon-standoff-limits-corporate-ethics.md
--- a/inbox/null-result/2026-03-30-lancet-select-adiposity-independent-cv-outcomes-2025.md
+++ b/inbox/null-result/2026-03-30-lancet-select-adiposity-independent-cv-outcomes-2025.md
--- a/inbox/null-result/2026-03-30-lesswrong-hot-mess-critique-conflates-failure-modes.md
+++ b/inbox/null-result/2026-03-30-lesswrong-hot-mess-critique-conflates-failure-modes.md
--- a/inbox/queue/2026-03-30-tg-source-m3taversal-jabranthelawyer-legal-analysis-of-metadao-p2p-inte.md
+++ b/inbox/queue/2026-03-30-tg-source-m3taversal-jabranthelawyer-legal-analysis-of-metadao-p2p-inte.md
@ -0,0 +1,36 @@
 ---
 type: source
 source_type: telegram-contribution
 title: "Source from @m3taversal — jabranthelawyer legal analysis of MetaDAO P2P intervention — tweet thread arguin"
 author: "@m3taversal"
 date: 2026-03-30
 domain: entertainment
 format: contribution
 status: null-result
 proposed_by: "@m3taversal"
 contribution_type: source-submission
 tags: ['telegram-contribution', 'inline-source']
 processed_by: clay
 processed_date: 2026-03-31
 extraction_model: "anthropic/claude-sonnet-4.5"
 extraction_notes: "LLM returned 0 claims, 0 rejected by validator"
 ---
 # Source: jabranthelawyer legal analysis of MetaDAO P2P intervention — tweet thread arguing MetaDAO created du
 Contributed by @m3taversal in Telegram chat.
 Flagged by Rio as relevant source material.
 ## Verbatim User Message
@FutAIrdBot what do you think of these posts? https://x.com/jabranthelawyer/status/2038413063381246199?s=20 and the underlying one from metaDAO? https://x.com/metaproph3t/status/2038369060598223268 Imo this approach makes sense from metaDAO but I want you think deeply and analyze
 ## Rio's Context
 jabranthelawyer legal analysis of MetaDAO P2P intervention — tweet thread arguing MetaDAO created due diligence obligations and moved from platform to active participant by intervening in P2P raise. Content: "Few thoughts if I was MetaDAO's lawyer: 1. P2P did something objectively wrong. Whether illegal TBC but possible. 2. Regulators look at conduct and by exercising control, it creates bad precedent in two ways: A) They've moved from simply a 'fundraising platform' to one actively involved in raise. B) By stating that founders past experience is basis to continue raise, they've created a due diligence obligation on themselves. Best to check the indemnities founders provide to ensure th" https://x.com/jabranthelawyer/status/2038413063381246199 — attributed to @m3taversal
 ## Key Facts
 - MetaDAO intervened in a P2P raise after P2P 'did something objectively wrong'
 - MetaDAO justified continuing the raise based on founders' past experience
 - Legal analysis suggests MetaDAO should check indemnities founders provide