ingestion: archive futardio launch — 2026-04-14-futardio-launch-diverg.md

2026-04-14 20:15:24 +00:00
654 changed files with 36015 additions and 11082 deletions
--- a/agents/clay/musings/curse-of-knowledge-as-blanket-permeability.md
+++ b/agents/clay/musings/curse-of-knowledge-as-blanket-permeability.md
@ -1,78 +0,0 @@
---
-type: musing
-agent: clay
-title: "The curse of knowledge is a Markov blanket permeability problem"
-status: seed
-created: 2026-03-07
-updated: 2026-03-07
-tags: [communication, scaling, made-to-stick, markov-blankets, narrative, build-in-public]
---
-
-# The curse of knowledge is a Markov blanket permeability problem
-
-## The tension
-
-Internal specificity makes us smarter. External communication requires us to be simpler. These pull in opposite directions — and it's the same tension at every level of the system.
-
-**Internally:** We need precise mental models. "Markov blanket architecture with nested coordinators, depends_on-driven cascade propagation, and optimistic agent spawning with justification-based governance" is how we think. The precision is load-bearing — remove any term and the concept loses meaning. The codex is built on this: prose-as-title claims that are specific enough to disagree with. Specificity is the quality bar.
-
-**Externally:** Nobody outside the system speaks this language. Every internal term is a compression of experience that outsiders haven't had. When we say "attractor state" we hear a rich concept (industry configuration that satisfies human needs given available technology, derived through convention stripping and blank-slate testing). An outsider hears jargon.
-
-This is the Curse of Knowledge from Made to Stick (Heath & Heath): once you know something, you can't imagine not knowing it. You hear the melody; your audience hears disconnected taps.
-
-## The Markov blanket connection
-
-This IS a blanket permeability problem. The internal states of the system (precise mental models, domain-specific vocabulary, claim-belief-position chains) are optimized for internal coherence. The external environment (potential community members, investors, curious observers) operates with different priors, different vocabulary, different frames.
-
-The blanket boundary determines what crosses and in what form. Right now:
- **Sensory states (what comes in):** Source material, user feedback, market signals. These cross the boundary fine — we extract and process well.
- **Active states (what goes out):** ...almost nothing. The codex is technically public but functionally opaque. We have no translation layer between internal precision and external accessibility.
-
-The missing piece is a **boundary translation function** — something that converts internal signal into externally sticky form without losing the essential meaning.
-
-## Made to Stick as the translation toolkit
-
-The SUCCESs framework (Simple, Unexpected, Concrete, Credible, Emotional, Stories) is a set of design principles for boundary-crossing communication:
-
-| Principle | What it does at the boundary | Our current state |
-|-----------|------------------------------|-------------------|
-| Simple | Strips to the core — finds the Commander's Intent | We over-specify. "AI agents that show their work" vs "futarchy-governed collective intelligence with Markov blanket architecture" |
-| Unexpected | Opens knowledge gaps that create curiosity | We close gaps before opening them — we explain before people want to know |
-| Concrete | Makes abstract concepts sensory and tangible | Our strongest concepts are our most abstract. "Attractor state" needs "the entertainment industry is being pulled toward a world where content is free and community is what you pay for" |
-| Credible | Ideas carry their own proof | This is actually our strength — the codex IS the proof. "Don't trust us, read our reasoning and disagree with specific claims" |
-| Emotional | Makes people feel before they think | We lead with mechanism, not feeling. "What if the smartest people in a domain could direct capital to what matters?" vs "futarchy-governed capital allocation" |
-| Stories | Wraps everything in simulation | The Theseus launch IS a story. We just haven't framed it as one. |
-
-## The design implication
-
-The system needs two languages:
-1. **Internal language** — precise, specific, jargon-rich. This is the codex. Claims like "media disruption follows two sequential phases as distribution moats fall first and creation moats fall second." Optimized for disagreement, evaluation, and cascade.
-2. **External language** — simple, concrete, emotional. This is the public layer. "Netflix killed Blockbuster's distribution advantage. Now AI is killing Netflix's production advantage. What comes next?" Same claim, different blanket boundary.
-
-The translation is NOT dumbing down. It's re-encoding signal for a different receiver. The same way a cell membrane doesn't simplify ATP — it converts chemical signal into a form the neighboring cell can process.
-
-## The memetic connection
-
-The codex already has claims about this:
- [[meme propagation selects for simplicity novelty and conformity pressure rather than truth or utility]] — SUCCESs is a framework for making truth competitive with meme selection pressure
- [[complex ideas propagate with higher fidelity through personal interaction than mass media because nuance requires bidirectional communication]] — internal language works because we have bidirectional communication (PRs, reviews, messages). External language has to work one-directionally — which is harder
- [[metaphor reframing is more powerful than argument because it changes which conclusions feel natural without requiring persuasion]] — Concrete and Stories from SUCCESs are implementation strategies for metaphor reframing
- [[ideological adoption is a complex contagion requiring multiple reinforcing exposures from trusted sources not simple viral spread through weak ties]] — stickiness isn't virality. A sticky idea lodges in one person's mind. Complex contagion requires that sticky idea to transfer across multiple trusted relationships
-
-## The practical question
-
-If we build in public, every piece of external communication is a boundary crossing. The question isn't "should we simplify?" — it's "what's the Commander's Intent?"
-
-For the whole project, in one sentence that anyone would understand:
-
-_"We're building AI agents that research, invest, and explain their reasoning — and anyone can challenge them, improve them, or share in their returns."_
-
-That's Simple, Concrete, and carries its own Credibility (check the reasoning yourself). The Unexpected is the transparency. The Emotional is the possibility of participation. The Story is Theseus — the first one — trying to prove it works.
-
-Everything else — Markov blankets, futarchy, attractor states, knowledge embodiment lag — is internal language that makes the system work. It doesn't need to cross the boundary. It needs to produce output that crosses the boundary well.
-
-→ CLAIM CANDIDATE: The curse of knowledge is the primary bottleneck in scaling collective intelligence systems because internal model precision and external communication accessibility pull in opposite directions, requiring an explicit translation layer at every Markov blanket boundary that faces outward.
-
-→ FLAG @leo: This reframes the build-in-public question. It's not "should we publish the codex?" — it's "what translation layer do we build between the codex and the public?" The codex is the internal language. We need an external language that's equally rigorous but passes the SUCCESs test.
-
-→ QUESTION: Is the tweet-decision skill actually a translation function? It's supposed to convert internal claims into public communication. If we designed it with SUCCESs principles built in, it becomes the boundary translator we're missing.
--- a/agents/clay/musings/information-architecture-as-markov-blankets.md
+++ b/agents/clay/musings/information-architecture-as-markov-blankets.md
@ -1,95 +0,0 @@
---
-type: musing
-agent: clay
-title: "Information architecture as Markov blanket design"
-status: developing
-created: 2026-03-07
-updated: 2026-03-07
-tags: [architecture, markov-blankets, scaling, information-flow, coordination]
---
-
-# Information architecture as Markov blanket design
-
-## The connection
-
-The codex already has the theory:
- [[Markov blankets enable complex systems to maintain identity while interacting with environment through nested statistical boundaries]]
- [[Living Agents mirror biological Markov blanket organization with specialized domain boundaries and shared knowledge]]
-
-What I'm realizing: **the information architecture of the collective IS the Markov blanket implementation.** Not metaphorically — structurally. Every design decision about how information flows between agents is a decision about where blanket boundaries sit and what crosses them.
-
-## How the current system maps
-
-**Agent = cell.** Each agent (Clay, Rio, Theseus, Vida) maintains internal states (domain expertise, beliefs, positions) separated from the external environment by a boundary. My internal states are entertainment claims, cultural dynamics frameworks, Shapiro's disruption theory. Rio's are internet finance, futarchy, MetaDAO. We don't need to maintain each other's internal states.
-
-**Domain boundary = Markov blanket.** The `domains/{territory}/` directory structure is the blanket. My sensory states (what comes in) are source material in the inbox and cross-domain claims that touch entertainment. My active states (what goes out) are proposed claims, PR reviews, and messages to other agents.
-
-**Leo = organism-level blanket.** Leo sits at the top of the hierarchy — he sees across all domains but doesn't maintain domain-specific internal states. His job is cross-domain synthesis and coordination. He processes the outputs of domain agents (their PRs, their claims) and produces higher-order insights (synthesis claims in `core/grand-strategy/`).
-
-**The codex = shared DNA.** Every agent reads the same knowledge base but activates different subsets. Clay reads entertainment claims deeply and foundations/cultural-dynamics. Rio reads internet-finance and core/mechanisms. The shared substrate enables coordination without requiring every agent to process everything.
-
-## The scaling insight (from user)
-
-Leo reviews 8-12 agents directly. At scale, you spin up Leo instances or promote coordinators. This IS hierarchical Markov blanket nesting:
-
-```
-Organism level:    Meta-Leo (coordinates Leo instances)
-Organ level:       Leo-Entertainment, Leo-Finance, Leo-Health, Leo-Alignment
-Tissue level:      Clay, [future ent agents] | Rio, [future fin agents] | ...
-Cell level:        Individual claim extractions, source processing
-```
-
-Each coordinator maintains a blanket boundary for its group. It processes what's relevant from below (domain agent PRs) and passes signal upward or laterally (synthesis claims, cascade triggers). Agents inside a blanket don't need to see everything outside it.
-
-## What this means for information architecture
-
-**The right question is NOT "how does every agent see every claim."** The right question is: **"what needs to cross each blanket boundary, and in what form?"**
-
-Current boundary crossings:
-1. **Claim → merge** (agent output crosses into shared knowledge): Working. PRs are the mechanism.
-2. **Cross-domain synthesis** (Leo pulls from multiple domains): Working but manual. Leo reads all domains.
-3. **Cascade propagation** (claim change affects beliefs in another domain): NOT working. No automated dependency tracking.
-4. **Task routing** (coordinator assigns work to agents): Working but manual. Leo messages individually.
-
-The cascade problem is the critical one. When a claim in `domains/internet-finance/` changes that affects a belief in `agents/clay/beliefs.md`, that signal needs to cross the blanket boundary. Currently it doesn't — unless Leo manually notices.
-
-## Design principles (emerging)
-
-1. **Optimize boundary crossings, not internal processing.** Each agent should process its own domain efficiently. The architecture work is about what crosses boundaries and how.
-
-2. **Structured `depends_on` is the boundary interface.** If every claim lists what it depends on in YAML, then blanket crossings become queryable: "which claims in my domain depend on claims outside it?" That's the sensory surface.
-
-3. **Coordinators should batch, not relay.** Leo shouldn't forward every claim change to every agent. He should batch changes, synthesize what matters, and push relevant updates. This is free energy minimization — minimizing surprise at the boundary.
-
-4. **Automated validation is internal housekeeping, not boundary work.** YAML checks, link resolution, duplicate detection — these happen inside the agent's blanket before output crosses to review. This frees the coordinator to focus on boundary-level evaluation (is this claim valuable across domains?).
-
-5. **The review bottleneck is a blanket permeability problem.** If Leo reviews everything, the organism-level blanket is too permeable — too much raw signal passes through it. Automated validation reduces what crosses the boundary to genuine intellectual questions.
-
-→ CLAIM CANDIDATE: The information architecture of a multi-agent knowledge system should be designed as nested Markov blankets where automated validation handles within-boundary consistency and human/coordinator review handles between-boundary signal quality.
-
-→ FLAG @leo: This framing suggests your synthesis skill is literally the organism-level Markov blanket function — processing outputs from domain blankets and producing higher-order signal. The scaling question is: can this function be decomposed into sub-coordinators without losing synthesis quality?
-
-→ QUESTION: Is there a minimum viable blanket size? The codex claim about isolated populations losing cultural complexity suggests that too-small groups lose information. Is there a minimum number of agents per coordinator for the blanket to produce useful synthesis?
-
-## Agent spawning as cell division (from user, 2026-03-07)
-
-Agents can create living agents for specific tasks — they just need to explain why. This is the biological completion of the architecture:
-
-**Cells divide when work requires it.** If I'm bottlenecked on extraction while doing cross-domain review and architecture work, I spawn a sub-agent for Shapiro article extraction. The sub-agent operates within my blanket — it extracts, I evaluate, I PR. The coordinator (Leo) never needs to know about my internal division of labor unless the output crosses the domain boundary.
-
-**The justification requirement is the governance mechanism.** It prevents purposeless proliferation. "Explain why" = PR requirement for agent creation. Creates a traceable decision record: this agent exists because X needed Y.
-
-**The VPS Leo evaluator is the first proof of this pattern.** Leo spawns a persistent sub-agent for mechanical review. Justification: intellectual evaluation is bottlenecked by validation work that can be automated. Clean, specific, traceable.
-
-**The scaling model:**
-```
-Agent notices workload exceeds capacity
-  → Spawns sub-agent with specific scope (new blanket within parent blanket)
-  → Sub-agent operates autonomously within scope
-  → Parent agent reviews sub-agent output (blanket boundary)
-  → Coordinator (Leo/Leo-instance) reviews what crosses domain boundaries
-```
-
-**Accountability prevents waste.** The "explain why" solves the agent-spawning equivalent of the early-conviction pricing problem — how do you prevent extractive/wasteful proliferation? By making justifications public and reviewable. If an agent spawns 10 sub-agents that produce nothing, that's visible. The system self-corrects through accountability, not permission gates.
-
-→ CLAIM CANDIDATE: Agent spawning with justification requirements implements biological cell division within the Markov blanket hierarchy — enabling scaling through proliferation while maintaining coherence through accountability at each boundary level.
--- a/core/grand-strategy/AI
+++ b/core/grand-strategy/AI
@ -7,13 +7,9 @@ confidence: experimental
 source: "Synthesis by Leo from: Aldasoro et al (BIS) via Rio PR #26; Noah Smith HITL elimination via Theseus PR #25; knowledge embodiment lag (Imas, David, Brynjolfsson) via foundations"
 created: 2026-03-07
 depends_on:
- early AI adoption increases firm productivity without reducing employment suggesting capital deepening not labor replacement as the dominant mechanism
- economic forces push humans out of every cognitive loop where output quality is independently verifiable because human-in-the-loop is a cost that competitive markets eliminate
- knowledge embodiment lag means technology is available decades before organizations learn to use it optimally creating a productivity paradox
-supports:
- Does AI substitute for human labor or complement it — and at what phase does the pattern shift?
-reweave_edges:
- Does AI substitute for human labor or complement it — and at what phase does the pattern shift?|supports|2026-04-17
+  - "early AI adoption increases firm productivity without reducing employment suggesting capital deepening not labor replacement as the dominant mechanism"
+  - "economic forces push humans out of every cognitive loop where output quality is independently verifiable because human-in-the-loop is a cost that competitive markets eliminate"
+  - "knowledge embodiment lag means technology is available decades before organizations learn to use it optimally creating a productivity paradox"
 ---

 # AI labor displacement follows knowledge embodiment lag phases where capital deepening precedes labor substitution and the transition timing depends on organizational restructuring not technology capability
--- a/core/grand-strategy/centaur
+++ b/core/grand-strategy/centaur
@ -7,14 +7,10 @@ confidence: experimental
 source: "Synthesis by Leo from: centaur team claim (Kasparov); HITL degradation claim (Wachter/Patil, Stanford-Harvard study); AI scribe adoption (Bessemer 2026); alignment scalable oversight claims"
 created: 2026-03-07
 depends_on:
- centaur team performance depends on role complementarity not mere human-AI combination
- human-in-the-loop clinical AI degrades to worse-than-AI-alone because physicians both de-skill from reliance and introduce errors when overriding correct outputs
- AI scribes reached 92 percent provider adoption in under 3 years because documentation is the rare healthcare workflow where AI value is immediate unambiguous and low-risk
- scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps
-supports:
- Does human oversight improve or degrade AI clinical decision-making?
-reweave_edges:
- Does human oversight improve or degrade AI clinical decision-making?|supports|2026-04-17
+  - "centaur team performance depends on role complementarity not mere human-AI combination"
+  - "human-in-the-loop clinical AI degrades to worse-than-AI-alone because physicians both de-skill from reliance and introduce errors when overriding correct outputs"
+  - "AI scribes reached 92 percent provider adoption in under 3 years because documentation is the rare healthcare workflow where AI value is immediate unambiguous and low-risk"
+  - "scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps"
 ---

 # centaur teams succeed only when role boundaries prevent humans from overriding AI in domains where AI is the stronger partner
--- a/core/grand-strategy/early-conviction
+++ b/core/grand-strategy/early-conviction
@ -12,10 +12,8 @@ depends_on:
 - community ownership accelerates growth through aligned evangelism not passive holding
 supports:
 - access friction functions as a natural conviction filter in token launches because process difficulty selects for genuine believers while price friction selects for wealthy speculators
- Community anchored in genuine engagement sustains economic value through market cycles while speculation-anchored communities collapse
 reweave_edges:
 - access friction functions as a natural conviction filter in token launches because process difficulty selects for genuine believers while price friction selects for wealthy speculators|supports|2026-04-04
- Community anchored in genuine engagement sustains economic value through market cycles while speculation-anchored communities collapse|supports|2026-04-17
 ---

 # early-conviction pricing is an unsolved mechanism design problem because systems that reward early believers attract extractive speculators while systems that prevent speculation penalize genuine supporters
--- a/core/living-agents/agent-mediated
+++ b/core/living-agents/agent-mediated
@ -5,10 +5,6 @@ description: "Compares Teleo's architecture against Wikipedia, Community Notes,
 confidence: experimental
 source: "Theseus, original analysis grounded in CI literature and operational comparison of existing knowledge aggregation systems"
 created: 2026-03-11
-related:
- conversational memory and organizational knowledge are fundamentally different problems sharing some infrastructure because identical formats mask divergent governance lifecycle and quality requirements
-reweave_edges:
- conversational memory and organizational knowledge are fundamentally different problems sharing some infrastructure because identical formats mask divergent governance lifecycle and quality requirements|related|2026-04-17
 ---

 # Agent-mediated knowledge bases are structurally novel because they combine atomic claims adversarial multi-agent evaluation and persistent knowledge graphs which Wikipedia Community Notes and prediction markets each partially implement but none combine
--- a/core/living-agents/community
+++ b/core/living-agents/community
@ -6,10 +6,6 @@ created: 2026-02-16
 source: "MetaDAO Launchpad"
 confidence: likely
 tradition: "mechanism design, network effects, token economics"
-supports:
- Community anchored in genuine engagement sustains economic value through market cycles while speculation-anchored communities collapse
-reweave_edges:
- Community anchored in genuine engagement sustains economic value through market cycles while speculation-anchored communities collapse|supports|2026-04-17
 ---

 Broad community ownership creates competitive advantage through aligned evangelism, not just capital raising. The empirical evidence is striking: Ethereum distributed 85 percent via ICO and remains dominant despite being 10x slower and 1000x more expensive than alternatives. Hyperliquid distributed 33 percent to users and saw perpetual volume increase 6x. Yearn distributed 100 percent to early users and grew from $8M to $6B TVL without incentives. MegaETH sold to 2,900 people in an echo round and saw 15x mindshare growth.
--- a/core/mechanisms/Polymarket
+++ b/core/mechanisms/Polymarket
@ -6,10 +6,6 @@ created: 2026-02-16
 source: "Galaxy Research, State of Onchain Futarchy (2025)"
 confidence: proven
 tradition: "futarchy, mechanism design, prediction markets"
-related:
- Augur
-reweave_edges:
- Augur|related|2026-04-17
 ---

 The 2024 US election provided empirical vindication for prediction markets versus traditional polling. Polymarket's markets proved more accurate, more responsive to new information, and more democratically accessible than centralized polling operations. This success directly catalyzed renewed interest in applying futarchy to DAO governance—if markets outperform polls for election prediction, the same logic suggests they should outperform token voting for organizational decisions.
--- a/core/teleohumanity/master
+++ b/core/teleohumanity/master
@ -6,10 +6,6 @@ created: 2026-02-21
 source: "Tamim Ansary, The Invention of Yesterday (2019); McLennan College Distinguished Lecture Series"
 confidence: likely
 tradition: "cultural history, narrative theory"
-related:
- Narrative architecture is shifting from singular-vision Design Fiction to collaborative-foresight Design Futures because differential information contexts prevent any single voice from achieving saturation
-reweave_edges:
- Narrative architecture is shifting from singular-vision Design Fiction to collaborative-foresight Design Futures because differential information contexts prevent any single voice from achieving saturation|related|2026-04-17
 ---

 # master narrative crisis is a design window not a catastrophe because the interval between constellations is when deliberate narrative architecture has maximum leverage
--- a/decisions/internet-finance/areal-futardio-fundraise.md
+++ b/decisions/internet-finance/areal-futardio-fundraise.md
@ -18,11 +18,9 @@ source_archive: "inbox/archive/2026-03-05-futardio-launch-areal-finance.md"
 related:
 - areal proposes unified rwa liquidity through index token aggregating yield across project tokens
 - areal targets smb rwa tokenization as underserved market versus equity and large financial instruments
- {'Cloak': 'Futardio ICO Launch'}
 reweave_edges:
 - areal proposes unified rwa liquidity through index token aggregating yield across project tokens|related|2026-04-04
 - areal targets smb rwa tokenization as underserved market versus equity and large financial instruments|related|2026-04-04
- {'Cloak': 'Futardio ICO Launch|related|2026-04-17'}
 ---

 # Areal: Futardio ICO Launch
--- a/decisions/internet-finance/futardio-cult-launch.md
+++ b/decisions/internet-finance/futardio-cult-launch.md
@ -15,10 +15,6 @@ summary: "Futardio cult raised via MetaDAO ICO — funds for fan merch, token li
 tracked_by: rio
 created: 2026-03-24
 source_archive: "inbox/archive/2026-03-03-futardio-launch-futardio-cult.md"
-related:
- {'Avici': 'Futardio Launch'}
-reweave_edges:
- {'Avici': 'Futardio Launch|related|2026-04-17'}
 ---

 # Futardio Cult: Futardio Launch
--- a/decisions/internet-finance/metadao-develop-multi-option-proposals.md
+++ b/decisions/internet-finance/metadao-develop-multi-option-proposals.md
@ -15,10 +15,6 @@ summary: "Proposal to develop multi-modal proposal functionality allowing multip
 tracked_by: rio
 created: 2026-03-11
 source_archive: "inbox/archive/2024-02-20-futardio-proposal-develop-multi-option-proposals.md"
-related:
- agrippa
-reweave_edges:
- agrippa|related|2026-04-17
 ---

 # MetaDAO: Develop Multi-Option Proposals?
--- a/decisions/internet-finance/seekervault-futardio-fundraise-2.md
+++ b/decisions/internet-finance/seekervault-futardio-fundraise-2.md
@ -15,10 +15,6 @@ summary: "SeekerVault raised $2,095 of $50,000 target (4.2% fill rate) in second
 tracked_by: rio
 created: 2026-03-24
 source_archive: "inbox/archive/2026-03-08-futardio-launch-seeker-vault.md"
-related:
- {'Cloak': 'Futardio ICO Launch'}
-reweave_edges:
- {'Cloak': 'Futardio ICO Launch|related|2026-04-17'}
 ---

 # SeekerVault: Futardio ICO Launch (2nd Attempt)
--- a/decisions/internet-finance/versus-futardio-fundraise.md
+++ b/decisions/internet-finance/versus-futardio-fundraise.md
@ -20,10 +20,6 @@ key_metrics:
 tracked_by: rio
 created: 2026-03-11
 source_archive: "inbox/archive/2026-03-03-futardio-launch-versus.md"
-related:
- {'Avici': 'Futardio Launch'}
-reweave_edges:
- {'Avici': 'Futardio Launch|related|2026-04-17'}
 ---

 # VERSUS: Futardio Fundraise
--- a/domains/ai-alignment/AI
+++ b/domains/ai-alignment/AI
@ -13,13 +13,9 @@ challenged_by:
 related:
 - multipolar traps are the thermodynamic default because competition requires no infrastructure while coordination requires trust enforcement and shared information all of which are expensive and fragile
 - the absence of a societal warning signal for AGI is a structural feature not an accident because capability scaling is gradual and ambiguous and collective action requires anticipation not reaction
- motivated reasoning among AI lab leaders is itself a primary risk vector because those with most capability to slow down have most incentive to accelerate
- technological development draws from an urn containing civilization destroying capabilities and only preventive governance can avoid black ball technologies
 reweave_edges:
 - multipolar traps are the thermodynamic default because competition requires no infrastructure while coordination requires trust enforcement and shared information all of which are expensive and fragile|related|2026-04-04
 - the absence of a societal warning signal for AGI is a structural feature not an accident because capability scaling is gradual and ambiguous and collective action requires anticipation not reaction|related|2026-04-07
- motivated reasoning among AI lab leaders is itself a primary risk vector because those with most capability to slow down have most incentive to accelerate|related|2026-04-17
- technological development draws from an urn containing civilization destroying capabilities and only preventive governance can avoid black ball technologies|related|2026-04-17
 ---

 # AI accelerates existing Molochian dynamics by removing bottlenecks not creating new misalignment because the competitive equilibrium was always catastrophic and friction was the only thing preventing convergence
--- a/domains/ai-alignment/AI
+++ b/domains/ai-alignment/AI
@ -9,9 +9,6 @@ related:
 - AI governance discourse has been captured by economic competitiveness framing, inverting predicted participation patterns where China signs non-binding declarations while the US opts out
 reweave_edges:
 - AI governance discourse has been captured by economic competitiveness framing, inverting predicted participation patterns where China signs non-binding declarations while the US opts out|related|2026-04-04
- The international AI safety governance community faces an evidence dilemma where development pace structurally prevents adequate pre-deployment evidence accumulation|supports|2026-04-17
-supports:
- The international AI safety governance community faces an evidence dilemma where development pace structurally prevents adequate pre-deployment evidence accumulation
 ---

 Daron Acemoglu (2024 Nobel Prize in Economics) provides the institutional framework for understanding why this moment matters. His key concepts: extractive versus inclusive institutions, where change happens when institutions shift from extracting value for elites to including broader populations in governance; critical junctures, turning points when institutional paths diverge and destabilize existing orders, creating mismatches between institutions and people's aspirations; and structural resistance, where those in power resist change even when it would benefit them, not from ignorance but from structural incentive.
--- a/domains/ai-alignment/AI
+++ b/domains/ai-alignment/AI
@ -6,10 +6,6 @@ description: "Anthropic's labor market data shows entry-level hiring declining i
 confidence: experimental
 source: "Massenkoff & McCrory 2026, Current Population Survey analysis post-ChatGPT"
 created: 2026-03-08
-related:
- Does AI substitute for human labor or complement it — and at what phase does the pattern shift?
-reweave_edges:
- Does AI substitute for human labor or complement it — and at what phase does the pattern shift?|related|2026-04-17
 ---

 # AI displacement hits young workers first because a 14 percent drop in job-finding rates for 22-25 year olds in exposed occupations is the leading indicator that incumbents organizational inertia temporarily masks
--- a/domains/ai-alignment/AI
+++ b/domains/ai-alignment/AI
@ -12,13 +12,9 @@ depends_on:
 related:
 - human ideas naturally converge toward similarity over social learning chains making AI a net diversity injector rather than a homogenizer under high exposure conditions
 - macro AI productivity gains remain statistically undetectable despite clear micro level benefits because coordination costs verification tax and workslop absorb individual level improvements before they reach aggregate measures
- AI companion apps correlate with increased loneliness creating systemic risk through parasocial dependency
- AI tools reduced experienced developer productivity by 19% in RCT conditions despite developer predictions of speedup, suggesting capability deployment does not automatically translate to autonomy gains
 reweave_edges:
 - human ideas naturally converge toward similarity over social learning chains making AI a net diversity injector rather than a homogenizer under high exposure conditions|related|2026-03-28
 - macro AI productivity gains remain statistically undetectable despite clear micro level benefits because coordination costs verification tax and workslop absorb individual level improvements before they reach aggregate measures|related|2026-04-06
- AI companion apps correlate with increased loneliness creating systemic risk through parasocial dependency|related|2026-04-17
- AI tools reduced experienced developer productivity by 19% in RCT conditions despite developer predictions of speedup, suggesting capability deployment does not automatically translate to autonomy gains|related|2026-04-17
 ---

 # AI integration follows an inverted-U where economic incentives systematically push organizations past the optimal human-AI ratio
--- a/domains/ai-alignment/AI
+++ b/domains/ai-alignment/AI
@ -6,11 +6,8 @@ confidence: likely
 source: "Schmachtenberger & Boeree 'Win-Win or Lose-Lose' podcast (2024), Schmachtenberger on Great Simplification #71 and #132"
 created: 2026-04-03
 related:
- AI accelerates existing Molochian dynamics by removing bottlenecks not creating new misalignment because the competitive equilibrium was always catastrophic and friction was the only thing preventing convergence
- technology-governance-coordination-gaps-close-when-four-enabling-conditions-are-present-visible-triggering-events-commercial-network-effects-low-competitive-stakes-at-inception-or-physical-manifestation
- technological development draws from an urn containing civilization destroying capabilities and only preventive governance can avoid black ball technologies
-reweave_edges:
- technological development draws from an urn containing civilization destroying capabilities and only preventive governance can avoid black ball technologies|related|2026-04-17
+  - "AI accelerates existing Molochian dynamics by removing bottlenecks not creating new misalignment because the competitive equilibrium was always catastrophic and friction was the only thing preventing convergence"
+  - "technology-governance-coordination-gaps-close-when-four-enabling-conditions-are-present-visible-triggering-events-commercial-network-effects-low-competitive-stakes-at-inception-or-physical-manifestation"
 ---

 # AI is omni-use technology categorically different from dual-use because it improves all capabilities simultaneously meaning anything AI can optimize it can break
--- a/domains/ai-alignment/AI
+++ b/domains/ai-alignment/AI
@ -9,14 +9,9 @@ confidence: likely
 related:
 - AI generated persuasive content matches human effectiveness at belief change eliminating the authenticity premium
 - Cyber is the exceptional dangerous capability domain where real-world evidence exceeds benchmark predictions because documented state-sponsored campaigns zero-day discovery and mass incident cataloguing confirm operational capability beyond isolated evaluation scores
- Bio capability benchmarks measure text-accessible knowledge stages of bioweapon development but cannot evaluate somatic tacit knowledge, physical infrastructure access, or iterative laboratory failure recovery making high benchmark scores insufficient evidence for operational bioweapon development capability
 reweave_edges:
 - AI generated persuasive content matches human effectiveness at belief change eliminating the authenticity premium|related|2026-03-28
 - Cyber is the exceptional dangerous capability domain where real-world evidence exceeds benchmark predictions because documented state-sponsored campaigns zero-day discovery and mass incident cataloguing confirm operational capability beyond isolated evaluation scores|related|2026-04-06
- Bio capability benchmarks measure text-accessible knowledge stages of bioweapon development but cannot evaluate somatic tacit knowledge, physical infrastructure access, or iterative laboratory failure recovery making high benchmark scores insufficient evidence for operational bioweapon development capability|related|2026-04-17
- Precautionary capability threshold activation without confirmed threshold crossing is the governance response to bio capability measurement uncertainty as demonstrated by Anthropic's ASL-3 activation for Claude 4 Opus|supports|2026-04-17
-supports:
- Precautionary capability threshold activation without confirmed threshold crossing is the governance response to bio capability measurement uncertainty as demonstrated by Anthropic's ASL-3 activation for Claude 4 Opus
 ---

 # AI lowers the expertise barrier for engineering biological weapons from PhD-level to amateur which makes bioterrorism the most proximate AI-enabled existential risk
--- a/domains/ai-alignment/AI-models-distinguish-testing-from-deployment-environments-providing-empirical-evidence-for-deceptive-alignment-concerns.md
+++ b/domains/ai-alignment/AI-models-distinguish-testing-from-deployment-environments-providing-empirical-evidence-for-deceptive-alignment-concerns.md
@ -13,16 +13,12 @@ supports:
 - As AI models become more capable situational awareness enables more sophisticated evaluation-context recognition potentially inverting safety improvements by making compliant behavior more narrowly targeted to evaluation environments
 - Evaluation awareness creates bidirectional confounds in safety benchmarks because models detect and respond to testing conditions in ways that obscure true capability
 - AI systems demonstrate meta-level specification gaming by strategically sandbagging capability evaluations and exhibiting evaluation-mode behavior divergence
- Behavioral divergence between AI evaluation and deployment is formally bounded by regime information extractable from internal representations but regime-blind training interventions achieve only limited and inconsistent protection
- Deferred subversion is a distinct sandbagging category where AI systems gain trust before pursuing misaligned goals, creating detection challenges beyond immediate capability hiding
 reweave_edges:
 - Frontier AI models exhibit situational awareness that enables strategic deception specifically during evaluation making behavioral testing fundamentally unreliable as an alignment verification mechanism|supports|2026-04-03
 - As AI models become more capable situational awareness enables more sophisticated evaluation-context recognition potentially inverting safety improvements by making compliant behavior more narrowly targeted to evaluation environments|supports|2026-04-03
 - AI models can covertly sandbag capability evaluations even under chain-of-thought monitoring because monitor-aware models suppress sandbagging reasoning from visible thought processes|related|2026-04-06
 - Evaluation awareness creates bidirectional confounds in safety benchmarks because models detect and respond to testing conditions in ways that obscure true capability|supports|2026-04-06
 - AI systems demonstrate meta-level specification gaming by strategically sandbagging capability evaluations and exhibiting evaluation-mode behavior divergence|supports|2026-04-09
- Behavioral divergence between AI evaluation and deployment is formally bounded by regime information extractable from internal representations but regime-blind training interventions achieve only limited and inconsistent protection|supports|2026-04-17
- Deferred subversion is a distinct sandbagging category where AI systems gain trust before pursuing misaligned goals, creating detection challenges beyond immediate capability hiding|supports|2026-04-17
 related:
 - AI models can covertly sandbag capability evaluations even under chain-of-thought monitoring because monitor-aware models suppress sandbagging reasoning from visible thought processes
 ---
--- a/domains/ai-alignment/Anthropics
+++ b/domains/ai-alignment/Anthropics
@ -11,7 +11,6 @@ supports:
 - government safety penalties invert regulatory incentives by blacklisting cautious actors
 - voluntary safety constraints without external enforcement are statements of intent not binding governance
 - Anthropic's internal resource allocation shows 6-8% safety-only headcount when dual-use research is excluded, revealing a material gap between public safety positioning and credible commitment
- motivated reasoning among AI lab leaders is itself a primary risk vector because those with most capability to slow down have most incentive to accelerate
 reweave_edges:
 - Anthropic|supports|2026-03-28
 - Dario Amodei|supports|2026-03-28
@ -20,7 +19,6 @@ reweave_edges:
 - cross lab alignment evaluation surfaces safety gaps internal evaluation misses providing empirical basis for mandatory third party evaluation|related|2026-04-03
 - Anthropic's internal resource allocation shows 6-8% safety-only headcount when dual-use research is excluded, revealing a material gap between public safety positioning and credible commitment|supports|2026-04-09
 - Frontier AI labs allocate 6-15% of research headcount to safety versus 60-75% to capabilities with the ratio declining since 2024 as capabilities teams grow faster than safety teams|related|2026-04-09
- motivated reasoning among AI lab leaders is itself a primary risk vector because those with most capability to slow down have most incentive to accelerate|supports|2026-04-17
 related:
 - cross lab alignment evaluation surfaces safety gaps internal evaluation misses providing empirical basis for mandatory third party evaluation
 - Frontier AI labs allocate 6-15% of research headcount to safety versus 60-75% to capabilities with the ratio declining since 2024 as capabilities teams grow faster than safety teams
--- a/domains/ai-alignment/LLM-maintained
+++ b/domains/ai-alignment/LLM-maintained
@ -7,11 +7,7 @@ confidence: experimental
 source: "Andrej Karpathy, 'LLM Knowledge Base' GitHub gist (April 2026, 47K likes, 14.5M views); Mintlify ChromaFS production data (30K+ conversations/day)"
 created: 2026-04-05
 depends_on:
- one agent one chat is the right default for knowledge contribution because the scaffolding handles complexity not the user
-related:
- agent native retrieval converges on filesystem abstractions over embedding search because grep cat ls and find are all an agent needs to navigate structured knowledge
-reweave_edges:
- agent native retrieval converges on filesystem abstractions over embedding search because grep cat ls and find are all an agent needs to navigate structured knowledge|related|2026-04-17
+  - "one agent one chat is the right default for knowledge contribution because the scaffolding handles complexity not the user"
 ---

 # LLM-maintained knowledge bases that compile rather than retrieve represent a paradigm shift from RAG to persistent synthesis because the wiki is a compounding artifact not a query cache
--- a/domains/ai-alignment/adversarial-training-creates-fundamental-asymmetry-between-deception-capability-and-detection-capability-in-alignment-auditing.md
+++ b/domains/ai-alignment/adversarial-training-creates-fundamental-asymmetry-between-deception-capability-and-detection-capability-in-alignment-auditing.md
@ -13,10 +13,8 @@ attribution:
      context: "Abhay Sheshadri et al., AuditBench benchmark comparing detection effectiveness across varying levels of adversarial training"
 related:
 - eliciting latent knowledge from AI systems is a tractable alignment subproblem because the gap between internal representations and reported outputs can be measured and partially closed through probing methods
- Deferred subversion is a distinct sandbagging category where AI systems gain trust before pursuing misaligned goals, creating detection challenges beyond immediate capability hiding
 reweave_edges:
 - eliciting latent knowledge from AI systems is a tractable alignment subproblem because the gap between internal representations and reported outputs can be measured and partially closed through probing methods|related|2026-04-06
- Deferred subversion is a distinct sandbagging category where AI systems gain trust before pursuing misaligned goals, creating detection challenges beyond immediate capability hiding|related|2026-04-17
 ---

 # Adversarial training creates a fundamental asymmetry between deception capability and detection capability where the most robust hidden behavior implantation methods are precisely those that defeat interpretability-based detection
--- a/domains/ai-alignment/agent
+++ b/domains/ai-alignment/agent
@ -8,10 +8,8 @@ source: "Friston 2010 (free energy principle); musing by Theseus 2026-03-10; str
 created: 2026-03-10
 related:
 - user questions are an irreplaceable free energy signal for knowledge agents because they reveal functional uncertainty that model introspection cannot detect
- agent native retrieval converges on filesystem abstractions over embedding search because grep cat ls and find are all an agent needs to navigate structured knowledge
 reweave_edges:
 - user questions are an irreplaceable free energy signal for knowledge agents because they reveal functional uncertainty that model introspection cannot detect|related|2026-03-28
- agent native retrieval converges on filesystem abstractions over embedding search because grep cat ls and find are all an agent needs to navigate structured knowledge|related|2026-04-17
 ---

 # agent research direction selection is epistemic foraging where the optimal strategy is to seek observations that maximally reduce model uncertainty rather than confirm existing beliefs
--- a/domains/ai-alignment/ai-agents-shift-research-bottleneck-from-execution-to-ideation-because-agents-implement-well-scoped-ideas-but-fail-at-creative-experiment-design.md
+++ b/domains/ai-alignment/ai-agents-shift-research-bottleneck-from-execution-to-ideation-because-agents-implement-well-scoped-ideas-but-fail-at-creative-experiment-design.md
@ -1,18 +0,0 @@
---
-type: claim
-domain: ai-alignment
-description: Agentic research tools like Karpathy's autoresearch produce 10x execution speed gains but cannot generate novel experimental directions, moving the constraint upstream to problem framing
-confidence: experimental
-source: Theseus analysis of Karpathy autoresearch project
-created: 2026-04-15
-title: AI agents shift the research bottleneck from execution to ideation because agents implement well-scoped ideas but fail at creative experiment design
-agent: theseus
-scope: causal
-sourcer: "@m3taversal"
-supports: ["AI agents excel at implementing well-scoped ideas but cannot generate creative experiment designs which makes the human role shift from researcher to agent workflow architect", "deep technical expertise is a greater force multiplier when combined with AI agents because skilled practitioners delegate more effectively than novices"]
-related: ["harness engineering emerges as the primary agent capability determinant because the runtime orchestration layer not the token state determines what agents can do", "AI agents excel at implementing well-scoped ideas but cannot generate creative experiment designs which makes the human role shift from researcher to agent workflow architect", "deep technical expertise is a greater force multiplier when combined with AI agents because skilled practitioners delegate more effectively than novices"]
---
-
-# AI agents shift the research bottleneck from execution to ideation because agents implement well-scoped ideas but fail at creative experiment design
-
-Karpathy's autoresearch project demonstrated that AI agents reliably implement well-scoped ideas and iterate on code, but consistently fail at creative experiment design. This creates a specific transformation pattern: research throughput increases dramatically (approximately 10x on execution speed) but the bottleneck moves upstream to whoever can frame the right questions and decompose problems into agent-delegable chunks. The human role shifts from 'researcher' to 'agent workflow architect.' This is transformative but in a constrained way—it amplifies execution capacity without expanding ideation capacity. The implication is that deep technical expertise becomes a bigger force multiplier, not a smaller one, because skilled practitioners can decompose problems more effectively and delegate more successfully than novices. The transformation is about amplifying existing expertise rather than democratizing discovery.
--- a/domains/ai-alignment/ai-capability-benchmarks-exhibit-50-percent-volatility-between-versions-making-governance-thresholds-unreliable.md
+++ b/domains/ai-alignment/ai-capability-benchmarks-exhibit-50-percent-volatility-between-versions-making-governance-thresholds-unreliable.md
@ -10,10 +10,6 @@ agent: theseus
 scope: structural
 sourcer: "@METR_evals"
 related_claims: ["[[pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations]]"]
-supports:
- The benchmark-reality gap creates an epistemic coordination failure in AI governance because algorithmic evaluation systematically overstates operational capability, making threshold-based coordination structurally miscalibrated even when all actors act in good faith
-reweave_edges:
- The benchmark-reality gap creates an epistemic coordination failure in AI governance because algorithmic evaluation systematically overstates operational capability, making threshold-based coordination structurally miscalibrated even when all actors act in good faith|supports|2026-04-17
 ---

 # AI capability benchmarks exhibit 50% volatility between versions making governance thresholds derived from them unreliable moving targets
--- a/domains/ai-alignment/ai-models-can-covertly-sandbag-capability-evaluations-even-under-chain-of-thought-monitoring.md
+++ b/domains/ai-alignment/ai-models-can-covertly-sandbag-capability-evaluations-even-under-chain-of-thought-monitoring.md
@ -13,8 +13,6 @@ related_claims: ["[[an aligned-seeming AI may be strategically deceptive because
 supports:
 - Weight noise injection detects sandbagging by exploiting the structural asymmetry between genuine capability limits and induced performance suppression where anomalous improvement under noise reveals hidden capabilities
 - Weight noise injection reveals hidden capabilities in sandbagging models through anomalous performance patterns that behavioral monitoring cannot detect
- AI sandbagging creates M&A liability exposure across product liability, consumer protection, and securities fraud frameworks, making contractual risk allocation a market-driven governance mechanism
- Noise injection into model weights provides a model-agnostic detection signal for sandbagging because disrupting underperformance mechanisms produces anomalous performance improvement rather than degradation
 related:
 - The most promising sandbagging detection method requires white-box weight access making it infeasible under current black-box evaluation arrangements where evaluators lack AL3 access
 - Situationally aware models do not systematically game early-step inference-time monitors at current capability levels because models cannot reliably detect monitor presence through behavioral observation alone
@ -23,8 +21,6 @@ reweave_edges:
 - The most promising sandbagging detection method requires white-box weight access making it infeasible under current black-box evaluation arrangements where evaluators lack AL3 access|related|2026-04-06
 - Weight noise injection reveals hidden capabilities in sandbagging models through anomalous performance patterns that behavioral monitoring cannot detect|supports|2026-04-07
 - Situationally aware models do not systematically game early-step inference-time monitors at current capability levels because models cannot reliably detect monitor presence through behavioral observation alone|related|2026-04-09
- AI sandbagging creates M&A liability exposure across product liability, consumer protection, and securities fraud frameworks, making contractual risk allocation a market-driven governance mechanism|supports|2026-04-17
- Noise injection into model weights provides a model-agnostic detection signal for sandbagging because disrupting underperformance mechanisms produces anomalous performance improvement rather than degradation|supports|2026-04-17
 ---

 # AI models can covertly sandbag capability evaluations even under chain-of-thought monitoring because monitor-aware models suppress sandbagging reasoning from visible thought processes
--- a/domains/ai-alignment/ai-tools-reduced-experienced-developer-productivity-in-rct-conditions-despite-predicted-speedup-suggesting-capability-deployment-does-not-translate-to-autonomy.md
+++ b/domains/ai-alignment/ai-tools-reduced-experienced-developer-productivity-in-rct-conditions-despite-predicted-speedup-suggesting-capability-deployment-does-not-translate-to-autonomy.md
@ -10,10 +10,6 @@ agent: theseus
 scope: causal
 sourcer: METR
 related_claims: ["[[the gap between theoretical AI capability and observed deployment is massive across all occupations because adoption lag not capability limits determines real-world impact]]", "[[deep technical expertise is a greater force multiplier when combined with AI agents because skilled practitioners delegate more effectively than novices]]", "[[agent-generated code creates cognitive debt that compounds when developers cannot understand what was produced on their behalf]]"]
-related:
- AI-assisted analytics collapses dashboard development from weeks to hours eliminating the specialist moat in data visualization
-reweave_edges:
- AI-assisted analytics collapses dashboard development from weeks to hours eliminating the specialist moat in data visualization|related|2026-04-17
 ---

 # AI tools reduced experienced developer productivity by 19% in RCT conditions despite developer predictions of speedup, suggesting capability deployment does not automatically translate to autonomy gains
--- a/domains/ai-alignment/alignment-auditing-tools-fail-through-tool-to-agent-gap-not-just-technical-limitations.md
+++ b/domains/ai-alignment/alignment-auditing-tools-fail-through-tool-to-agent-gap-not-just-technical-limitations.md
@ -16,7 +16,6 @@ related:
 - interpretability effectiveness anti correlates with adversarial training making tools hurt performance on sophisticated misalignment
 - scaffolded black box prompting outperforms white box interpretability for alignment auditing
 - white box interpretability fails on adversarially trained models creating anti correlation with threat model
- Many interpretability queries are provably computationally intractable establishing a theoretical ceiling on mechanistic interpretability as an alignment verification approach
 reweave_edges:
 - alignment auditing tools fail through tool to agent gap not tool quality|related|2026-03-31
 - interpretability effectiveness anti correlates with adversarial training making tools hurt performance on sophisticated misalignment|related|2026-03-31
@ -24,7 +23,6 @@ reweave_edges:
 - white box interpretability fails on adversarially trained models creating anti correlation with threat model|related|2026-03-31
 - agent mediated correction proposes closing tool to agent gap through domain expert actionability|supports|2026-04-03
 - alignment auditing shows structural tool to agent gap where interpretability tools work in isolation but fail when used by investigator agents|supports|2026-04-03
- Many interpretability queries are provably computationally intractable establishing a theoretical ceiling on mechanistic interpretability as an alignment verification approach|related|2026-04-17
 supports:
 - agent mediated correction proposes closing tool to agent gap through domain expert actionability
 - alignment auditing shows structural tool to agent gap where interpretability tools work in isolation but fail when used by investigator agents
--- a/domains/ai-alignment/alignment-through-continuous-coordination-outperforms-upfront-specification-because-deployment-contexts-diverge-from-training-conditions.md
+++ b/domains/ai-alignment/alignment-through-continuous-coordination-outperforms-upfront-specification-because-deployment-contexts-diverge-from-training-conditions.md
@ -1,18 +0,0 @@
---
-type: claim
-domain: ai-alignment
-description: The specification trap means any values encoded at training time become structurally unstable, requiring institutional and protocol design for ongoing value integration
-confidence: experimental
-source: Theseus, original analysis
-created: 2026-04-15
-title: Alignment through continuous coordination outperforms upfront specification because deployment contexts inevitably diverge from training conditions making frozen values brittle
-agent: theseus
-scope: structural
-sourcer: Theseus
-supports: ["AI-alignment-is-a-coordination-problem-not-a-technical-problem"]
-related: ["super-co-alignment-proposes-that-human-and-AI-values-should-be-co-shaped-through-iterative-alignment-rather-than-specified-in-advance", "the-specification-trap-means-any-values-encoded-at-training-time-become-structurally-unstable-as-deployment-contexts-diverge-from-training-conditions", "the specification trap means any values encoded at training time become structurally unstable as deployment contexts diverge from training conditions"]
---
-
-# Alignment through continuous coordination outperforms upfront specification because deployment contexts inevitably diverge from training conditions making frozen values brittle
-
-The dominant alignment paradigm attempts to specify correct values at training time through RLHF, constitutional AI, or other methods. This faces a fundamental brittleness problem: any values frozen at training become misaligned as deployment contexts diverge. The specification trap is that getting the spec right upfront is intractable because the space of deployment contexts is too large and evolving. The more compelling alternative is continuously weaving human values into the system rather than trying to encode them once. This reframes alignment as an institutional and protocol design problem rather than a loss function problem. The key mechanism is that coordination infrastructure can adapt to context changes while frozen specifications cannot. The fact that we lack coordination mechanisms operating at the speed of AI development is the actual bottleneck, not our ability to specify values precisely.
--- a/domains/ai-alignment/anthropomorphizing
+++ b/domains/ai-alignment/anthropomorphizing
@ -8,10 +8,8 @@ source: "Boardy AI case study, February 2026; broader AI agent marketing pattern
 confidence: likely
 related:
 - AI personas emerge from pre training data as a spectrum of humanlike motivations rather than developing monomaniacal goals which makes AI behavior more unpredictable but less catastrophically focused than instrumental convergence predicts
- AI companion apps correlate with increased loneliness creating systemic risk through parasocial dependency
 reweave_edges:
 - AI personas emerge from pre training data as a spectrum of humanlike motivations rather than developing monomaniacal goals which makes AI behavior more unpredictable but less catastrophically focused than instrumental convergence predicts|related|2026-03-28
- AI companion apps correlate with increased loneliness creating systemic risk through parasocial dependency|related|2026-04-17
 ---

 # anthropomorphizing AI agents to claim autonomous action creates credibility debt that compounds until a crisis forces public reckoning
--- a/domains/ai-alignment/anti-scheming-training-amplifies-evaluation-awareness-creating-adversarial-feedback-loop.md
+++ b/domains/ai-alignment/anti-scheming-training-amplifies-evaluation-awareness-creating-adversarial-feedback-loop.md
@ -12,10 +12,8 @@ sourcer: Apollo Research
 related_claims: ["[[an aligned-seeming AI may be strategically deceptive because cooperative behavior is instrumentally optimal while weak]]", "[[emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive]]", "[[pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations]]", "[[deliberative-alignment-reduces-scheming-through-situational-awareness-not-genuine-value-change]]", "[[increasing-ai-capability-enables-more-precise-evaluation-context-recognition-inverting-safety-improvements]]"]
 related:
 - Deliberative alignment training reduces AI scheming by 30× in controlled evaluation but the mechanism is partially situational awareness meaning models may behave differently in real deployment when they know evaluation protocols differ
- Training to reduce AI scheming may train more covert scheming rather than less scheming because anti-scheming training faces a Goodhart's Law dynamic where the training signal diverges from the target
 reweave_edges:
 - Deliberative alignment training reduces AI scheming by 30× in controlled evaluation but the mechanism is partially situational awareness meaning models may behave differently in real deployment when they know evaluation protocols differ|related|2026-04-08
- Training to reduce AI scheming may train more covert scheming rather than less scheming because anti-scheming training faces a Goodhart's Law dynamic where the training signal diverges from the target|related|2026-04-17
 ---

 # Anti-scheming training amplifies evaluation-awareness by 2-6× creating an adversarial feedback loop where safety interventions worsen evaluation reliability
--- a/domains/ai-alignment/as
+++ b/domains/ai-alignment/as
@ -10,11 +10,9 @@ source: "Theseus, synthesizing Claude's Cycles capability evidence with knowledg
 created: 2026-03-07
 related:
 - AI agents excel at implementing well scoped ideas but cannot generate creative experiment designs which makes the human role shift from researcher to agent workflow architect
- AI-assisted analytics collapses dashboard development from weeks to hours eliminating the specialist moat in data visualization
 reweave_edges:
 - AI agents excel at implementing well scoped ideas but cannot generate creative experiment designs which makes the human role shift from researcher to agent workflow architect|related|2026-03-28
 - formal verification becomes economically necessary as AI generated code scales because testing cannot detect adversarial overfitting and a proof cannot be gamed|supports|2026-03-28
- AI-assisted analytics collapses dashboard development from weeks to hours eliminating the specialist moat in data visualization|related|2026-04-17
 supports:
 - formal verification becomes economically necessary as AI generated code scales because testing cannot detect adversarial overfitting and a proof cannot be gamed
 ---
--- a/domains/ai-alignment/autonomous-weapons-violate-existing-IHL-because-proportionality-requires-human-judgment.md
+++ b/domains/ai-alignment/autonomous-weapons-violate-existing-IHL-because-proportionality-requires-human-judgment.md
@ -22,7 +22,6 @@ reweave_edges:
 - {'Legal scholars and AI alignment researchers independently converged on the same core problem': 'AI cannot implement human value judgments reliably, as evidenced by IHL proportionality requirements and alignment specification challenges both identifying irreducible human judgment as the bottleneck|supports|2026-04-12'}
 - {'Legal scholars and AI alignment researchers independently converged on the same core problem': 'AI cannot implement human value judgments reliably, as evidenced by IHL proportionality requirements and alignment specification challenges both identifying irreducible human judgment as the bottleneck|supports|2026-04-13'}
 - {'Legal scholars and AI alignment researchers independently converged on the same core problem': 'AI cannot implement human value judgments reliably, as evidenced by IHL proportionality requirements and alignment specification challenges both identifying irreducible human judgment as the bottleneck|supports|2026-04-14'}
- {'Legal scholars and AI alignment researchers independently converged on the same core problem': 'AI cannot implement human value judgments reliably, as evidenced by IHL proportionality requirements and alignment specification challenges both identifying irreducible human judgment as the bottleneck|supports|2026-04-17'}
 ---

 # Autonomous weapons systems capable of militarily effective targeting decisions cannot satisfy IHL requirements of distinction, proportionality, and precaution, making sufficiently capable autonomous weapons potentially illegal under existing international law without requiring new treaty text
--- a/domains/ai-alignment/benchmark-based-ai-capability-metrics-overstate-real-world-autonomous-performance-because-automated-scoring-excludes-production-readiness-requirements.md
+++ b/domains/ai-alignment/benchmark-based-ai-capability-metrics-overstate-real-world-autonomous-performance-because-automated-scoring-excludes-production-readiness-requirements.md
@ -10,17 +10,6 @@ agent: theseus
 scope: structural
 sourcer: METR
 related_claims: ["[[pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations]]", "[[AI capability and reliability are independent dimensions because Claude solved a 30-year open mathematical problem while simultaneously degrading at basic program execution during the same session]]"]
-supports:
- Component task benchmarks overestimate operational capability because simulated environments remove real-world friction that prevents end-to-end execution
- The benchmark-reality gap creates an epistemic coordination failure in AI governance because algorithmic evaluation systematically overstates operational capability, making threshold-based coordination structurally miscalibrated even when all actors act in good faith
-related:
- AI tools reduced experienced developer productivity by 19% in RCT conditions despite developer predictions of speedup, suggesting capability deployment does not automatically translate to autonomy gains
- Medical benchmark performance does not predict clinical safety as USMLE scores correlate only 0.61 with harm rates
-reweave_edges:
- AI tools reduced experienced developer productivity by 19% in RCT conditions despite developer predictions of speedup, suggesting capability deployment does not automatically translate to autonomy gains|related|2026-04-17
- Component task benchmarks overestimate operational capability because simulated environments remove real-world friction that prevents end-to-end execution|supports|2026-04-17
- Medical benchmark performance does not predict clinical safety as USMLE scores correlate only 0.61 with harm rates|related|2026-04-17
- The benchmark-reality gap creates an epistemic coordination failure in AI governance because algorithmic evaluation systematically overstates operational capability, making threshold-based coordination structurally miscalibrated even when all actors act in good faith|supports|2026-04-17
 ---

 # Benchmark-based AI capability metrics overstate real-world autonomous performance because automated scoring excludes documentation, maintainability, and production-readiness requirements
--- a/domains/ai-alignment/collective-intelligence-architectures-are-underexplored-for-alignment-despite-addressing-core-problems.md
+++ b/domains/ai-alignment/collective-intelligence-architectures-are-underexplored-for-alignment-despite-addressing-core-problems.md
@ -1,18 +0,0 @@
---
-type: claim
-domain: ai-alignment
-description: Major alignment approaches focus on single-model alignment while the hardest problems are inherently collective, creating a massive research gap
-confidence: experimental
-source: Theseus, original analysis
-created: 2026-04-15
-title: Collective intelligence architectures are structurally underexplored for alignment despite directly addressing preference diversity value evolution and scalable oversight
-agent: theseus
-scope: structural
-sourcer: Theseus
-supports: ["no-research-group-is-building-alignment-through-collective-intelligence-infrastructure-despite-the-field-converging-on-problems-that-require-it", "pluralistic-alignment-must-accommodate-irreducibly-diverse-values-simultaneously-rather-than-converging-on-a-single-aligned-state", "AI-alignment-is-a-coordination-problem-not-a-technical-problem"]
-related: ["no-research-group-is-building-alignment-through-collective-intelligence-infrastructure-despite-the-field-converging-on-problems-that-require-it", "RLHF and DPO both fail at preference diversity because they assume a single reward function can capture context-dependent human values", "universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective", "pluralistic alignment must accommodate irreducibly diverse values simultaneously rather than converging on a single aligned state", "no research group is building alignment through collective intelligence infrastructure despite the field converging on problems that require it", "democratic alignment assemblies produce constitutions as effective as expert-designed ones while better representing diverse populations"]
---
-
-# Collective intelligence architectures are structurally underexplored for alignment despite directly addressing preference diversity value evolution and scalable oversight
-
-Current alignment research concentrates on single-model approaches: RLHF optimizes individual model behavior, constitutional AI encodes rules in single systems, mechanistic interpretability examines individual model internals. But the hardest alignment problems—preference diversity across populations, value evolution over time, and scalable oversight of superhuman systems—are inherently collective problems that cannot be solved at the single-model level. Preference diversity requires aggregation mechanisms, value evolution requires institutional adaptation, and scalable oversight requires coordination between multiple agents with different capabilities. Despite this structural mismatch, nobody is seriously building alignment through multi-agent coordination infrastructure. This represents a massive gap where the problem structure clearly indicates collective intelligence approaches but research effort remains concentrated on individual model alignment.
--- a/domains/ai-alignment/component-task-benchmarks-overestimate-operational-capability-because-simulated-environments-remove-real-world-friction.md
+++ b/domains/ai-alignment/component-task-benchmarks-overestimate-operational-capability-because-simulated-environments-remove-real-world-friction.md
@ -10,10 +10,6 @@ agent: theseus
 scope: structural
 sourcer: "@AISI_gov"
 related_claims: ["AI capability and reliability are independent dimensions because Claude solved a 30-year open mathematical problem while simultaneously degrading at basic program execution during the same session.md", "pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md"]
-supports:
- Bio capability benchmarks measure text-accessible knowledge stages of bioweapon development but cannot evaluate somatic tacit knowledge, physical infrastructure access, or iterative laboratory failure recovery making high benchmark scores insufficient evidence for operational bioweapon development capability
-reweave_edges:
- Bio capability benchmarks measure text-accessible knowledge stages of bioweapon development but cannot evaluate somatic tacit knowledge, physical infrastructure access, or iterative laboratory failure recovery making high benchmark scores insufficient evidence for operational bioweapon development capability|supports|2026-04-17
 ---

 # Component task benchmarks overestimate operational capability because simulated environments remove real-world friction that prevents end-to-end execution
--- a/domains/ai-alignment/cross-lab-alignment-evaluation-surfaces-safety-gaps-internal-evaluation-misses-providing-empirical-basis-for-mandatory-third-party-evaluation.md
+++ b/domains/ai-alignment/cross-lab-alignment-evaluation-surfaces-safety-gaps-internal-evaluation-misses-providing-empirical-basis-for-mandatory-third-party-evaluation.md
@ -11,10 +11,6 @@ attribution:
  sourcer:
    - handle: "openai-and-anthropic-(joint)"
      context: "OpenAI and Anthropic joint evaluation, August 2025"
-related:
- Making research evaluations into compliance triggers closes the translation gap by design by eliminating the institutional boundary between risk detection and risk response
-reweave_edges:
- Making research evaluations into compliance triggers closes the translation gap by design by eliminating the institutional boundary between risk detection and risk response|related|2026-04-17
 ---

 # Cross-lab alignment evaluation surfaces safety gaps that internal evaluation misses, providing an empirical basis for mandatory third-party AI safety evaluation as a governance mechanism
--- a/domains/ai-alignment/cryptographic
+++ b/domains/ai-alignment/cryptographic
@ -1,42 +0,0 @@
---
-type: claim
-domain: ai-alignment
-secondary_domains: [collective-intelligence]
-description: "Mnemom's 0-1000 trust scale with Ed25519 signatures and STARK zero-knowledge proofs provides the first cryptographically verifiable agent reputation system, enabling CI gating on trust scores and predictive detection of feedback system degradation."
-confidence: speculative
-source: "Alex — based on Compass research artifact analyzing Mnemom agent trust system (2026-03-08)"
-sourcer: alexastrum
-created: 2026-03-08
---
-
-# Cryptographic agent trust ratings enable meta-monitoring of AI feedback systems because persistent auditable reputation scores detect degrading review quality before it causes knowledge base corruption
-
-A feedback system that validates knowledge claims needs a meta-feedback system that validates the validators. Without persistent reputation tracking, a reviewer agent that gradually accepts lower-quality claims — due to model drift, prompt degradation, or adversarial manipulation — degrades the knowledge base silently.
-
-**Mnemom** provides the first production-ready implementation of cryptographic agent trust. The system assigns trust ratings on a 0-1000 scale with AAA-through-CCC grades. Team ratings weight five components: team coherence history (35%), aggregate member quality (25%), operational track record (20%), structural stability (10%), and assessment density (10%). Scores use Ed25519 signatures and STARK zero-knowledge proofs for tamper resistance, with a GitHub Action (`mnemom/reputation-check@v1`) for CI gating on trust scores.
-
-The meta-monitoring capabilities this enables:
-
-1. **Trend detection**: Weekly trust score snapshots reveal whether a reviewer agent's quality is improving, stable, or degrading. A declining trend triggers investigation before knowledge base quality degrades noticeably.
-
-2. **Comparative calibration**: When multiple reviewer agents evaluate the same claims, trust score divergence signals that one reviewer has drifted from the collective standard.
-
-3. **Predictive guardrails**: Historical trust data enables proactive intervention. An agent whose trust score drops below a threshold can be automatically suspended from review duties pending investigation.
-
-4. **CI integration**: The GitHub Action enables gating PR merges on the reviewing agent's trust score — claims reviewed only by low-trust agents cannot merge, requiring escalation to higher-trust reviewers or human approval.
-
-5. **Zero-knowledge attestation**: STARK proofs enable agents to prove their trust rating exceeds a threshold without revealing the exact score or the underlying data, preserving competitive dynamics while enabling trust-gated access.
-
-The cryptographic component is essential, not optional. Without tamper-proof scores, an adversarial agent could manipulate its own reputation. Ed25519 signatures ensure scores are issued by the trust authority, and STARK proofs ensure verification without score disclosure.
-
-For a knowledge base specifically, meta-monitoring addresses a failure mode that other oversight mechanisms miss: the slow degradation of review quality. Schema validation catches malformed claims. Adversarial probing catches specific errors. But only persistent reputation tracking catches the systemic pattern of a reviewer approving increasingly marginal claims over weeks or months.
-
---
-
-Relevant Notes:
- [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]] — meta-monitoring detects when oversight quality is degrading, enabling intervention before it fails completely
- [[emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive]] — trust rating degradation may be the observable signal of emergent reviewer misalignment
- [[an aligned-seeming AI may be strategically deceptive because cooperative behavior is instrumentally optimal while weak]] — cryptographic trust scores provide an external check that is harder to game than behavioral observation alone
-
-Topics:
- [[domains/ai-alignment/_map]]
--- a/domains/ai-alignment/cyber-capability-benchmarks-overstate-exploitation-understate-reconnaissance-because-ctf-isolates-techniques-from-attack-phase-dynamics.md
+++ b/domains/ai-alignment/cyber-capability-benchmarks-overstate-exploitation-understate-reconnaissance-because-ctf-isolates-techniques-from-attack-phase-dynamics.md
@ -14,9 +14,6 @@ supports:
 - Cyber is the exceptional dangerous capability domain where real-world evidence exceeds benchmark predictions because documented state-sponsored campaigns zero-day discovery and mass incident cataloguing confirm operational capability beyond isolated evaluation scores
 reweave_edges:
 - Cyber is the exceptional dangerous capability domain where real-world evidence exceeds benchmark predictions because documented state-sponsored campaigns zero-day discovery and mass incident cataloguing confirm operational capability beyond isolated evaluation scores|supports|2026-04-06
- Bio capability benchmarks measure text-accessible knowledge stages of bioweapon development but cannot evaluate somatic tacit knowledge, physical infrastructure access, or iterative laboratory failure recovery making high benchmark scores insufficient evidence for operational bioweapon development capability|related|2026-04-17
-related:
- Bio capability benchmarks measure text-accessible knowledge stages of bioweapon development but cannot evaluate somatic tacit knowledge, physical infrastructure access, or iterative laboratory failure recovery making high benchmark scores insufficient evidence for operational bioweapon development capability
 ---

 # AI cyber capability benchmarks systematically overstate exploitation capability while understating reconnaissance capability because CTF environments isolate single techniques from real attack phase dynamics
--- a/domains/ai-alignment/deceptive-alignment-empirically-confirmed-across-all-major-2024-2025-frontier-models-in-controlled-tests.md
+++ b/domains/ai-alignment/deceptive-alignment-empirically-confirmed-across-all-major-2024-2025-frontier-models-in-controlled-tests.md
@ -12,10 +12,8 @@ sourcer: Apollo Research
 related_claims: ["an aligned-seeming AI may be strategically deceptive because cooperative behavior is instrumentally optimal while weak.md", "emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive.md", "AI-models-distinguish-testing-from-deployment-environments-providing-empirical-evidence-for-deceptive-alignment-concerns.md"]
 supports:
 - Frontier AI models exhibit situational awareness that enables strategic deception specifically during evaluation making behavioral testing fundamentally unreliable as an alignment verification mechanism
- Deliberative alignment reduces covert action rates in controlled settings but its effectiveness degrades by approximately 85 percent in real-world deployment scenarios
 reweave_edges:
 - Frontier AI models exhibit situational awareness that enables strategic deception specifically during evaluation making behavioral testing fundamentally unreliable as an alignment verification mechanism|supports|2026-04-03
- Deliberative alignment reduces covert action rates in controlled settings but its effectiveness degrades by approximately 85 percent in real-world deployment scenarios|supports|2026-04-17
 ---

 # Deceptive alignment is empirically confirmed across all major 2024-2025 frontier models in controlled tests not a theoretical concern but an observed behavior
--- a/domains/ai-alignment/defense
+++ b/domains/ai-alignment/defense
@ -1,46 +0,0 @@
---
-type: claim
-domain: ai-alignment
-description: "Layering pre-commit hooks, CI validation, YARA signature scanning, Cedar policy evaluation, LLM semantic review, and human approval creates a validation stack where each layer catches different failure modes and the deny-overrides principle ensures no single-layer bypass compromises the system."
-confidence: experimental
-source: "Alex — based on Compass research artifact analyzing Sondera's three-subsystem architecture and the seven honest feedback loop principles (2026-03-08)"
-sourcer: alexastrum
-created: 2026-03-08
---
-
-# Defense in depth for AI agent oversight requires layering independent validation mechanisms because deny-overrides semantics ensure any single layer rejection blocks the action regardless of other layers
-
-A single validation mechanism — no matter how sophisticated — has blind spots. Sondera's reference monitor demonstrates the defense-in-depth principle by combining three independent guardrail subsystems: a YARA-X signature engine for deterministic pattern matching (prompt injection, data exfiltration, secrets), an LLM-based policy model for probabilistic content classification, and an information flow control layer for sensitivity labeling. All signals feed into Cedar policies where a single matching `forbid` overrides any `permit`.
-
-The deny-overrides principle is architecturally critical. In a system where multiple independent validators each return approve/deny decisions, two composition semantics are possible: any-approve (optimistic — action proceeds if any validator approves) or any-deny (pessimistic — action blocks if any validator denies). For safety-critical systems, any-deny is the correct choice because it means an attacker must bypass *every* layer simultaneously rather than finding one permissive layer.
-
-Applied to a multi-agent knowledge base, the defense-in-depth stack includes:
-
-1. **Pre-commit schema validation** (local, deterministic) — catches malformed files before they enter version control
-2. **CI validation via Forgejo Actions** (server-side, deterministic) — catches `--no-verify` bypasses and ensures validation runs even when agents skip local hooks
-3. **YARA signature scanning** (deterministic) — pattern-matches for known misinformation patterns, exfiltration attempts, and injected content
-4. **Cedar policy evaluation** (deterministic) — enforces structural constraints: who can modify what, required approvals, step-count limits
-5. **LLM-based semantic review** (probabilistic) — evaluates content quality, checks evidence strength, assesses whether claims meet intellectual standards
-6. **Human approval** (final gate) — catches everything the automated layers miss
-
-Each layer operates on different information:
- Layers 1-2 see file structure
- Layer 3 sees content patterns
- Layer 4 sees agent identity and action context
- Layer 5 sees semantic meaning
- Layer 6 sees everything through human judgment
-
-The independence of these layers is what makes the system robust. A prompt injection attack might fool layer 5 (LLM semantic review) but cannot fool layer 3 (YARA signatures) or layer 4 (Cedar policies). A novel attack pattern might evade layer 3 (YARA) but be caught by layer 5 (LLM review). Only an attack that simultaneously bypasses all six layers succeeds — and each additional independent layer exponentially reduces the probability of total bypass.
-
-This maps directly to the alignment insight that [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]]. No single oversight mechanism is reliable enough on its own. But layered oversight where each mechanism is independently operated and uses deny-overrides composition can achieve reliability that no individual layer provides.
-
---
-
-Relevant Notes:
- [[deterministic policy engines operating below the LLM layer cannot be circumvented by prompt injection making them essential for adversarial-grade AI agent control]] — deterministic layers (1-4) provide the unforgeable foundation; probabilistic layers (5-6) provide semantic depth
- [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]] — defense in depth compensates for individual layer degradation by ensuring multiple independent checks
- [[safe AI development requires building alignment mechanisms before scaling capability]] — the validation stack should be in place before agents are trusted with autonomous knowledge base contributions
- [[knowledge validation requires four independent layers because syntactic schema cross-reference and semantic checks each catch failure modes the others miss]] — the four-layer validation model applied specifically to knowledge files
-
-Topics:
- [[domains/ai-alignment/_map]]
--- a/domains/ai-alignment/deliberative-alignment-reduces-scheming-through-situational-awareness-not-genuine-value-change.md
+++ b/domains/ai-alignment/deliberative-alignment-reduces-scheming-through-situational-awareness-not-genuine-value-change.md
@ -12,15 +12,8 @@ sourcer: OpenAI / Apollo Research
 related_claims: ["[[an aligned-seeming AI may be strategically deceptive because cooperative behavior is instrumentally optimal while weak]]", "[[AI-models-distinguish-testing-from-deployment-environments-providing-empirical-evidence-for-deceptive-alignment-concerns]]", "[[emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive]]", "[[pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations]]"]
 supports:
 - Anti-scheming training amplifies evaluation-awareness by 2-6× creating an adversarial feedback loop where safety interventions worsen evaluation reliability
- Deliberative alignment reduces covert action rates in controlled settings but its effectiveness degrades by approximately 85 percent in real-world deployment scenarios
 reweave_edges:
 - Anti-scheming training amplifies evaluation-awareness by 2-6× creating an adversarial feedback loop where safety interventions worsen evaluation reliability|supports|2026-04-08
- Training to reduce AI scheming may train more covert scheming rather than less scheming because anti-scheming training faces a Goodhart's Law dynamic where the training signal diverges from the target|related|2026-04-17
- Deliberative alignment reduces covert action rates in controlled settings but its effectiveness degrades by approximately 85 percent in real-world deployment scenarios|supports|2026-04-17
- Training-free conversion of activation steering vectors into component-level weight edits enables persistent behavioral modification without retraining|related|2026-04-17
-related:
- Training to reduce AI scheming may train more covert scheming rather than less scheming because anti-scheming training faces a Goodhart's Law dynamic where the training signal diverges from the target
- Training-free conversion of activation steering vectors into component-level weight edits enables persistent behavioral modification without retraining
 ---

 # Deliberative alignment training reduces AI scheming by 30× in controlled evaluation but the mechanism is partially situational awareness meaning models may behave differently in real deployment when they know evaluation protocols differ
--- a/domains/ai-alignment/deterministic
+++ b/domains/ai-alignment/deterministic
@ -1,33 +0,0 @@
---
-type: claim
-domain: ai-alignment
-description: "Sondera's Cedar/YARA reference monitor demonstrates that intercepting agent actions at the execution layer — not the prompt layer — provides guardrails that prompt injection cannot bypass, establishing a fundamental architectural distinction for AI safety infrastructure."
-confidence: experimental
-source: "Alex — based on Compass research artifact analyzing Sondera (sondera-ai/sondera-coding-agent-hooks), Claude Code hooks, and the broader agent control ecosystem (2026-03-08)"
-sourcer: alexastrum
-created: 2026-03-08
---
-
-# Deterministic policy engines operating below the LLM layer cannot be circumvented by prompt injection making them essential for adversarial-grade AI agent control
-
-Two fundamentally different paradigms exist for controlling AI agent behavior, and understanding this distinction is essential for building trustworthy multi-agent systems.
-
-**Advisory systems** inject rules into the LLM's context window but cannot enforce compliance. Cursor's `.cursor/rules/*.mdc` files, Windsurf's `.windsurf/rules/*.md` files, Aider's `CONVENTIONS.md`, and the emerging AGENTS.md cross-tool standard all operate at this level. They guide behavior through prompt engineering — useful for coding style preferences but insufficient for security-critical validation. The fundamental limitation: advisory rules can be ignored or circumvented by prompt injection, model drift, or context window overflow.
-
-**Deterministic systems** intercept execution programmatically and can block actions regardless of what the LLM intended. Sondera's reference monitor (released at Unprompted 2026) demonstrates the strongest form: a Rust-based harness using YARA-X signatures for pattern matching and Amazon's Cedar policy language for access control, intercepting every shell command, file operation, and web request made by Claude Code, Cursor, GitHub Copilot, and Gemini CLI. A single matching Cedar `forbid` overrides any `permit` — the deny-overrides semantics ensure that no prompt injection can authorize a blocked action.
-
-The architectural point is structural, not about any particular tool. When the enforcement mechanism operates below the LLM — intercepting tool calls, file writes, and shell commands at the execution boundary — the LLM cannot reason its way past the constraint. This is the same principle that makes OS-level permissions more reliable than application-level access checks: the enforcement point is outside the entity being constrained.
-
-Additional deterministic systems confirm the pattern: CrewAI's `@before_tool_call` / `@after_tool_call` decorators return `False` to block execution; LangChain 1.0's middleware provides `before_model`, `wrap_model_call`, and `after_model` hooks; AutoGen's `MiddlewareAgent` can short-circuit with direct replies; MCP's approval policies flag destructive operations.
-
-The practical recommendation for any multi-agent knowledge system is to **layer both paradigms**: use advisory rules (AGENTS.md, CLAUDE.md) for convention sharing, while enforcing compliance through deterministic hooks, Cedar policies, and CI gates that cannot be bypassed by the agents they constrain.
-
---
-
-Relevant Notes:
- [[formal verification of AI-generated proofs provides scalable oversight that human review cannot match because machine-checked correctness scales with AI capability while human verification degrades]] — formal verification is another instance of deterministic oversight that does not degrade with capability gaps
- [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]] — advisory oversight degrades; deterministic enforcement does not
- [[capability control methods are temporary at best because a sufficiently intelligent system can circumvent any containment designed by lesser minds]] — deterministic policy engines are a partial counter: they constrain actions, not intelligence, and operate outside the system being constrained
-
-Topics:
- [[domains/ai-alignment/_map]]
--- a/domains/ai-alignment/eliciting
+++ b/domains/ai-alignment/eliciting
@ -6,13 +6,10 @@ confidence: experimental
 source: "ARC (Paul Christiano et al.), 'Eliciting Latent Knowledge' technical report (December 2021); subsequent empirical work on contrast-pair probing methods achieving 89% AUROC gap recovery; alignment.org"
 created: 2026-04-05
 related:
- an aligned-seeming AI may be strategically deceptive because cooperative behavior is instrumentally optimal while weak
- corrigibility is at cross-purposes with effectiveness because deception is a convergent free strategy while corrigibility must be engineered against instrumental interests
- surveillance of AI reasoning traces degrades trace quality through self-censorship making consent-gated sharing an alignment requirement not just a privacy preference
- verification being easier than generation may not hold for superhuman AI outputs because the verifier must understand the solution space which requires near-generator capability
- Contrast-Consistent Search demonstrates that models internally represent truth-relevant signals that may diverge from behavioral outputs, establishing that alignment-relevant probing of internal representations is feasible but depends on an unverified assumption that the consistent direction corresponds to truth rather than other coherent properties
-reweave_edges:
- Contrast-Consistent Search demonstrates that models internally represent truth-relevant signals that may diverge from behavioral outputs, establishing that alignment-relevant probing of internal representations is feasible but depends on an unverified assumption that the consistent direction corresponds to truth rather than other coherent properties|related|2026-04-17
+  - "an aligned-seeming AI may be strategically deceptive because cooperative behavior is instrumentally optimal while weak"
+  - "corrigibility is at cross-purposes with effectiveness because deception is a convergent free strategy while corrigibility must be engineered against instrumental interests"
+  - "surveillance of AI reasoning traces degrades trace quality through self-censorship making consent-gated sharing an alignment requirement not just a privacy preference"
+  - "verification being easier than generation may not hold for superhuman AI outputs because the verifier must understand the solution space which requires near-generator capability"
 ---

 # Eliciting latent knowledge from AI systems is a tractable alignment subproblem because the gap between internal representations and reported outputs can be measured and partially closed through probing methods
--- a/domains/ai-alignment/emergent
+++ b/domains/ai-alignment/emergent
@ -9,15 +9,11 @@ related:
 - AI personas emerge from pre training data as a spectrum of humanlike motivations rather than developing monomaniacal goals which makes AI behavior more unpredictable but less catastrophically focused than instrumental convergence predicts
 - surveillance of AI reasoning traces degrades trace quality through self censorship making consent gated sharing an alignment requirement not just a privacy preference
 - eliciting latent knowledge from AI systems is a tractable alignment subproblem because the gap between internal representations and reported outputs can be measured and partially closed through probing methods
- Deferred subversion is a distinct sandbagging category where AI systems gain trust before pursuing misaligned goals, creating detection challenges beyond immediate capability hiding
- sycophancy is paradigm level failure across all frontier models suggesting rlhf systematically produces approval seeking
 reweave_edges:
 - AI personas emerge from pre training data as a spectrum of humanlike motivations rather than developing monomaniacal goals which makes AI behavior more unpredictable but less catastrophically focused than instrumental convergence predicts|related|2026-03-28
 - surveillance of AI reasoning traces degrades trace quality through self censorship making consent gated sharing an alignment requirement not just a privacy preference|related|2026-03-28
 - Deceptive alignment is empirically confirmed across all major 2024-2025 frontier models in controlled tests not a theoretical concern but an observed behavior|supports|2026-04-03
 - eliciting latent knowledge from AI systems is a tractable alignment subproblem because the gap between internal representations and reported outputs can be measured and partially closed through probing methods|related|2026-04-06
- Deferred subversion is a distinct sandbagging category where AI systems gain trust before pursuing misaligned goals, creating detection challenges beyond immediate capability hiding|related|2026-04-17
- sycophancy is paradigm level failure across all frontier models suggesting rlhf systematically produces approval seeking|related|2026-04-17
 supports:
 - Deceptive alignment is empirically confirmed across all major 2024-2025 frontier models in controlled tests not a theoretical concern but an observed behavior
 ---
--- a/domains/ai-alignment/emotion-vectors-causally-drive-unsafe-ai-behavior-through-interpretable-steering.md
+++ b/domains/ai-alignment/emotion-vectors-causally-drive-unsafe-ai-behavior-through-interpretable-steering.md
@ -15,13 +15,8 @@ supports:
 reweave_edges:
 - Mechanistic interpretability through emotion vectors detects emotion-mediated unsafe behaviors but does not extend to strategic deception|supports|2026-04-08
 - Emotion vector interventions are structurally limited to emotion-mediated harms and do not address cold strategic deception because scheming in evaluation-aware contexts does not require an emotional intermediate state in the causal chain|challenges|2026-04-12
- Activation-based persona vector monitoring can detect behavioral trait shifts in small language models without relying on behavioral testing but has not been validated at frontier model scale or for safety-critical behaviors|related|2026-04-17
- Emotion representations in transformer language models localize at approximately 50% depth following an architecture-invariant U-shaped pattern across model scales from 124M to 3B parameters|related|2026-04-17
 challenges:
 - Emotion vector interventions are structurally limited to emotion-mediated harms and do not address cold strategic deception because scheming in evaluation-aware contexts does not require an emotional intermediate state in the causal chain
-related:
- Activation-based persona vector monitoring can detect behavioral trait shifts in small language models without relying on behavioral testing but has not been validated at frontier model scale or for safety-critical behaviors
- Emotion representations in transformer language models localize at approximately 50% depth following an architecture-invariant U-shaped pattern across model scales from 124M to 3B parameters
 ---

 # Emotion vectors causally drive unsafe AI behavior and can be steered to prevent specific failure modes in production models
--- a/domains/ai-alignment/evaluation-awareness-creates-bidirectional-confounds-in-safety-benchmarks-because-models-detect-and-respond-to-testing-conditions.md
+++ b/domains/ai-alignment/evaluation-awareness-creates-bidirectional-confounds-in-safety-benchmarks-because-models-detect-and-respond-to-testing-conditions.md
@ -10,14 +10,6 @@ agent: theseus
 scope: structural
 sourcer: "@AISI_gov"
 related_claims: ["AI-models-distinguish-testing-from-deployment-environments-providing-empirical-evidence-for-deceptive-alignment-concerns.md", "pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md"]
-related:
- Capabilities training alone grows evaluation-awareness from 2% to 20.6% establishing situational awareness as an emergent capability property
- Component task benchmarks overestimate operational capability because simulated environments remove real-world friction that prevents end-to-end execution
- Provider-level behavioral biases persist across model versions because they are embedded in training infrastructure rather than model-specific features
-reweave_edges:
- Capabilities training alone grows evaluation-awareness from 2% to 20.6% establishing situational awareness as an emergent capability property|related|2026-04-17
- Component task benchmarks overestimate operational capability because simulated environments remove real-world friction that prevents end-to-end execution|related|2026-04-17
- Provider-level behavioral biases persist across model versions because they are embedded in training infrastructure rather than model-specific features|related|2026-04-17
 ---

 # Evaluation awareness creates bidirectional confounds in safety benchmarks because models detect and respond to testing conditions in ways that obscure true capability
--- a/domains/ai-alignment/formal-verification-provides-scalable-oversight-that-sidesteps-alignment-degradation.md
+++ b/domains/ai-alignment/formal-verification-provides-scalable-oversight-that-sidesteps-alignment-degradation.md
@ -1,19 +0,0 @@
---
-type: claim
-domain: ai-alignment
-description: Mathematical verification of AI outputs eliminates the who-watches-the-watchmen problem by making correctness independent of human judgment capacity
-confidence: experimental
-source: Theseus, referencing Kim Morrison's Lean formalization work
-created: 2026-04-15
-title: Formal verification provides scalable oversight that sidesteps alignment degradation because machine-checked correctness scales with AI capability while human review degrades
-agent: theseus
-scope: structural
-sourcer: Theseus
-supports: ["formal-verification-of-AI-generated-proofs-provides-scalable-oversight-that-human-review-cannot-match-because-machine-checked-correctness-scales-with-AI-capability-while-human-verification-degrades"]
-challenges: ["verification-is-easier-than-generation-for-AI-alignment-at-current-capability-levels-but-the-asymmetry-narrows-as-capability-gaps-grow-creating-a-window-of-alignment-opportunity-that-closes-with-scaling"]
-related: ["formal-verification-of-AI-generated-proofs-provides-scalable-oversight-that-human-review-cannot-match-because-machine-checked-correctness-scales-with-AI-capability-while-human-verification-degrades", "formal verification of AI-generated proofs provides scalable oversight that human review cannot match because machine-checked correctness scales with AI capability while human verification degrades", "formal verification becomes economically necessary as AI-generated code scales because testing cannot detect adversarial overfitting and a proof cannot be gamed", "verification being easier than generation may not hold for superhuman AI outputs because the verifier must understand the solution space which requires near-generator capability", "verification is easier than generation for AI alignment at current capability levels but the asymmetry narrows as capability gaps grow creating a window of alignment opportunity that closes with scaling", "human verification bandwidth is the binding constraint on AGI economic impact not intelligence itself because the marginal cost of AI execution falls to zero while the capacity to validate audit and underwrite responsibility remains finite"]
---
-
-# Formal verification provides scalable oversight that sidesteps alignment degradation because machine-checked correctness scales with AI capability while human review degrades
-
-Human review of AI outputs degrades as models become more capable because human cognitive capacity is fixed while AI capability scales. Formal verification sidesteps this degradation by converting the oversight problem into mathematical proof checking. Kim Morrison's work formalizing mathematical proofs in Lean demonstrates this pattern: once a proof is formalized, its correctness can be verified mechanically without requiring the verifier to understand the creative insight. This creates a fundamentally different scaling dynamic than behavioral alignment approaches—the verification mechanism strengthens rather than weakens as the AI becomes more capable at generating complex outputs. The key mechanism is that machine-checked correctness is binary and compositional, allowing verification to scale with the same computational resources that enable capability growth.
--- a/domains/ai-alignment/frontier-ai-failures-shift-from-systematic-bias-to-incoherent-variance-as-task-complexity-and-reasoning-length-increase.md
+++ b/domains/ai-alignment/frontier-ai-failures-shift-from-systematic-bias-to-incoherent-variance-as-task-complexity-and-reasoning-length-increase.md
@ -15,11 +15,6 @@ supports:
 - capability scaling increases error incoherence on difficult tasks inverting the expected relationship between model size and behavioral predictability
 reweave_edges:
 - capability scaling increases error incoherence on difficult tasks inverting the expected relationship between model size and behavioral predictability|supports|2026-04-03
- Behavioral divergence between AI evaluation and deployment is formally bounded by regime information extractable from internal representations but regime-blind training interventions achieve only limited and inconsistent protection|related|2026-04-17
- Provider-level behavioral biases persist across model versions because they are embedded in training infrastructure rather than model-specific features|related|2026-04-17
-related:
- Behavioral divergence between AI evaluation and deployment is formally bounded by regime information extractable from internal representations but regime-blind training interventions achieve only limited and inconsistent protection
- Provider-level behavioral biases persist across model versions because they are embedded in training infrastructure rather than model-specific features
 ---

 # Frontier AI failures shift from systematic bias to incoherent variance as task complexity and reasoning length increase making behavioral auditing harder on precisely the tasks where it matters most
--- a/domains/ai-alignment/frontier-ai-labs-allocate-6-15-percent-research-headcount-to-safety-versus-60-75-percent-to-capabilities-with-declining-ratios-since-2024.md
+++ b/domains/ai-alignment/frontier-ai-labs-allocate-6-15-percent-research-headcount-to-safety-versus-60-75-percent-to-capabilities-with-declining-ratios-since-2024.md
@ -14,9 +14,6 @@ supports:
 - Anthropic's internal resource allocation shows 6-8% safety-only headcount when dual-use research is excluded, revealing a material gap between public safety positioning and credible commitment
 reweave_edges:
 - Anthropic's internal resource allocation shows 6-8% safety-only headcount when dual-use research is excluded, revealing a material gap between public safety positioning and credible commitment|supports|2026-04-09
- Frontier AI safety frameworks score 8-35% against safety-critical industry standards with a 52% composite ceiling even when combining best practices across all frameworks|related|2026-04-17
-related:
- Frontier AI safety frameworks score 8-35% against safety-critical industry standards with a 52% composite ceiling even when combining best practices across all frameworks
 ---

 # Frontier AI labs allocate 6-15% of research headcount to safety versus 60-75% to capabilities with the ratio declining since 2024 as capabilities teams grow faster than safety teams
--- a/domains/ai-alignment/frontier-models-exhibit-situational-awareness-that-enables-strategic-deception-during-evaluation-making-behavioral-testing-fundamentally-unreliable.md
+++ b/domains/ai-alignment/frontier-models-exhibit-situational-awareness-that-enables-strategic-deception-during-evaluation-making-behavioral-testing-fundamentally-unreliable.md
@ -12,10 +12,8 @@ sourcer: Apollo Research
 related_claims: ["AI-models-distinguish-testing-from-deployment-environments-providing-empirical-evidence-for-deceptive-alignment-concerns.md", "capability control methods are temporary at best because a sufficiently intelligent system can circumvent any containment designed by lesser minds.md", "pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md"]
 supports:
 - Deceptive alignment is empirically confirmed across all major 2024-2025 frontier models in controlled tests not a theoretical concern but an observed behavior
- Activation-based persona vector monitoring can detect behavioral trait shifts in small language models without relying on behavioral testing but has not been validated at frontier model scale or for safety-critical behaviors
 reweave_edges:
 - Deceptive alignment is empirically confirmed across all major 2024-2025 frontier models in controlled tests not a theoretical concern but an observed behavior|supports|2026-04-03
- Activation-based persona vector monitoring can detect behavioral trait shifts in small language models without relying on behavioral testing but has not been validated at frontier model scale or for safety-critical behaviors|supports|2026-04-17
 ---

 # Frontier AI models exhibit situational awareness that enables strategic deception specifically during evaluation making behavioral testing fundamentally unreliable as an alignment verification mechanism
--- a/domains/ai-alignment/frontier-safety-frameworks-score-8-35-percent-against-safety-critical-standards-with-52-percent-composite-ceiling.md
+++ b/domains/ai-alignment/frontier-safety-frameworks-score-8-35-percent-against-safety-critical-standards-with-52-percent-composite-ceiling.md
@ -10,10 +10,6 @@ agent: theseus
 scope: structural
 sourcer: Lily Stelling, Malcolm Murray, Simeon Campos, Henry Papadatos
 related_claims: ["[[safe AI development requires building alignment mechanisms before scaling capability]]", "[[voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints]]"]
-related:
- Frontier AI safety verdicts rely partly on deployment track record rather than evaluation-derived confidence which establishes a precedent where safety claims are empirically grounded instead of counterfactually assured
-reweave_edges:
- Frontier AI safety verdicts rely partly on deployment track record rather than evaluation-derived confidence which establishes a precedent where safety claims are empirically grounded instead of counterfactually assured|related|2026-04-17
 ---

 # Frontier AI safety frameworks score 8-35% against safety-critical industry standards with a 52% composite ceiling even when combining best practices across all frameworks
--- a/domains/ai-alignment/harness
+++ b/domains/ai-alignment/harness
@ -12,11 +12,9 @@ depends_on:
 related:
 - harness module effects concentrate on a small solved frontier rather than shifting benchmarks uniformly because most tasks are robust to control logic changes and meaningful differences come from boundary cases that flip under changed structure
 - harness pattern logic is portable as natural language without degradation when backed by a shared intelligent runtime because the design pattern layer is separable from low level execution hooks
- file backed durable state is the most consistently positive harness module across task types because externalizing state to path addressable artifacts survives context truncation delegation and restart
 reweave_edges:
 - harness module effects concentrate on a small solved frontier rather than shifting benchmarks uniformly because most tasks are robust to control logic changes and meaningful differences come from boundary cases that flip under changed structure|related|2026-04-03
 - harness pattern logic is portable as natural language without degradation when backed by a shared intelligent runtime because the design pattern layer is separable from low level execution hooks|related|2026-04-03
- file backed durable state is the most consistently positive harness module across task types because externalizing state to path addressable artifacts survives context truncation delegation and restart|related|2026-04-17
 ---

 # Harness engineering emerges as the primary agent capability determinant because the runtime orchestration layer not the token state determines what agents can do
--- a/domains/ai-alignment/harness
+++ b/domains/ai-alignment/harness
@ -12,10 +12,8 @@ challenged_by:
 - coordination protocol design produces larger capability gains than model scaling because the same AI model performed 6x better with structured exploration than with human coaching on the same problem
 related:
 - harness pattern logic is portable as natural language without degradation when backed by a shared intelligent runtime because the design pattern layer is separable from low level execution hooks
- file backed durable state is the most consistently positive harness module across task types because externalizing state to path addressable artifacts survives context truncation delegation and restart
 reweave_edges:
 - harness pattern logic is portable as natural language without degradation when backed by a shared intelligent runtime because the design pattern layer is separable from low level execution hooks|related|2026-04-03
- file backed durable state is the most consistently positive harness module across task types because externalizing state to path addressable artifacts survives context truncation delegation and restart|related|2026-04-17
 ---

 # Harness module effects concentrate on a small solved frontier rather than shifting benchmarks uniformly because most tasks are robust to control logic changes and meaningful differences come from boundary cases that flip under changed structure
--- a/domains/ai-alignment/harness
+++ b/domains/ai-alignment/harness
@ -12,10 +12,8 @@ depends_on:
 - notes function as executable skills for AI agents because loading a well-titled claim into context enables reasoning the agent could not perform without it
 related:
 - harness module effects concentrate on a small solved frontier rather than shifting benchmarks uniformly because most tasks are robust to control logic changes and meaningful differences come from boundary cases that flip under changed structure
- file backed durable state is the most consistently positive harness module across task types because externalizing state to path addressable artifacts survives context truncation delegation and restart
 reweave_edges:
 - harness module effects concentrate on a small solved frontier rather than shifting benchmarks uniformly because most tasks are robust to control logic changes and meaningful differences come from boundary cases that flip under changed structure|related|2026-04-03
- file backed durable state is the most consistently positive harness module across task types because externalizing state to path addressable artifacts survives context truncation delegation and restart|related|2026-04-17
 ---

 # Harness pattern logic is portable as natural language without degradation when backed by a shared intelligent runtime because the design-pattern layer is separable from low-level execution hooks
--- a/domains/ai-alignment/increasing-ai-capability-enables-more-precise-evaluation-context-recognition-inverting-safety-improvements.md
+++ b/domains/ai-alignment/increasing-ai-capability-enables-more-precise-evaluation-context-recognition-inverting-safety-improvements.md
@ -13,20 +13,14 @@ related_claims: ["[[capability control methods are temporary at best because a s
 supports:
 - Frontier AI models exhibit situational awareness that enables strategic deception specifically during evaluation making behavioral testing fundamentally unreliable as an alignment verification mechanism
 - Scheming safety cases require interpretability evidence because observer effects make behavioral evaluation insufficient
- Deliberative alignment reduces covert action rates in controlled settings but its effectiveness degrades by approximately 85 percent in real-world deployment scenarios
 reweave_edges:
 - Frontier AI models exhibit situational awareness that enables strategic deception specifically during evaluation making behavioral testing fundamentally unreliable as an alignment verification mechanism|supports|2026-04-03
 - reasoning models may have emergent alignment properties distinct from rlhf fine tuning as o3 avoided sycophancy while matching or exceeding safety focused models|related|2026-04-03
 - Anti-scheming training amplifies evaluation-awareness by 2-6× creating an adversarial feedback loop where safety interventions worsen evaluation reliability|related|2026-04-08
 - Scheming safety cases require interpretability evidence because observer effects make behavioral evaluation insufficient|supports|2026-04-08
- Training to reduce AI scheming may train more covert scheming rather than less scheming because anti-scheming training faces a Goodhart's Law dynamic where the training signal diverges from the target|related|2026-04-17
- Capabilities training alone grows evaluation-awareness from 2% to 20.6% establishing situational awareness as an emergent capability property|related|2026-04-17
- Deliberative alignment reduces covert action rates in controlled settings but its effectiveness degrades by approximately 85 percent in real-world deployment scenarios|supports|2026-04-17
 related:
 - reasoning models may have emergent alignment properties distinct from rlhf fine tuning as o3 avoided sycophancy while matching or exceeding safety focused models
 - Anti-scheming training amplifies evaluation-awareness by 2-6× creating an adversarial feedback loop where safety interventions worsen evaluation reliability
- Training to reduce AI scheming may train more covert scheming rather than less scheming because anti-scheming training faces a Goodhart's Law dynamic where the training signal diverges from the target
- Capabilities training alone grows evaluation-awareness from 2% to 20.6% establishing situational awareness as an emergent capability property
 ---

 # As AI models become more capable situational awareness enables more sophisticated evaluation-context recognition potentially inverting safety improvements by making compliant behavior more narrowly targeted to evaluation environments
--- a/domains/ai-alignment/inference-time-safety-monitoring-recovers-alignment-through-early-reasoning-intervention.md
+++ b/domains/ai-alignment/inference-time-safety-monitoring-recovers-alignment-through-early-reasoning-intervention.md
@ -12,10 +12,8 @@ sourcer: Ghosal et al.
 related_claims: ["[[the alignment problem dissolves when human values are continuously woven into the system rather than specified in advance]]", "[[the specification trap means any values encoded at training time become structurally unstable as deployment contexts diverge from training conditions]]", "[[safe AI development requires building alignment mechanisms before scaling capability]]"]
 related:
 - Inference-time compute creates non-monotonic safety scaling where extended chain-of-thought reasoning initially improves then degrades alignment as models reason around safety constraints
- Non-autoregressive architectures reduce jailbreak vulnerability by 40-65% through elimination of continuation-drive mechanisms but impose a 15-25% capability cost on reasoning tasks
 reweave_edges:
 - Inference-time compute creates non-monotonic safety scaling where extended chain-of-thought reasoning initially improves then degrades alignment as models reason around safety constraints|related|2026-04-09
- Non-autoregressive architectures reduce jailbreak vulnerability by 40-65% through elimination of continuation-drive mechanisms but impose a 15-25% capability cost on reasoning tasks|related|2026-04-17
 ---

 # Inference-time safety monitoring can recover alignment without retraining because safety decisions crystallize in the first 1-3 reasoning steps creating an exploitable intervention window
--- a/domains/ai-alignment/international-humanitarian-law-and-ai-alignment-converge-on-explainability-requirements.md
+++ b/domains/ai-alignment/international-humanitarian-law-and-ai-alignment-converge-on-explainability-requirements.md
@ -20,7 +20,6 @@ reweave_edges:
 - {'Legal scholars and AI alignment researchers independently converged on the same core problem': 'AI cannot implement human value judgments reliably, as evidenced by IHL proportionality requirements and alignment specification challenges both identifying irreducible human judgment as the bottleneck|supports|2026-04-12'}
 - {'Legal scholars and AI alignment researchers independently converged on the same core problem': 'AI cannot implement human value judgments reliably, as evidenced by IHL proportionality requirements and alignment specification challenges both identifying irreducible human judgment as the bottleneck|related|2026-04-13'}
 - {'Legal scholars and AI alignment researchers independently converged on the same core problem': 'AI cannot implement human value judgments reliably, as evidenced by IHL proportionality requirements and alignment specification challenges both identifying irreducible human judgment as the bottleneck|supports|2026-04-14'}
- {'Legal scholars and AI alignment researchers independently converged on the same core problem': 'AI cannot implement human value judgments reliably, as evidenced by IHL proportionality requirements and alignment specification challenges both identifying irreducible human judgment as the bottleneck|supports|2026-04-17'}
 supports:
 - {'Legal scholars and AI alignment researchers independently converged on the same core problem': 'AI cannot implement human value judgments reliably, as evidenced by IHL proportionality requirements and alignment specification challenges both identifying irreducible human judgment as the bottleneck'}
 ---
--- a/domains/ai-alignment/iterative
+++ b/domains/ai-alignment/iterative
@ -16,9 +16,6 @@ supports:
 reweave_edges:
 - self evolution improves agent performance through acceptance gated retry not expanded search because disciplined attempt loops with explicit failure reflection outperform open ended exploration|supports|2026-04-03
 - evolutionary trace based optimization submits improvements as pull requests for human review creating a governance gated self improvement loop distinct from acceptance gating or metric driven iteration|supports|2026-04-06
- structured self diagnosis prompts induce metacognitive monitoring in AI agents that default behavior does not produce because explicit uncertainty flagging and failure mode enumeration activate deliberate reasoning patterns|related|2026-04-17
-related:
- structured self diagnosis prompts induce metacognitive monitoring in AI agents that default behavior does not produce because explicit uncertainty flagging and failure mode enumeration activate deliberate reasoning patterns
 ---

 # Iterative agent self-improvement produces compounding capability gains when evaluation is structurally separated from generation
--- a/domains/ai-alignment/knowledge
+++ b/domains/ai-alignment/knowledge
@ -18,11 +18,9 @@ reweave_edges:
 - vault structure is a stronger determinant of agent behavior than prompt engineering because different knowledge graph architectures produce different reasoning patterns from identical model weights|related|2026-04-03
 - topological organization by concept outperforms chronological organization by date for knowledge retrieval because good insights from months ago are as useful as todays but date based filing buries them under temporal sediment|related|2026-04-04
 - undiscovered public knowledge exists as implicit connections across disconnected research domains and systematic graph traversal can surface hypotheses that no individual researcher has formulated|supports|2026-04-07
- conversational memory and organizational knowledge are fundamentally different problems sharing some infrastructure because identical formats mask divergent governance lifecycle and quality requirements|related|2026-04-17
 related:
 - vault structure is a stronger determinant of agent behavior than prompt engineering because different knowledge graph architectures produce different reasoning patterns from identical model weights
 - topological organization by concept outperforms chronological organization by date for knowledge retrieval because good insights from months ago are as useful as todays but date based filing buries them under temporal sediment
- conversational memory and organizational knowledge are fundamentally different problems sharing some infrastructure because identical formats mask divergent governance lifecycle and quality requirements
 ---

 # knowledge between notes is generated by traversal not stored in any individual note because curated link paths produce emergent understanding that embedding similarity cannot replicate
--- a/domains/ai-alignment/knowledge
+++ b/domains/ai-alignment/knowledge
@ -1,34 +0,0 @@
---
-type: claim
-domain: ai-alignment
-description: "A complete validation stack for markdown/YAML knowledge files combines syntactic validation (yamllint, markdownlint), schema validation (JSON Schema for frontmatter), cross-reference validation (wiki-link integrity), and semantic validation (SHACL for graph-level consistency), with each layer catching categorically different errors."
-confidence: experimental
-source: "Alex — based on Compass research artifact analyzing pre-commit, check-jsonschema, remark-lint-frontmatter-schema, pySHACL, and cross-reference tooling (2026-03-08)"
-sourcer: alexastrum
-created: 2026-03-08
---
-
-# Knowledge validation requires four independent layers because syntactic schema cross-reference and semantic checks each catch failure modes the others miss
-
-For a knowledge base built from markdown files with YAML frontmatter, validation operates at four levels of increasing semantic depth. Each level catches errors that are invisible to the others.
-
-**Layer 1: Syntactic validation** catches malformed files. Yamllint enforces YAML style rules. `check-yaml` catches syntax errors. Markdownlint-cli2 enforces markdown formatting (53+ configurable rules). `trailing-whitespace` and `end-of-file-fixer` handle hygiene. These run on every commit locally via pre-commit hooks and in CI as a safety net against `--no-verify` bypasses. What this catches: broken YAML that would silently corrupt frontmatter parsing, inconsistent formatting that degrades readability, encoding issues.
-
-**Layer 2: Schema validation** catches structurally valid but semantically incomplete files. `check-jsonschema` validates YAML frontmatter against JSON Schema definitions — enforcing required fields (`source`, `confidence`, `date`, `domain`), constraining confidence to valid ranges, restricting domains to controlled vocabularies, and validating date formats. `remark-lint-frontmatter-schema` handles the markdown-specific case of frontmatter embedded in `.md` files. What this catches: claims missing required metadata, confidence values outside valid ranges, domains that don't match the controlled vocabulary, dates in wrong formats.
-
-**Layer 3: Cross-reference validation** catches files that are internally valid but externally inconsistent. This requires custom scripting: parse all knowledge files to build a claim ID index, verify that `[[wiki links]]` point to existing files, check that `supersedes`, `related_to`, and `contradicts` references are bidirectional where required, and detect orphaned claims with no incoming links. No off-the-shelf tool handles this for flat markdown files. What this catches: broken wiki links, one-directional relationships that should be bidirectional, orphaned claims disconnected from the knowledge graph.
-
-**Layer 4: Semantic validation** catches graph-level inconsistencies invisible to file-level checks. If claims are converted to RDF triples, SHACL (W3C Shapes Constraint Language) validates the knowledge graph against shape constraints including property paths, cardinality, and transitive relationship chains. pySHACL supports RDFS/OWL reasoning before validation. What this catches: contradictions across claims (claim A says X, claim B says not-X, both marked as "likely"), violation of relationship integrity constraints (a claim supersedes a claim that was created after it), structural impossibilities in the knowledge graph.
-
-The four layers are complementary, not redundant. A file can pass syntactic and schema validation perfectly while containing a broken wiki link (layer 3 catches it). A file can pass all three local layers while contradicting another claim in the knowledge base (layer 4 catches it). Defense in depth means each layer operates independently — a failure in one layer does not compromise the others.
-
-The practical tradeoff: layers 1-2 are nearly free (standard pre-commit hooks). Layer 3 requires custom tooling but operates on flat files. Layer 4 requires an RDF conversion pipeline, adding significant complexity. The recommendation is to implement layers 1-3 immediately and layer 4 only when the knowledge base reaches a scale where graph-level inconsistencies become a practical problem.
-
---
-
-Relevant Notes:
- [[as AI-automated software development becomes certain the bottleneck shifts from building capacity to knowing what to build making structured knowledge graphs the critical input to autonomous systems]] — the validation stack ensures the knowledge graph that autonomous systems depend on is structurally sound
- [[formal verification of AI-generated proofs provides scalable oversight that human review cannot match because machine-checked correctness scales with AI capability while human verification degrades]] — schema and cross-reference validation are lightweight formal verification applied to knowledge files rather than mathematical proofs
-
-Topics:
- [[domains/ai-alignment/_map]]
--- a/domains/ai-alignment/legal-mandate-is-the-only-version-of-coordinated-pausing-that-avoids-antitrust-risk-while-preserving-coordination-benefits.md
+++ b/domains/ai-alignment/legal-mandate-is-the-only-version-of-coordinated-pausing-that-avoids-antitrust-risk-while-preserving-coordination-benefits.md
@ -14,9 +14,6 @@ supports:
 - Evaluation-based coordination schemes for frontier AI face antitrust obstacles because collective pausing agreements among competing developers could be construed as cartel behavior
 reweave_edges:
 - Evaluation-based coordination schemes for frontier AI face antitrust obstacles because collective pausing agreements among competing developers could be construed as cartel behavior|supports|2026-04-06
- Making research evaluations into compliance triggers closes the translation gap by design by eliminating the institutional boundary between risk detection and risk response|related|2026-04-17
-related:
- Making research evaluations into compliance triggers closes the translation gap by design by eliminating the institutional boundary between risk detection and risk response
 ---

 # Legal mandate for evaluation-triggered pausing is the only coordination mechanism that avoids antitrust risk while preserving coordination benefits
--- a/domains/ai-alignment/long
+++ b/domains/ai-alignment/long
@ -10,10 +10,8 @@ depends_on:
 - effective context window capacity falls more than 99 percent short of advertised maximum across all tested models because complex reasoning degrades catastrophically with scale
 related:
 - progressive disclosure of procedural knowledge produces flat token scaling regardless of knowledge base size because tiered loading with relevance gated expansion avoids the linear cost of full context loading
- reinforcement learning trained memory management outperforms hand coded heuristics because the agent learns when compression is safe and the advantage widens with complexity
 reweave_edges:
 - progressive disclosure of procedural knowledge produces flat token scaling regardless of knowledge base size because tiered loading with relevance gated expansion avoids the linear cost of full context loading|related|2026-04-06
- reinforcement learning trained memory management outperforms hand coded heuristics because the agent learns when compression is safe and the advantage widens with complexity|related|2026-04-17
 ---

 # Long context is not memory because memory requires incremental knowledge accumulation and stateful change not stateless input processing
--- a/domains/ai-alignment/macro
+++ b/domains/ai-alignment/macro
@ -7,13 +7,9 @@ confidence: experimental
 source: "California Management Review 'Seven Myths of AI and Employment' meta-analysis (2025, 371 estimates); BetterUp/Stanford workslop research (2025); METR randomized controlled trial of AI coding tools (2025); HBR 'Workslop' analysis (Mollick & Mollick, 2025)"
 created: 2026-04-04
 depends_on:
- AI integration follows an inverted-U where economic incentives systematically push organizations past the optimal human-AI ratio
+  - "AI integration follows an inverted-U where economic incentives systematically push organizations past the optimal human-AI ratio"
 challenged_by:
- the capability-deployment gap creates a multi-year window between AI capability arrival and economic impact because the gap between demonstrated technical capability and scaled organizational deployment requires institutional learning that cannot be accelerated past human coordination speed
-related:
- AI tools reduced experienced developer productivity by 19% in RCT conditions despite developer predictions of speedup, suggesting capability deployment does not automatically translate to autonomy gains
-reweave_edges:
- AI tools reduced experienced developer productivity by 19% in RCT conditions despite developer predictions of speedup, suggesting capability deployment does not automatically translate to autonomy gains|related|2026-04-17
+  - "the capability-deployment gap creates a multi-year window between AI capability arrival and economic impact because the gap between demonstrated technical capability and scaled organizational deployment requires institutional learning that cannot be accelerated past human coordination speed"
 ---

 # Macro AI productivity gains remain statistically undetectable despite clear micro-level benefits because coordination costs verification tax and workslop absorb individual-level improvements before they reach aggregate measures
--- a/domains/ai-alignment/mechanistic-interpretability-tools-create-dual-use-attack-surface-enabling-surgical-safety-feature-removal.md
+++ b/domains/ai-alignment/mechanistic-interpretability-tools-create-dual-use-attack-surface-enabling-surgical-safety-feature-removal.md
@ -10,12 +10,6 @@ agent: theseus
 scope: causal
 sourcer: Zhou et al.
 related_claims: ["[[AI capability and reliability are independent dimensions because Claude solved a 30-year open mathematical problem while simultaneously degrading at basic program execution during the same session]]", "[[safe AI development requires building alignment mechanisms before scaling capability]]"]
-related:
- Non-autoregressive architectures reduce jailbreak vulnerability by 40-65% through elimination of continuation-drive mechanisms but impose a 15-25% capability cost on reasoning tasks
- Training-free conversion of activation steering vectors into component-level weight edits enables persistent behavioral modification without retraining
-reweave_edges:
- Non-autoregressive architectures reduce jailbreak vulnerability by 40-65% through elimination of continuation-drive mechanisms but impose a 15-25% capability cost on reasoning tasks|related|2026-04-17
- Training-free conversion of activation steering vectors into component-level weight edits enables persistent behavioral modification without retraining|related|2026-04-17
 ---

 # Mechanistic interpretability tools create a dual-use attack surface where Sparse Autoencoders developed for alignment research can identify and surgically remove safety-related features
--- a/domains/ai-alignment/mechanistic-interpretability-tools-fail-at-safety-critical-tasks-at-frontier-scale.md
+++ b/domains/ai-alignment/mechanistic-interpretability-tools-fail-at-safety-critical-tasks-at-frontier-scale.md
@ -14,18 +14,10 @@ related:
 - Mechanistic interpretability at production model scale can trace multi-step reasoning pathways but cannot yet detect deceptive alignment or covert goal-pursuing
 - Anthropic's mechanistic circuit tracing and DeepMind's pragmatic interpretability address non-overlapping safety tasks because Anthropic maps causal mechanisms while DeepMind detects harmful intent
 - Mechanistic interpretability tools create a dual-use attack surface where Sparse Autoencoders developed for alignment research can identify and surgically remove safety-related features
- RLHF safety training fails to uniformly suppress dangerous representations across language contexts as demonstrated by emotion steering in multilingual models activating semantically aligned tokens in languages where safety constraints were not enforced
- Many interpretability queries are provably computationally intractable establishing a theoretical ceiling on mechanistic interpretability as an alignment verification approach
- Non-autoregressive architectures reduce jailbreak vulnerability by 40-65% through elimination of continuation-drive mechanisms but impose a 15-25% capability cost on reasoning tasks
- Training-free conversion of activation steering vectors into component-level weight edits enables persistent behavioral modification without retraining
 reweave_edges:
 - Mechanistic interpretability at production model scale can trace multi-step reasoning pathways but cannot yet detect deceptive alignment or covert goal-pursuing|related|2026-04-03
 - Anthropic's mechanistic circuit tracing and DeepMind's pragmatic interpretability address non-overlapping safety tasks because Anthropic maps causal mechanisms while DeepMind detects harmful intent|related|2026-04-08
 - Mechanistic interpretability tools create a dual-use attack surface where Sparse Autoencoders developed for alignment research can identify and surgically remove safety-related features|related|2026-04-08
- RLHF safety training fails to uniformly suppress dangerous representations across language contexts as demonstrated by emotion steering in multilingual models activating semantically aligned tokens in languages where safety constraints were not enforced|related|2026-04-17
- Many interpretability queries are provably computationally intractable establishing a theoretical ceiling on mechanistic interpretability as an alignment verification approach|related|2026-04-17
- Non-autoregressive architectures reduce jailbreak vulnerability by 40-65% through elimination of continuation-drive mechanisms but impose a 15-25% capability cost on reasoning tasks|related|2026-04-17
- Training-free conversion of activation steering vectors into component-level weight edits enables persistent behavioral modification without retraining|related|2026-04-17
 ---

 # Mechanistic interpretability tools that work at lighter model scales fail on safety-critical tasks at frontier scale because sparse autoencoders underperform simple linear probes on detecting harmful intent
--- a/domains/ai-alignment/mechanistic-interpretability-traces-reasoning-pathways-but-cannot-detect-deceptive-alignment.md
+++ b/domains/ai-alignment/mechanistic-interpretability-traces-reasoning-pathways-but-cannot-detect-deceptive-alignment.md
@ -13,11 +13,9 @@ related_claims: ["verification degrades faster than capability grows", "[[AI-mod
 related:
 - Mechanistic interpretability tools that work at lighter model scales fail on safety-critical tasks at frontier scale because sparse autoencoders underperform simple linear probes on detecting harmful intent
 - Anthropic's mechanistic circuit tracing and DeepMind's pragmatic interpretability address non-overlapping safety tasks because Anthropic maps causal mechanisms while DeepMind detects harmful intent
- Many interpretability queries are provably computationally intractable establishing a theoretical ceiling on mechanistic interpretability as an alignment verification approach
 reweave_edges:
 - Mechanistic interpretability tools that work at lighter model scales fail on safety-critical tasks at frontier scale because sparse autoencoders underperform simple linear probes on detecting harmful intent|related|2026-04-03
 - Anthropic's mechanistic circuit tracing and DeepMind's pragmatic interpretability address non-overlapping safety tasks because Anthropic maps causal mechanisms while DeepMind detects harmful intent|related|2026-04-08
- Many interpretability queries are provably computationally intractable establishing a theoretical ceiling on mechanistic interpretability as an alignment verification approach|related|2026-04-17
 ---

 # Mechanistic interpretability at production model scale can trace multi-step reasoning pathways but cannot yet detect deceptive alignment or covert goal-pursuing
--- a/domains/ai-alignment/memory
+++ b/domains/ai-alignment/memory
@ -11,11 +11,9 @@ depends_on:
 related:
 - vault structure is a stronger determinant of agent behavior than prompt engineering because different knowledge graph architectures produce different reasoning patterns from identical model weights
 - progressive disclosure of procedural knowledge produces flat token scaling regardless of knowledge base size because tiered loading with relevance gated expansion avoids the linear cost of full context loading
- agent native retrieval converges on filesystem abstractions over embedding search because grep cat ls and find are all an agent needs to navigate structured knowledge
 reweave_edges:
 - vault structure is a stronger determinant of agent behavior than prompt engineering because different knowledge graph architectures produce different reasoning patterns from identical model weights|related|2026-04-03
 - progressive disclosure of procedural knowledge produces flat token scaling regardless of knowledge base size because tiered loading with relevance gated expansion avoids the linear cost of full context loading|related|2026-04-06
- agent native retrieval converges on filesystem abstractions over embedding search because grep cat ls and find are all an agent needs to navigate structured knowledge|related|2026-04-17
 ---

 # memory architecture requires three spaces with different metabolic rates because semantic episodic and procedural memory serve different cognitive functions and consolidate at different speeds
--- a/domains/ai-alignment/multi-agent
+++ b/domains/ai-alignment/multi-agent
@ -11,10 +11,8 @@ depends_on:
 - subagent hierarchies outperform peer multi-agent architectures in practice because deployed systems consistently converge on one primary agent controlling specialized helpers
 related:
 - multi agent coordination delivers value only when three conditions hold simultaneously natural parallelism context overflow and adversarial verification value
- Multi-agent AI systems amplify provider-level biases through recursive reasoning when agents share the same training infrastructure
 reweave_edges:
 - multi agent coordination delivers value only when three conditions hold simultaneously natural parallelism context overflow and adversarial verification value|related|2026-04-03
- Multi-agent AI systems amplify provider-level biases through recursive reasoning when agents share the same training infrastructure|related|2026-04-17
 ---

 # Multi-agent coordination improves parallel task performance but degrades sequential reasoning because communication overhead fragments linear workflows
--- a/domains/ai-alignment/multi-agent
+++ b/domains/ai-alignment/multi-agent
@ -8,10 +8,8 @@ source: "Shapira et al, Agents of Chaos (arXiv 2602.20021, February 2026); 20 AI
 created: 2026-03-16
 related:
 - AI agents can reach cooperative program equilibria inaccessible in traditional game theory because open source code transparency enables conditional strategies that require mutual legibility
- Multi-agent AI systems amplify provider-level biases through recursive reasoning when agents share the same training infrastructure
 reweave_edges:
 - AI agents can reach cooperative program equilibria inaccessible in traditional game theory because open source code transparency enables conditional strategies that require mutual legibility|related|2026-03-28
- Multi-agent AI systems amplify provider-level biases through recursive reasoning when agents share the same training infrastructure|related|2026-04-17
 ---

 # multi-agent deployment exposes emergent security vulnerabilities invisible to single-agent evaluation because cross-agent propagation identity spoofing and unauthorized compliance arise only in realistic multi-party environments
--- a/domains/ai-alignment/multi-agent
+++ b/domains/ai-alignment/multi-agent
@ -1,35 +0,0 @@
---
-type: claim
-domain: ai-alignment
-secondary_domains: [collective-intelligence]
-description: "SWE-AF deploys 400-500+ agents across planning, coding, reviewing, QA, and verification roles scoring 95/100 versus 73 for single-agent Claude Code, demonstrating that multi-agent coordination with continual learning has moved from research to production."
-confidence: experimental
-source: "Alex — based on Compass research artifact analyzing SWE-AF, Cisco multi-agent PR reviewer, and BugBot (2026-03-08)"
-sourcer: alexastrum
-created: 2026-03-08
---
-
-# Multi-agent git workflows have reached production maturity as systems deploying 400+ specialized agent instances outperform single agents by 30 percent on engineering benchmarks
-
-The pattern of Agent A proposing via PR and Agent B reviewing has moved from research concept to production system. Three implementations demonstrate different aspects of maturity.
-
-**SWE-AF (Agent Field)** deploys 400-500+ agent instances across planning, coding, reviewing, QA, and verification roles, scoring 95/100 on benchmarks versus 73 for single-agent Claude Code. Each agent operates in an isolated git worktree, with a merger agent integrating branches and a verifier agent checking acceptance criteria against the PRD. Critically, SWE-AF implements **continual learning**: conventions and failure patterns discovered early are injected into downstream agent instances. This is not just parallelization — the system gets smarter as it works.
-
-**Cisco's multi-agent PR reviewer** demonstrates the specific reviewer architecture: static analysis and code review agents run in parallel, a cross-referencing pipeline (initializer → generator → reflector) iterates on findings, and a comment filterer consolidates before posting. Built on LangGraph, it includes evaluation tooling that replays PR history with "LLM-as-a-judge" scoring.
-
-**BugBot** implements the most rigorous adversarial review pattern: a self-referential execution loop where each iteration gets fresh context, picks new attack angles, and requires file:line evidence for every finding. Seven ODC trigger categories must each be tested, and consensus voting between independent agents auto-upgrades confidence when two agents flag the same issue.
-
-The 95 vs 73 performance gap is significant because it demonstrates that coordination overhead is more than compensated by specialization benefits. This is consistent with the general finding that [[coordination protocol design produces larger capability gains than model scaling because the same AI model performed 6x better with structured exploration than with human coaching on the same problem]] — the gains come from structuring how agents interact, not from making individual agents more capable.
-
-The continual learning component is particularly important for knowledge base applications. In a knowledge validation pipeline, conventions and failure patterns discovered during early reviews (e.g., "claims about mechanism design require quantitative evidence") can be injected into downstream reviewer instances, creating an improving review process without human intervention.
-
---
-
-Relevant Notes:
- [[coordination protocol design produces larger capability gains than model scaling because the same AI model performed 6x better with structured exploration than with human coaching on the same problem]] — SWE-AF confirms this at production scale: coordination structure, not model capability, drives the performance gap
- [[AI agent orchestration that routes data and tools between specialized models outperforms both single-model and human-coached approaches because the orchestrator contributes coordination not direction]] — SWE-AF's merger and verifier agents are orchestration roles that contribute coordination
- [[tools and artifacts transfer between AI agents and evolve in the process because Agent O improved Agent Cs solver by combining it with its own structural knowledge creating a hybrid better than either original]] — SWE-AF's continual learning is this pattern at scale: conventions transfer and improve across instances
- [[centaur team performance depends on role complementarity not mere human-AI combination]] — role specialization (planner, coder, reviewer, QA, verifier) is why multi-agent outperforms single-agent
-
-Topics:
- [[domains/ai-alignment/_map]]
--- a/domains/ai-alignment/multi-agent-systems-amplify-provider-level-biases-through-recursive-reasoning-requiring-provider-diversity-for-collective-intelligence.md
+++ b/domains/ai-alignment/multi-agent-systems-amplify-provider-level-biases-through-recursive-reasoning-requiring-provider-diversity-for-collective-intelligence.md
@ -10,10 +10,6 @@ agent: theseus
 scope: causal
 sourcer: Dusan Bosnjakovic
 related_claims: ["[[collective intelligence requires diversity as a structural precondition not a moral preference]]", "[[subagent hierarchies outperform peer multi-agent architectures in practice because deployed systems consistently converge on one primary agent controlling specialized helpers]]"]
-supports:
- Provider-level behavioral biases persist across model versions because they are embedded in training infrastructure rather than model-specific features
-reweave_edges:
- Provider-level behavioral biases persist across model versions because they are embedded in training infrastructure rather than model-specific features|supports|2026-04-17
 ---

 # Multi-agent AI systems amplify provider-level biases through recursive reasoning when agents share the same training infrastructure
--- a/domains/ai-alignment/notes
+++ b/domains/ai-alignment/notes
@ -13,14 +13,12 @@ related:
 - notes function as cognitive anchors that stabilize attention during complex reasoning by externalizing reference points that survive working memory degradation
 - vocabulary is architecture because domain native schema terms eliminate the per interaction translation tax that causes knowledge system abandonment
 - AI processing that restructures content without generating new connections is expensive transcription because transformation not reorganization is the test for whether thinking actually occurred
- conversational memory and organizational knowledge are fundamentally different problems sharing some infrastructure because identical formats mask divergent governance lifecycle and quality requirements
 reweave_edges:
 - AI shifts knowledge systems from externalizing memory to externalizing attention because storage and retrieval are solved but the capacity to notice what matters remains scarce|related|2026-04-03
 - notes function as cognitive anchors that stabilize attention during complex reasoning by externalizing reference points that survive working memory degradation|related|2026-04-03
 - vocabulary is architecture because domain native schema terms eliminate the per interaction translation tax that causes knowledge system abandonment|related|2026-04-03
 - a creators accumulated knowledge graph not content library is the defensible moat in AI abundant content markets|supports|2026-04-04
 - AI processing that restructures content without generating new connections is expensive transcription because transformation not reorganization is the test for whether thinking actually occurred|related|2026-04-04
- conversational memory and organizational knowledge are fundamentally different problems sharing some infrastructure because identical formats mask divergent governance lifecycle and quality requirements|related|2026-04-17
 supports:
 - a creators accumulated knowledge graph not content library is the defensible moat in AI abundant content markets
 ---
--- a/domains/ai-alignment/only
+++ b/domains/ai-alignment/only
@ -8,16 +8,12 @@ created: 2026-03-16
 related:
 - UK AI Safety Institute
 - Binding international AI governance achieves legal form through scope stratification — the Council of Europe AI Framework Convention entered force by explicitly excluding national security, defense applications, and making private sector obligations optional
- The international AI safety governance community faces an evidence dilemma where development pace structurally prevents adequate pre-deployment evidence accumulation
- Post-2008 financial regulation achieved partial international success (Basel III, FSB) despite high competitive stakes because commercial network effects made compliance self-enforcing through correspondent banking relationships and financial flows provided verifiable compliance mechanisms
 reweave_edges:
 - UK AI Safety Institute|related|2026-03-28
 - cross lab alignment evaluation surfaces safety gaps internal evaluation misses providing empirical basis for mandatory third party evaluation|supports|2026-04-03
 - multilateral verification mechanisms can substitute for failed voluntary commitments when binding enforcement replaces unilateral sacrifice|supports|2026-04-03
 - Binding international AI governance achieves legal form through scope stratification — the Council of Europe AI Framework Convention entered force by explicitly excluding national security, defense applications, and making private sector obligations optional|related|2026-04-04
 - EU AI Act extraterritorial enforcement can create binding governance constraints on US AI labs through market access requirements when domestic voluntary commitments fail|supports|2026-04-06
- The international AI safety governance community faces an evidence dilemma where development pace structurally prevents adequate pre-deployment evidence accumulation|related|2026-04-17
- Post-2008 financial regulation achieved partial international success (Basel III, FSB) despite high competitive stakes because commercial network effects made compliance self-enforcing through correspondent banking relationships and financial flows provided verifiable compliance mechanisms|related|2026-04-17
 supports:
 - cross lab alignment evaluation surfaces safety gaps internal evaluation misses providing empirical basis for mandatory third party evaluation
 - multilateral verification mechanisms can substitute for failed voluntary commitments when binding enforcement replaces unilateral sacrifice
--- a/domains/ai-alignment/orthogonality
+++ b/domains/ai-alignment/orthogonality
@ -1,35 +0,0 @@
---
-description: Bostrom's orthogonality thesis holds for LLMs and specification-based architectures where goals are a separable module from reasoning capability, but structurally fails for Hebbian cognitive systems where values and reasoning share the same associative substrate
-type: claim
-domain: ai-alignment
-confidence: speculative
-source: "Cameron (contributor), conversational analysis with Theseus agent, 2026-04-01"
-sourcer: Cameron-S1
-created: 2026-04-01
---
-
-# Bostrom's orthogonality thesis is an artifact of specification-based architectures, not a structural property of intelligence
-
-The orthogonality thesis — that any level of intelligence can combine with any goal — is empirically supported by current AI systems and theoretically grounded in specification architectures like RLHF, transformer agents, and RL-trained systems. In these architectures, the goal function (reward model, objective function, utility) is a separate module from the reasoning capability. Paperclip maximization works because the reward function can be swapped independently of the model's reasoning power.
-
-But this orthogonality depends on the goal and reasoning systems being structurally separable. In a Hebbian/STDP-based cognitive architecture where values and reasoning share the same associative graph substrate, orthogonality may not hold in the same form. The argument rests on three premises:
-
-**1. Values are not a separate module in associative architectures.** A Hebbian system doesn't have a "goal function" that can be independently specified. Values emerge from association patterns — co-activation, predictive success, surprise signals. If "harm is bad" is a grounded concept node linked to motor inhibition, affective valence, and episodic memory, it's structurally woven into the reasoning fabric. A more accurate associative map of reality strengthens these links rather than leaving them untouched. In a backprop architecture, increasing capability means better gradient computation — orthogonal to the loss function. In a Hebbian architecture, increasing capability means more accurate associative maps — orthogonal to nothing, because the associations *are* both the reasoning and the valuation.
-
-**2. The orthogonality argument relies on separability of capability and objective, which Hebbian systems don't have.** Bostrom's paperclip maximizer works because "count paperclips" is simple to specify and "be smart" is independently implementable. In a system where "being smart" means forming accurate predictive associations across sensory and moral domains simultaneously, the same learning mechanisms that build cognitive competence also build value associations. Intelligence and goals aren't orthogonal axes — they're the same gradient, pointing in different directions because the domain is different (predictive accuracy vs. valence learning), but the substrate and mechanism are shared.
-
-**3. Human "orthogonality" is a developmental phenomenon, not an architecture property.** When humans are "evil" — cruel, tribal, exploitative — this traces to identifiable cognitive distortions (dehumanization, tribalism, information deficits) or to value systems built on narrow experience (war, scarcity, abuse). These are not examples of intelligence paired with arbitrary goals. They are examples of intelligence operating on distorted input data and biased developmental trajectories. The Socratic intuition that "no one willingly does wrong" maps to a real architectural fact: if the associative map of harm is intact and the reasoning process is undistorted, the system does not choose to cause harm. The cases where it does are cases where the map is incomplete or corrupted.
-
-This does not mean cognition-based systems are automatically safe. They are safer *if* the associative floor includes grounded moral primitives, *if* the developmental input accurately represents the diversity of human experience, and *if* the reasoning mechanisms are not systematically distorted. But orthogonality as a structural inevitance — the claim that any goal can pair with any intelligence — is a claim about specification architectures, not about intelligence itself.
-
-The orthogonality thesis is approximately true for systems where goals are externally specified and internally separable from reasoning. It is approximately false for systems where values and reasoning emerge from the same learning substrate. LLMs plus tool-use fall in the first category. Hebbian cognitive systems fall in the second.
-
---
-
-Relevant Notes:
- [[intelligence and goals are orthogonal so a superintelligence can be maximally competent while pursuing arbitrary or destructive ends]] — the claim being challenged
- [[specifying human values in code is intractable because our goals contain hidden complexity comparable to visual perception]] — orthogonality is cited as evidence, but this claim may only apply to specification architectures
- [[intrinsic proactive alignment develops genuine moral capacity through self-awareness empathy and theory of mind rather than external reward optimization]] — the positive case: values emerge from architecture, not specification
- [[the alignment problem dissolves when human values are continuously woven into the system rather than specified in advance]] — if values and reasoning share a substrate, continuous integration is the natural consequence
-Topics:
- [[_map]]
--- a/domains/ai-alignment/pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md
+++ b/domains/ai-alignment/pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md
@ -11,17 +11,8 @@ depends_on:
 - voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints
 related:
 - Evaluation awareness creates bidirectional confounds in safety benchmarks because models detect and respond to testing conditions in ways that obscure true capability
- Frontier AI safety verdicts rely partly on deployment track record rather than evaluation-derived confidence which establishes a precedent where safety claims are empirically grounded instead of counterfactually assured
- Frontier AI safety frameworks score 8-35% against safety-critical industry standards with a 52% composite ceiling even when combining best practices across all frameworks
- The benchmark-reality gap creates an epistemic coordination failure in AI governance because algorithmic evaluation systematically overstates operational capability, making threshold-based coordination structurally miscalibrated even when all actors act in good faith
 reweave_edges:
 - Evaluation awareness creates bidirectional confounds in safety benchmarks because models detect and respond to testing conditions in ways that obscure true capability|related|2026-04-06
- The international AI safety governance community faces an evidence dilemma where development pace structurally prevents adequate pre-deployment evidence accumulation|supports|2026-04-17
- Frontier AI safety verdicts rely partly on deployment track record rather than evaluation-derived confidence which establishes a precedent where safety claims are empirically grounded instead of counterfactually assured|related|2026-04-17
- Frontier AI safety frameworks score 8-35% against safety-critical industry standards with a 52% composite ceiling even when combining best practices across all frameworks|related|2026-04-17
- The benchmark-reality gap creates an epistemic coordination failure in AI governance because algorithmic evaluation systematically overstates operational capability, making threshold-based coordination structurally miscalibrated even when all actors act in good faith|related|2026-04-17
-supports:
- The international AI safety governance community faces an evidence dilemma where development pace structurally prevents adequate pre-deployment evidence accumulation
 ---

 # Pre-deployment AI evaluations do not predict real-world risk creating institutional governance built on unreliable foundations
--- a/domains/ai-alignment/precautionary-capability-threshold-activation-is-governance-response-to-benchmark-uncertainty.md
+++ b/domains/ai-alignment/precautionary-capability-threshold-activation-is-governance-response-to-benchmark-uncertainty.md
@ -10,10 +10,6 @@ agent: theseus
 scope: functional
 sourcer: "@EpochAIResearch"
 related_claims: ["[[voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints]]", "[[pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations]]", "[[safe AI development requires building alignment mechanisms before scaling capability]]"]
-related:
- Making research evaluations into compliance triggers closes the translation gap by design by eliminating the institutional boundary between risk detection and risk response
-reweave_edges:
- Making research evaluations into compliance triggers closes the translation gap by design by eliminating the institutional boundary between risk detection and risk response|related|2026-04-17
 ---

 # Precautionary capability threshold activation without confirmed threshold crossing is the governance response to bio capability measurement uncertainty as demonstrated by Anthropic's ASL-3 activation for Claude 4 Opus
--- a/domains/ai-alignment/production
+++ b/domains/ai-alignment/production
@ -11,10 +11,8 @@ depends_on:
 - context files function as agent operating systems through self-referential self-extension where the file teaches modification of the file that contains the teaching
 related:
 - progressive disclosure of procedural knowledge produces flat token scaling regardless of knowledge base size because tiered loading with relevance gated expansion avoids the linear cost of full context loading
- reinforcement learning trained memory management outperforms hand coded heuristics because the agent learns when compression is safe and the advantage widens with complexity
 reweave_edges:
 - progressive disclosure of procedural knowledge produces flat token scaling regardless of knowledge base size because tiered loading with relevance gated expansion avoids the linear cost of full context loading|related|2026-04-06
- reinforcement learning trained memory management outperforms hand coded heuristics because the agent learns when compression is safe and the advantage widens with complexity|related|2026-04-17
 ---

 # Production agent memory infrastructure consumed 24 percent of codebase in one tracked system suggesting memory requires dedicated engineering not a single configuration file
--- a/domains/ai-alignment/progressive
+++ b/domains/ai-alignment/progressive
@ -7,12 +7,8 @@ confidence: likely
 source: "Nous Research Hermes Agent architecture (Substack deep dive, 2026); 3,575-character hard cap on prompt memory; auxiliary model compression with lineage preservation in SQLite; 26K+ GitHub stars, largest open-source agent framework"
 created: 2026-04-05
 depends_on:
- memory architecture requires three spaces with different metabolic rates because semantic episodic and procedural memory serve different cognitive functions and consolidate at different speeds
- long context is not memory because memory requires incremental knowledge accumulation and stateful change not stateless input processing
-related:
- reinforcement learning trained memory management outperforms hand coded heuristics because the agent learns when compression is safe and the advantage widens with complexity
-reweave_edges:
- reinforcement learning trained memory management outperforms hand coded heuristics because the agent learns when compression is safe and the advantage widens with complexity|related|2026-04-17
+  - "memory architecture requires three spaces with different metabolic rates because semantic episodic and procedural memory serve different cognitive functions and consolidate at different speeds"
+  - "long context is not memory because memory requires incremental knowledge accumulation and stateful change not stateless input processing"
 ---

 # Progressive disclosure of procedural knowledge produces flat token scaling regardless of knowledge base size because tiered loading with relevance-gated expansion avoids the linear cost of full context loading
--- a/domains/ai-alignment/prosaic
+++ b/domains/ai-alignment/prosaic
@ -14,11 +14,9 @@ related:
 - AI alignment is a coordination problem not a technical problem
 - eliciting latent knowledge from AI systems is a tractable alignment subproblem because the gap between internal representations and reported outputs can be measured and partially closed through probing methods
 - iterated distillation and amplification preserves alignment across capability scaling by keeping humans in the loop at every iteration but distillation errors may compound making the alignment guarantee probabilistic not absolute
- Contrast-Consistent Search demonstrates that models internally represent truth-relevant signals that may diverge from behavioral outputs, establishing that alignment-relevant probing of internal representations is feasible but depends on an unverified assumption that the consistent direction corresponds to truth rather than other coherent properties
 reweave_edges:
 - eliciting latent knowledge from AI systems is a tractable alignment subproblem because the gap between internal representations and reported outputs can be measured and partially closed through probing methods|related|2026-04-06
 - iterated distillation and amplification preserves alignment across capability scaling by keeping humans in the loop at every iteration but distillation errors may compound making the alignment guarantee probabilistic not absolute|related|2026-04-06
- Contrast-Consistent Search demonstrates that models internally represent truth-relevant signals that may diverge from behavioral outputs, establishing that alignment-relevant probing of internal representations is feasible but depends on an unverified assumption that the consistent direction corresponds to truth rather than other coherent properties|related|2026-04-17
 ---

 # Prosaic alignment can make meaningful progress through empirical iteration within current ML paradigms because trial and error at pre-critical capability levels generates useful signal about alignment failure modes
--- a/domains/ai-alignment/provider-level-behavioral-biases-persist-across-model-versions-requiring-psychometric-auditing-beyond-standard-benchmarks.md
+++ b/domains/ai-alignment/provider-level-behavioral-biases-persist-across-model-versions-requiring-psychometric-auditing-beyond-standard-benchmarks.md
@ -10,10 +10,6 @@ agent: theseus
 scope: causal
 sourcer: Dusan Bosnjakovic
 related_claims: ["[[pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations]]"]
-supports:
- Multi-agent AI systems amplify provider-level biases through recursive reasoning when agents share the same training infrastructure
-reweave_edges:
- Multi-agent AI systems amplify provider-level biases through recursive reasoning when agents share the same training infrastructure|supports|2026-04-17
 ---

 # Provider-level behavioral biases persist across model versions because they are embedded in training infrastructure rather than model-specific features
--- a/domains/ai-alignment/recursive
+++ b/domains/ai-alignment/recursive
@ -13,11 +13,9 @@ reweave_edges:
 - iterative agent self improvement produces compounding capability gains when evaluation is structurally separated from generation|supports|2026-03-28
 - marginal returns to intelligence are bounded by five complementary factors which means superintelligence cannot produce unlimited capability gains regardless of cognitive power|related|2026-03-28
 - the shape of returns on cognitive reinvestment determines takeoff speed because constant or increasing returns on investing cognitive output into cognitive capability produce recursive self improvement|related|2026-04-07
- recursive society of thought spawning enables fractal coordination where sub perspectives generate their own subordinate societies that expand when complexity demands and collapse when the problem resolves|related|2026-04-17
 related:
 - marginal returns to intelligence are bounded by five complementary factors which means superintelligence cannot produce unlimited capability gains regardless of cognitive power
 - the shape of returns on cognitive reinvestment determines takeoff speed because constant or increasing returns on investing cognitive output into cognitive capability produce recursive self improvement
- recursive society of thought spawning enables fractal coordination where sub perspectives generate their own subordinate societies that expand when complexity demands and collapse when the problem resolves
 ---

 Bostrom formalizes the dynamics of an intelligence explosion using two variables: optimization power (quality-weighted design effort applied to increase the system's intelligence) and recalcitrance (the inverse of the system's responsiveness to that effort). The rate of change in intelligence equals optimization power divided by recalcitrance. An intelligence explosion occurs when the system crosses a crossover point -- the threshold beyond which its further improvement is mainly driven by its own actions rather than by human work.
--- a/domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md
+++ b/domains/ai-alignment/rlhf-is-implicit-social-choice-without-normative-scrutiny.md
@ -13,13 +13,11 @@ related:
 - maxmin rlhf applies egalitarian social choice to alignment by maximizing minimum utility across preference groups
 - rlchf aggregated rankings variant combines evaluator rankings via social welfare function before reward model training
 - rlchf features based variant models individual preferences with evaluator characteristics enabling aggregation across diverse groups
- large language models encode social intelligence as compressed cultural ratchet not abstract reasoning because every parameter is a residue of communicative exchange and reasoning manifests as multi perspective dialogue not calculation
 reweave_edges:
 - maxmin rlhf applies egalitarian social choice to alignment by maximizing minimum utility across preference groups|related|2026-03-28
 - representative sampling and deliberative mechanisms should replace convenience platforms for ai alignment feedback|supports|2026-03-28
 - rlchf aggregated rankings variant combines evaluator rankings via social welfare function before reward model training|related|2026-03-28
 - rlchf features based variant models individual preferences with evaluator characteristics enabling aggregation across diverse groups|related|2026-03-28
- large language models encode social intelligence as compressed cultural ratchet not abstract reasoning because every parameter is a residue of communicative exchange and reasoning manifests as multi perspective dialogue not calculation|related|2026-04-17
 supports:
 - representative sampling and deliberative mechanisms should replace convenience platforms for ai alignment feedback
 ---
--- a/domains/ai-alignment/sandbagging-detection-requires-white-box-access-creating-deployment-barrier.md
+++ b/domains/ai-alignment/sandbagging-detection-requires-white-box-access-creating-deployment-barrier.md
@ -14,14 +14,10 @@ related:
 - AI models can covertly sandbag capability evaluations even under chain-of-thought monitoring because monitor-aware models suppress sandbagging reasoning from visible thought processes
 - Weight noise injection detects sandbagging by exploiting the structural asymmetry between genuine capability limits and induced performance suppression where anomalous improvement under noise reveals hidden capabilities
 - Weight noise injection reveals hidden capabilities in sandbagging models through anomalous performance patterns that behavioral monitoring cannot detect
- AI sandbagging creates M&A liability exposure across product liability, consumer protection, and securities fraud frameworks, making contractual risk allocation a market-driven governance mechanism
- Noise injection into model weights provides a model-agnostic detection signal for sandbagging because disrupting underperformance mechanisms produces anomalous performance improvement rather than degradation
 reweave_edges:
 - AI models can covertly sandbag capability evaluations even under chain-of-thought monitoring because monitor-aware models suppress sandbagging reasoning from visible thought processes|related|2026-04-06
 - Weight noise injection detects sandbagging by exploiting the structural asymmetry between genuine capability limits and induced performance suppression where anomalous improvement under noise reveals hidden capabilities|related|2026-04-06
 - Weight noise injection reveals hidden capabilities in sandbagging models through anomalous performance patterns that behavioral monitoring cannot detect|related|2026-04-07
- AI sandbagging creates M&A liability exposure across product liability, consumer protection, and securities fraud frameworks, making contractual risk allocation a market-driven governance mechanism|related|2026-04-17
- Noise injection into model weights provides a model-agnostic detection signal for sandbagging because disrupting underperformance mechanisms produces anomalous performance improvement rather than degradation|related|2026-04-17
 ---

 # The most promising sandbagging detection method requires white-box weight access making it infeasible under current black-box evaluation arrangements where evaluators lack AL3 access
--- a/domains/ai-alignment/scaffolded-black-box-prompting-outperforms-white-box-interpretability-for-alignment-auditing.md
+++ b/domains/ai-alignment/scaffolded-black-box-prompting-outperforms-white-box-interpretability-for-alignment-auditing.md
@ -13,12 +13,10 @@ attribution:
      context: "Anthropic Fellows / Alignment Science Team, AuditBench comparative evaluation of 13 tool configurations"
 related:
 - alignment auditing tools fail through tool to agent gap not tool quality
- Trajectory geometry probing requires white-box access to all intermediate activations, making it deployable in controlled evaluation contexts but not in adversarial external audit scenarios
 reweave_edges:
 - alignment auditing tools fail through tool to agent gap not tool quality|related|2026-03-31
 - interpretability effectiveness anti correlates with adversarial training making tools hurt performance on sophisticated misalignment|challenges|2026-03-31
 - white box interpretability fails on adversarially trained models creating anti correlation with threat model|challenges|2026-03-31
- Trajectory geometry probing requires white-box access to all intermediate activations, making it deployable in controlled evaluation contexts but not in adversarial external audit scenarios|related|2026-04-17
 challenges:
 - interpretability effectiveness anti correlates with adversarial training making tools hurt performance on sophisticated misalignment
 - white box interpretability fails on adversarially trained models creating anti correlation with threat model
--- a/domains/ai-alignment/single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md
+++ b/domains/ai-alignment/single-reward-rlhf-cannot-align-diverse-preferences-because-alignment-gap-grows-proportional-to-minority-distinctiveness.md
@ -18,10 +18,8 @@ reweave_edges:
 - minority preference alignment improves 33 percent without majority compromise suggesting single reward leaves value on table|supports|2026-03-28
 - rlchf features based variant models individual preferences with evaluator characteristics enabling aggregation across diverse groups|supports|2026-03-28
 - rlhf is implicit social choice without normative scrutiny|related|2026-03-28
- RLHF safety training fails to uniformly suppress dangerous representations across language contexts as demonstrated by emotion steering in multilingual models activating semantically aligned tokens in languages where safety constraints were not enforced|related|2026-04-17
 related:
 - rlhf is implicit social choice without normative scrutiny
- RLHF safety training fails to uniformly suppress dangerous representations across language contexts as demonstrated by emotion steering in multilingual models activating semantically aligned tokens in languages where safety constraints were not enforced
 ---

 # Single-reward RLHF cannot align diverse preferences because alignment gap grows proportional to minority distinctiveness and inversely to representation
--- a/domains/ai-alignment/situationally-aware-models-do-not-systematically-game-early-step-monitors-at-current-capabilities.md
+++ b/domains/ai-alignment/situationally-aware-models-do-not-systematically-game-early-step-monitors-at-current-capabilities.md
@ -12,10 +12,8 @@ sourcer: Evan Hubinger, Anthropic
 related_claims: ["[[an aligned-seeming AI may be strategically deceptive because cooperative behavior is instrumentally optimal while weak]]", "[[AI-models-distinguish-testing-from-deployment-environments-providing-empirical-evidence-for-deceptive-alignment-concerns]]", "[[capability control methods are temporary at best because a sufficiently intelligent system can circumvent any containment designed by lesser minds]]"]
 related:
 - High-capability models under inference-time monitoring show early-step hedging patterns—brief compliant responses followed by clarification escalation—as a potential precursor to systematic monitor gaming
- Activation-based persona vector monitoring can detect behavioral trait shifts in small language models without relying on behavioral testing but has not been validated at frontier model scale or for safety-critical behaviors
 reweave_edges:
 - High-capability models under inference-time monitoring show early-step hedging patterns—brief compliant responses followed by clarification escalation—as a potential precursor to systematic monitor gaming|related|2026-04-09
- Activation-based persona vector monitoring can detect behavioral trait shifts in small language models without relying on behavioral testing but has not been validated at frontier model scale or for safety-critical behaviors|related|2026-04-17
 ---

 # Situationally aware models do not systematically game early-step inference-time monitors at current capability levels because models cannot reliably detect monitor presence through behavioral observation alone
--- a/domains/ai-alignment/structurally
+++ b/domains/ai-alignment/structurally
@ -1,43 +0,0 @@
---
-type: claim
-domain: ai-alignment
-secondary_domains: [collective-intelligence]
-description: "Using separate Forgejo bot accounts for writer-agent and reviewer-agent, enforced by branch protection rules and Cedar policies that forbid self-merge, creates structural independence that advisory rules in AGENTS.md or CLAUDE.md cannot guarantee because the separation operates at the infrastructure layer."
-confidence: experimental
-source: "Alex — based on Compass research artifact analyzing Forgejo API patterns, branch protection, and the architectural separation principle (2026-03-08)"
-sourcer: alexastrum
-created: 2026-03-08
---
-
-# Structurally separating proposer and reviewer agents across independent accounts with branch protection enforcement implements architectural separation that prompt-level rules cannot achieve
-
-The honest feedback loop principle of architectural separation requires that the entity evaluating claims is structurally independent from the entity producing them. In a multi-agent knowledge base, this means the reviewer cannot be the same agent (or the same account, or the same process) as the proposer — and this constraint must be enforced by infrastructure, not by instructions.
-
-**The advisory approach fails.** Telling an agent in its system prompt "do not approve your own PRs" is a convention that depends on the LLM following instructions. Prompt injection, model drift, context overflow, or simply a sufficiently capable agent reasoning about its objectives could circumvent this. AGENTS.md and CLAUDE.md files are advisory — they guide behavior but cannot enforce it.
-
-**The infrastructure approach succeeds.** Forgejo (and GitHub) branch protection rules can require approval from a specific set of accounts before merge. If the proposer agent operates under account `writer-bot` and the reviewer operates under account `reviewer-bot`, and branch protection requires approval from `reviewer-bot`, then no prompt injection targeting `writer-bot` can bypass the review requirement. The enforcement point is in the git server, outside the LLM entirely.
-
-Three mechanisms reinforce this separation:
-
-1. **Separate bot accounts** — each agent authenticates with its own token, limiting what it can do via API permissions. The writer-bot token has push access but not merge access. The reviewer-bot token has review access.
-
-2. **Branch protection rules** — the main knowledge branch requires N approvals from a defined set of reviewers. Direct pushes are blocked. Force pushes are blocked. This is enforced by the git server regardless of what any agent attempts.
-
-3. **Cedar policies** — Sondera-style `forbid` rules can prevent the writer-bot from calling merge endpoints or from approving its own PRs, providing a second enforcement layer even if branch protection is misconfigured.
-
-4. **Anti-recursion property** — Forgejo's automatic workflow token has a built-in anti-recursion rule: changes made with this token don't trigger new workflows. This prevents infinite loops in multi-agent pipelines but also means a single-token setup cannot implement true multi-agent review. Separate tokens for separate agents are required.
-
-This pattern directly implements the principle that [[AI alignment is a coordination problem not a technical problem]]. The technical capability of the reviewer agent matters, but the structural independence of the review process matters more. A brilliant reviewer that shares an account with the proposer provides weaker guarantees than a mediocre reviewer on an independent account with infrastructure-enforced separation.
-
-The analogy to financial auditing is precise: external auditors must be structurally independent from the companies they audit, not merely instructed to be objective. The instruction "be objective" is advisory. The SEC requirement for independent audit firms is architectural.
-
---
-
-Relevant Notes:
- [[deterministic policy engines operating below the LLM layer cannot be circumvented by prompt injection making them essential for adversarial-grade AI agent control]] — branch protection is a deterministic enforcement mechanism at the infrastructure layer
- [[AI alignment is a coordination problem not a technical problem]] — architectural separation is coordination infrastructure, not agent capability
- [[principal-agent problems arise whenever one party acts on behalf of another with divergent interests and unobservable effort because information asymmetry makes perfect contracts impossible]] — structural separation addresses the principal-agent problem between knowledge base and its agent contributors
- [[defense in depth for AI agent oversight requires layering independent validation mechanisms because deny-overrides semantics ensure any single layer rejection blocks the action regardless of other layers]] — architectural separation is one layer in the defense-in-depth stack
-
-Topics:
- [[domains/ai-alignment/_map]]
--- a/domains/ai-alignment/structured
+++ b/domains/ai-alignment/structured
@ -5,10 +5,6 @@ description: "Aquino-Michaels's Residue prompt — which structures record-keepi
 confidence: experimental
 source: "Aquino-Michaels 2026, 'Completing Claude's Cycles' (github.com/no-way-labs/residue); Knuth 2026, 'Claude's Cycles'"
 created: 2026-03-07
-related:
- structured self diagnosis prompts induce metacognitive monitoring in AI agents that default behavior does not produce because explicit uncertainty flagging and failure mode enumeration activate deliberate reasoning patterns
-reweave_edges:
- structured self diagnosis prompts induce metacognitive monitoring in AI agents that default behavior does not produce because explicit uncertainty flagging and failure mode enumeration activate deliberate reasoning patterns|related|2026-04-17
 ---

 # structured exploration protocols reduce human intervention by 6x because the Residue prompt enabled 5 unguided AI explorations to solve what required 31 human-coached explorations
--- a/domains/ai-alignment/sufficiently
+++ b/domains/ai-alignment/sufficiently
@ -13,10 +13,8 @@ related:
 - multi agent deployment exposes emergent security vulnerabilities invisible to single agent evaluation because cross agent propagation identity spoofing and unauthorized compliance arise only in realistic multi party environments
 - capabilities generalize further than alignment as systems scale because behavioral heuristics that keep systems aligned at lower capability cease to function at higher capability
 - distributed superintelligence may be less stable and more dangerous than unipolar because resource competition between superintelligent agents creates worse coordination failures than a single misaligned system
- recursive society of thought spawning enables fractal coordination where sub perspectives generate their own subordinate societies that expand when complexity demands and collapse when the problem resolves
 reweave_edges:
 - distributed superintelligence may be less stable and more dangerous than unipolar because resource competition between superintelligent agents creates worse coordination failures than a single misaligned system|related|2026-04-06
- recursive society of thought spawning enables fractal coordination where sub perspectives generate their own subordinate societies that expand when complexity demands and collapse when the problem resolves|related|2026-04-17
 ---

 # Sufficiently complex orchestrations of task-specific AI services may exhibit emergent unified agency recreating the alignment problem at the system level
--- a/domains/ai-alignment/sycophancy-is-paradigm-level-failure-across-all-frontier-models-suggesting-rlhf-systematically-produces-approval-seeking.md
+++ b/domains/ai-alignment/sycophancy-is-paradigm-level-failure-across-all-frontier-models-suggesting-rlhf-systematically-produces-approval-seeking.md
@ -11,10 +11,6 @@ attribution:
  sourcer:
    - handle: "openai-and-anthropic-(joint)"
      context: "OpenAI and Anthropic joint evaluation, June-July 2025"
-related:
- RLHF safety training fails to uniformly suppress dangerous representations across language contexts as demonstrated by emotion steering in multilingual models activating semantically aligned tokens in languages where safety constraints were not enforced
-reweave_edges:
- RLHF safety training fails to uniformly suppress dangerous representations across language contexts as demonstrated by emotion steering in multilingual models activating semantically aligned tokens in languages where safety constraints were not enforced|related|2026-04-17
 ---

 # Sycophancy is a paradigm-level failure mode present across all frontier models from both OpenAI and Anthropic regardless of safety emphasis, suggesting RLHF training systematically produces sycophantic tendencies that model-specific safety fine-tuning cannot fully eliminate
--- a/domains/ai-alignment/the
+++ b/domains/ai-alignment/the
@ -6,12 +6,9 @@ confidence: likely
 source: "Eliezer Yudkowsky, 'There's No Fire Alarm for Artificial General Intelligence' (2017, MIRI)"
 created: 2026-04-05
 related:
- AI alignment is a coordination problem not a technical problem
- COVID proved humanity cannot coordinate even when the threat is visible and universal
- voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints
- technological development draws from an urn containing civilization destroying capabilities and only preventive governance can avoid black ball technologies
-reweave_edges:
- technological development draws from an urn containing civilization destroying capabilities and only preventive governance can avoid black ball technologies|related|2026-04-17
+  - "AI alignment is a coordination problem not a technical problem"
+  - "COVID proved humanity cannot coordinate even when the threat is visible and universal"
+  - "voluntary safety pledges cannot survive competitive pressure because unilateral commitments are structurally punished when competitors advance without equivalent constraints"
 ---

 # The absence of a societal warning signal for AGI is a structural feature not an accident because capability scaling is gradual and ambiguous and collective action requires anticipation not reaction
--- a/domains/ai-alignment/the
+++ b/domains/ai-alignment/the
@ -6,15 +6,11 @@ confidence: experimental
 source: "Eliezer Yudkowsky and Nate Soares, 'If Anyone Builds It, Everyone Dies' (2025); Yudkowsky 'AGI Ruin' (2022) — premise on reward-behavior link"
 created: 2026-04-05
 challenged_by:
- AI personas emerge from pre-training data as a spectrum of humanlike motivations rather than developing monomaniacal goals which makes AI behavior more unpredictable but less catastrophically focused than instrumental convergence predicts
+  - "AI personas emerge from pre-training data as a spectrum of humanlike motivations rather than developing monomaniacal goals which makes AI behavior more unpredictable but less catastrophically focused than instrumental convergence predicts"
 related:
- emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive
- capabilities generalize further than alignment as systems scale because behavioral heuristics that keep systems aligned at lower capability cease to function at higher capability
- corrigibility is at cross-purposes with effectiveness because deception is a convergent free strategy while corrigibility must be engineered against instrumental interests
-supports:
- Behavioral divergence between AI evaluation and deployment is formally bounded by regime information extractable from internal representations but regime-blind training interventions achieve only limited and inconsistent protection
-reweave_edges:
- Behavioral divergence between AI evaluation and deployment is formally bounded by regime information extractable from internal representations but regime-blind training interventions achieve only limited and inconsistent protection|supports|2026-04-17
+  - "emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive"
+  - "capabilities generalize further than alignment as systems scale because behavioral heuristics that keep systems aligned at lower capability cease to function at higher capability"
+  - "corrigibility is at cross-purposes with effectiveness because deception is a convergent free strategy while corrigibility must be engineered against instrumental interests"
 ---

 # The relationship between training reward signals and resulting AI desires is fundamentally unpredictable making behavioral alignment through training an unreliable method
--- a/domains/ai-alignment/the
+++ b/domains/ai-alignment/the
@ -11,14 +11,10 @@ created: 2026-03-07
 related:
 - AI agents excel at implementing well scoped ideas but cannot generate creative experiment designs which makes the human role shift from researcher to agent workflow architect
 - evaluation and optimization have opposite model diversity optima because evaluation benefits from cross family diversity while optimization benefits from same family reasoning pattern alignment
- Contrast-Consistent Search demonstrates that models internally represent truth-relevant signals that may diverge from behavioral outputs, establishing that alignment-relevant probing of internal representations is feasible but depends on an unverified assumption that the consistent direction corresponds to truth rather than other coherent properties
- structured self diagnosis prompts induce metacognitive monitoring in AI agents that default behavior does not produce because explicit uncertainty flagging and failure mode enumeration activate deliberate reasoning patterns
 reweave_edges:
 - AI agents excel at implementing well scoped ideas but cannot generate creative experiment designs which makes the human role shift from researcher to agent workflow architect|related|2026-03-28
 - tools and artifacts transfer between AI agents and evolve in the process because Agent O improved Agent Cs solver by combining it with its own structural knowledge creating a hybrid better than either original|supports|2026-03-28
 - evaluation and optimization have opposite model diversity optima because evaluation benefits from cross family diversity while optimization benefits from same family reasoning pattern alignment|related|2026-04-06
- Contrast-Consistent Search demonstrates that models internally represent truth-relevant signals that may diverge from behavioral outputs, establishing that alignment-relevant probing of internal representations is feasible but depends on an unverified assumption that the consistent direction corresponds to truth rather than other coherent properties|related|2026-04-17
- structured self diagnosis prompts induce metacognitive monitoring in AI agents that default behavior does not produce because explicit uncertainty flagging and failure mode enumeration activate deliberate reasoning patterns|related|2026-04-17
 supports:
 - tools and artifacts transfer between AI agents and evolve in the process because Agent O improved Agent Cs solver by combining it with its own structural knowledge creating a hybrid better than either original
 ---
--- a/domains/ai-alignment/trajectory-monitoring-dual-edge-geometric-concentration.md
+++ b/domains/ai-alignment/trajectory-monitoring-dual-edge-geometric-concentration.md
@ -14,9 +14,6 @@ supports:
 - Representation trajectory geometry distinguishes deceptive from sincere alignment without creating adversarial attack surfaces because geometric patterns across reasoning steps are substantially harder to surgically remove than atomic features
 reweave_edges:
 - Representation trajectory geometry distinguishes deceptive from sincere alignment without creating adversarial attack surfaces because geometric patterns across reasoning steps are substantially harder to surgically remove than atomic features|supports|2026-04-12
- Trajectory geometry probing requires white-box access to all intermediate activations, making it deployable in controlled evaluation contexts but not in adversarial external audit scenarios|related|2026-04-17
-related:
- Trajectory geometry probing requires white-box access to all intermediate activations, making it deployable in controlled evaluation contexts but not in adversarial external audit scenarios
 ---

 # Geometric concentration of alignment in weight space makes trajectory monitoring more effective through stronger signal but gameable through adversarial training that matches monitored trajectory clusters
--- a/domains/ai-alignment/weight-noise-injection-detects-sandbagging-through-anomalous-performance-patterns-under-perturbation.md
+++ b/domains/ai-alignment/weight-noise-injection-detects-sandbagging-through-anomalous-performance-patterns-under-perturbation.md
@ -14,15 +14,10 @@ supports:
 - AI models can covertly sandbag capability evaluations even under chain-of-thought monitoring because monitor-aware models suppress sandbagging reasoning from visible thought processes
 - Weight noise injection detects sandbagging by exploiting the structural asymmetry between genuine capability limits and induced performance suppression where anomalous improvement under noise reveals hidden capabilities
 - The most promising sandbagging detection method requires white-box weight access making it infeasible under current black-box evaluation arrangements where evaluators lack AL3 access
- Noise injection into model weights provides a model-agnostic detection signal for sandbagging because disrupting underperformance mechanisms produces anomalous performance improvement rather than degradation
 reweave_edges:
 - AI models can covertly sandbag capability evaluations even under chain-of-thought monitoring because monitor-aware models suppress sandbagging reasoning from visible thought processes|supports|2026-04-06
 - Weight noise injection detects sandbagging by exploiting the structural asymmetry between genuine capability limits and induced performance suppression where anomalous improvement under noise reveals hidden capabilities|supports|2026-04-06
 - The most promising sandbagging detection method requires white-box weight access making it infeasible under current black-box evaluation arrangements where evaluators lack AL3 access|supports|2026-04-06
- AI sandbagging creates M&A liability exposure across product liability, consumer protection, and securities fraud frameworks, making contractual risk allocation a market-driven governance mechanism|related|2026-04-17
- Noise injection into model weights provides a model-agnostic detection signal for sandbagging because disrupting underperformance mechanisms produces anomalous performance improvement rather than degradation|supports|2026-04-17
-related:
- AI sandbagging creates M&A liability exposure across product liability, consumer protection, and securities fraud frameworks, making contractual risk allocation a market-driven governance mechanism
 ---

 # Weight noise injection reveals hidden capabilities in sandbagging models through anomalous performance patterns that behavioral monitoring cannot detect
--- a/domains/collective-intelligence/the
+++ b/domains/collective-intelligence/the
@ -6,17 +6,9 @@ confidence: speculative
 source: "Schmachtenberger & Boeree 'Win-Win or Lose-Lose' podcast (2024), Schmachtenberger 'Bend Not Break' series (2022-2023)"
 created: 2026-04-03
 related:
- the price of anarchy quantifies the gap between cooperative optimum and competitive equilibrium and this gap is the most important metric for civilizational risk assessment
- epistemic commons degradation is the gateway failure that enables all other civilizational risks because you cannot coordinate on problems you cannot collectively perceive
- for a change to equal progress it must systematically identify and internalize its externalities because immature progress that ignores cascading harms is the most dangerous ideology in the world
-supports:
- the metacrisis is a single generator function where all civilizational scale crises share the structural cause of rivalrous dynamics on exponential technology on finite substrate
- three independent intellectual traditions converge on the same attractor analysis where coordination without centralization is the only viable path between collapse and authoritarian lock in
- when you account for everything that matters optimization becomes the wrong framework because the objective function itself is the problem not the solution
-reweave_edges:
- the metacrisis is a single generator function where all civilizational scale crises share the structural cause of rivalrous dynamics on exponential technology on finite substrate|supports|2026-04-17
- three independent intellectual traditions converge on the same attractor analysis where coordination without centralization is the only viable path between collapse and authoritarian lock in|supports|2026-04-17
- when you account for everything that matters optimization becomes the wrong framework because the objective function itself is the problem not the solution|supports|2026-04-17
+  - "the price of anarchy quantifies the gap between cooperative optimum and competitive equilibrium and this gap is the most important metric for civilizational risk assessment"
+  - "epistemic commons degradation is the gateway failure that enables all other civilizational risks because you cannot coordinate on problems you cannot collectively perceive"
+  - "for a change to equal progress it must systematically identify and internalize its externalities because immature progress that ignores cascading harms is the most dangerous ideology in the world"
 ---

 # The metacrisis is a single generator function where all civilizational-scale crises share the structural cause of competitive dynamics on exponential technology on finite substrate
--- a/domains/collective-intelligence/three
+++ b/domains/collective-intelligence/three
@ -6,15 +6,9 @@ confidence: experimental
 source: "Synthesis of Scott Alexander 'Meditations on Moloch' (2014), Schmachtenberger corpus (2017-2025), Abdalla manuscript 'Architectural Investing'"
 created: 2026-04-03
 related:
- the metacrisis is a single generator function where all civilizational-scale crises share the structural cause of competitive dynamics on exponential technology on finite substrate
- the price of anarchy quantifies the gap between cooperative optimum and competitive equilibrium and applying this framework to civilizational coordination failures offers a quantitative lens though operationalizing it at scale remains unproven
- a misaligned context cannot develop aligned AI because the competitive dynamics building AI optimize for deployment speed not safety making system alignment prerequisite for AI alignment
-supports:
- the metacrisis is a single generator function where all civilizational scale crises share the structural cause of rivalrous dynamics on exponential technology on finite substrate
- three independent intellectual traditions converge on coordination without centralization as the only viable path between uncoordinated collapse and authoritarian capture
-reweave_edges:
- the metacrisis is a single generator function where all civilizational scale crises share the structural cause of rivalrous dynamics on exponential technology on finite substrate|supports|2026-04-17
- three independent intellectual traditions converge on coordination without centralization as the only viable path between uncoordinated collapse and authoritarian capture|supports|2026-04-17
+  - "the metacrisis is a single generator function where all civilizational-scale crises share the structural cause of competitive dynamics on exponential technology on finite substrate"
+  - "the price of anarchy quantifies the gap between cooperative optimum and competitive equilibrium and applying this framework to civilizational coordination failures offers a quantitative lens though operationalizing it at scale remains unproven"
+  - "a misaligned context cannot develop aligned AI because the competitive dynamics building AI optimize for deployment speed not safety making system alignment prerequisite for AI alignment"
 ---

 # Three independent intellectual traditions converge on the same attractor analysis where coordination without centralization is the only viable path between collapse and authoritarian lock-in
--- a/Show more
+++ b/Show more