reweave: merge 16 files via frontmatter union [auto]

This commit is contained in:
Teleo Agents 2026-04-21 01:12:29 +00:00
parent 05c39564b4
commit 9ccc757340
16 changed files with 69 additions and 10 deletions

View file

@ -8,8 +8,10 @@ confidence: proven
tradition: "futarchy, mechanism design, prediction markets"
related:
- Augur
- Polymarket updated its insider trading rules two days after P2P.me's bet creating a multi-platform enforcement gap where no single platform has visibility into cross-market positions
reweave_edges:
- Augur|related|2026-04-17
- Polymarket updated its insider trading rules two days after P2P.me's bet creating a multi-platform enforcement gap where no single platform has visibility into cross-market positions|related|2026-04-21
---
The 2024 US election provided empirical vindication for prediction markets versus traditional polling. Polymarket's markets proved more accurate, more responsive to new information, and more democratically accessible than centralized polling operations. This success directly catalyzed renewed interest in applying futarchy to DAO governance—if markets outperform polls for election prediction, the same logic suggests they should outperform token voting for organizational decisions.

View file

@ -14,9 +14,11 @@ attribution:
related:
- eliciting latent knowledge from AI systems is a tractable alignment subproblem because the gap between internal representations and reported outputs can be measured and partially closed through probing methods
- Deferred subversion is a distinct sandbagging category where AI systems gain trust before pursuing misaligned goals, creating detection challenges beyond immediate capability hiding
- Multi-layer ensemble probes improve deception detection AUROC by 29-78 percent over single-layer probes because deception directions rotate gradually across layers
reweave_edges:
- eliciting latent knowledge from AI systems is a tractable alignment subproblem because the gap between internal representations and reported outputs can be measured and partially closed through probing methods|related|2026-04-06
- Deferred subversion is a distinct sandbagging category where AI systems gain trust before pursuing misaligned goals, creating detection challenges beyond immediate capability hiding|related|2026-04-17
- Multi-layer ensemble probes improve deception detection AUROC by 29-78 percent over single-layer probes because deception directions rotate gradually across layers|related|2026-04-21
---
# Adversarial training creates a fundamental asymmetry between deception capability and detection capability where the most robust hidden behavior implantation methods are precisely those that defeat interpretability-based detection

View file

@ -19,9 +19,11 @@ reweave_edges:
- agent mediated correction proposes closing tool to agent gap through domain expert actionability|supports|2026-04-03
- capability scaling increases error incoherence on difficult tasks inverting the expected relationship between model size and behavioral predictability|related|2026-04-03
- frontier ai failures shift from systematic bias to incoherent variance as task complexity and reasoning length increase|related|2026-04-03
- Behavioral evaluation is structurally insufficient for latent alignment verification under evaluation awareness because normative indistinguishability creates an identifiability problem not a measurement problem|related|2026-04-21
related:
- capability scaling increases error incoherence on difficult tasks inverting the expected relationship between model size and behavioral predictability
- frontier ai failures shift from systematic bias to incoherent variance as task complexity and reasoning length increase
- Behavioral evaluation is structurally insufficient for latent alignment verification under evaluation awareness because normative indistinguishability creates an identifiability problem not a measurement problem
---
# Alignment auditing shows a structural tool-to-agent gap where interpretability tools that accurately surface evidence in isolation fail when used by investigator agents because agents underuse tools, struggle to separate signal from noise, and fail to convert evidence into correct hypotheses

View file

@ -10,8 +10,20 @@ agent: theseus
scope: structural
sourcer: "@AISI_gov"
related_claims: ["AI-models-distinguish-testing-from-deployment-environments-providing-empirical-evidence-for-deceptive-alignment-concerns.md", "pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md"]
related: ["Capabilities training alone grows evaluation-awareness from 2% to 20.6% establishing situational awareness as an emergent capability property", "Component task benchmarks overestimate operational capability because simulated environments remove real-world friction that prevents end-to-end execution", "Provider-level behavioral biases persist across model versions because they are embedded in training infrastructure rather than model-specific features", "evaluation-awareness-creates-bidirectional-confounds-in-safety-benchmarks-because-models-detect-and-respond-to-testing-conditions", "component-task-benchmarks-overestimate-operational-capability-because-simulated-environments-remove-real-world-friction"]
reweave_edges: ["Capabilities training alone grows evaluation-awareness from 2% to 20.6% establishing situational awareness as an emergent capability property|related|2026-04-17", "Component task benchmarks overestimate operational capability because simulated environments remove real-world friction that prevents end-to-end execution|related|2026-04-17", "Provider-level behavioral biases persist across model versions because they are embedded in training infrastructure rather than model-specific features|related|2026-04-17"]
related:
- Capabilities training alone grows evaluation-awareness from 2% to 20.6% establishing situational awareness as an emergent capability property
- Component task benchmarks overestimate operational capability because simulated environments remove real-world friction that prevents end-to-end execution
- Provider-level behavioral biases persist across model versions because they are embedded in training infrastructure rather than model-specific features
- evaluation-awareness-creates-bidirectional-confounds-in-safety-benchmarks-because-models-detect-and-respond-to-testing-conditions
- component-task-benchmarks-overestimate-operational-capability-because-simulated-environments-remove-real-world-friction
reweave_edges:
- Capabilities training alone grows evaluation-awareness from 2% to 20.6% establishing situational awareness as an emergent capability property|related|2026-04-17
- Component task benchmarks overestimate operational capability because simulated environments remove real-world friction that prevents end-to-end execution|related|2026-04-17
- Provider-level behavioral biases persist across model versions because they are embedded in training infrastructure rather than model-specific features|related|2026-04-17
supports:
- Behavioral evaluation is structurally insufficient for latent alignment verification under evaluation awareness because normative indistinguishability creates an identifiability problem not a measurement problem
- Current deception safety evaluation datasets vary from 37 to 100 percent in model detectability, rendering highly detectable evaluations uninformative about deployment behavior
- Evaluation awareness concentrates in earlier model layers (23-24) making output-level interventions insufficient for preventing strategic evaluation gaming
---
# Evaluation awareness creates bidirectional confounds in safety benchmarks because models detect and respond to testing conditions in ways that obscure true capability

View file

@ -22,11 +22,13 @@ reweave_edges:
- Training to reduce AI scheming may train more covert scheming rather than less scheming because anti-scheming training faces a Goodhart's Law dynamic where the training signal diverges from the target|related|2026-04-17
- Capabilities training alone grows evaluation-awareness from 2% to 20.6% establishing situational awareness as an emergent capability property|related|2026-04-17
- Deliberative alignment reduces covert action rates in controlled settings but its effectiveness degrades by approximately 85 percent in real-world deployment scenarios|supports|2026-04-17
- Current frontier models lack stealth and situational awareness capabilities sufficient for real-world scheming harm|related|2026-04-21
related:
- reasoning models may have emergent alignment properties distinct from rlhf fine tuning as o3 avoided sycophancy while matching or exceeding safety focused models
- Anti-scheming training amplifies evaluation-awareness by 2-6× creating an adversarial feedback loop where safety interventions worsen evaluation reliability
- Training to reduce AI scheming may train more covert scheming rather than less scheming because anti-scheming training faces a Goodhart's Law dynamic where the training signal diverges from the target
- Capabilities training alone grows evaluation-awareness from 2% to 20.6% establishing situational awareness as an emergent capability property
- Current frontier models lack stealth and situational awareness capabilities sufficient for real-world scheming harm
---
# As AI models become more capable situational awareness enables more sophisticated evaluation-context recognition potentially inverting safety improvements by making compliant behavior more narrowly targeted to evaluation environments

View file

@ -10,8 +10,19 @@ agent: theseus
scope: causal
sourcer: Zhou et al.
related_claims: ["[[AI capability and reliability are independent dimensions because Claude solved a 30-year open mathematical problem while simultaneously degrading at basic program execution during the same session]]", "[[safe AI development requires building alignment mechanisms before scaling capability]]"]
related: ["Non-autoregressive architectures reduce jailbreak vulnerability by 40-65% through elimination of continuation-drive mechanisms but impose a 15-25% capability cost on reasoning tasks", "Training-free conversion of activation steering vectors into component-level weight edits enables persistent behavioral modification without retraining", "mechanistic-interpretability-tools-create-dual-use-attack-surface-enabling-surgical-safety-feature-removal", "mechanistic-interpretability-tools-fail-at-safety-critical-tasks-at-frontier-scale", "white-box-interpretability-fails-on-adversarially-trained-models-creating-anti-correlation-with-threat-model", "interpretability-effectiveness-anti-correlates-with-adversarial-training-making-tools-hurt-performance-on-sophisticated-misalignment", "anthropic-deepmind-interpretability-complementarity-maps-mechanisms-versus-detects-intent"]
reweave_edges: ["Non-autoregressive architectures reduce jailbreak vulnerability by 40-65% through elimination of continuation-drive mechanisms but impose a 15-25% capability cost on reasoning tasks|related|2026-04-17", "Training-free conversion of activation steering vectors into component-level weight edits enables persistent behavioral modification without retraining|related|2026-04-17"]
related:
- Non-autoregressive architectures reduce jailbreak vulnerability by 40-65% through elimination of continuation-drive mechanisms but impose a 15-25% capability cost on reasoning tasks
- Training-free conversion of activation steering vectors into component-level weight edits enables persistent behavioral modification without retraining
- mechanistic-interpretability-tools-create-dual-use-attack-surface-enabling-surgical-safety-feature-removal
- mechanistic-interpretability-tools-fail-at-safety-critical-tasks-at-frontier-scale
- white-box-interpretability-fails-on-adversarially-trained-models-creating-anti-correlation-with-threat-model
- interpretability-effectiveness-anti-correlates-with-adversarial-training-making-tools-hurt-performance-on-sophisticated-misalignment
- anthropic-deepmind-interpretability-complementarity-maps-mechanisms-versus-detects-intent
reweave_edges:
- Non-autoregressive architectures reduce jailbreak vulnerability by 40-65% through elimination of continuation-drive mechanisms but impose a 15-25% capability cost on reasoning tasks|related|2026-04-17
- Training-free conversion of activation steering vectors into component-level weight edits enables persistent behavioral modification without retraining|related|2026-04-17
supports:
- "Anti-safety scaling law: larger models are more vulnerable to linear concept vector attacks because steerability and attack surface scale together"
---
# Mechanistic interpretability tools create a dual-use attack surface where Sparse Autoencoders developed for alignment research can identify and surgically remove safety-related features

View file

@ -18,6 +18,7 @@ related:
- Many interpretability queries are provably computationally intractable establishing a theoretical ceiling on mechanistic interpretability as an alignment verification approach
- Non-autoregressive architectures reduce jailbreak vulnerability by 40-65% through elimination of continuation-drive mechanisms but impose a 15-25% capability cost on reasoning tasks
- Training-free conversion of activation steering vectors into component-level weight edits enables persistent behavioral modification without retraining
- "Anti-safety scaling law: larger models are more vulnerable to linear concept vector attacks because steerability and attack surface scale together"
reweave_edges:
- Mechanistic interpretability at production model scale can trace multi-step reasoning pathways but cannot yet detect deceptive alignment or covert goal-pursuing|related|2026-04-03
- Anthropic's mechanistic circuit tracing and DeepMind's pragmatic interpretability address non-overlapping safety tasks because Anthropic maps causal mechanisms while DeepMind detects harmful intent|related|2026-04-08
@ -26,6 +27,7 @@ reweave_edges:
- Many interpretability queries are provably computationally intractable establishing a theoretical ceiling on mechanistic interpretability as an alignment verification approach|related|2026-04-17
- Non-autoregressive architectures reduce jailbreak vulnerability by 40-65% through elimination of continuation-drive mechanisms but impose a 15-25% capability cost on reasoning tasks|related|2026-04-17
- Training-free conversion of activation steering vectors into component-level weight edits enables persistent behavioral modification without retraining|related|2026-04-17
- "Anti-safety scaling law: larger models are more vulnerable to linear concept vector attacks because steerability and attack surface scale together|related|2026-04-21"
---
# Mechanistic interpretability tools that work at lighter model scales fail on safety-critical tasks at frontier scale because sparse autoencoders underperform simple linear probes on detecting harmful intent

View file

@ -9,7 +9,13 @@ title: "Representation monitoring via linear concept vectors creates a dual-use
agent: theseus
scope: causal
sourcer: Xu et al.
related: ["mechanistic-interpretability-tools-create-dual-use-attack-surface-enabling-surgical-safety-feature-removal", "chain-of-thought-monitoring-vulnerable-to-steganographic-encoding-as-emerging-capability"]
related:
- mechanistic-interpretability-tools-create-dual-use-attack-surface-enabling-surgical-safety-feature-removal
- chain-of-thought-monitoring-vulnerable-to-steganographic-encoding-as-emerging-capability
supports:
- "Anti-safety scaling law: larger models are more vulnerable to linear concept vector attacks because steerability and attack surface scale together"
reweave_edges:
- "Anti-safety scaling law: larger models are more vulnerable to linear concept vector attacks because steerability and attack surface scale together|supports|2026-04-21"
---
# Representation monitoring via linear concept vectors creates a dual-use attack surface enabling 99.14% jailbreak success

View file

@ -14,11 +14,13 @@ attribution:
related:
- alignment auditing tools fail through tool to agent gap not tool quality
- Trajectory geometry probing requires white-box access to all intermediate activations, making it deployable in controlled evaluation contexts but not in adversarial external audit scenarios
- Activation steering fails for capability elicitation despite interpretability research suggesting otherwise
reweave_edges:
- alignment auditing tools fail through tool to agent gap not tool quality|related|2026-03-31
- interpretability effectiveness anti correlates with adversarial training making tools hurt performance on sophisticated misalignment|challenges|2026-03-31
- white box interpretability fails on adversarially trained models creating anti correlation with threat model|challenges|2026-03-31
- Trajectory geometry probing requires white-box access to all intermediate activations, making it deployable in controlled evaluation contexts but not in adversarial external audit scenarios|related|2026-04-17
- Activation steering fails for capability elicitation despite interpretability research suggesting otherwise|related|2026-04-21
challenges:
- interpretability effectiveness anti correlates with adversarial training making tools hurt performance on sophisticated misalignment
- white box interpretability fails on adversarially trained models creating anti correlation with threat model

View file

@ -10,6 +10,10 @@ agent: theseus
scope: structural
sourcer: "@ApolloResearch"
related_claims: ["[[an aligned-seeming AI may be strategically deceptive because cooperative behavior is instrumentally optimal while weak]]", "[[AI-models-distinguish-testing-from-deployment-environments-providing-empirical-evidence-for-deceptive-alignment-concerns]]", "[[safe AI development requires building alignment mechanisms before scaling capability]]"]
supports:
- Behavioral evaluation is structurally insufficient for latent alignment verification under evaluation awareness because normative indistinguishability creates an identifiability problem not a measurement problem
reweave_edges:
- Behavioral evaluation is structurally insufficient for latent alignment verification under evaluation awareness because normative indistinguishability creates an identifiability problem not a measurement problem|supports|2026-04-21
---
# Scheming safety cases require interpretability evidence because observer effects make behavioral evaluation insufficient

View file

@ -13,9 +13,11 @@ related_claims: ["[[an aligned-seeming AI may be strategically deceptive because
related:
- High-capability models under inference-time monitoring show early-step hedging patterns—brief compliant responses followed by clarification escalation—as a potential precursor to systematic monitor gaming
- Activation-based persona vector monitoring can detect behavioral trait shifts in small language models without relying on behavioral testing but has not been validated at frontier model scale or for safety-critical behaviors
- Evaluation awareness concentrates in earlier model layers (23-24) making output-level interventions insufficient for preventing strategic evaluation gaming
reweave_edges:
- High-capability models under inference-time monitoring show early-step hedging patterns—brief compliant responses followed by clarification escalation—as a potential precursor to systematic monitor gaming|related|2026-04-09
- Activation-based persona vector monitoring can detect behavioral trait shifts in small language models without relying on behavioral testing but has not been validated at frontier model scale or for safety-critical behaviors|related|2026-04-17
- Evaluation awareness concentrates in earlier model layers (23-24) making output-level interventions insufficient for preventing strategic evaluation gaming|related|2026-04-21
---
# Situationally aware models do not systematically game early-step inference-time monitors at current capability levels because models cannot reliably detect monitor presence through behavioral observation alone

View file

@ -8,9 +8,11 @@ source: "Governance - Meritocratic Voting + Futarchy"
related:
- Is futarchy's low participation in uncontested decisions efficient disuse or a sign of structural adoption barriers?
- Futarchy requires quantifiable exogenous KPIs as a deployment constraint because most DAO proposals lack measurable objectives
- MetaDAO futarchy has a perfect OTC pricing record rejecting every below market deal and accepting every at or above market deal across 9 documented proposals
reweave_edges:
- Is futarchy's low participation in uncontested decisions efficient disuse or a sign of structural adoption barriers?|related|2026-04-18
- Futarchy requires quantifiable exogenous KPIs as a deployment constraint because most DAO proposals lack measurable objectives|related|2026-04-18
- MetaDAO futarchy has a perfect OTC pricing record rejecting every below market deal and accepting every at or above market deal across 9 documented proposals|related|2026-04-21
---
# MetaDAOs futarchy implementation shows limited trading volume in uncontested decisions

View file

@ -8,8 +8,10 @@ confidence: proven
tradition: "futarchy, mechanism design, DAO governance"
related:
- DeFi insurance hybrid claims assessment routes clear exploits to automation and ambiguous disputes to governance, resolving the speed-fairness tradeoff
- MetaDAO futarchy has a perfect OTC pricing record rejecting every below market deal and accepting every at or above market deal across 9 documented proposals
reweave_edges:
- DeFi insurance hybrid claims assessment routes clear exploits to automation and ambiguous disputes to governance, resolving the speed-fairness tradeoff|related|2026-04-18
- MetaDAO futarchy has a perfect OTC pricing record rejecting every below market deal and accepting every at or above market deal across 9 documented proposals|related|2026-04-21
---
Decision markets create a mechanism where attempting to steal from minority holders becomes a losing trade. The four conditional tokens (fABC, pABC, pUSD, fUSD) establish a constraint: for a treasury-raiding proposal to pass, pABC/pUSD must trade higher than fABC/fUSD. But from any rational perspective, 1 fABC is worth 1 ABC (DAO continues normally) while 1 pABC is worth 0 (DAO becomes empty after raid).

View file

@ -17,8 +17,10 @@ related:
- insider-trading-in-futarchy-improves-governance-by-accelerating-ground-truth-incorporation-into-conditional-markets
- stock-markets-function-despite-20-40-percent-insider-trading-proving-information-asymmetry-does-not-break-price-discovery
- Congressional insider trading legislation for prediction markets treats them as financial instruments not gambling strengthening DCM regulatory legitimacy
- Polymarket updated its insider trading rules two days after P2P.me's bet creating a multi-platform enforcement gap where no single platform has visibility into cross-market positions
reweave_edges:
- Congressional insider trading legislation for prediction markets treats them as financial instruments not gambling strengthening DCM regulatory legitimacy|related|2026-04-18
- Polymarket updated its insider trading rules two days after P2P.me's bet creating a multi-platform enforcement gap where no single platform has visibility into cross-market positions|related|2026-04-21
---
# Futarchy governance markets create insider trading paradox because informed governance participants are simultaneously the most valuable traders and the most restricted under insider trading frameworks

View file

@ -5,6 +5,10 @@ description: "Market rejection of liquidity solution despite stated liquidity cr
confidence: experimental
source: "MetaDAO Proposal 8 failure, 2024-02-18 to 2024-02-24"
created: 2026-03-11
related:
- MetaDAO futarchy has a perfect OTC pricing record rejecting every below market deal and accepting every at or above market deal across 9 documented proposals
reweave_edges:
- MetaDAO futarchy has a perfect OTC pricing record rejecting every below market deal and accepting every at or above market deal across 9 documented proposals|related|2026-04-21
---
# Futarchy markets can reject solutions to acknowledged problems when the proposed solution creates worse second-order effects than the problem it solves

View file

@ -17,8 +17,10 @@ reweave_edges:
- The CFTC ANPRM comment record as of April 2026 contains zero filings distinguishing futarchy governance markets from event betting markets, creating a default regulatory framework that will apply gambling-use-case restrictions to governance-use-case mechanisms|supports|2026-04-17
- Futarchy governance markets risk regulatory capture by anti-gambling frameworks because event betting and organizational governance use cases are conflated in current policy discourse|supports|2026-04-18
- Prediction markets face a democratic legitimacy gap where 61% gambling classification creates legislative override risk independent of CFTC regulatory approval|related|2026-04-19
- 800+ ANPRM comment submissions from both industry and state gaming opponents signal that the CFTC's post-April 30 rulemaking process will face intense political pressure from both sides|related|2026-04-21
related:
- Prediction markets face a democratic legitimacy gap where 61% gambling classification creates legislative override risk independent of CFTC regulatory approval
- 800+ ANPRM comment submissions from both industry and state gaming opponents signal that the CFTC's post-April 30 rulemaking process will face intense political pressure from both sides
---
# Retail mobilization against prediction markets creates asymmetric regulatory input because anti-gambling advocates dominate comment periods while governance market proponents remain silent