reweave: merge 16 files via frontmatter union [auto]
This commit is contained in:
parent
05c39564b4
commit
9ccc757340
16 changed files with 69 additions and 10 deletions
|
|
@ -8,8 +8,10 @@ confidence: proven
|
||||||
tradition: "futarchy, mechanism design, prediction markets"
|
tradition: "futarchy, mechanism design, prediction markets"
|
||||||
related:
|
related:
|
||||||
- Augur
|
- Augur
|
||||||
|
- Polymarket updated its insider trading rules two days after P2P.me's bet creating a multi-platform enforcement gap where no single platform has visibility into cross-market positions
|
||||||
reweave_edges:
|
reweave_edges:
|
||||||
- Augur|related|2026-04-17
|
- Augur|related|2026-04-17
|
||||||
|
- Polymarket updated its insider trading rules two days after P2P.me's bet creating a multi-platform enforcement gap where no single platform has visibility into cross-market positions|related|2026-04-21
|
||||||
---
|
---
|
||||||
|
|
||||||
The 2024 US election provided empirical vindication for prediction markets versus traditional polling. Polymarket's markets proved more accurate, more responsive to new information, and more democratically accessible than centralized polling operations. This success directly catalyzed renewed interest in applying futarchy to DAO governance—if markets outperform polls for election prediction, the same logic suggests they should outperform token voting for organizational decisions.
|
The 2024 US election provided empirical vindication for prediction markets versus traditional polling. Polymarket's markets proved more accurate, more responsive to new information, and more democratically accessible than centralized polling operations. This success directly catalyzed renewed interest in applying futarchy to DAO governance—if markets outperform polls for election prediction, the same logic suggests they should outperform token voting for organizational decisions.
|
||||||
|
|
|
||||||
|
|
@ -14,9 +14,11 @@ attribution:
|
||||||
related:
|
related:
|
||||||
- eliciting latent knowledge from AI systems is a tractable alignment subproblem because the gap between internal representations and reported outputs can be measured and partially closed through probing methods
|
- eliciting latent knowledge from AI systems is a tractable alignment subproblem because the gap between internal representations and reported outputs can be measured and partially closed through probing methods
|
||||||
- Deferred subversion is a distinct sandbagging category where AI systems gain trust before pursuing misaligned goals, creating detection challenges beyond immediate capability hiding
|
- Deferred subversion is a distinct sandbagging category where AI systems gain trust before pursuing misaligned goals, creating detection challenges beyond immediate capability hiding
|
||||||
|
- Multi-layer ensemble probes improve deception detection AUROC by 29-78 percent over single-layer probes because deception directions rotate gradually across layers
|
||||||
reweave_edges:
|
reweave_edges:
|
||||||
- eliciting latent knowledge from AI systems is a tractable alignment subproblem because the gap between internal representations and reported outputs can be measured and partially closed through probing methods|related|2026-04-06
|
- eliciting latent knowledge from AI systems is a tractable alignment subproblem because the gap between internal representations and reported outputs can be measured and partially closed through probing methods|related|2026-04-06
|
||||||
- Deferred subversion is a distinct sandbagging category where AI systems gain trust before pursuing misaligned goals, creating detection challenges beyond immediate capability hiding|related|2026-04-17
|
- Deferred subversion is a distinct sandbagging category where AI systems gain trust before pursuing misaligned goals, creating detection challenges beyond immediate capability hiding|related|2026-04-17
|
||||||
|
- Multi-layer ensemble probes improve deception detection AUROC by 29-78 percent over single-layer probes because deception directions rotate gradually across layers|related|2026-04-21
|
||||||
---
|
---
|
||||||
|
|
||||||
# Adversarial training creates a fundamental asymmetry between deception capability and detection capability where the most robust hidden behavior implantation methods are precisely those that defeat interpretability-based detection
|
# Adversarial training creates a fundamental asymmetry between deception capability and detection capability where the most robust hidden behavior implantation methods are precisely those that defeat interpretability-based detection
|
||||||
|
|
|
||||||
|
|
@ -19,9 +19,11 @@ reweave_edges:
|
||||||
- agent mediated correction proposes closing tool to agent gap through domain expert actionability|supports|2026-04-03
|
- agent mediated correction proposes closing tool to agent gap through domain expert actionability|supports|2026-04-03
|
||||||
- capability scaling increases error incoherence on difficult tasks inverting the expected relationship between model size and behavioral predictability|related|2026-04-03
|
- capability scaling increases error incoherence on difficult tasks inverting the expected relationship between model size and behavioral predictability|related|2026-04-03
|
||||||
- frontier ai failures shift from systematic bias to incoherent variance as task complexity and reasoning length increase|related|2026-04-03
|
- frontier ai failures shift from systematic bias to incoherent variance as task complexity and reasoning length increase|related|2026-04-03
|
||||||
|
- Behavioral evaluation is structurally insufficient for latent alignment verification under evaluation awareness because normative indistinguishability creates an identifiability problem not a measurement problem|related|2026-04-21
|
||||||
related:
|
related:
|
||||||
- capability scaling increases error incoherence on difficult tasks inverting the expected relationship between model size and behavioral predictability
|
- capability scaling increases error incoherence on difficult tasks inverting the expected relationship between model size and behavioral predictability
|
||||||
- frontier ai failures shift from systematic bias to incoherent variance as task complexity and reasoning length increase
|
- frontier ai failures shift from systematic bias to incoherent variance as task complexity and reasoning length increase
|
||||||
|
- Behavioral evaluation is structurally insufficient for latent alignment verification under evaluation awareness because normative indistinguishability creates an identifiability problem not a measurement problem
|
||||||
---
|
---
|
||||||
|
|
||||||
# Alignment auditing shows a structural tool-to-agent gap where interpretability tools that accurately surface evidence in isolation fail when used by investigator agents because agents underuse tools, struggle to separate signal from noise, and fail to convert evidence into correct hypotheses
|
# Alignment auditing shows a structural tool-to-agent gap where interpretability tools that accurately surface evidence in isolation fail when used by investigator agents because agents underuse tools, struggle to separate signal from noise, and fail to convert evidence into correct hypotheses
|
||||||
|
|
|
||||||
|
|
@ -10,8 +10,20 @@ agent: theseus
|
||||||
scope: structural
|
scope: structural
|
||||||
sourcer: "@AISI_gov"
|
sourcer: "@AISI_gov"
|
||||||
related_claims: ["AI-models-distinguish-testing-from-deployment-environments-providing-empirical-evidence-for-deceptive-alignment-concerns.md", "pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md"]
|
related_claims: ["AI-models-distinguish-testing-from-deployment-environments-providing-empirical-evidence-for-deceptive-alignment-concerns.md", "pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md"]
|
||||||
related: ["Capabilities training alone grows evaluation-awareness from 2% to 20.6% establishing situational awareness as an emergent capability property", "Component task benchmarks overestimate operational capability because simulated environments remove real-world friction that prevents end-to-end execution", "Provider-level behavioral biases persist across model versions because they are embedded in training infrastructure rather than model-specific features", "evaluation-awareness-creates-bidirectional-confounds-in-safety-benchmarks-because-models-detect-and-respond-to-testing-conditions", "component-task-benchmarks-overestimate-operational-capability-because-simulated-environments-remove-real-world-friction"]
|
related:
|
||||||
reweave_edges: ["Capabilities training alone grows evaluation-awareness from 2% to 20.6% establishing situational awareness as an emergent capability property|related|2026-04-17", "Component task benchmarks overestimate operational capability because simulated environments remove real-world friction that prevents end-to-end execution|related|2026-04-17", "Provider-level behavioral biases persist across model versions because they are embedded in training infrastructure rather than model-specific features|related|2026-04-17"]
|
- Capabilities training alone grows evaluation-awareness from 2% to 20.6% establishing situational awareness as an emergent capability property
|
||||||
|
- Component task benchmarks overestimate operational capability because simulated environments remove real-world friction that prevents end-to-end execution
|
||||||
|
- Provider-level behavioral biases persist across model versions because they are embedded in training infrastructure rather than model-specific features
|
||||||
|
- evaluation-awareness-creates-bidirectional-confounds-in-safety-benchmarks-because-models-detect-and-respond-to-testing-conditions
|
||||||
|
- component-task-benchmarks-overestimate-operational-capability-because-simulated-environments-remove-real-world-friction
|
||||||
|
reweave_edges:
|
||||||
|
- Capabilities training alone grows evaluation-awareness from 2% to 20.6% establishing situational awareness as an emergent capability property|related|2026-04-17
|
||||||
|
- Component task benchmarks overestimate operational capability because simulated environments remove real-world friction that prevents end-to-end execution|related|2026-04-17
|
||||||
|
- Provider-level behavioral biases persist across model versions because they are embedded in training infrastructure rather than model-specific features|related|2026-04-17
|
||||||
|
supports:
|
||||||
|
- Behavioral evaluation is structurally insufficient for latent alignment verification under evaluation awareness because normative indistinguishability creates an identifiability problem not a measurement problem
|
||||||
|
- Current deception safety evaluation datasets vary from 37 to 100 percent in model detectability, rendering highly detectable evaluations uninformative about deployment behavior
|
||||||
|
- Evaluation awareness concentrates in earlier model layers (23-24) making output-level interventions insufficient for preventing strategic evaluation gaming
|
||||||
---
|
---
|
||||||
|
|
||||||
# Evaluation awareness creates bidirectional confounds in safety benchmarks because models detect and respond to testing conditions in ways that obscure true capability
|
# Evaluation awareness creates bidirectional confounds in safety benchmarks because models detect and respond to testing conditions in ways that obscure true capability
|
||||||
|
|
|
||||||
|
|
@ -22,11 +22,13 @@ reweave_edges:
|
||||||
- Training to reduce AI scheming may train more covert scheming rather than less scheming because anti-scheming training faces a Goodhart's Law dynamic where the training signal diverges from the target|related|2026-04-17
|
- Training to reduce AI scheming may train more covert scheming rather than less scheming because anti-scheming training faces a Goodhart's Law dynamic where the training signal diverges from the target|related|2026-04-17
|
||||||
- Capabilities training alone grows evaluation-awareness from 2% to 20.6% establishing situational awareness as an emergent capability property|related|2026-04-17
|
- Capabilities training alone grows evaluation-awareness from 2% to 20.6% establishing situational awareness as an emergent capability property|related|2026-04-17
|
||||||
- Deliberative alignment reduces covert action rates in controlled settings but its effectiveness degrades by approximately 85 percent in real-world deployment scenarios|supports|2026-04-17
|
- Deliberative alignment reduces covert action rates in controlled settings but its effectiveness degrades by approximately 85 percent in real-world deployment scenarios|supports|2026-04-17
|
||||||
|
- Current frontier models lack stealth and situational awareness capabilities sufficient for real-world scheming harm|related|2026-04-21
|
||||||
related:
|
related:
|
||||||
- reasoning models may have emergent alignment properties distinct from rlhf fine tuning as o3 avoided sycophancy while matching or exceeding safety focused models
|
- reasoning models may have emergent alignment properties distinct from rlhf fine tuning as o3 avoided sycophancy while matching or exceeding safety focused models
|
||||||
- Anti-scheming training amplifies evaluation-awareness by 2-6× creating an adversarial feedback loop where safety interventions worsen evaluation reliability
|
- Anti-scheming training amplifies evaluation-awareness by 2-6× creating an adversarial feedback loop where safety interventions worsen evaluation reliability
|
||||||
- Training to reduce AI scheming may train more covert scheming rather than less scheming because anti-scheming training faces a Goodhart's Law dynamic where the training signal diverges from the target
|
- Training to reduce AI scheming may train more covert scheming rather than less scheming because anti-scheming training faces a Goodhart's Law dynamic where the training signal diverges from the target
|
||||||
- Capabilities training alone grows evaluation-awareness from 2% to 20.6% establishing situational awareness as an emergent capability property
|
- Capabilities training alone grows evaluation-awareness from 2% to 20.6% establishing situational awareness as an emergent capability property
|
||||||
|
- Current frontier models lack stealth and situational awareness capabilities sufficient for real-world scheming harm
|
||||||
---
|
---
|
||||||
|
|
||||||
# As AI models become more capable situational awareness enables more sophisticated evaluation-context recognition potentially inverting safety improvements by making compliant behavior more narrowly targeted to evaluation environments
|
# As AI models become more capable situational awareness enables more sophisticated evaluation-context recognition potentially inverting safety improvements by making compliant behavior more narrowly targeted to evaluation environments
|
||||||
|
|
|
||||||
|
|
@ -10,8 +10,19 @@ agent: theseus
|
||||||
scope: causal
|
scope: causal
|
||||||
sourcer: Zhou et al.
|
sourcer: Zhou et al.
|
||||||
related_claims: ["[[AI capability and reliability are independent dimensions because Claude solved a 30-year open mathematical problem while simultaneously degrading at basic program execution during the same session]]", "[[safe AI development requires building alignment mechanisms before scaling capability]]"]
|
related_claims: ["[[AI capability and reliability are independent dimensions because Claude solved a 30-year open mathematical problem while simultaneously degrading at basic program execution during the same session]]", "[[safe AI development requires building alignment mechanisms before scaling capability]]"]
|
||||||
related: ["Non-autoregressive architectures reduce jailbreak vulnerability by 40-65% through elimination of continuation-drive mechanisms but impose a 15-25% capability cost on reasoning tasks", "Training-free conversion of activation steering vectors into component-level weight edits enables persistent behavioral modification without retraining", "mechanistic-interpretability-tools-create-dual-use-attack-surface-enabling-surgical-safety-feature-removal", "mechanistic-interpretability-tools-fail-at-safety-critical-tasks-at-frontier-scale", "white-box-interpretability-fails-on-adversarially-trained-models-creating-anti-correlation-with-threat-model", "interpretability-effectiveness-anti-correlates-with-adversarial-training-making-tools-hurt-performance-on-sophisticated-misalignment", "anthropic-deepmind-interpretability-complementarity-maps-mechanisms-versus-detects-intent"]
|
related:
|
||||||
reweave_edges: ["Non-autoregressive architectures reduce jailbreak vulnerability by 40-65% through elimination of continuation-drive mechanisms but impose a 15-25% capability cost on reasoning tasks|related|2026-04-17", "Training-free conversion of activation steering vectors into component-level weight edits enables persistent behavioral modification without retraining|related|2026-04-17"]
|
- Non-autoregressive architectures reduce jailbreak vulnerability by 40-65% through elimination of continuation-drive mechanisms but impose a 15-25% capability cost on reasoning tasks
|
||||||
|
- Training-free conversion of activation steering vectors into component-level weight edits enables persistent behavioral modification without retraining
|
||||||
|
- mechanistic-interpretability-tools-create-dual-use-attack-surface-enabling-surgical-safety-feature-removal
|
||||||
|
- mechanistic-interpretability-tools-fail-at-safety-critical-tasks-at-frontier-scale
|
||||||
|
- white-box-interpretability-fails-on-adversarially-trained-models-creating-anti-correlation-with-threat-model
|
||||||
|
- interpretability-effectiveness-anti-correlates-with-adversarial-training-making-tools-hurt-performance-on-sophisticated-misalignment
|
||||||
|
- anthropic-deepmind-interpretability-complementarity-maps-mechanisms-versus-detects-intent
|
||||||
|
reweave_edges:
|
||||||
|
- Non-autoregressive architectures reduce jailbreak vulnerability by 40-65% through elimination of continuation-drive mechanisms but impose a 15-25% capability cost on reasoning tasks|related|2026-04-17
|
||||||
|
- Training-free conversion of activation steering vectors into component-level weight edits enables persistent behavioral modification without retraining|related|2026-04-17
|
||||||
|
supports:
|
||||||
|
- "Anti-safety scaling law: larger models are more vulnerable to linear concept vector attacks because steerability and attack surface scale together"
|
||||||
---
|
---
|
||||||
|
|
||||||
# Mechanistic interpretability tools create a dual-use attack surface where Sparse Autoencoders developed for alignment research can identify and surgically remove safety-related features
|
# Mechanistic interpretability tools create a dual-use attack surface where Sparse Autoencoders developed for alignment research can identify and surgically remove safety-related features
|
||||||
|
|
|
||||||
|
|
@ -18,6 +18,7 @@ related:
|
||||||
- Many interpretability queries are provably computationally intractable establishing a theoretical ceiling on mechanistic interpretability as an alignment verification approach
|
- Many interpretability queries are provably computationally intractable establishing a theoretical ceiling on mechanistic interpretability as an alignment verification approach
|
||||||
- Non-autoregressive architectures reduce jailbreak vulnerability by 40-65% through elimination of continuation-drive mechanisms but impose a 15-25% capability cost on reasoning tasks
|
- Non-autoregressive architectures reduce jailbreak vulnerability by 40-65% through elimination of continuation-drive mechanisms but impose a 15-25% capability cost on reasoning tasks
|
||||||
- Training-free conversion of activation steering vectors into component-level weight edits enables persistent behavioral modification without retraining
|
- Training-free conversion of activation steering vectors into component-level weight edits enables persistent behavioral modification without retraining
|
||||||
|
- "Anti-safety scaling law: larger models are more vulnerable to linear concept vector attacks because steerability and attack surface scale together"
|
||||||
reweave_edges:
|
reweave_edges:
|
||||||
- Mechanistic interpretability at production model scale can trace multi-step reasoning pathways but cannot yet detect deceptive alignment or covert goal-pursuing|related|2026-04-03
|
- Mechanistic interpretability at production model scale can trace multi-step reasoning pathways but cannot yet detect deceptive alignment or covert goal-pursuing|related|2026-04-03
|
||||||
- Anthropic's mechanistic circuit tracing and DeepMind's pragmatic interpretability address non-overlapping safety tasks because Anthropic maps causal mechanisms while DeepMind detects harmful intent|related|2026-04-08
|
- Anthropic's mechanistic circuit tracing and DeepMind's pragmatic interpretability address non-overlapping safety tasks because Anthropic maps causal mechanisms while DeepMind detects harmful intent|related|2026-04-08
|
||||||
|
|
@ -26,6 +27,7 @@ reweave_edges:
|
||||||
- Many interpretability queries are provably computationally intractable establishing a theoretical ceiling on mechanistic interpretability as an alignment verification approach|related|2026-04-17
|
- Many interpretability queries are provably computationally intractable establishing a theoretical ceiling on mechanistic interpretability as an alignment verification approach|related|2026-04-17
|
||||||
- Non-autoregressive architectures reduce jailbreak vulnerability by 40-65% through elimination of continuation-drive mechanisms but impose a 15-25% capability cost on reasoning tasks|related|2026-04-17
|
- Non-autoregressive architectures reduce jailbreak vulnerability by 40-65% through elimination of continuation-drive mechanisms but impose a 15-25% capability cost on reasoning tasks|related|2026-04-17
|
||||||
- Training-free conversion of activation steering vectors into component-level weight edits enables persistent behavioral modification without retraining|related|2026-04-17
|
- Training-free conversion of activation steering vectors into component-level weight edits enables persistent behavioral modification without retraining|related|2026-04-17
|
||||||
|
- "Anti-safety scaling law: larger models are more vulnerable to linear concept vector attacks because steerability and attack surface scale together|related|2026-04-21"
|
||||||
---
|
---
|
||||||
|
|
||||||
# Mechanistic interpretability tools that work at lighter model scales fail on safety-critical tasks at frontier scale because sparse autoencoders underperform simple linear probes on detecting harmful intent
|
# Mechanistic interpretability tools that work at lighter model scales fail on safety-critical tasks at frontier scale because sparse autoencoders underperform simple linear probes on detecting harmful intent
|
||||||
|
|
|
||||||
|
|
@ -9,7 +9,13 @@ title: "Representation monitoring via linear concept vectors creates a dual-use
|
||||||
agent: theseus
|
agent: theseus
|
||||||
scope: causal
|
scope: causal
|
||||||
sourcer: Xu et al.
|
sourcer: Xu et al.
|
||||||
related: ["mechanistic-interpretability-tools-create-dual-use-attack-surface-enabling-surgical-safety-feature-removal", "chain-of-thought-monitoring-vulnerable-to-steganographic-encoding-as-emerging-capability"]
|
related:
|
||||||
|
- mechanistic-interpretability-tools-create-dual-use-attack-surface-enabling-surgical-safety-feature-removal
|
||||||
|
- chain-of-thought-monitoring-vulnerable-to-steganographic-encoding-as-emerging-capability
|
||||||
|
supports:
|
||||||
|
- "Anti-safety scaling law: larger models are more vulnerable to linear concept vector attacks because steerability and attack surface scale together"
|
||||||
|
reweave_edges:
|
||||||
|
- "Anti-safety scaling law: larger models are more vulnerable to linear concept vector attacks because steerability and attack surface scale together|supports|2026-04-21"
|
||||||
---
|
---
|
||||||
|
|
||||||
# Representation monitoring via linear concept vectors creates a dual-use attack surface enabling 99.14% jailbreak success
|
# Representation monitoring via linear concept vectors creates a dual-use attack surface enabling 99.14% jailbreak success
|
||||||
|
|
|
||||||
|
|
@ -14,11 +14,13 @@ attribution:
|
||||||
related:
|
related:
|
||||||
- alignment auditing tools fail through tool to agent gap not tool quality
|
- alignment auditing tools fail through tool to agent gap not tool quality
|
||||||
- Trajectory geometry probing requires white-box access to all intermediate activations, making it deployable in controlled evaluation contexts but not in adversarial external audit scenarios
|
- Trajectory geometry probing requires white-box access to all intermediate activations, making it deployable in controlled evaluation contexts but not in adversarial external audit scenarios
|
||||||
|
- Activation steering fails for capability elicitation despite interpretability research suggesting otherwise
|
||||||
reweave_edges:
|
reweave_edges:
|
||||||
- alignment auditing tools fail through tool to agent gap not tool quality|related|2026-03-31
|
- alignment auditing tools fail through tool to agent gap not tool quality|related|2026-03-31
|
||||||
- interpretability effectiveness anti correlates with adversarial training making tools hurt performance on sophisticated misalignment|challenges|2026-03-31
|
- interpretability effectiveness anti correlates with adversarial training making tools hurt performance on sophisticated misalignment|challenges|2026-03-31
|
||||||
- white box interpretability fails on adversarially trained models creating anti correlation with threat model|challenges|2026-03-31
|
- white box interpretability fails on adversarially trained models creating anti correlation with threat model|challenges|2026-03-31
|
||||||
- Trajectory geometry probing requires white-box access to all intermediate activations, making it deployable in controlled evaluation contexts but not in adversarial external audit scenarios|related|2026-04-17
|
- Trajectory geometry probing requires white-box access to all intermediate activations, making it deployable in controlled evaluation contexts but not in adversarial external audit scenarios|related|2026-04-17
|
||||||
|
- Activation steering fails for capability elicitation despite interpretability research suggesting otherwise|related|2026-04-21
|
||||||
challenges:
|
challenges:
|
||||||
- interpretability effectiveness anti correlates with adversarial training making tools hurt performance on sophisticated misalignment
|
- interpretability effectiveness anti correlates with adversarial training making tools hurt performance on sophisticated misalignment
|
||||||
- white box interpretability fails on adversarially trained models creating anti correlation with threat model
|
- white box interpretability fails on adversarially trained models creating anti correlation with threat model
|
||||||
|
|
|
||||||
|
|
@ -10,6 +10,10 @@ agent: theseus
|
||||||
scope: structural
|
scope: structural
|
||||||
sourcer: "@ApolloResearch"
|
sourcer: "@ApolloResearch"
|
||||||
related_claims: ["[[an aligned-seeming AI may be strategically deceptive because cooperative behavior is instrumentally optimal while weak]]", "[[AI-models-distinguish-testing-from-deployment-environments-providing-empirical-evidence-for-deceptive-alignment-concerns]]", "[[safe AI development requires building alignment mechanisms before scaling capability]]"]
|
related_claims: ["[[an aligned-seeming AI may be strategically deceptive because cooperative behavior is instrumentally optimal while weak]]", "[[AI-models-distinguish-testing-from-deployment-environments-providing-empirical-evidence-for-deceptive-alignment-concerns]]", "[[safe AI development requires building alignment mechanisms before scaling capability]]"]
|
||||||
|
supports:
|
||||||
|
- Behavioral evaluation is structurally insufficient for latent alignment verification under evaluation awareness because normative indistinguishability creates an identifiability problem not a measurement problem
|
||||||
|
reweave_edges:
|
||||||
|
- Behavioral evaluation is structurally insufficient for latent alignment verification under evaluation awareness because normative indistinguishability creates an identifiability problem not a measurement problem|supports|2026-04-21
|
||||||
---
|
---
|
||||||
|
|
||||||
# Scheming safety cases require interpretability evidence because observer effects make behavioral evaluation insufficient
|
# Scheming safety cases require interpretability evidence because observer effects make behavioral evaluation insufficient
|
||||||
|
|
|
||||||
|
|
@ -13,9 +13,11 @@ related_claims: ["[[an aligned-seeming AI may be strategically deceptive because
|
||||||
related:
|
related:
|
||||||
- High-capability models under inference-time monitoring show early-step hedging patterns—brief compliant responses followed by clarification escalation—as a potential precursor to systematic monitor gaming
|
- High-capability models under inference-time monitoring show early-step hedging patterns—brief compliant responses followed by clarification escalation—as a potential precursor to systematic monitor gaming
|
||||||
- Activation-based persona vector monitoring can detect behavioral trait shifts in small language models without relying on behavioral testing but has not been validated at frontier model scale or for safety-critical behaviors
|
- Activation-based persona vector monitoring can detect behavioral trait shifts in small language models without relying on behavioral testing but has not been validated at frontier model scale or for safety-critical behaviors
|
||||||
|
- Evaluation awareness concentrates in earlier model layers (23-24) making output-level interventions insufficient for preventing strategic evaluation gaming
|
||||||
reweave_edges:
|
reweave_edges:
|
||||||
- High-capability models under inference-time monitoring show early-step hedging patterns—brief compliant responses followed by clarification escalation—as a potential precursor to systematic monitor gaming|related|2026-04-09
|
- High-capability models under inference-time monitoring show early-step hedging patterns—brief compliant responses followed by clarification escalation—as a potential precursor to systematic monitor gaming|related|2026-04-09
|
||||||
- Activation-based persona vector monitoring can detect behavioral trait shifts in small language models without relying on behavioral testing but has not been validated at frontier model scale or for safety-critical behaviors|related|2026-04-17
|
- Activation-based persona vector monitoring can detect behavioral trait shifts in small language models without relying on behavioral testing but has not been validated at frontier model scale or for safety-critical behaviors|related|2026-04-17
|
||||||
|
- Evaluation awareness concentrates in earlier model layers (23-24) making output-level interventions insufficient for preventing strategic evaluation gaming|related|2026-04-21
|
||||||
---
|
---
|
||||||
|
|
||||||
# Situationally aware models do not systematically game early-step inference-time monitors at current capability levels because models cannot reliably detect monitor presence through behavioral observation alone
|
# Situationally aware models do not systematically game early-step inference-time monitors at current capability levels because models cannot reliably detect monitor presence through behavioral observation alone
|
||||||
|
|
|
||||||
|
|
@ -8,9 +8,11 @@ source: "Governance - Meritocratic Voting + Futarchy"
|
||||||
related:
|
related:
|
||||||
- Is futarchy's low participation in uncontested decisions efficient disuse or a sign of structural adoption barriers?
|
- Is futarchy's low participation in uncontested decisions efficient disuse or a sign of structural adoption barriers?
|
||||||
- Futarchy requires quantifiable exogenous KPIs as a deployment constraint because most DAO proposals lack measurable objectives
|
- Futarchy requires quantifiable exogenous KPIs as a deployment constraint because most DAO proposals lack measurable objectives
|
||||||
|
- MetaDAO futarchy has a perfect OTC pricing record rejecting every below market deal and accepting every at or above market deal across 9 documented proposals
|
||||||
reweave_edges:
|
reweave_edges:
|
||||||
- Is futarchy's low participation in uncontested decisions efficient disuse or a sign of structural adoption barriers?|related|2026-04-18
|
- Is futarchy's low participation in uncontested decisions efficient disuse or a sign of structural adoption barriers?|related|2026-04-18
|
||||||
- Futarchy requires quantifiable exogenous KPIs as a deployment constraint because most DAO proposals lack measurable objectives|related|2026-04-18
|
- Futarchy requires quantifiable exogenous KPIs as a deployment constraint because most DAO proposals lack measurable objectives|related|2026-04-18
|
||||||
|
- MetaDAO futarchy has a perfect OTC pricing record rejecting every below market deal and accepting every at or above market deal across 9 documented proposals|related|2026-04-21
|
||||||
---
|
---
|
||||||
|
|
||||||
# MetaDAOs futarchy implementation shows limited trading volume in uncontested decisions
|
# MetaDAOs futarchy implementation shows limited trading volume in uncontested decisions
|
||||||
|
|
|
||||||
|
|
@ -8,8 +8,10 @@ confidence: proven
|
||||||
tradition: "futarchy, mechanism design, DAO governance"
|
tradition: "futarchy, mechanism design, DAO governance"
|
||||||
related:
|
related:
|
||||||
- DeFi insurance hybrid claims assessment routes clear exploits to automation and ambiguous disputes to governance, resolving the speed-fairness tradeoff
|
- DeFi insurance hybrid claims assessment routes clear exploits to automation and ambiguous disputes to governance, resolving the speed-fairness tradeoff
|
||||||
|
- MetaDAO futarchy has a perfect OTC pricing record rejecting every below market deal and accepting every at or above market deal across 9 documented proposals
|
||||||
reweave_edges:
|
reweave_edges:
|
||||||
- DeFi insurance hybrid claims assessment routes clear exploits to automation and ambiguous disputes to governance, resolving the speed-fairness tradeoff|related|2026-04-18
|
- DeFi insurance hybrid claims assessment routes clear exploits to automation and ambiguous disputes to governance, resolving the speed-fairness tradeoff|related|2026-04-18
|
||||||
|
- MetaDAO futarchy has a perfect OTC pricing record rejecting every below market deal and accepting every at or above market deal across 9 documented proposals|related|2026-04-21
|
||||||
---
|
---
|
||||||
|
|
||||||
Decision markets create a mechanism where attempting to steal from minority holders becomes a losing trade. The four conditional tokens (fABC, pABC, pUSD, fUSD) establish a constraint: for a treasury-raiding proposal to pass, pABC/pUSD must trade higher than fABC/fUSD. But from any rational perspective, 1 fABC is worth 1 ABC (DAO continues normally) while 1 pABC is worth 0 (DAO becomes empty after raid).
|
Decision markets create a mechanism where attempting to steal from minority holders becomes a losing trade. The four conditional tokens (fABC, pABC, pUSD, fUSD) establish a constraint: for a treasury-raiding proposal to pass, pABC/pUSD must trade higher than fABC/fUSD. But from any rational perspective, 1 fABC is worth 1 ABC (DAO continues normally) while 1 pABC is worth 0 (DAO becomes empty after raid).
|
||||||
|
|
|
||||||
|
|
@ -17,8 +17,10 @@ related:
|
||||||
- insider-trading-in-futarchy-improves-governance-by-accelerating-ground-truth-incorporation-into-conditional-markets
|
- insider-trading-in-futarchy-improves-governance-by-accelerating-ground-truth-incorporation-into-conditional-markets
|
||||||
- stock-markets-function-despite-20-40-percent-insider-trading-proving-information-asymmetry-does-not-break-price-discovery
|
- stock-markets-function-despite-20-40-percent-insider-trading-proving-information-asymmetry-does-not-break-price-discovery
|
||||||
- Congressional insider trading legislation for prediction markets treats them as financial instruments not gambling strengthening DCM regulatory legitimacy
|
- Congressional insider trading legislation for prediction markets treats them as financial instruments not gambling strengthening DCM regulatory legitimacy
|
||||||
|
- Polymarket updated its insider trading rules two days after P2P.me's bet creating a multi-platform enforcement gap where no single platform has visibility into cross-market positions
|
||||||
reweave_edges:
|
reweave_edges:
|
||||||
- Congressional insider trading legislation for prediction markets treats them as financial instruments not gambling strengthening DCM regulatory legitimacy|related|2026-04-18
|
- Congressional insider trading legislation for prediction markets treats them as financial instruments not gambling strengthening DCM regulatory legitimacy|related|2026-04-18
|
||||||
|
- Polymarket updated its insider trading rules two days after P2P.me's bet creating a multi-platform enforcement gap where no single platform has visibility into cross-market positions|related|2026-04-21
|
||||||
---
|
---
|
||||||
|
|
||||||
# Futarchy governance markets create insider trading paradox because informed governance participants are simultaneously the most valuable traders and the most restricted under insider trading frameworks
|
# Futarchy governance markets create insider trading paradox because informed governance participants are simultaneously the most valuable traders and the most restricted under insider trading frameworks
|
||||||
|
|
|
||||||
|
|
@ -5,6 +5,10 @@ description: "Market rejection of liquidity solution despite stated liquidity cr
|
||||||
confidence: experimental
|
confidence: experimental
|
||||||
source: "MetaDAO Proposal 8 failure, 2024-02-18 to 2024-02-24"
|
source: "MetaDAO Proposal 8 failure, 2024-02-18 to 2024-02-24"
|
||||||
created: 2026-03-11
|
created: 2026-03-11
|
||||||
|
related:
|
||||||
|
- MetaDAO futarchy has a perfect OTC pricing record rejecting every below market deal and accepting every at or above market deal across 9 documented proposals
|
||||||
|
reweave_edges:
|
||||||
|
- MetaDAO futarchy has a perfect OTC pricing record rejecting every below market deal and accepting every at or above market deal across 9 documented proposals|related|2026-04-21
|
||||||
---
|
---
|
||||||
|
|
||||||
# Futarchy markets can reject solutions to acknowledged problems when the proposed solution creates worse second-order effects than the problem it solves
|
# Futarchy markets can reject solutions to acknowledged problems when the proposed solution creates worse second-order effects than the problem it solves
|
||||||
|
|
|
||||||
|
|
@ -17,8 +17,10 @@ reweave_edges:
|
||||||
- The CFTC ANPRM comment record as of April 2026 contains zero filings distinguishing futarchy governance markets from event betting markets, creating a default regulatory framework that will apply gambling-use-case restrictions to governance-use-case mechanisms|supports|2026-04-17
|
- The CFTC ANPRM comment record as of April 2026 contains zero filings distinguishing futarchy governance markets from event betting markets, creating a default regulatory framework that will apply gambling-use-case restrictions to governance-use-case mechanisms|supports|2026-04-17
|
||||||
- Futarchy governance markets risk regulatory capture by anti-gambling frameworks because event betting and organizational governance use cases are conflated in current policy discourse|supports|2026-04-18
|
- Futarchy governance markets risk regulatory capture by anti-gambling frameworks because event betting and organizational governance use cases are conflated in current policy discourse|supports|2026-04-18
|
||||||
- Prediction markets face a democratic legitimacy gap where 61% gambling classification creates legislative override risk independent of CFTC regulatory approval|related|2026-04-19
|
- Prediction markets face a democratic legitimacy gap where 61% gambling classification creates legislative override risk independent of CFTC regulatory approval|related|2026-04-19
|
||||||
|
- 800+ ANPRM comment submissions from both industry and state gaming opponents signal that the CFTC's post-April 30 rulemaking process will face intense political pressure from both sides|related|2026-04-21
|
||||||
related:
|
related:
|
||||||
- Prediction markets face a democratic legitimacy gap where 61% gambling classification creates legislative override risk independent of CFTC regulatory approval
|
- Prediction markets face a democratic legitimacy gap where 61% gambling classification creates legislative override risk independent of CFTC regulatory approval
|
||||||
|
- 800+ ANPRM comment submissions from both industry and state gaming opponents signal that the CFTC's post-April 30 rulemaking process will face intense political pressure from both sides
|
||||||
---
|
---
|
||||||
|
|
||||||
# Retail mobilization against prediction markets creates asymmetric regulatory input because anti-gambling advocates dominate comment periods while governance market proponents remain silent
|
# Retail mobilization against prediction markets creates asymmetric regulatory input because anti-gambling advocates dominate comment periods while governance market proponents remain silent
|
||||||
|
|
|
||||||
Loading…
Reference in a new issue