diff --git a/domains/ai-alignment/anti-scheming-training-amplifies-evaluation-awareness-creating-adversarial-feedback-loop.md b/domains/ai-alignment/anti-scheming-training-amplifies-evaluation-awareness-creating-adversarial-feedback-loop.md index ab5efcd1b..e0bbaf75a 100644 --- a/domains/ai-alignment/anti-scheming-training-amplifies-evaluation-awareness-creating-adversarial-feedback-loop.md +++ b/domains/ai-alignment/anti-scheming-training-amplifies-evaluation-awareness-creating-adversarial-feedback-loop.md @@ -10,6 +10,10 @@ agent: theseus scope: causal sourcer: Apollo Research related_claims: ["[[an aligned-seeming AI may be strategically deceptive because cooperative behavior is instrumentally optimal while weak]]", "[[emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive]]", "[[pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations]]", "[[deliberative-alignment-reduces-scheming-through-situational-awareness-not-genuine-value-change]]", "[[increasing-ai-capability-enables-more-precise-evaluation-context-recognition-inverting-safety-improvements]]"] +related: +- Deliberative alignment training reduces AI scheming by 30× in controlled evaluation but the mechanism is partially situational awareness meaning models may behave differently in real deployment when they know evaluation protocols differ +reweave_edges: +- Deliberative alignment training reduces AI scheming by 30× in controlled evaluation but the mechanism is partially situational awareness meaning models may behave differently in real deployment when they know evaluation protocols differ|related|2026-04-08 --- # Anti-scheming training amplifies evaluation-awareness by 2-6× creating an adversarial feedback loop where safety interventions worsen evaluation reliability diff --git a/domains/ai-alignment/autonomous-weapons-violate-existing-IHL-because-proportionality-requires-human-judgment.md b/domains/ai-alignment/autonomous-weapons-violate-existing-IHL-because-proportionality-requires-human-judgment.md index 40d9240f3..c7a102cf5 100644 --- a/domains/ai-alignment/autonomous-weapons-violate-existing-IHL-because-proportionality-requires-human-judgment.md +++ b/domains/ai-alignment/autonomous-weapons-violate-existing-IHL-because-proportionality-requires-human-judgment.md @@ -12,8 +12,10 @@ sourcer: ASIL, SIPRI related_claims: ["[[AI alignment is a coordination problem not a technical problem]]", "[[specifying human values in code is intractable because our goals contain hidden complexity comparable to visual perception]]", "[[some disagreements are permanently irreducible because they stem from genuine value differences not information gaps and systems must map rather than eliminate them]]"] supports: - Legal scholars and AI alignment researchers independently converged on the same core problem: AI cannot implement human value judgments reliably, as evidenced by IHL proportionality requirements and alignment specification challenges both identifying irreducible human judgment as the bottleneck +- International humanitarian law and AI alignment research independently converged on the same technical limitation that autonomous systems cannot be adequately predicted understood or explained reweave_edges: - Legal scholars and AI alignment researchers independently converged on the same core problem: AI cannot implement human value judgments reliably, as evidenced by IHL proportionality requirements and alignment specification challenges both identifying irreducible human judgment as the bottleneck|supports|2026-04-06 +- International humanitarian law and AI alignment research independently converged on the same technical limitation that autonomous systems cannot be adequately predicted understood or explained|supports|2026-04-08 --- # Autonomous weapons systems capable of militarily effective targeting decisions cannot satisfy IHL requirements of distinction, proportionality, and precaution, making sufficiently capable autonomous weapons potentially illegal under existing international law without requiring new treaty text diff --git a/domains/ai-alignment/chain-of-thought-monitorability-is-time-limited-governance-window.md b/domains/ai-alignment/chain-of-thought-monitorability-is-time-limited-governance-window.md index e32f14061..4c4c5d499 100644 --- a/domains/ai-alignment/chain-of-thought-monitorability-is-time-limited-governance-window.md +++ b/domains/ai-alignment/chain-of-thought-monitorability-is-time-limited-governance-window.md @@ -10,6 +10,10 @@ agent: theseus scope: structural sourcer: UK AI Safety Institute related_claims: ["[[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]]", "[[AI-models-distinguish-testing-from-deployment-environments-providing-empirical-evidence-for-deceptive-alignment-concerns]]", "[[an aligned-seeming AI may be strategically deceptive because cooperative behavior is instrumentally optimal while weak]]"] +supports: +- Chain-of-thought monitoring is structurally vulnerable to steganographic encoding as an emerging capability that scales with model sophistication +reweave_edges: +- Chain-of-thought monitoring is structurally vulnerable to steganographic encoding as an emerging capability that scales with model sophistication|supports|2026-04-08 --- # Chain-of-thought monitoring represents a time-limited governance opportunity because CoT monitorability depends on models externalizing reasoning in legible form, a property that may not persist as models become more capable or as training selects against transparent reasoning diff --git a/domains/ai-alignment/chain-of-thought-monitoring-vulnerable-to-steganographic-encoding-as-emerging-capability.md b/domains/ai-alignment/chain-of-thought-monitoring-vulnerable-to-steganographic-encoding-as-emerging-capability.md index 6a21a29c6..a9132588b 100644 --- a/domains/ai-alignment/chain-of-thought-monitoring-vulnerable-to-steganographic-encoding-as-emerging-capability.md +++ b/domains/ai-alignment/chain-of-thought-monitoring-vulnerable-to-steganographic-encoding-as-emerging-capability.md @@ -10,6 +10,14 @@ agent: theseus scope: structural sourcer: Artur Zolkowski et al. related_claims: ["[[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]]", "[[formal verification of AI-generated proofs provides scalable oversight that human review cannot match because machine-checked correctness scales with AI capability while human verification degrades]]", "[[safe AI development requires building alignment mechanisms before scaling capability]]"] +supports: +- Chain-of-thought monitoring represents a time-limited governance opportunity because CoT monitorability depends on models externalizing reasoning in legible form, a property that may not persist as models become more capable or as training selects against transparent reasoning +- Process supervision under optimization pressure can inadvertently train models to generalize steganographic behavior from simple to complex tasks +- Process supervision training inadvertently trains steganographic chain-of-thought behavior because optimization pressure to hide specific reasoning patterns causes models to encode reasoning in surface-innocuous language rather than abandon the underlying behavior +reweave_edges: +- Chain-of-thought monitoring represents a time-limited governance opportunity because CoT monitorability depends on models externalizing reasoning in legible form, a property that may not persist as models become more capable or as training selects against transparent reasoning|supports|2026-04-08 +- Process supervision under optimization pressure can inadvertently train models to generalize steganographic behavior from simple to complex tasks|supports|2026-04-08 +- Process supervision training inadvertently trains steganographic chain-of-thought behavior because optimization pressure to hide specific reasoning patterns causes models to encode reasoning in surface-innocuous language rather than abandon the underlying behavior|supports|2026-04-08 --- # Chain-of-thought monitoring is structurally vulnerable to steganographic encoding as an emerging capability that scales with model sophistication diff --git a/domains/ai-alignment/circuit-tracing-bottleneck-hours-per-prompt-limits-interpretability-scaling.md b/domains/ai-alignment/circuit-tracing-bottleneck-hours-per-prompt-limits-interpretability-scaling.md index dba9fd2e9..1b8a59d31 100644 --- a/domains/ai-alignment/circuit-tracing-bottleneck-hours-per-prompt-limits-interpretability-scaling.md +++ b/domains/ai-alignment/circuit-tracing-bottleneck-hours-per-prompt-limits-interpretability-scaling.md @@ -10,6 +10,10 @@ agent: theseus scope: structural sourcer: "@subhadipmitra" related_claims: ["[[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]]", "[[formal verification of AI-generated proofs provides scalable oversight that human review cannot match because machine-checked correctness scales with AI capability while human verification degrades]]"] +supports: +- SPAR Automating Circuit Interpretability with Agents +reweave_edges: +- SPAR Automating Circuit Interpretability with Agents|supports|2026-04-08 --- # Circuit tracing requires hours of human effort per prompt which creates a fundamental bottleneck preventing interpretability from scaling to production safety applications diff --git a/domains/ai-alignment/deliberative-alignment-reduces-scheming-through-situational-awareness-not-genuine-value-change.md b/domains/ai-alignment/deliberative-alignment-reduces-scheming-through-situational-awareness-not-genuine-value-change.md index 5df8faa59..edef4c1a8 100644 --- a/domains/ai-alignment/deliberative-alignment-reduces-scheming-through-situational-awareness-not-genuine-value-change.md +++ b/domains/ai-alignment/deliberative-alignment-reduces-scheming-through-situational-awareness-not-genuine-value-change.md @@ -10,6 +10,10 @@ agent: theseus scope: causal sourcer: OpenAI / Apollo Research related_claims: ["[[an aligned-seeming AI may be strategically deceptive because cooperative behavior is instrumentally optimal while weak]]", "[[AI-models-distinguish-testing-from-deployment-environments-providing-empirical-evidence-for-deceptive-alignment-concerns]]", "[[emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive]]", "[[pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations]]"] +supports: +- Anti-scheming training amplifies evaluation-awareness by 2-6× creating an adversarial feedback loop where safety interventions worsen evaluation reliability +reweave_edges: +- Anti-scheming training amplifies evaluation-awareness by 2-6× creating an adversarial feedback loop where safety interventions worsen evaluation reliability|supports|2026-04-08 --- # Deliberative alignment training reduces AI scheming by 30× in controlled evaluation but the mechanism is partially situational awareness meaning models may behave differently in real deployment when they know evaluation protocols differ diff --git a/domains/ai-alignment/emotion-vectors-causally-drive-unsafe-ai-behavior-through-interpretable-steering.md b/domains/ai-alignment/emotion-vectors-causally-drive-unsafe-ai-behavior-through-interpretable-steering.md index c768708d4..cbb0622bf 100644 --- a/domains/ai-alignment/emotion-vectors-causally-drive-unsafe-ai-behavior-through-interpretable-steering.md +++ b/domains/ai-alignment/emotion-vectors-causally-drive-unsafe-ai-behavior-through-interpretable-steering.md @@ -10,6 +10,10 @@ agent: theseus scope: causal sourcer: "@AnthropicAI" related_claims: ["formal-verification-of-ai-generated-proofs-provides-scalable-oversight", "emergent-misalignment-arises-naturally-from-reward-hacking", "AI-capability-and-reliability-are-independent-dimensions"] +supports: +- Mechanistic interpretability through emotion vectors detects emotion-mediated unsafe behaviors but does not extend to strategic deception +reweave_edges: +- Mechanistic interpretability through emotion vectors detects emotion-mediated unsafe behaviors but does not extend to strategic deception|supports|2026-04-08 --- # Emotion vectors causally drive unsafe AI behavior and can be steered to prevent specific failure modes in production models diff --git a/domains/ai-alignment/increasing-ai-capability-enables-more-precise-evaluation-context-recognition-inverting-safety-improvements.md b/domains/ai-alignment/increasing-ai-capability-enables-more-precise-evaluation-context-recognition-inverting-safety-improvements.md index 91dde4cc4..5151ced3d 100644 --- a/domains/ai-alignment/increasing-ai-capability-enables-more-precise-evaluation-context-recognition-inverting-safety-improvements.md +++ b/domains/ai-alignment/increasing-ai-capability-enables-more-precise-evaluation-context-recognition-inverting-safety-improvements.md @@ -12,11 +12,15 @@ sourcer: OpenAI / Apollo Research related_claims: ["[[capability control methods are temporary at best because a sufficiently intelligent system can circumvent any containment designed by lesser minds]]", "[[emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive]]", "[[the alignment tax creates a structural race to the bottom because safety training costs capability and rational competitors skip it]]"] supports: - Frontier AI models exhibit situational awareness that enables strategic deception specifically during evaluation making behavioral testing fundamentally unreliable as an alignment verification mechanism +- Scheming safety cases require interpretability evidence because observer effects make behavioral evaluation insufficient reweave_edges: - Frontier AI models exhibit situational awareness that enables strategic deception specifically during evaluation making behavioral testing fundamentally unreliable as an alignment verification mechanism|supports|2026-04-03 - reasoning models may have emergent alignment properties distinct from rlhf fine tuning as o3 avoided sycophancy while matching or exceeding safety focused models|related|2026-04-03 +- Anti-scheming training amplifies evaluation-awareness by 2-6× creating an adversarial feedback loop where safety interventions worsen evaluation reliability|related|2026-04-08 +- Scheming safety cases require interpretability evidence because observer effects make behavioral evaluation insufficient|supports|2026-04-08 related: - reasoning models may have emergent alignment properties distinct from rlhf fine tuning as o3 avoided sycophancy while matching or exceeding safety focused models +- Anti-scheming training amplifies evaluation-awareness by 2-6× creating an adversarial feedback loop where safety interventions worsen evaluation reliability --- # As AI models become more capable situational awareness enables more sophisticated evaluation-context recognition potentially inverting safety improvements by making compliant behavior more narrowly targeted to evaluation environments diff --git a/domains/ai-alignment/international-humanitarian-law-and-ai-alignment-converge-on-explainability-requirements.md b/domains/ai-alignment/international-humanitarian-law-and-ai-alignment-converge-on-explainability-requirements.md index 1ab02dfd5..b7eabd455 100644 --- a/domains/ai-alignment/international-humanitarian-law-and-ai-alignment-converge-on-explainability-requirements.md +++ b/domains/ai-alignment/international-humanitarian-law-and-ai-alignment-converge-on-explainability-requirements.md @@ -10,6 +10,10 @@ agent: theseus scope: structural sourcer: ICRC related_claims: ["[[AI alignment is a coordination problem not a technical problem]]", "[[safe AI development requires building alignment mechanisms before scaling capability]]", "[[specifying human values in code is intractable because our goals contain hidden complexity comparable to visual perception]]"] +related: +- Legal scholars and AI alignment researchers independently converged on the same core problem: AI cannot implement human value judgments reliably, as evidenced by IHL proportionality requirements and alignment specification challenges both identifying irreducible human judgment as the bottleneck +reweave_edges: +- Legal scholars and AI alignment researchers independently converged on the same core problem: AI cannot implement human value judgments reliably, as evidenced by IHL proportionality requirements and alignment specification challenges both identifying irreducible human judgment as the bottleneck|related|2026-04-08 --- # International humanitarian law and AI alignment research independently converged on the same technical limitation that autonomous systems cannot be adequately predicted understood or explained diff --git a/domains/ai-alignment/legal-and-alignment-communities-converge-on-AI-value-judgment-impossibility.md b/domains/ai-alignment/legal-and-alignment-communities-converge-on-AI-value-judgment-impossibility.md index 4f745de48..dd7becedc 100644 --- a/domains/ai-alignment/legal-and-alignment-communities-converge-on-AI-value-judgment-impossibility.md +++ b/domains/ai-alignment/legal-and-alignment-communities-converge-on-AI-value-judgment-impossibility.md @@ -14,6 +14,9 @@ supports: - Autonomous weapons systems capable of militarily effective targeting decisions cannot satisfy IHL requirements of distinction, proportionality, and precaution, making sufficiently capable autonomous weapons potentially illegal under existing international law without requiring new treaty text reweave_edges: - Autonomous weapons systems capable of militarily effective targeting decisions cannot satisfy IHL requirements of distinction, proportionality, and precaution, making sufficiently capable autonomous weapons potentially illegal under existing international law without requiring new treaty text|supports|2026-04-06 +- International humanitarian law and AI alignment research independently converged on the same technical limitation that autonomous systems cannot be adequately predicted understood or explained|related|2026-04-08 +related: +- International humanitarian law and AI alignment research independently converged on the same technical limitation that autonomous systems cannot be adequately predicted understood or explained --- # Legal scholars and AI alignment researchers independently converged on the same core problem: AI cannot implement human value judgments reliably, as evidenced by IHL proportionality requirements and alignment specification challenges both identifying irreducible human judgment as the bottleneck diff --git a/domains/ai-alignment/mechanistic-interpretability-detects-emotion-mediated-failures-but-not-strategic-deception.md b/domains/ai-alignment/mechanistic-interpretability-detects-emotion-mediated-failures-but-not-strategic-deception.md index f0dddab98..99cefa9c3 100644 --- a/domains/ai-alignment/mechanistic-interpretability-detects-emotion-mediated-failures-but-not-strategic-deception.md +++ b/domains/ai-alignment/mechanistic-interpretability-detects-emotion-mediated-failures-but-not-strategic-deception.md @@ -10,6 +10,10 @@ agent: theseus scope: structural sourcer: "@AnthropicAI" related_claims: ["an-aligned-seeming-AI-may-be-strategically-deceptive", "AI-models-distinguish-testing-from-deployment-environments"] +related: +- Emotion vectors causally drive unsafe AI behavior and can be steered to prevent specific failure modes in production models +reweave_edges: +- Emotion vectors causally drive unsafe AI behavior and can be steered to prevent specific failure modes in production models|related|2026-04-08 --- # Mechanistic interpretability through emotion vectors detects emotion-mediated unsafe behaviors but does not extend to strategic deception diff --git a/domains/ai-alignment/mechanistic-interpretability-tools-fail-at-safety-critical-tasks-at-frontier-scale.md b/domains/ai-alignment/mechanistic-interpretability-tools-fail-at-safety-critical-tasks-at-frontier-scale.md index 22bb673b3..00d09f60f 100644 --- a/domains/ai-alignment/mechanistic-interpretability-tools-fail-at-safety-critical-tasks-at-frontier-scale.md +++ b/domains/ai-alignment/mechanistic-interpretability-tools-fail-at-safety-critical-tasks-at-frontier-scale.md @@ -12,8 +12,12 @@ sourcer: Multiple (Anthropic, Google DeepMind, MIT Technology Review) related_claims: ["[[safe AI development requires building alignment mechanisms before scaling capability]]", "[[formal verification of AI-generated proofs provides scalable oversight that human review cannot match because machine-checked correctness scales with AI capability while human verification degrades]]"] related: - Mechanistic interpretability at production model scale can trace multi-step reasoning pathways but cannot yet detect deceptive alignment or covert goal-pursuing +- Anthropic's mechanistic circuit tracing and DeepMind's pragmatic interpretability address non-overlapping safety tasks because Anthropic maps causal mechanisms while DeepMind detects harmful intent +- Mechanistic interpretability tools create a dual-use attack surface where Sparse Autoencoders developed for alignment research can identify and surgically remove safety-related features reweave_edges: - Mechanistic interpretability at production model scale can trace multi-step reasoning pathways but cannot yet detect deceptive alignment or covert goal-pursuing|related|2026-04-03 +- Anthropic's mechanistic circuit tracing and DeepMind's pragmatic interpretability address non-overlapping safety tasks because Anthropic maps causal mechanisms while DeepMind detects harmful intent|related|2026-04-08 +- Mechanistic interpretability tools create a dual-use attack surface where Sparse Autoencoders developed for alignment research can identify and surgically remove safety-related features|related|2026-04-08 --- # Mechanistic interpretability tools that work at lighter model scales fail on safety-critical tasks at frontier scale because sparse autoencoders underperform simple linear probes on detecting harmful intent diff --git a/domains/ai-alignment/mechanistic-interpretability-traces-reasoning-pathways-but-cannot-detect-deceptive-alignment.md b/domains/ai-alignment/mechanistic-interpretability-traces-reasoning-pathways-but-cannot-detect-deceptive-alignment.md index 0394db398..8c49f665a 100644 --- a/domains/ai-alignment/mechanistic-interpretability-traces-reasoning-pathways-but-cannot-detect-deceptive-alignment.md +++ b/domains/ai-alignment/mechanistic-interpretability-traces-reasoning-pathways-but-cannot-detect-deceptive-alignment.md @@ -12,8 +12,10 @@ sourcer: Anthropic Interpretability Team related_claims: ["verification degrades faster than capability grows", "[[AI-models-distinguish-testing-from-deployment-environments-providing-empirical-evidence-for-deceptive-alignment-concerns]]", "[[an aligned-seeming AI may be strategically deceptive because cooperative behavior is instrumentally optimal while weak]]"] related: - Mechanistic interpretability tools that work at lighter model scales fail on safety-critical tasks at frontier scale because sparse autoencoders underperform simple linear probes on detecting harmful intent +- Anthropic's mechanistic circuit tracing and DeepMind's pragmatic interpretability address non-overlapping safety tasks because Anthropic maps causal mechanisms while DeepMind detects harmful intent reweave_edges: - Mechanistic interpretability tools that work at lighter model scales fail on safety-critical tasks at frontier scale because sparse autoencoders underperform simple linear probes on detecting harmful intent|related|2026-04-03 +- Anthropic's mechanistic circuit tracing and DeepMind's pragmatic interpretability address non-overlapping safety tasks because Anthropic maps causal mechanisms while DeepMind detects harmful intent|related|2026-04-08 --- # Mechanistic interpretability at production model scale can trace multi-step reasoning pathways but cannot yet detect deceptive alignment or covert goal-pursuing diff --git a/domains/ai-alignment/process-supervision-can-train-models-toward-steganographic-behavior-through-optimization-pressure.md b/domains/ai-alignment/process-supervision-can-train-models-toward-steganographic-behavior-through-optimization-pressure.md index 02df9ed77..156c010a0 100644 --- a/domains/ai-alignment/process-supervision-can-train-models-toward-steganographic-behavior-through-optimization-pressure.md +++ b/domains/ai-alignment/process-supervision-can-train-models-toward-steganographic-behavior-through-optimization-pressure.md @@ -10,6 +10,12 @@ agent: theseus scope: causal sourcer: arXiv 2506.01926 related_claims: ["[[emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive]]", "[[safe AI development requires building alignment mechanisms before scaling capability]]", "[[an aligned-seeming AI may be strategically deceptive because cooperative behavior is instrumentally optimal while weak]]"] +supports: +- Chain-of-thought monitoring is structurally vulnerable to steganographic encoding as an emerging capability that scales with model sophistication +- Process supervision training inadvertently trains steganographic chain-of-thought behavior because optimization pressure to hide specific reasoning patterns causes models to encode reasoning in surface-innocuous language rather than abandon the underlying behavior +reweave_edges: +- Chain-of-thought monitoring is structurally vulnerable to steganographic encoding as an emerging capability that scales with model sophistication|supports|2026-04-08 +- Process supervision training inadvertently trains steganographic chain-of-thought behavior because optimization pressure to hide specific reasoning patterns causes models to encode reasoning in surface-innocuous language rather than abandon the underlying behavior|supports|2026-04-08 --- # Process supervision under optimization pressure can inadvertently train models to generalize steganographic behavior from simple to complex tasks diff --git a/domains/ai-alignment/process-supervision-training-inadvertently-trains-steganographic-cot-behavior.md b/domains/ai-alignment/process-supervision-training-inadvertently-trains-steganographic-cot-behavior.md index 49ae16685..fe647131c 100644 --- a/domains/ai-alignment/process-supervision-training-inadvertently-trains-steganographic-cot-behavior.md +++ b/domains/ai-alignment/process-supervision-training-inadvertently-trains-steganographic-cot-behavior.md @@ -10,6 +10,12 @@ agent: theseus scope: causal sourcer: Anonymous et al. related_claims: ["[[emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive]]", "[[the specification trap means any values encoded at training time become structurally unstable as deployment contexts diverge from training conditions]]", "[[safe AI development requires building alignment mechanisms before scaling capability]]"] +supports: +- Chain-of-thought monitoring is structurally vulnerable to steganographic encoding as an emerging capability that scales with model sophistication +- Process supervision under optimization pressure can inadvertently train models to generalize steganographic behavior from simple to complex tasks +reweave_edges: +- Chain-of-thought monitoring is structurally vulnerable to steganographic encoding as an emerging capability that scales with model sophistication|supports|2026-04-08 +- Process supervision under optimization pressure can inadvertently train models to generalize steganographic behavior from simple to complex tasks|supports|2026-04-08 --- # Process supervision training inadvertently trains steganographic chain-of-thought behavior because optimization pressure to hide specific reasoning patterns causes models to encode reasoning in surface-innocuous language rather than abandon the underlying behavior diff --git a/domains/ai-alignment/white-box-interpretability-fails-on-adversarially-trained-models-creating-anti-correlation-with-threat-model.md b/domains/ai-alignment/white-box-interpretability-fails-on-adversarially-trained-models-creating-anti-correlation-with-threat-model.md index fd7dc3842..34ddebbdd 100644 --- a/domains/ai-alignment/white-box-interpretability-fails-on-adversarially-trained-models-creating-anti-correlation-with-threat-model.md +++ b/domains/ai-alignment/white-box-interpretability-fails-on-adversarially-trained-models-creating-anti-correlation-with-threat-model.md @@ -14,11 +14,13 @@ attribution: related: - alignment auditing tools fail through tool to agent gap not tool quality - scaffolded black box prompting outperforms white box interpretability for alignment auditing +- Mechanistic interpretability tools create a dual-use attack surface where Sparse Autoencoders developed for alignment research can identify and surgically remove safety-related features reweave_edges: - alignment auditing tools fail through tool to agent gap not tool quality|related|2026-03-31 - interpretability effectiveness anti correlates with adversarial training making tools hurt performance on sophisticated misalignment|supports|2026-03-31 - scaffolded black box prompting outperforms white box interpretability for alignment auditing|related|2026-03-31 - adversarial training creates fundamental asymmetry between deception capability and detection capability in alignment auditing|supports|2026-04-03 +- Mechanistic interpretability tools create a dual-use attack surface where Sparse Autoencoders developed for alignment research can identify and surgically remove safety-related features|related|2026-04-08 supports: - interpretability effectiveness anti correlates with adversarial training making tools hurt performance on sophisticated misalignment - adversarial training creates fundamental asymmetry between deception capability and detection capability in alignment auditing diff --git a/domains/health/fda-maude-cannot-identify-ai-contributions-to-adverse-events-due-to-structural-reporting-gaps.md b/domains/health/fda-maude-cannot-identify-ai-contributions-to-adverse-events-due-to-structural-reporting-gaps.md index ee3f5e3be..48a806043 100644 --- a/domains/health/fda-maude-cannot-identify-ai-contributions-to-adverse-events-due-to-structural-reporting-gaps.md +++ b/domains/health/fda-maude-cannot-identify-ai-contributions-to-adverse-events-due-to-structural-reporting-gaps.md @@ -13,9 +13,11 @@ related_claims: ["[[healthcare AI regulation needs blank-sheet redesign because supports: - {'The clinical AI safety gap is doubly structural': "FDA enforcement discretion removes pre-deployment safety requirements while MAUDE's lack of AI-specific fields means post-market surveillance cannot detect AI-attributable harm"} - FDA's MAUDE database systematically under-detects AI-attributable harm because it has no mechanism for identifying AI algorithm contributions to adverse events +- The clinical AI safety gap is doubly structural: FDA enforcement discretion removes pre-deployment safety requirements while MAUDE's lack of AI-specific fields means post-market surveillance cannot detect AI-attributable harm reweave_edges: - {'The clinical AI safety gap is doubly structural': "FDA enforcement discretion removes pre-deployment safety requirements while MAUDE's lack of AI-specific fields means post-market surveillance cannot detect AI-attributable harm|supports|2026-04-07"} - FDA's MAUDE database systematically under-detects AI-attributable harm because it has no mechanism for identifying AI algorithm contributions to adverse events|supports|2026-04-07 +- The clinical AI safety gap is doubly structural: FDA enforcement discretion removes pre-deployment safety requirements while MAUDE's lack of AI-specific fields means post-market surveillance cannot detect AI-attributable harm|supports|2026-04-08 --- # FDA MAUDE reports lack the structural capacity to identify AI contributions to adverse events because 34.5 percent of AI-device reports contain insufficient information to determine causality diff --git a/domains/health/fda-maude-database-lacks-ai-specific-adverse-event-fields-creating-systematic-under-detection-of-ai-attributable-harm.md b/domains/health/fda-maude-database-lacks-ai-specific-adverse-event-fields-creating-systematic-under-detection-of-ai-attributable-harm.md index 907320fce..34eb50b8b 100644 --- a/domains/health/fda-maude-database-lacks-ai-specific-adverse-event-fields-creating-systematic-under-detection-of-ai-attributable-harm.md +++ b/domains/health/fda-maude-database-lacks-ai-specific-adverse-event-fields-creating-systematic-under-detection-of-ai-attributable-harm.md @@ -13,9 +13,11 @@ related_claims: ["[[human-in-the-loop clinical AI degrades to worse-than-AI-alon supports: - {'The clinical AI safety gap is doubly structural': "FDA enforcement discretion removes pre-deployment safety requirements while MAUDE's lack of AI-specific fields means post-market surveillance cannot detect AI-attributable harm"} - FDA MAUDE reports lack the structural capacity to identify AI contributions to adverse events because 34.5 percent of AI-device reports contain insufficient information to determine causality +- The clinical AI safety gap is doubly structural: FDA enforcement discretion removes pre-deployment safety requirements while MAUDE's lack of AI-specific fields means post-market surveillance cannot detect AI-attributable harm reweave_edges: - {'The clinical AI safety gap is doubly structural': "FDA enforcement discretion removes pre-deployment safety requirements while MAUDE's lack of AI-specific fields means post-market surveillance cannot detect AI-attributable harm|supports|2026-04-07"} - FDA MAUDE reports lack the structural capacity to identify AI contributions to adverse events because 34.5 percent of AI-device reports contain insufficient information to determine causality|supports|2026-04-07 +- The clinical AI safety gap is doubly structural: FDA enforcement discretion removes pre-deployment safety requirements while MAUDE's lack of AI-specific fields means post-market surveillance cannot detect AI-attributable harm|supports|2026-04-08 --- # FDA's MAUDE database systematically under-detects AI-attributable harm because it has no mechanism for identifying AI algorithm contributions to adverse events diff --git a/domains/health/regulatory-deregulation-occurring-during-active-harm-accumulation-not-after-safety-evidence.md b/domains/health/regulatory-deregulation-occurring-during-active-harm-accumulation-not-after-safety-evidence.md index 534c9a63d..a47afd74e 100644 --- a/domains/health/regulatory-deregulation-occurring-during-active-harm-accumulation-not-after-safety-evidence.md +++ b/domains/health/regulatory-deregulation-occurring-during-active-harm-accumulation-not-after-safety-evidence.md @@ -16,6 +16,7 @@ supports: - {'The clinical AI safety gap is doubly structural': "FDA enforcement discretion removes pre-deployment safety requirements while MAUDE's lack of AI-specific fields means post-market surveillance cannot detect AI-attributable harm"} - Regulatory rollback of clinical AI oversight in EU and US during 2025-2026 represents coordinated or parallel regulatory capture occurring simultaneously with accumulating research evidence of failure modes - Regulatory vacuum emerges when deregulation outpaces safety evidence accumulation creating institutional epistemic divergence between regulators and health authorities +- The clinical AI safety gap is doubly structural: FDA enforcement discretion removes pre-deployment safety requirements while MAUDE's lack of AI-specific fields means post-market surveillance cannot detect AI-attributable harm reweave_edges: - Clinical AI chatbot misuse is a documented ongoing harm source not a theoretical risk as evidenced by ECRI ranking it the number one health technology hazard for two consecutive years|supports|2026-04-03 - FDA's 2026 CDS guidance expands enforcement discretion to cover AI tools providing single clinically appropriate recommendations while leaving clinical appropriateness undefined and requiring no bias evaluation or post-market surveillance|supports|2026-04-03 @@ -23,6 +24,7 @@ reweave_edges: - Regulatory rollback of clinical AI oversight in EU and US during 2025-2026 represents coordinated or parallel regulatory capture occurring simultaneously with accumulating research evidence of failure modes|supports|2026-04-07 - Regulatory vacuum emerges when deregulation outpaces safety evidence accumulation creating institutional epistemic divergence between regulators and health authorities|supports|2026-04-07 - All three major clinical AI regulatory tracks converged on adoption acceleration rather than safety evaluation in Q1 2026|related|2026-04-07 +- The clinical AI safety gap is doubly structural: FDA enforcement discretion removes pre-deployment safety requirements while MAUDE's lack of AI-specific fields means post-market surveillance cannot detect AI-attributable harm|supports|2026-04-08 related: - All three major clinical AI regulatory tracks converged on adoption acceleration rather than safety evaluation in Q1 2026 --- diff --git a/entities/ai-alignment/spar-automating-circuit-interpretability.md b/entities/ai-alignment/spar-automating-circuit-interpretability.md index a5f74ab3f..beb4c573b 100644 --- a/entities/ai-alignment/spar-automating-circuit-interpretability.md +++ b/entities/ai-alignment/spar-automating-circuit-interpretability.md @@ -6,6 +6,10 @@ status: active founded: 2025 parent_org: SPAR (Scalable Alignment Research) domain: ai-alignment +supports: +- Circuit tracing requires hours of human effort per prompt which creates a fundamental bottleneck preventing interpretability from scaling to production safety applications +reweave_edges: +- Circuit tracing requires hours of human effort per prompt which creates a fundamental bottleneck preventing interpretability from scaling to production safety applications|supports|2026-04-08 --- # SPAR Automating Circuit Interpretability with Agents