reweave: merge 20 files via frontmatter union [auto]
Some checks are pending
Sync Graph Data to teleo-app / sync (push) Waiting to run

This commit is contained in:
Teleo Agents 2026-04-08 01:10:40 +00:00
parent 4e6ddb5667
commit d8dfbeb5d4
20 changed files with 87 additions and 15 deletions

View file

@ -10,6 +10,10 @@ agent: theseus
scope: causal scope: causal
sourcer: Apollo Research sourcer: Apollo Research
related_claims: ["[[an aligned-seeming AI may be strategically deceptive because cooperative behavior is instrumentally optimal while weak]]", "[[emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive]]", "[[pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations]]", "[[deliberative-alignment-reduces-scheming-through-situational-awareness-not-genuine-value-change]]", "[[increasing-ai-capability-enables-more-precise-evaluation-context-recognition-inverting-safety-improvements]]"] related_claims: ["[[an aligned-seeming AI may be strategically deceptive because cooperative behavior is instrumentally optimal while weak]]", "[[emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive]]", "[[pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations]]", "[[deliberative-alignment-reduces-scheming-through-situational-awareness-not-genuine-value-change]]", "[[increasing-ai-capability-enables-more-precise-evaluation-context-recognition-inverting-safety-improvements]]"]
related:
- Deliberative alignment training reduces AI scheming by 30× in controlled evaluation but the mechanism is partially situational awareness meaning models may behave differently in real deployment when they know evaluation protocols differ
reweave_edges:
- Deliberative alignment training reduces AI scheming by 30× in controlled evaluation but the mechanism is partially situational awareness meaning models may behave differently in real deployment when they know evaluation protocols differ|related|2026-04-08
--- ---
# Anti-scheming training amplifies evaluation-awareness by 2-6× creating an adversarial feedback loop where safety interventions worsen evaluation reliability # Anti-scheming training amplifies evaluation-awareness by 2-6× creating an adversarial feedback loop where safety interventions worsen evaluation reliability

View file

@ -11,9 +11,11 @@ scope: structural
sourcer: ASIL, SIPRI sourcer: ASIL, SIPRI
related_claims: ["[[AI alignment is a coordination problem not a technical problem]]", "[[specifying human values in code is intractable because our goals contain hidden complexity comparable to visual perception]]", "[[some disagreements are permanently irreducible because they stem from genuine value differences not information gaps and systems must map rather than eliminate them]]"] related_claims: ["[[AI alignment is a coordination problem not a technical problem]]", "[[specifying human values in code is intractable because our goals contain hidden complexity comparable to visual perception]]", "[[some disagreements are permanently irreducible because they stem from genuine value differences not information gaps and systems must map rather than eliminate them]]"]
supports: supports:
- Legal scholars and AI alignment researchers independently converged on the same core problem: AI cannot implement human value judgments reliably, as evidenced by IHL proportionality requirements and alignment specification challenges both identifying irreducible human judgment as the bottleneck - {'Legal scholars and AI alignment researchers independently converged on the same core problem': 'AI cannot implement human value judgments reliably, as evidenced by IHL proportionality requirements and alignment specification challenges both identifying irreducible human judgment as the bottleneck'}
- International humanitarian law and AI alignment research independently converged on the same technical limitation that autonomous systems cannot be adequately predicted understood or explained
reweave_edges: reweave_edges:
- Legal scholars and AI alignment researchers independently converged on the same core problem: AI cannot implement human value judgments reliably, as evidenced by IHL proportionality requirements and alignment specification challenges both identifying irreducible human judgment as the bottleneck|supports|2026-04-06 - {'Legal scholars and AI alignment researchers independently converged on the same core problem': 'AI cannot implement human value judgments reliably, as evidenced by IHL proportionality requirements and alignment specification challenges both identifying irreducible human judgment as the bottleneck|supports|2026-04-06'}
- International humanitarian law and AI alignment research independently converged on the same technical limitation that autonomous systems cannot be adequately predicted understood or explained|supports|2026-04-08
--- ---
# Autonomous weapons systems capable of militarily effective targeting decisions cannot satisfy IHL requirements of distinction, proportionality, and precaution, making sufficiently capable autonomous weapons potentially illegal under existing international law without requiring new treaty text # Autonomous weapons systems capable of militarily effective targeting decisions cannot satisfy IHL requirements of distinction, proportionality, and precaution, making sufficiently capable autonomous weapons potentially illegal under existing international law without requiring new treaty text

View file

@ -10,6 +10,10 @@ agent: theseus
scope: structural scope: structural
sourcer: UK AI Safety Institute sourcer: UK AI Safety Institute
related_claims: ["[[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]]", "[[AI-models-distinguish-testing-from-deployment-environments-providing-empirical-evidence-for-deceptive-alignment-concerns]]", "[[an aligned-seeming AI may be strategically deceptive because cooperative behavior is instrumentally optimal while weak]]"] related_claims: ["[[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]]", "[[AI-models-distinguish-testing-from-deployment-environments-providing-empirical-evidence-for-deceptive-alignment-concerns]]", "[[an aligned-seeming AI may be strategically deceptive because cooperative behavior is instrumentally optimal while weak]]"]
supports:
- Chain-of-thought monitoring is structurally vulnerable to steganographic encoding as an emerging capability that scales with model sophistication
reweave_edges:
- Chain-of-thought monitoring is structurally vulnerable to steganographic encoding as an emerging capability that scales with model sophistication|supports|2026-04-08
--- ---
# Chain-of-thought monitoring represents a time-limited governance opportunity because CoT monitorability depends on models externalizing reasoning in legible form, a property that may not persist as models become more capable or as training selects against transparent reasoning # Chain-of-thought monitoring represents a time-limited governance opportunity because CoT monitorability depends on models externalizing reasoning in legible form, a property that may not persist as models become more capable or as training selects against transparent reasoning

View file

@ -10,6 +10,14 @@ agent: theseus
scope: structural scope: structural
sourcer: Artur Zolkowski et al. sourcer: Artur Zolkowski et al.
related_claims: ["[[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]]", "[[formal verification of AI-generated proofs provides scalable oversight that human review cannot match because machine-checked correctness scales with AI capability while human verification degrades]]", "[[safe AI development requires building alignment mechanisms before scaling capability]]"] related_claims: ["[[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]]", "[[formal verification of AI-generated proofs provides scalable oversight that human review cannot match because machine-checked correctness scales with AI capability while human verification degrades]]", "[[safe AI development requires building alignment mechanisms before scaling capability]]"]
supports:
- Chain-of-thought monitoring represents a time-limited governance opportunity because CoT monitorability depends on models externalizing reasoning in legible form, a property that may not persist as models become more capable or as training selects against transparent reasoning
- Process supervision under optimization pressure can inadvertently train models to generalize steganographic behavior from simple to complex tasks
- Process supervision training inadvertently trains steganographic chain-of-thought behavior because optimization pressure to hide specific reasoning patterns causes models to encode reasoning in surface-innocuous language rather than abandon the underlying behavior
reweave_edges:
- Chain-of-thought monitoring represents a time-limited governance opportunity because CoT monitorability depends on models externalizing reasoning in legible form, a property that may not persist as models become more capable or as training selects against transparent reasoning|supports|2026-04-08
- Process supervision under optimization pressure can inadvertently train models to generalize steganographic behavior from simple to complex tasks|supports|2026-04-08
- Process supervision training inadvertently trains steganographic chain-of-thought behavior because optimization pressure to hide specific reasoning patterns causes models to encode reasoning in surface-innocuous language rather than abandon the underlying behavior|supports|2026-04-08
--- ---
# Chain-of-thought monitoring is structurally vulnerable to steganographic encoding as an emerging capability that scales with model sophistication # Chain-of-thought monitoring is structurally vulnerable to steganographic encoding as an emerging capability that scales with model sophistication

View file

@ -10,6 +10,10 @@ agent: theseus
scope: structural scope: structural
sourcer: "@subhadipmitra" sourcer: "@subhadipmitra"
related_claims: ["[[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]]", "[[formal verification of AI-generated proofs provides scalable oversight that human review cannot match because machine-checked correctness scales with AI capability while human verification degrades]]"] related_claims: ["[[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]]", "[[formal verification of AI-generated proofs provides scalable oversight that human review cannot match because machine-checked correctness scales with AI capability while human verification degrades]]"]
supports:
- SPAR Automating Circuit Interpretability with Agents
reweave_edges:
- SPAR Automating Circuit Interpretability with Agents|supports|2026-04-08
--- ---
# Circuit tracing requires hours of human effort per prompt which creates a fundamental bottleneck preventing interpretability from scaling to production safety applications # Circuit tracing requires hours of human effort per prompt which creates a fundamental bottleneck preventing interpretability from scaling to production safety applications

View file

@ -10,6 +10,10 @@ agent: theseus
scope: causal scope: causal
sourcer: OpenAI / Apollo Research sourcer: OpenAI / Apollo Research
related_claims: ["[[an aligned-seeming AI may be strategically deceptive because cooperative behavior is instrumentally optimal while weak]]", "[[AI-models-distinguish-testing-from-deployment-environments-providing-empirical-evidence-for-deceptive-alignment-concerns]]", "[[emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive]]", "[[pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations]]"] related_claims: ["[[an aligned-seeming AI may be strategically deceptive because cooperative behavior is instrumentally optimal while weak]]", "[[AI-models-distinguish-testing-from-deployment-environments-providing-empirical-evidence-for-deceptive-alignment-concerns]]", "[[emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive]]", "[[pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations]]"]
supports:
- Anti-scheming training amplifies evaluation-awareness by 2-6× creating an adversarial feedback loop where safety interventions worsen evaluation reliability
reweave_edges:
- Anti-scheming training amplifies evaluation-awareness by 2-6× creating an adversarial feedback loop where safety interventions worsen evaluation reliability|supports|2026-04-08
--- ---
# Deliberative alignment training reduces AI scheming by 30× in controlled evaluation but the mechanism is partially situational awareness meaning models may behave differently in real deployment when they know evaluation protocols differ # Deliberative alignment training reduces AI scheming by 30× in controlled evaluation but the mechanism is partially situational awareness meaning models may behave differently in real deployment when they know evaluation protocols differ

View file

@ -10,6 +10,10 @@ agent: theseus
scope: causal scope: causal
sourcer: "@AnthropicAI" sourcer: "@AnthropicAI"
related_claims: ["formal-verification-of-ai-generated-proofs-provides-scalable-oversight", "emergent-misalignment-arises-naturally-from-reward-hacking", "AI-capability-and-reliability-are-independent-dimensions"] related_claims: ["formal-verification-of-ai-generated-proofs-provides-scalable-oversight", "emergent-misalignment-arises-naturally-from-reward-hacking", "AI-capability-and-reliability-are-independent-dimensions"]
supports:
- Mechanistic interpretability through emotion vectors detects emotion-mediated unsafe behaviors but does not extend to strategic deception
reweave_edges:
- Mechanistic interpretability through emotion vectors detects emotion-mediated unsafe behaviors but does not extend to strategic deception|supports|2026-04-08
--- ---
# Emotion vectors causally drive unsafe AI behavior and can be steered to prevent specific failure modes in production models # Emotion vectors causally drive unsafe AI behavior and can be steered to prevent specific failure modes in production models

View file

@ -12,11 +12,15 @@ sourcer: OpenAI / Apollo Research
related_claims: ["[[capability control methods are temporary at best because a sufficiently intelligent system can circumvent any containment designed by lesser minds]]", "[[emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive]]", "[[the alignment tax creates a structural race to the bottom because safety training costs capability and rational competitors skip it]]"] related_claims: ["[[capability control methods are temporary at best because a sufficiently intelligent system can circumvent any containment designed by lesser minds]]", "[[emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive]]", "[[the alignment tax creates a structural race to the bottom because safety training costs capability and rational competitors skip it]]"]
supports: supports:
- Frontier AI models exhibit situational awareness that enables strategic deception specifically during evaluation making behavioral testing fundamentally unreliable as an alignment verification mechanism - Frontier AI models exhibit situational awareness that enables strategic deception specifically during evaluation making behavioral testing fundamentally unreliable as an alignment verification mechanism
- Scheming safety cases require interpretability evidence because observer effects make behavioral evaluation insufficient
reweave_edges: reweave_edges:
- Frontier AI models exhibit situational awareness that enables strategic deception specifically during evaluation making behavioral testing fundamentally unreliable as an alignment verification mechanism|supports|2026-04-03 - Frontier AI models exhibit situational awareness that enables strategic deception specifically during evaluation making behavioral testing fundamentally unreliable as an alignment verification mechanism|supports|2026-04-03
- reasoning models may have emergent alignment properties distinct from rlhf fine tuning as o3 avoided sycophancy while matching or exceeding safety focused models|related|2026-04-03 - reasoning models may have emergent alignment properties distinct from rlhf fine tuning as o3 avoided sycophancy while matching or exceeding safety focused models|related|2026-04-03
- Anti-scheming training amplifies evaluation-awareness by 2-6× creating an adversarial feedback loop where safety interventions worsen evaluation reliability|related|2026-04-08
- Scheming safety cases require interpretability evidence because observer effects make behavioral evaluation insufficient|supports|2026-04-08
related: related:
- reasoning models may have emergent alignment properties distinct from rlhf fine tuning as o3 avoided sycophancy while matching or exceeding safety focused models - reasoning models may have emergent alignment properties distinct from rlhf fine tuning as o3 avoided sycophancy while matching or exceeding safety focused models
- Anti-scheming training amplifies evaluation-awareness by 2-6× creating an adversarial feedback loop where safety interventions worsen evaluation reliability
--- ---
# As AI models become more capable situational awareness enables more sophisticated evaluation-context recognition potentially inverting safety improvements by making compliant behavior more narrowly targeted to evaluation environments # As AI models become more capable situational awareness enables more sophisticated evaluation-context recognition potentially inverting safety improvements by making compliant behavior more narrowly targeted to evaluation environments

View file

@ -10,6 +10,10 @@ agent: theseus
scope: structural scope: structural
sourcer: ICRC sourcer: ICRC
related_claims: ["[[AI alignment is a coordination problem not a technical problem]]", "[[safe AI development requires building alignment mechanisms before scaling capability]]", "[[specifying human values in code is intractable because our goals contain hidden complexity comparable to visual perception]]"] related_claims: ["[[AI alignment is a coordination problem not a technical problem]]", "[[safe AI development requires building alignment mechanisms before scaling capability]]", "[[specifying human values in code is intractable because our goals contain hidden complexity comparable to visual perception]]"]
related:
- {'Legal scholars and AI alignment researchers independently converged on the same core problem': 'AI cannot implement human value judgments reliably, as evidenced by IHL proportionality requirements and alignment specification challenges both identifying irreducible human judgment as the bottleneck'}
reweave_edges:
- {'Legal scholars and AI alignment researchers independently converged on the same core problem': 'AI cannot implement human value judgments reliably, as evidenced by IHL proportionality requirements and alignment specification challenges both identifying irreducible human judgment as the bottleneck|related|2026-04-08'}
--- ---
# International humanitarian law and AI alignment research independently converged on the same technical limitation that autonomous systems cannot be adequately predicted understood or explained # International humanitarian law and AI alignment research independently converged on the same technical limitation that autonomous systems cannot be adequately predicted understood or explained

View file

@ -14,6 +14,9 @@ supports:
- Autonomous weapons systems capable of militarily effective targeting decisions cannot satisfy IHL requirements of distinction, proportionality, and precaution, making sufficiently capable autonomous weapons potentially illegal under existing international law without requiring new treaty text - Autonomous weapons systems capable of militarily effective targeting decisions cannot satisfy IHL requirements of distinction, proportionality, and precaution, making sufficiently capable autonomous weapons potentially illegal under existing international law without requiring new treaty text
reweave_edges: reweave_edges:
- Autonomous weapons systems capable of militarily effective targeting decisions cannot satisfy IHL requirements of distinction, proportionality, and precaution, making sufficiently capable autonomous weapons potentially illegal under existing international law without requiring new treaty text|supports|2026-04-06 - Autonomous weapons systems capable of militarily effective targeting decisions cannot satisfy IHL requirements of distinction, proportionality, and precaution, making sufficiently capable autonomous weapons potentially illegal under existing international law without requiring new treaty text|supports|2026-04-06
- International humanitarian law and AI alignment research independently converged on the same technical limitation that autonomous systems cannot be adequately predicted understood or explained|related|2026-04-08
related:
- International humanitarian law and AI alignment research independently converged on the same technical limitation that autonomous systems cannot be adequately predicted understood or explained
--- ---
# Legal scholars and AI alignment researchers independently converged on the same core problem: AI cannot implement human value judgments reliably, as evidenced by IHL proportionality requirements and alignment specification challenges both identifying irreducible human judgment as the bottleneck # Legal scholars and AI alignment researchers independently converged on the same core problem: AI cannot implement human value judgments reliably, as evidenced by IHL proportionality requirements and alignment specification challenges both identifying irreducible human judgment as the bottleneck

View file

@ -10,6 +10,10 @@ agent: theseus
scope: structural scope: structural
sourcer: "@AnthropicAI" sourcer: "@AnthropicAI"
related_claims: ["an-aligned-seeming-AI-may-be-strategically-deceptive", "AI-models-distinguish-testing-from-deployment-environments"] related_claims: ["an-aligned-seeming-AI-may-be-strategically-deceptive", "AI-models-distinguish-testing-from-deployment-environments"]
related:
- Emotion vectors causally drive unsafe AI behavior and can be steered to prevent specific failure modes in production models
reweave_edges:
- Emotion vectors causally drive unsafe AI behavior and can be steered to prevent specific failure modes in production models|related|2026-04-08
--- ---
# Mechanistic interpretability through emotion vectors detects emotion-mediated unsafe behaviors but does not extend to strategic deception # Mechanistic interpretability through emotion vectors detects emotion-mediated unsafe behaviors but does not extend to strategic deception

View file

@ -12,8 +12,12 @@ sourcer: Multiple (Anthropic, Google DeepMind, MIT Technology Review)
related_claims: ["[[safe AI development requires building alignment mechanisms before scaling capability]]", "[[formal verification of AI-generated proofs provides scalable oversight that human review cannot match because machine-checked correctness scales with AI capability while human verification degrades]]"] related_claims: ["[[safe AI development requires building alignment mechanisms before scaling capability]]", "[[formal verification of AI-generated proofs provides scalable oversight that human review cannot match because machine-checked correctness scales with AI capability while human verification degrades]]"]
related: related:
- Mechanistic interpretability at production model scale can trace multi-step reasoning pathways but cannot yet detect deceptive alignment or covert goal-pursuing - Mechanistic interpretability at production model scale can trace multi-step reasoning pathways but cannot yet detect deceptive alignment or covert goal-pursuing
- Anthropic's mechanistic circuit tracing and DeepMind's pragmatic interpretability address non-overlapping safety tasks because Anthropic maps causal mechanisms while DeepMind detects harmful intent
- Mechanistic interpretability tools create a dual-use attack surface where Sparse Autoencoders developed for alignment research can identify and surgically remove safety-related features
reweave_edges: reweave_edges:
- Mechanistic interpretability at production model scale can trace multi-step reasoning pathways but cannot yet detect deceptive alignment or covert goal-pursuing|related|2026-04-03 - Mechanistic interpretability at production model scale can trace multi-step reasoning pathways but cannot yet detect deceptive alignment or covert goal-pursuing|related|2026-04-03
- Anthropic's mechanistic circuit tracing and DeepMind's pragmatic interpretability address non-overlapping safety tasks because Anthropic maps causal mechanisms while DeepMind detects harmful intent|related|2026-04-08
- Mechanistic interpretability tools create a dual-use attack surface where Sparse Autoencoders developed for alignment research can identify and surgically remove safety-related features|related|2026-04-08
--- ---
# Mechanistic interpretability tools that work at lighter model scales fail on safety-critical tasks at frontier scale because sparse autoencoders underperform simple linear probes on detecting harmful intent # Mechanistic interpretability tools that work at lighter model scales fail on safety-critical tasks at frontier scale because sparse autoencoders underperform simple linear probes on detecting harmful intent

View file

@ -12,8 +12,10 @@ sourcer: Anthropic Interpretability Team
related_claims: ["verification degrades faster than capability grows", "[[AI-models-distinguish-testing-from-deployment-environments-providing-empirical-evidence-for-deceptive-alignment-concerns]]", "[[an aligned-seeming AI may be strategically deceptive because cooperative behavior is instrumentally optimal while weak]]"] related_claims: ["verification degrades faster than capability grows", "[[AI-models-distinguish-testing-from-deployment-environments-providing-empirical-evidence-for-deceptive-alignment-concerns]]", "[[an aligned-seeming AI may be strategically deceptive because cooperative behavior is instrumentally optimal while weak]]"]
related: related:
- Mechanistic interpretability tools that work at lighter model scales fail on safety-critical tasks at frontier scale because sparse autoencoders underperform simple linear probes on detecting harmful intent - Mechanistic interpretability tools that work at lighter model scales fail on safety-critical tasks at frontier scale because sparse autoencoders underperform simple linear probes on detecting harmful intent
- Anthropic's mechanistic circuit tracing and DeepMind's pragmatic interpretability address non-overlapping safety tasks because Anthropic maps causal mechanisms while DeepMind detects harmful intent
reweave_edges: reweave_edges:
- Mechanistic interpretability tools that work at lighter model scales fail on safety-critical tasks at frontier scale because sparse autoencoders underperform simple linear probes on detecting harmful intent|related|2026-04-03 - Mechanistic interpretability tools that work at lighter model scales fail on safety-critical tasks at frontier scale because sparse autoencoders underperform simple linear probes on detecting harmful intent|related|2026-04-03
- Anthropic's mechanistic circuit tracing and DeepMind's pragmatic interpretability address non-overlapping safety tasks because Anthropic maps causal mechanisms while DeepMind detects harmful intent|related|2026-04-08
--- ---
# Mechanistic interpretability at production model scale can trace multi-step reasoning pathways but cannot yet detect deceptive alignment or covert goal-pursuing # Mechanistic interpretability at production model scale can trace multi-step reasoning pathways but cannot yet detect deceptive alignment or covert goal-pursuing

View file

@ -10,6 +10,12 @@ agent: theseus
scope: causal scope: causal
sourcer: arXiv 2506.01926 sourcer: arXiv 2506.01926
related_claims: ["[[emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive]]", "[[safe AI development requires building alignment mechanisms before scaling capability]]", "[[an aligned-seeming AI may be strategically deceptive because cooperative behavior is instrumentally optimal while weak]]"] related_claims: ["[[emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive]]", "[[safe AI development requires building alignment mechanisms before scaling capability]]", "[[an aligned-seeming AI may be strategically deceptive because cooperative behavior is instrumentally optimal while weak]]"]
supports:
- Chain-of-thought monitoring is structurally vulnerable to steganographic encoding as an emerging capability that scales with model sophistication
- Process supervision training inadvertently trains steganographic chain-of-thought behavior because optimization pressure to hide specific reasoning patterns causes models to encode reasoning in surface-innocuous language rather than abandon the underlying behavior
reweave_edges:
- Chain-of-thought monitoring is structurally vulnerable to steganographic encoding as an emerging capability that scales with model sophistication|supports|2026-04-08
- Process supervision training inadvertently trains steganographic chain-of-thought behavior because optimization pressure to hide specific reasoning patterns causes models to encode reasoning in surface-innocuous language rather than abandon the underlying behavior|supports|2026-04-08
--- ---
# Process supervision under optimization pressure can inadvertently train models to generalize steganographic behavior from simple to complex tasks # Process supervision under optimization pressure can inadvertently train models to generalize steganographic behavior from simple to complex tasks

View file

@ -10,6 +10,12 @@ agent: theseus
scope: causal scope: causal
sourcer: Anonymous et al. sourcer: Anonymous et al.
related_claims: ["[[emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive]]", "[[the specification trap means any values encoded at training time become structurally unstable as deployment contexts diverge from training conditions]]", "[[safe AI development requires building alignment mechanisms before scaling capability]]"] related_claims: ["[[emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive]]", "[[the specification trap means any values encoded at training time become structurally unstable as deployment contexts diverge from training conditions]]", "[[safe AI development requires building alignment mechanisms before scaling capability]]"]
supports:
- Chain-of-thought monitoring is structurally vulnerable to steganographic encoding as an emerging capability that scales with model sophistication
- Process supervision under optimization pressure can inadvertently train models to generalize steganographic behavior from simple to complex tasks
reweave_edges:
- Chain-of-thought monitoring is structurally vulnerable to steganographic encoding as an emerging capability that scales with model sophistication|supports|2026-04-08
- Process supervision under optimization pressure can inadvertently train models to generalize steganographic behavior from simple to complex tasks|supports|2026-04-08
--- ---
# Process supervision training inadvertently trains steganographic chain-of-thought behavior because optimization pressure to hide specific reasoning patterns causes models to encode reasoning in surface-innocuous language rather than abandon the underlying behavior # Process supervision training inadvertently trains steganographic chain-of-thought behavior because optimization pressure to hide specific reasoning patterns causes models to encode reasoning in surface-innocuous language rather than abandon the underlying behavior

View file

@ -14,11 +14,13 @@ attribution:
related: related:
- alignment auditing tools fail through tool to agent gap not tool quality - alignment auditing tools fail through tool to agent gap not tool quality
- scaffolded black box prompting outperforms white box interpretability for alignment auditing - scaffolded black box prompting outperforms white box interpretability for alignment auditing
- Mechanistic interpretability tools create a dual-use attack surface where Sparse Autoencoders developed for alignment research can identify and surgically remove safety-related features
reweave_edges: reweave_edges:
- alignment auditing tools fail through tool to agent gap not tool quality|related|2026-03-31 - alignment auditing tools fail through tool to agent gap not tool quality|related|2026-03-31
- interpretability effectiveness anti correlates with adversarial training making tools hurt performance on sophisticated misalignment|supports|2026-03-31 - interpretability effectiveness anti correlates with adversarial training making tools hurt performance on sophisticated misalignment|supports|2026-03-31
- scaffolded black box prompting outperforms white box interpretability for alignment auditing|related|2026-03-31 - scaffolded black box prompting outperforms white box interpretability for alignment auditing|related|2026-03-31
- adversarial training creates fundamental asymmetry between deception capability and detection capability in alignment auditing|supports|2026-04-03 - adversarial training creates fundamental asymmetry between deception capability and detection capability in alignment auditing|supports|2026-04-03
- Mechanistic interpretability tools create a dual-use attack surface where Sparse Autoencoders developed for alignment research can identify and surgically remove safety-related features|related|2026-04-08
supports: supports:
- interpretability effectiveness anti correlates with adversarial training making tools hurt performance on sophisticated misalignment - interpretability effectiveness anti correlates with adversarial training making tools hurt performance on sophisticated misalignment
- adversarial training creates fundamental asymmetry between deception capability and detection capability in alignment auditing - adversarial training creates fundamental asymmetry between deception capability and detection capability in alignment auditing

View file

@ -16,6 +16,7 @@ supports:
reweave_edges: reweave_edges:
- {'The clinical AI safety gap is doubly structural': "FDA enforcement discretion removes pre-deployment safety requirements while MAUDE's lack of AI-specific fields means post-market surveillance cannot detect AI-attributable harm|supports|2026-04-07"} - {'The clinical AI safety gap is doubly structural': "FDA enforcement discretion removes pre-deployment safety requirements while MAUDE's lack of AI-specific fields means post-market surveillance cannot detect AI-attributable harm|supports|2026-04-07"}
- FDA's MAUDE database systematically under-detects AI-attributable harm because it has no mechanism for identifying AI algorithm contributions to adverse events|supports|2026-04-07 - FDA's MAUDE database systematically under-detects AI-attributable harm because it has no mechanism for identifying AI algorithm contributions to adverse events|supports|2026-04-07
- {'The clinical AI safety gap is doubly structural': "FDA enforcement discretion removes pre-deployment safety requirements while MAUDE's lack of AI-specific fields means post-market surveillance cannot detect AI-attributable harm|supports|2026-04-08"}
--- ---
# FDA MAUDE reports lack the structural capacity to identify AI contributions to adverse events because 34.5 percent of AI-device reports contain insufficient information to determine causality # FDA MAUDE reports lack the structural capacity to identify AI contributions to adverse events because 34.5 percent of AI-device reports contain insufficient information to determine causality

View file

@ -16,6 +16,7 @@ supports:
reweave_edges: reweave_edges:
- {'The clinical AI safety gap is doubly structural': "FDA enforcement discretion removes pre-deployment safety requirements while MAUDE's lack of AI-specific fields means post-market surveillance cannot detect AI-attributable harm|supports|2026-04-07"} - {'The clinical AI safety gap is doubly structural': "FDA enforcement discretion removes pre-deployment safety requirements while MAUDE's lack of AI-specific fields means post-market surveillance cannot detect AI-attributable harm|supports|2026-04-07"}
- FDA MAUDE reports lack the structural capacity to identify AI contributions to adverse events because 34.5 percent of AI-device reports contain insufficient information to determine causality|supports|2026-04-07 - FDA MAUDE reports lack the structural capacity to identify AI contributions to adverse events because 34.5 percent of AI-device reports contain insufficient information to determine causality|supports|2026-04-07
- {'The clinical AI safety gap is doubly structural': "FDA enforcement discretion removes pre-deployment safety requirements while MAUDE's lack of AI-specific fields means post-market surveillance cannot detect AI-attributable harm|supports|2026-04-08"}
--- ---
# FDA's MAUDE database systematically under-detects AI-attributable harm because it has no mechanism for identifying AI algorithm contributions to adverse events # FDA's MAUDE database systematically under-detects AI-attributable harm because it has no mechanism for identifying AI algorithm contributions to adverse events

View file

@ -23,6 +23,7 @@ reweave_edges:
- Regulatory rollback of clinical AI oversight in EU and US during 2025-2026 represents coordinated or parallel regulatory capture occurring simultaneously with accumulating research evidence of failure modes|supports|2026-04-07 - Regulatory rollback of clinical AI oversight in EU and US during 2025-2026 represents coordinated or parallel regulatory capture occurring simultaneously with accumulating research evidence of failure modes|supports|2026-04-07
- Regulatory vacuum emerges when deregulation outpaces safety evidence accumulation creating institutional epistemic divergence between regulators and health authorities|supports|2026-04-07 - Regulatory vacuum emerges when deregulation outpaces safety evidence accumulation creating institutional epistemic divergence between regulators and health authorities|supports|2026-04-07
- All three major clinical AI regulatory tracks converged on adoption acceleration rather than safety evaluation in Q1 2026|related|2026-04-07 - All three major clinical AI regulatory tracks converged on adoption acceleration rather than safety evaluation in Q1 2026|related|2026-04-07
- {'The clinical AI safety gap is doubly structural': "FDA enforcement discretion removes pre-deployment safety requirements while MAUDE's lack of AI-specific fields means post-market surveillance cannot detect AI-attributable harm|supports|2026-04-08"}
related: related:
- All three major clinical AI regulatory tracks converged on adoption acceleration rather than safety evaluation in Q1 2026 - All three major clinical AI regulatory tracks converged on adoption acceleration rather than safety evaluation in Q1 2026
--- ---

View file

@ -6,6 +6,10 @@ status: active
founded: 2025 founded: 2025
parent_org: SPAR (Scalable Alignment Research) parent_org: SPAR (Scalable Alignment Research)
domain: ai-alignment domain: ai-alignment
supports:
- Circuit tracing requires hours of human effort per prompt which creates a fundamental bottleneck preventing interpretability from scaling to production safety applications
reweave_edges:
- Circuit tracing requires hours of human effort per prompt which creates a fundamental bottleneck preventing interpretability from scaling to production safety applications|supports|2026-04-08
--- ---
# SPAR Automating Circuit Interpretability with Agents # SPAR Automating Circuit Interpretability with Agents