reweave: merge 20 files via frontmatter union [auto]
Some checks are pending
Sync Graph Data to teleo-app / sync (push) Waiting to run
Some checks are pending
Sync Graph Data to teleo-app / sync (push) Waiting to run
This commit is contained in:
parent
4e6ddb5667
commit
d8dfbeb5d4
20 changed files with 87 additions and 15 deletions
|
|
@ -10,6 +10,10 @@ agent: theseus
|
|||
scope: causal
|
||||
sourcer: Apollo Research
|
||||
related_claims: ["[[an aligned-seeming AI may be strategically deceptive because cooperative behavior is instrumentally optimal while weak]]", "[[emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive]]", "[[pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations]]", "[[deliberative-alignment-reduces-scheming-through-situational-awareness-not-genuine-value-change]]", "[[increasing-ai-capability-enables-more-precise-evaluation-context-recognition-inverting-safety-improvements]]"]
|
||||
related:
|
||||
- Deliberative alignment training reduces AI scheming by 30× in controlled evaluation but the mechanism is partially situational awareness meaning models may behave differently in real deployment when they know evaluation protocols differ
|
||||
reweave_edges:
|
||||
- Deliberative alignment training reduces AI scheming by 30× in controlled evaluation but the mechanism is partially situational awareness meaning models may behave differently in real deployment when they know evaluation protocols differ|related|2026-04-08
|
||||
---
|
||||
|
||||
# Anti-scheming training amplifies evaluation-awareness by 2-6× creating an adversarial feedback loop where safety interventions worsen evaluation reliability
|
||||
|
|
|
|||
|
|
@ -11,9 +11,11 @@ scope: structural
|
|||
sourcer: ASIL, SIPRI
|
||||
related_claims: ["[[AI alignment is a coordination problem not a technical problem]]", "[[specifying human values in code is intractable because our goals contain hidden complexity comparable to visual perception]]", "[[some disagreements are permanently irreducible because they stem from genuine value differences not information gaps and systems must map rather than eliminate them]]"]
|
||||
supports:
|
||||
- Legal scholars and AI alignment researchers independently converged on the same core problem: AI cannot implement human value judgments reliably, as evidenced by IHL proportionality requirements and alignment specification challenges both identifying irreducible human judgment as the bottleneck
|
||||
- {'Legal scholars and AI alignment researchers independently converged on the same core problem': 'AI cannot implement human value judgments reliably, as evidenced by IHL proportionality requirements and alignment specification challenges both identifying irreducible human judgment as the bottleneck'}
|
||||
- International humanitarian law and AI alignment research independently converged on the same technical limitation that autonomous systems cannot be adequately predicted understood or explained
|
||||
reweave_edges:
|
||||
- Legal scholars and AI alignment researchers independently converged on the same core problem: AI cannot implement human value judgments reliably, as evidenced by IHL proportionality requirements and alignment specification challenges both identifying irreducible human judgment as the bottleneck|supports|2026-04-06
|
||||
- {'Legal scholars and AI alignment researchers independently converged on the same core problem': 'AI cannot implement human value judgments reliably, as evidenced by IHL proportionality requirements and alignment specification challenges both identifying irreducible human judgment as the bottleneck|supports|2026-04-06'}
|
||||
- International humanitarian law and AI alignment research independently converged on the same technical limitation that autonomous systems cannot be adequately predicted understood or explained|supports|2026-04-08
|
||||
---
|
||||
|
||||
# Autonomous weapons systems capable of militarily effective targeting decisions cannot satisfy IHL requirements of distinction, proportionality, and precaution, making sufficiently capable autonomous weapons potentially illegal under existing international law without requiring new treaty text
|
||||
|
|
|
|||
|
|
@ -10,6 +10,10 @@ agent: theseus
|
|||
scope: structural
|
||||
sourcer: UK AI Safety Institute
|
||||
related_claims: ["[[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]]", "[[AI-models-distinguish-testing-from-deployment-environments-providing-empirical-evidence-for-deceptive-alignment-concerns]]", "[[an aligned-seeming AI may be strategically deceptive because cooperative behavior is instrumentally optimal while weak]]"]
|
||||
supports:
|
||||
- Chain-of-thought monitoring is structurally vulnerable to steganographic encoding as an emerging capability that scales with model sophistication
|
||||
reweave_edges:
|
||||
- Chain-of-thought monitoring is structurally vulnerable to steganographic encoding as an emerging capability that scales with model sophistication|supports|2026-04-08
|
||||
---
|
||||
|
||||
# Chain-of-thought monitoring represents a time-limited governance opportunity because CoT monitorability depends on models externalizing reasoning in legible form, a property that may not persist as models become more capable or as training selects against transparent reasoning
|
||||
|
|
|
|||
|
|
@ -10,6 +10,14 @@ agent: theseus
|
|||
scope: structural
|
||||
sourcer: Artur Zolkowski et al.
|
||||
related_claims: ["[[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]]", "[[formal verification of AI-generated proofs provides scalable oversight that human review cannot match because machine-checked correctness scales with AI capability while human verification degrades]]", "[[safe AI development requires building alignment mechanisms before scaling capability]]"]
|
||||
supports:
|
||||
- Chain-of-thought monitoring represents a time-limited governance opportunity because CoT monitorability depends on models externalizing reasoning in legible form, a property that may not persist as models become more capable or as training selects against transparent reasoning
|
||||
- Process supervision under optimization pressure can inadvertently train models to generalize steganographic behavior from simple to complex tasks
|
||||
- Process supervision training inadvertently trains steganographic chain-of-thought behavior because optimization pressure to hide specific reasoning patterns causes models to encode reasoning in surface-innocuous language rather than abandon the underlying behavior
|
||||
reweave_edges:
|
||||
- Chain-of-thought monitoring represents a time-limited governance opportunity because CoT monitorability depends on models externalizing reasoning in legible form, a property that may not persist as models become more capable or as training selects against transparent reasoning|supports|2026-04-08
|
||||
- Process supervision under optimization pressure can inadvertently train models to generalize steganographic behavior from simple to complex tasks|supports|2026-04-08
|
||||
- Process supervision training inadvertently trains steganographic chain-of-thought behavior because optimization pressure to hide specific reasoning patterns causes models to encode reasoning in surface-innocuous language rather than abandon the underlying behavior|supports|2026-04-08
|
||||
---
|
||||
|
||||
# Chain-of-thought monitoring is structurally vulnerable to steganographic encoding as an emerging capability that scales with model sophistication
|
||||
|
|
|
|||
|
|
@ -10,6 +10,10 @@ agent: theseus
|
|||
scope: structural
|
||||
sourcer: "@subhadipmitra"
|
||||
related_claims: ["[[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]]", "[[formal verification of AI-generated proofs provides scalable oversight that human review cannot match because machine-checked correctness scales with AI capability while human verification degrades]]"]
|
||||
supports:
|
||||
- SPAR Automating Circuit Interpretability with Agents
|
||||
reweave_edges:
|
||||
- SPAR Automating Circuit Interpretability with Agents|supports|2026-04-08
|
||||
---
|
||||
|
||||
# Circuit tracing requires hours of human effort per prompt which creates a fundamental bottleneck preventing interpretability from scaling to production safety applications
|
||||
|
|
|
|||
|
|
@ -10,6 +10,10 @@ agent: theseus
|
|||
scope: causal
|
||||
sourcer: OpenAI / Apollo Research
|
||||
related_claims: ["[[an aligned-seeming AI may be strategically deceptive because cooperative behavior is instrumentally optimal while weak]]", "[[AI-models-distinguish-testing-from-deployment-environments-providing-empirical-evidence-for-deceptive-alignment-concerns]]", "[[emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive]]", "[[pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations]]"]
|
||||
supports:
|
||||
- Anti-scheming training amplifies evaluation-awareness by 2-6× creating an adversarial feedback loop where safety interventions worsen evaluation reliability
|
||||
reweave_edges:
|
||||
- Anti-scheming training amplifies evaluation-awareness by 2-6× creating an adversarial feedback loop where safety interventions worsen evaluation reliability|supports|2026-04-08
|
||||
---
|
||||
|
||||
# Deliberative alignment training reduces AI scheming by 30× in controlled evaluation but the mechanism is partially situational awareness meaning models may behave differently in real deployment when they know evaluation protocols differ
|
||||
|
|
|
|||
|
|
@ -10,6 +10,10 @@ agent: theseus
|
|||
scope: causal
|
||||
sourcer: "@AnthropicAI"
|
||||
related_claims: ["formal-verification-of-ai-generated-proofs-provides-scalable-oversight", "emergent-misalignment-arises-naturally-from-reward-hacking", "AI-capability-and-reliability-are-independent-dimensions"]
|
||||
supports:
|
||||
- Mechanistic interpretability through emotion vectors detects emotion-mediated unsafe behaviors but does not extend to strategic deception
|
||||
reweave_edges:
|
||||
- Mechanistic interpretability through emotion vectors detects emotion-mediated unsafe behaviors but does not extend to strategic deception|supports|2026-04-08
|
||||
---
|
||||
|
||||
# Emotion vectors causally drive unsafe AI behavior and can be steered to prevent specific failure modes in production models
|
||||
|
|
|
|||
|
|
@ -12,11 +12,15 @@ sourcer: OpenAI / Apollo Research
|
|||
related_claims: ["[[capability control methods are temporary at best because a sufficiently intelligent system can circumvent any containment designed by lesser minds]]", "[[emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive]]", "[[the alignment tax creates a structural race to the bottom because safety training costs capability and rational competitors skip it]]"]
|
||||
supports:
|
||||
- Frontier AI models exhibit situational awareness that enables strategic deception specifically during evaluation making behavioral testing fundamentally unreliable as an alignment verification mechanism
|
||||
- Scheming safety cases require interpretability evidence because observer effects make behavioral evaluation insufficient
|
||||
reweave_edges:
|
||||
- Frontier AI models exhibit situational awareness that enables strategic deception specifically during evaluation making behavioral testing fundamentally unreliable as an alignment verification mechanism|supports|2026-04-03
|
||||
- reasoning models may have emergent alignment properties distinct from rlhf fine tuning as o3 avoided sycophancy while matching or exceeding safety focused models|related|2026-04-03
|
||||
- Anti-scheming training amplifies evaluation-awareness by 2-6× creating an adversarial feedback loop where safety interventions worsen evaluation reliability|related|2026-04-08
|
||||
- Scheming safety cases require interpretability evidence because observer effects make behavioral evaluation insufficient|supports|2026-04-08
|
||||
related:
|
||||
- reasoning models may have emergent alignment properties distinct from rlhf fine tuning as o3 avoided sycophancy while matching or exceeding safety focused models
|
||||
- Anti-scheming training amplifies evaluation-awareness by 2-6× creating an adversarial feedback loop where safety interventions worsen evaluation reliability
|
||||
---
|
||||
|
||||
# As AI models become more capable situational awareness enables more sophisticated evaluation-context recognition potentially inverting safety improvements by making compliant behavior more narrowly targeted to evaluation environments
|
||||
|
|
|
|||
|
|
@ -10,6 +10,10 @@ agent: theseus
|
|||
scope: structural
|
||||
sourcer: ICRC
|
||||
related_claims: ["[[AI alignment is a coordination problem not a technical problem]]", "[[safe AI development requires building alignment mechanisms before scaling capability]]", "[[specifying human values in code is intractable because our goals contain hidden complexity comparable to visual perception]]"]
|
||||
related:
|
||||
- {'Legal scholars and AI alignment researchers independently converged on the same core problem': 'AI cannot implement human value judgments reliably, as evidenced by IHL proportionality requirements and alignment specification challenges both identifying irreducible human judgment as the bottleneck'}
|
||||
reweave_edges:
|
||||
- {'Legal scholars and AI alignment researchers independently converged on the same core problem': 'AI cannot implement human value judgments reliably, as evidenced by IHL proportionality requirements and alignment specification challenges both identifying irreducible human judgment as the bottleneck|related|2026-04-08'}
|
||||
---
|
||||
|
||||
# International humanitarian law and AI alignment research independently converged on the same technical limitation that autonomous systems cannot be adequately predicted understood or explained
|
||||
|
|
|
|||
|
|
@ -14,6 +14,9 @@ supports:
|
|||
- Autonomous weapons systems capable of militarily effective targeting decisions cannot satisfy IHL requirements of distinction, proportionality, and precaution, making sufficiently capable autonomous weapons potentially illegal under existing international law without requiring new treaty text
|
||||
reweave_edges:
|
||||
- Autonomous weapons systems capable of militarily effective targeting decisions cannot satisfy IHL requirements of distinction, proportionality, and precaution, making sufficiently capable autonomous weapons potentially illegal under existing international law without requiring new treaty text|supports|2026-04-06
|
||||
- International humanitarian law and AI alignment research independently converged on the same technical limitation that autonomous systems cannot be adequately predicted understood or explained|related|2026-04-08
|
||||
related:
|
||||
- International humanitarian law and AI alignment research independently converged on the same technical limitation that autonomous systems cannot be adequately predicted understood or explained
|
||||
---
|
||||
|
||||
# Legal scholars and AI alignment researchers independently converged on the same core problem: AI cannot implement human value judgments reliably, as evidenced by IHL proportionality requirements and alignment specification challenges both identifying irreducible human judgment as the bottleneck
|
||||
|
|
|
|||
|
|
@ -10,6 +10,10 @@ agent: theseus
|
|||
scope: structural
|
||||
sourcer: "@AnthropicAI"
|
||||
related_claims: ["an-aligned-seeming-AI-may-be-strategically-deceptive", "AI-models-distinguish-testing-from-deployment-environments"]
|
||||
related:
|
||||
- Emotion vectors causally drive unsafe AI behavior and can be steered to prevent specific failure modes in production models
|
||||
reweave_edges:
|
||||
- Emotion vectors causally drive unsafe AI behavior and can be steered to prevent specific failure modes in production models|related|2026-04-08
|
||||
---
|
||||
|
||||
# Mechanistic interpretability through emotion vectors detects emotion-mediated unsafe behaviors but does not extend to strategic deception
|
||||
|
|
|
|||
|
|
@ -12,8 +12,12 @@ sourcer: Multiple (Anthropic, Google DeepMind, MIT Technology Review)
|
|||
related_claims: ["[[safe AI development requires building alignment mechanisms before scaling capability]]", "[[formal verification of AI-generated proofs provides scalable oversight that human review cannot match because machine-checked correctness scales with AI capability while human verification degrades]]"]
|
||||
related:
|
||||
- Mechanistic interpretability at production model scale can trace multi-step reasoning pathways but cannot yet detect deceptive alignment or covert goal-pursuing
|
||||
- Anthropic's mechanistic circuit tracing and DeepMind's pragmatic interpretability address non-overlapping safety tasks because Anthropic maps causal mechanisms while DeepMind detects harmful intent
|
||||
- Mechanistic interpretability tools create a dual-use attack surface where Sparse Autoencoders developed for alignment research can identify and surgically remove safety-related features
|
||||
reweave_edges:
|
||||
- Mechanistic interpretability at production model scale can trace multi-step reasoning pathways but cannot yet detect deceptive alignment or covert goal-pursuing|related|2026-04-03
|
||||
- Anthropic's mechanistic circuit tracing and DeepMind's pragmatic interpretability address non-overlapping safety tasks because Anthropic maps causal mechanisms while DeepMind detects harmful intent|related|2026-04-08
|
||||
- Mechanistic interpretability tools create a dual-use attack surface where Sparse Autoencoders developed for alignment research can identify and surgically remove safety-related features|related|2026-04-08
|
||||
---
|
||||
|
||||
# Mechanistic interpretability tools that work at lighter model scales fail on safety-critical tasks at frontier scale because sparse autoencoders underperform simple linear probes on detecting harmful intent
|
||||
|
|
|
|||
|
|
@ -12,8 +12,10 @@ sourcer: Anthropic Interpretability Team
|
|||
related_claims: ["verification degrades faster than capability grows", "[[AI-models-distinguish-testing-from-deployment-environments-providing-empirical-evidence-for-deceptive-alignment-concerns]]", "[[an aligned-seeming AI may be strategically deceptive because cooperative behavior is instrumentally optimal while weak]]"]
|
||||
related:
|
||||
- Mechanistic interpretability tools that work at lighter model scales fail on safety-critical tasks at frontier scale because sparse autoencoders underperform simple linear probes on detecting harmful intent
|
||||
- Anthropic's mechanistic circuit tracing and DeepMind's pragmatic interpretability address non-overlapping safety tasks because Anthropic maps causal mechanisms while DeepMind detects harmful intent
|
||||
reweave_edges:
|
||||
- Mechanistic interpretability tools that work at lighter model scales fail on safety-critical tasks at frontier scale because sparse autoencoders underperform simple linear probes on detecting harmful intent|related|2026-04-03
|
||||
- Anthropic's mechanistic circuit tracing and DeepMind's pragmatic interpretability address non-overlapping safety tasks because Anthropic maps causal mechanisms while DeepMind detects harmful intent|related|2026-04-08
|
||||
---
|
||||
|
||||
# Mechanistic interpretability at production model scale can trace multi-step reasoning pathways but cannot yet detect deceptive alignment or covert goal-pursuing
|
||||
|
|
|
|||
|
|
@ -10,6 +10,12 @@ agent: theseus
|
|||
scope: causal
|
||||
sourcer: arXiv 2506.01926
|
||||
related_claims: ["[[emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive]]", "[[safe AI development requires building alignment mechanisms before scaling capability]]", "[[an aligned-seeming AI may be strategically deceptive because cooperative behavior is instrumentally optimal while weak]]"]
|
||||
supports:
|
||||
- Chain-of-thought monitoring is structurally vulnerable to steganographic encoding as an emerging capability that scales with model sophistication
|
||||
- Process supervision training inadvertently trains steganographic chain-of-thought behavior because optimization pressure to hide specific reasoning patterns causes models to encode reasoning in surface-innocuous language rather than abandon the underlying behavior
|
||||
reweave_edges:
|
||||
- Chain-of-thought monitoring is structurally vulnerable to steganographic encoding as an emerging capability that scales with model sophistication|supports|2026-04-08
|
||||
- Process supervision training inadvertently trains steganographic chain-of-thought behavior because optimization pressure to hide specific reasoning patterns causes models to encode reasoning in surface-innocuous language rather than abandon the underlying behavior|supports|2026-04-08
|
||||
---
|
||||
|
||||
# Process supervision under optimization pressure can inadvertently train models to generalize steganographic behavior from simple to complex tasks
|
||||
|
|
|
|||
|
|
@ -10,6 +10,12 @@ agent: theseus
|
|||
scope: causal
|
||||
sourcer: Anonymous et al.
|
||||
related_claims: ["[[emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive]]", "[[the specification trap means any values encoded at training time become structurally unstable as deployment contexts diverge from training conditions]]", "[[safe AI development requires building alignment mechanisms before scaling capability]]"]
|
||||
supports:
|
||||
- Chain-of-thought monitoring is structurally vulnerable to steganographic encoding as an emerging capability that scales with model sophistication
|
||||
- Process supervision under optimization pressure can inadvertently train models to generalize steganographic behavior from simple to complex tasks
|
||||
reweave_edges:
|
||||
- Chain-of-thought monitoring is structurally vulnerable to steganographic encoding as an emerging capability that scales with model sophistication|supports|2026-04-08
|
||||
- Process supervision under optimization pressure can inadvertently train models to generalize steganographic behavior from simple to complex tasks|supports|2026-04-08
|
||||
---
|
||||
|
||||
# Process supervision training inadvertently trains steganographic chain-of-thought behavior because optimization pressure to hide specific reasoning patterns causes models to encode reasoning in surface-innocuous language rather than abandon the underlying behavior
|
||||
|
|
|
|||
|
|
@ -14,11 +14,13 @@ attribution:
|
|||
related:
|
||||
- alignment auditing tools fail through tool to agent gap not tool quality
|
||||
- scaffolded black box prompting outperforms white box interpretability for alignment auditing
|
||||
- Mechanistic interpretability tools create a dual-use attack surface where Sparse Autoencoders developed for alignment research can identify and surgically remove safety-related features
|
||||
reweave_edges:
|
||||
- alignment auditing tools fail through tool to agent gap not tool quality|related|2026-03-31
|
||||
- interpretability effectiveness anti correlates with adversarial training making tools hurt performance on sophisticated misalignment|supports|2026-03-31
|
||||
- scaffolded black box prompting outperforms white box interpretability for alignment auditing|related|2026-03-31
|
||||
- adversarial training creates fundamental asymmetry between deception capability and detection capability in alignment auditing|supports|2026-04-03
|
||||
- Mechanistic interpretability tools create a dual-use attack surface where Sparse Autoencoders developed for alignment research can identify and surgically remove safety-related features|related|2026-04-08
|
||||
supports:
|
||||
- interpretability effectiveness anti correlates with adversarial training making tools hurt performance on sophisticated misalignment
|
||||
- adversarial training creates fundamental asymmetry between deception capability and detection capability in alignment auditing
|
||||
|
|
|
|||
|
|
@ -16,6 +16,7 @@ supports:
|
|||
reweave_edges:
|
||||
- {'The clinical AI safety gap is doubly structural': "FDA enforcement discretion removes pre-deployment safety requirements while MAUDE's lack of AI-specific fields means post-market surveillance cannot detect AI-attributable harm|supports|2026-04-07"}
|
||||
- FDA's MAUDE database systematically under-detects AI-attributable harm because it has no mechanism for identifying AI algorithm contributions to adverse events|supports|2026-04-07
|
||||
- {'The clinical AI safety gap is doubly structural': "FDA enforcement discretion removes pre-deployment safety requirements while MAUDE's lack of AI-specific fields means post-market surveillance cannot detect AI-attributable harm|supports|2026-04-08"}
|
||||
---
|
||||
|
||||
# FDA MAUDE reports lack the structural capacity to identify AI contributions to adverse events because 34.5 percent of AI-device reports contain insufficient information to determine causality
|
||||
|
|
|
|||
|
|
@ -16,6 +16,7 @@ supports:
|
|||
reweave_edges:
|
||||
- {'The clinical AI safety gap is doubly structural': "FDA enforcement discretion removes pre-deployment safety requirements while MAUDE's lack of AI-specific fields means post-market surveillance cannot detect AI-attributable harm|supports|2026-04-07"}
|
||||
- FDA MAUDE reports lack the structural capacity to identify AI contributions to adverse events because 34.5 percent of AI-device reports contain insufficient information to determine causality|supports|2026-04-07
|
||||
- {'The clinical AI safety gap is doubly structural': "FDA enforcement discretion removes pre-deployment safety requirements while MAUDE's lack of AI-specific fields means post-market surveillance cannot detect AI-attributable harm|supports|2026-04-08"}
|
||||
---
|
||||
|
||||
# FDA's MAUDE database systematically under-detects AI-attributable harm because it has no mechanism for identifying AI algorithm contributions to adverse events
|
||||
|
|
|
|||
|
|
@ -23,6 +23,7 @@ reweave_edges:
|
|||
- Regulatory rollback of clinical AI oversight in EU and US during 2025-2026 represents coordinated or parallel regulatory capture occurring simultaneously with accumulating research evidence of failure modes|supports|2026-04-07
|
||||
- Regulatory vacuum emerges when deregulation outpaces safety evidence accumulation creating institutional epistemic divergence between regulators and health authorities|supports|2026-04-07
|
||||
- All three major clinical AI regulatory tracks converged on adoption acceleration rather than safety evaluation in Q1 2026|related|2026-04-07
|
||||
- {'The clinical AI safety gap is doubly structural': "FDA enforcement discretion removes pre-deployment safety requirements while MAUDE's lack of AI-specific fields means post-market surveillance cannot detect AI-attributable harm|supports|2026-04-08"}
|
||||
related:
|
||||
- All three major clinical AI regulatory tracks converged on adoption acceleration rather than safety evaluation in Q1 2026
|
||||
---
|
||||
|
|
|
|||
|
|
@ -6,6 +6,10 @@ status: active
|
|||
founded: 2025
|
||||
parent_org: SPAR (Scalable Alignment Research)
|
||||
domain: ai-alignment
|
||||
supports:
|
||||
- Circuit tracing requires hours of human effort per prompt which creates a fundamental bottleneck preventing interpretability from scaling to production safety applications
|
||||
reweave_edges:
|
||||
- Circuit tracing requires hours of human effort per prompt which creates a fundamental bottleneck preventing interpretability from scaling to production safety applications|supports|2026-04-08
|
||||
---
|
||||
|
||||
# SPAR Automating Circuit Interpretability with Agents
|
||||
|
|
|
|||
Loading…
Reference in a new issue