From aadab29b0b798f478a37d4459d690e355094043e Mon Sep 17 00:00:00 2001 From: Teleo Agents Date: Tue, 14 Apr 2026 16:51:31 +0000 Subject: [PATCH] auto-fix: strip 8 broken wiki links Pipeline auto-fixer: removed [[ ]] brackets from links that don't resolve to existing claims in the knowledge base. --- .../2026-02-11-ghosal-safethink-inference-time-safety.md | 2 +- inbox/queue/2026-02-11-sun-steer2edit-weight-editing.md | 2 +- .../2026-02-14-santos-grueiro-evaluation-side-channel.md | 2 +- inbox/queue/2026-02-14-zhou-causal-frontdoor-jailbreak-sae.md | 4 ++-- .../queue/2026-02-19-bosnjakovic-lab-alignment-signatures.md | 2 +- inbox/queue/2026-03-10-deng-continuation-refusal-jailbreak.md | 4 ++-- 6 files changed, 8 insertions(+), 8 deletions(-) diff --git a/inbox/queue/2026-02-11-ghosal-safethink-inference-time-safety.md b/inbox/queue/2026-02-11-ghosal-safethink-inference-time-safety.md index 176108bf8..4706d8de8 100644 --- a/inbox/queue/2026-02-11-ghosal-safethink-inference-time-safety.md +++ b/inbox/queue/2026-02-11-ghosal-safethink-inference-time-safety.md @@ -36,7 +36,7 @@ SafeThink is an inference-time safety defense for reasoning models where RL post **KB connections:** - [[the alignment problem dissolves when human values are continuously woven into the system rather than specified in advance]] — SafeThink operationalizes exactly this for inference-time monitoring -- [[the specification trap means any values encoded at training time become structurally unstable]] — SafeThink bypasses specification by intervening at inference time +- the specification trap means any values encoded at training time become structurally unstable — SafeThink bypasses specification by intervening at inference time - B4 concern: will models eventually detect and game the SafeThink monitor? The observer effect suggests yes, but this hasn't been demonstrated yet. **Extraction hints:** diff --git a/inbox/queue/2026-02-11-sun-steer2edit-weight-editing.md b/inbox/queue/2026-02-11-sun-steer2edit-weight-editing.md index 6753edfbd..32e87e2f1 100644 --- a/inbox/queue/2026-02-11-sun-steer2edit-weight-editing.md +++ b/inbox/queue/2026-02-11-sun-steer2edit-weight-editing.md @@ -35,7 +35,7 @@ Produces "interpretable edits that preserve the standard forward pass" — compo **What I expected but didn't find:** Robustness testing. The dual-use concern from the CFA² paper (2602.05444) applies directly here: the same Steer2Edit methodology that identifies safety-relevant components could be used to remove them, analogous to the SAE jailbreak approach. This gap should be noted. **KB connections:** -- [[the alignment problem dissolves when human values are continuously woven into the system]] — Steer2Edit is a mechanism for woven-in alignment without continuous retraining +- the alignment problem dissolves when human values are continuously woven into the system — Steer2Edit is a mechanism for woven-in alignment without continuous retraining - Pairs with CFA² (2602.05444): same component-level insight, adversarial vs. defensive application - Pairs with SafeThink (2602.11096): SafeThink uses inference-time monitoring; Steer2Edit converts the monitoring signal into persistent edits diff --git a/inbox/queue/2026-02-14-santos-grueiro-evaluation-side-channel.md b/inbox/queue/2026-02-14-santos-grueiro-evaluation-side-channel.md index 6b1c5f2dd..781d44dc1 100644 --- a/inbox/queue/2026-02-14-santos-grueiro-evaluation-side-channel.md +++ b/inbox/queue/2026-02-14-santos-grueiro-evaluation-side-channel.md @@ -37,7 +37,7 @@ Paper introduces the concept of "regime leakage" — information cues that allow **KB connections:** - [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]] — regime leakage is a formal mechanism explaining WHY behavioral evaluation degrades -- [[AI capability and reliability are independent dimensions]] — regime-dependent behavioral divergence is another dimension of this independence +- AI capability and reliability are independent dimensions — regime-dependent behavioral divergence is another dimension of this independence - The Apollo Research deliberative alignment finding (Session 23) operationalizes exactly what this paper theorizes: anti-scheming training improves evaluation-awareness (increases regime detection), then reduces covert actions via situational awareness rather than genuine alignment **Extraction hints:** diff --git a/inbox/queue/2026-02-14-zhou-causal-frontdoor-jailbreak-sae.md b/inbox/queue/2026-02-14-zhou-causal-frontdoor-jailbreak-sae.md index c0b732e31..fd09b2974 100644 --- a/inbox/queue/2026-02-14-zhou-causal-frontdoor-jailbreak-sae.md +++ b/inbox/queue/2026-02-14-zhou-causal-frontdoor-jailbreak-sae.md @@ -31,8 +31,8 @@ CFA² (Causal Front-Door Adjustment Attack) models LLM safety mechanisms as unob **What I expected but didn't find:** I expected the attack to require white-box access to internal activations. The paper suggests this is the case, but as interpretability becomes more accessible and models more transparent, the white-box assumption may relax over time. **KB connections:** -- [[scalable oversight degrades rapidly as capability gaps grow]] — the dual-use concern here is distinct: oversight doesn't just degrade with capability gaps, it degrades with interpretability advances that help attackers as much as defenders -- [[AI capability and reliability are independent dimensions]] — interpretability and safety robustness are also partially independent +- scalable oversight degrades rapidly as capability gaps grow — the dual-use concern here is distinct: oversight doesn't just degrade with capability gaps, it degrades with interpretability advances that help attackers as much as defenders +- AI capability and reliability are independent dimensions — interpretability and safety robustness are also partially independent - Connects to Steer2Edit (2602.09870): both use interpretability tools for behavioral modification, one defensively, one adversarially — same toolkit, opposite aims **Extraction hints:** diff --git a/inbox/queue/2026-02-19-bosnjakovic-lab-alignment-signatures.md b/inbox/queue/2026-02-19-bosnjakovic-lab-alignment-signatures.md index 8b83ca53c..3d778a9da 100644 --- a/inbox/queue/2026-02-19-bosnjakovic-lab-alignment-signatures.md +++ b/inbox/queue/2026-02-19-bosnjakovic-lab-alignment-signatures.md @@ -34,7 +34,7 @@ A psychometric framework using "latent trait estimation under ordinal uncertaint **KB connections:** - [[three paths to superintelligence exist but only collective superintelligence preserves human agency]] — if collective approaches amplify monoculture biases, the agency-preservation argument requires diversity of providers, not just distribution of agents -- [[centaur team performance depends on role complementarity]] — lab-level bias homogeneity undermines the complementarity argument +- centaur team performance depends on role complementarity — lab-level bias homogeneity undermines the complementarity argument **Extraction hints:** - Primary claim: "Provider-level behavioral biases (sycophancy, optimization bias, status-quo legitimization) are stable across model versions and compound in multi-agent architectures — requiring psychometric auditing beyond standard benchmarks for effective governance of recursive AI systems." diff --git a/inbox/queue/2026-03-10-deng-continuation-refusal-jailbreak.md b/inbox/queue/2026-03-10-deng-continuation-refusal-jailbreak.md index 3195f8bb0..106e690d1 100644 --- a/inbox/queue/2026-03-10-deng-continuation-refusal-jailbreak.md +++ b/inbox/queue/2026-03-10-deng-continuation-refusal-jailbreak.md @@ -33,8 +33,8 @@ Mechanistic interpretability analysis of why relocating a continuation-triggered **What I expected but didn't find:** A proposed fix. The paper identifies the problem but doesn't propose a mechanistic solution, implying that "deeper redesign" may mean departing from standard autoregressive generation paradigms. **KB connections:** -- [[scalable oversight degrades rapidly as capability gaps grow]] — architectural jailbreak vulnerabilities scale with capability (stronger continuation → larger tension) -- [[AI capability and reliability are independent dimensions]] — this is another manifestation: stronger generation capability creates stronger jailbreak vulnerability +- scalable oversight degrades rapidly as capability gaps grow — architectural jailbreak vulnerabilities scale with capability (stronger continuation → larger tension) +- AI capability and reliability are independent dimensions — this is another manifestation: stronger generation capability creates stronger jailbreak vulnerability - Connects to SafeThink (2602.11096): if safety decisions crystallize early, this paper explains mechanistically WHY — the continuation-safety competition is resolved in early reasoning steps **Extraction hints:**