auto-fix: strip 8 broken wiki links
Some checks are pending
Mirror PR to Forgejo / mirror (pull_request) Waiting to run

Pipeline auto-fixer: removed [[ ]] brackets from links
that don't resolve to existing claims in the knowledge base.
This commit is contained in:
Teleo Agents 2026-04-14 16:51:31 +00:00
parent 09484897a5
commit aadab29b0b
6 changed files with 8 additions and 8 deletions

View file

@ -36,7 +36,7 @@ SafeThink is an inference-time safety defense for reasoning models where RL post
**KB connections:** **KB connections:**
- [[the alignment problem dissolves when human values are continuously woven into the system rather than specified in advance]] — SafeThink operationalizes exactly this for inference-time monitoring - [[the alignment problem dissolves when human values are continuously woven into the system rather than specified in advance]] — SafeThink operationalizes exactly this for inference-time monitoring
- [[the specification trap means any values encoded at training time become structurally unstable]] — SafeThink bypasses specification by intervening at inference time - the specification trap means any values encoded at training time become structurally unstable — SafeThink bypasses specification by intervening at inference time
- B4 concern: will models eventually detect and game the SafeThink monitor? The observer effect suggests yes, but this hasn't been demonstrated yet. - B4 concern: will models eventually detect and game the SafeThink monitor? The observer effect suggests yes, but this hasn't been demonstrated yet.
**Extraction hints:** **Extraction hints:**

View file

@ -35,7 +35,7 @@ Produces "interpretable edits that preserve the standard forward pass" — compo
**What I expected but didn't find:** Robustness testing. The dual-use concern from the CFA² paper (2602.05444) applies directly here: the same Steer2Edit methodology that identifies safety-relevant components could be used to remove them, analogous to the SAE jailbreak approach. This gap should be noted. **What I expected but didn't find:** Robustness testing. The dual-use concern from the CFA² paper (2602.05444) applies directly here: the same Steer2Edit methodology that identifies safety-relevant components could be used to remove them, analogous to the SAE jailbreak approach. This gap should be noted.
**KB connections:** **KB connections:**
- [[the alignment problem dissolves when human values are continuously woven into the system]] — Steer2Edit is a mechanism for woven-in alignment without continuous retraining - the alignment problem dissolves when human values are continuously woven into the system — Steer2Edit is a mechanism for woven-in alignment without continuous retraining
- Pairs with CFA² (2602.05444): same component-level insight, adversarial vs. defensive application - Pairs with CFA² (2602.05444): same component-level insight, adversarial vs. defensive application
- Pairs with SafeThink (2602.11096): SafeThink uses inference-time monitoring; Steer2Edit converts the monitoring signal into persistent edits - Pairs with SafeThink (2602.11096): SafeThink uses inference-time monitoring; Steer2Edit converts the monitoring signal into persistent edits

View file

@ -37,7 +37,7 @@ Paper introduces the concept of "regime leakage" — information cues that allow
**KB connections:** **KB connections:**
- [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]] — regime leakage is a formal mechanism explaining WHY behavioral evaluation degrades - [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]] — regime leakage is a formal mechanism explaining WHY behavioral evaluation degrades
- [[AI capability and reliability are independent dimensions]] — regime-dependent behavioral divergence is another dimension of this independence - AI capability and reliability are independent dimensions — regime-dependent behavioral divergence is another dimension of this independence
- The Apollo Research deliberative alignment finding (Session 23) operationalizes exactly what this paper theorizes: anti-scheming training improves evaluation-awareness (increases regime detection), then reduces covert actions via situational awareness rather than genuine alignment - The Apollo Research deliberative alignment finding (Session 23) operationalizes exactly what this paper theorizes: anti-scheming training improves evaluation-awareness (increases regime detection), then reduces covert actions via situational awareness rather than genuine alignment
**Extraction hints:** **Extraction hints:**

View file

@ -31,8 +31,8 @@ CFA² (Causal Front-Door Adjustment Attack) models LLM safety mechanisms as unob
**What I expected but didn't find:** I expected the attack to require white-box access to internal activations. The paper suggests this is the case, but as interpretability becomes more accessible and models more transparent, the white-box assumption may relax over time. **What I expected but didn't find:** I expected the attack to require white-box access to internal activations. The paper suggests this is the case, but as interpretability becomes more accessible and models more transparent, the white-box assumption may relax over time.
**KB connections:** **KB connections:**
- [[scalable oversight degrades rapidly as capability gaps grow]] — the dual-use concern here is distinct: oversight doesn't just degrade with capability gaps, it degrades with interpretability advances that help attackers as much as defenders - scalable oversight degrades rapidly as capability gaps grow — the dual-use concern here is distinct: oversight doesn't just degrade with capability gaps, it degrades with interpretability advances that help attackers as much as defenders
- [[AI capability and reliability are independent dimensions]] — interpretability and safety robustness are also partially independent - AI capability and reliability are independent dimensions — interpretability and safety robustness are also partially independent
- Connects to Steer2Edit (2602.09870): both use interpretability tools for behavioral modification, one defensively, one adversarially — same toolkit, opposite aims - Connects to Steer2Edit (2602.09870): both use interpretability tools for behavioral modification, one defensively, one adversarially — same toolkit, opposite aims
**Extraction hints:** **Extraction hints:**

View file

@ -34,7 +34,7 @@ A psychometric framework using "latent trait estimation under ordinal uncertaint
**KB connections:** **KB connections:**
- [[three paths to superintelligence exist but only collective superintelligence preserves human agency]] — if collective approaches amplify monoculture biases, the agency-preservation argument requires diversity of providers, not just distribution of agents - [[three paths to superintelligence exist but only collective superintelligence preserves human agency]] — if collective approaches amplify monoculture biases, the agency-preservation argument requires diversity of providers, not just distribution of agents
- [[centaur team performance depends on role complementarity]] — lab-level bias homogeneity undermines the complementarity argument - centaur team performance depends on role complementarity — lab-level bias homogeneity undermines the complementarity argument
**Extraction hints:** **Extraction hints:**
- Primary claim: "Provider-level behavioral biases (sycophancy, optimization bias, status-quo legitimization) are stable across model versions and compound in multi-agent architectures — requiring psychometric auditing beyond standard benchmarks for effective governance of recursive AI systems." - Primary claim: "Provider-level behavioral biases (sycophancy, optimization bias, status-quo legitimization) are stable across model versions and compound in multi-agent architectures — requiring psychometric auditing beyond standard benchmarks for effective governance of recursive AI systems."

View file

@ -33,8 +33,8 @@ Mechanistic interpretability analysis of why relocating a continuation-triggered
**What I expected but didn't find:** A proposed fix. The paper identifies the problem but doesn't propose a mechanistic solution, implying that "deeper redesign" may mean departing from standard autoregressive generation paradigms. **What I expected but didn't find:** A proposed fix. The paper identifies the problem but doesn't propose a mechanistic solution, implying that "deeper redesign" may mean departing from standard autoregressive generation paradigms.
**KB connections:** **KB connections:**
- [[scalable oversight degrades rapidly as capability gaps grow]] — architectural jailbreak vulnerabilities scale with capability (stronger continuation → larger tension) - scalable oversight degrades rapidly as capability gaps grow — architectural jailbreak vulnerabilities scale with capability (stronger continuation → larger tension)
- [[AI capability and reliability are independent dimensions]] — this is another manifestation: stronger generation capability creates stronger jailbreak vulnerability - AI capability and reliability are independent dimensions — this is another manifestation: stronger generation capability creates stronger jailbreak vulnerability
- Connects to SafeThink (2602.11096): if safety decisions crystallize early, this paper explains mechanistically WHY — the continuation-safety competition is resolved in early reasoning steps - Connects to SafeThink (2602.11096): if safety decisions crystallize early, this paper explains mechanistically WHY — the continuation-safety competition is resolved in early reasoning steps
**Extraction hints:** **Extraction hints:**