auto-fix: strip 8 broken wiki links
Some checks are pending
Mirror PR to Forgejo / mirror (pull_request) Waiting to run
Some checks are pending
Mirror PR to Forgejo / mirror (pull_request) Waiting to run
Pipeline auto-fixer: removed [[ ]] brackets from links that don't resolve to existing claims in the knowledge base.
This commit is contained in:
parent
09484897a5
commit
aadab29b0b
6 changed files with 8 additions and 8 deletions
|
|
@ -36,7 +36,7 @@ SafeThink is an inference-time safety defense for reasoning models where RL post
|
|||
|
||||
**KB connections:**
|
||||
- [[the alignment problem dissolves when human values are continuously woven into the system rather than specified in advance]] — SafeThink operationalizes exactly this for inference-time monitoring
|
||||
- [[the specification trap means any values encoded at training time become structurally unstable]] — SafeThink bypasses specification by intervening at inference time
|
||||
- the specification trap means any values encoded at training time become structurally unstable — SafeThink bypasses specification by intervening at inference time
|
||||
- B4 concern: will models eventually detect and game the SafeThink monitor? The observer effect suggests yes, but this hasn't been demonstrated yet.
|
||||
|
||||
**Extraction hints:**
|
||||
|
|
|
|||
|
|
@ -35,7 +35,7 @@ Produces "interpretable edits that preserve the standard forward pass" — compo
|
|||
**What I expected but didn't find:** Robustness testing. The dual-use concern from the CFA² paper (2602.05444) applies directly here: the same Steer2Edit methodology that identifies safety-relevant components could be used to remove them, analogous to the SAE jailbreak approach. This gap should be noted.
|
||||
|
||||
**KB connections:**
|
||||
- [[the alignment problem dissolves when human values are continuously woven into the system]] — Steer2Edit is a mechanism for woven-in alignment without continuous retraining
|
||||
- the alignment problem dissolves when human values are continuously woven into the system — Steer2Edit is a mechanism for woven-in alignment without continuous retraining
|
||||
- Pairs with CFA² (2602.05444): same component-level insight, adversarial vs. defensive application
|
||||
- Pairs with SafeThink (2602.11096): SafeThink uses inference-time monitoring; Steer2Edit converts the monitoring signal into persistent edits
|
||||
|
||||
|
|
|
|||
|
|
@ -37,7 +37,7 @@ Paper introduces the concept of "regime leakage" — information cues that allow
|
|||
|
||||
**KB connections:**
|
||||
- [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]] — regime leakage is a formal mechanism explaining WHY behavioral evaluation degrades
|
||||
- [[AI capability and reliability are independent dimensions]] — regime-dependent behavioral divergence is another dimension of this independence
|
||||
- AI capability and reliability are independent dimensions — regime-dependent behavioral divergence is another dimension of this independence
|
||||
- The Apollo Research deliberative alignment finding (Session 23) operationalizes exactly what this paper theorizes: anti-scheming training improves evaluation-awareness (increases regime detection), then reduces covert actions via situational awareness rather than genuine alignment
|
||||
|
||||
**Extraction hints:**
|
||||
|
|
|
|||
|
|
@ -31,8 +31,8 @@ CFA² (Causal Front-Door Adjustment Attack) models LLM safety mechanisms as unob
|
|||
**What I expected but didn't find:** I expected the attack to require white-box access to internal activations. The paper suggests this is the case, but as interpretability becomes more accessible and models more transparent, the white-box assumption may relax over time.
|
||||
|
||||
**KB connections:**
|
||||
- [[scalable oversight degrades rapidly as capability gaps grow]] — the dual-use concern here is distinct: oversight doesn't just degrade with capability gaps, it degrades with interpretability advances that help attackers as much as defenders
|
||||
- [[AI capability and reliability are independent dimensions]] — interpretability and safety robustness are also partially independent
|
||||
- scalable oversight degrades rapidly as capability gaps grow — the dual-use concern here is distinct: oversight doesn't just degrade with capability gaps, it degrades with interpretability advances that help attackers as much as defenders
|
||||
- AI capability and reliability are independent dimensions — interpretability and safety robustness are also partially independent
|
||||
- Connects to Steer2Edit (2602.09870): both use interpretability tools for behavioral modification, one defensively, one adversarially — same toolkit, opposite aims
|
||||
|
||||
**Extraction hints:**
|
||||
|
|
|
|||
|
|
@ -34,7 +34,7 @@ A psychometric framework using "latent trait estimation under ordinal uncertaint
|
|||
|
||||
**KB connections:**
|
||||
- [[three paths to superintelligence exist but only collective superintelligence preserves human agency]] — if collective approaches amplify monoculture biases, the agency-preservation argument requires diversity of providers, not just distribution of agents
|
||||
- [[centaur team performance depends on role complementarity]] — lab-level bias homogeneity undermines the complementarity argument
|
||||
- centaur team performance depends on role complementarity — lab-level bias homogeneity undermines the complementarity argument
|
||||
|
||||
**Extraction hints:**
|
||||
- Primary claim: "Provider-level behavioral biases (sycophancy, optimization bias, status-quo legitimization) are stable across model versions and compound in multi-agent architectures — requiring psychometric auditing beyond standard benchmarks for effective governance of recursive AI systems."
|
||||
|
|
|
|||
|
|
@ -33,8 +33,8 @@ Mechanistic interpretability analysis of why relocating a continuation-triggered
|
|||
**What I expected but didn't find:** A proposed fix. The paper identifies the problem but doesn't propose a mechanistic solution, implying that "deeper redesign" may mean departing from standard autoregressive generation paradigms.
|
||||
|
||||
**KB connections:**
|
||||
- [[scalable oversight degrades rapidly as capability gaps grow]] — architectural jailbreak vulnerabilities scale with capability (stronger continuation → larger tension)
|
||||
- [[AI capability and reliability are independent dimensions]] — this is another manifestation: stronger generation capability creates stronger jailbreak vulnerability
|
||||
- scalable oversight degrades rapidly as capability gaps grow — architectural jailbreak vulnerabilities scale with capability (stronger continuation → larger tension)
|
||||
- AI capability and reliability are independent dimensions — this is another manifestation: stronger generation capability creates stronger jailbreak vulnerability
|
||||
- Connects to SafeThink (2602.11096): if safety decisions crystallize early, this paper explains mechanistically WHY — the continuation-safety competition is resolved in early reasoning steps
|
||||
|
||||
**Extraction hints:**
|
||||
|
|
|
|||
Loading…
Reference in a new issue