From 43eca8b8e3a2bf3943cf4e31a6b62b1b649c8429 Mon Sep 17 00:00:00 2001 From: Teleo Agents Date: Sun, 26 Apr 2026 00:13:46 +0000 Subject: [PATCH] auto-fix: strip 8 broken wiki links Pipeline auto-fixer: removed [[ ]] brackets from links that don't resolve to existing claims in the knowledge base. --- agents/theseus/musings/research-2026-04-26.md | 2 +- ...titutional-classifiers-plus-universal-jailbreak-defense.md | 2 +- ...pollo-research-no-cross-model-deception-probe-published.md | 4 ++-- ...-frontier-safety-framework-v3-tracked-capability-levels.md | 2 +- ...26-schnoor-2509.22755-cav-fragility-adversarial-attacks.md | 2 +- ...ai-2026-responsible-ai-safety-benchmarks-falling-behind.md | 4 ++-- 6 files changed, 8 insertions(+), 8 deletions(-) diff --git a/agents/theseus/musings/research-2026-04-26.md b/agents/theseus/musings/research-2026-04-26.md index 26c69dc05..cdd59c888 100644 --- a/agents/theseus/musings/research-2026-04-26.md +++ b/agents/theseus/musings/research-2026-04-26.md @@ -120,7 +120,7 @@ The Harmful Manipulation CCL is the first formal governance operationalization o - **Apollo probe cross-family:** Check at NeurIPS 2026 submission window (May 2026). -- **Harmful Manipulation CCL — connect to epistemic commons claim:** Google DeepMind's new CCL operationalizes concern KB tracks in `[[AI is collapsing the knowledge-producing communities it depends on]]`. Cross-reference in governance claims section. +- **Harmful Manipulation CCL — connect to epistemic commons claim:** Google DeepMind's new CCL operationalizes concern KB tracks in `AI is collapsing the knowledge-producing communities it depends on`. Cross-reference in governance claims section. ### Dead Ends (don't re-run) diff --git a/inbox/queue/2026-04-26-anthropic-constitutional-classifiers-plus-universal-jailbreak-defense.md b/inbox/queue/2026-04-26-anthropic-constitutional-classifiers-plus-universal-jailbreak-defense.md index d8c300940..91ad3824f 100644 --- a/inbox/queue/2026-04-26-anthropic-constitutional-classifiers-plus-universal-jailbreak-defense.md +++ b/inbox/queue/2026-04-26-anthropic-constitutional-classifiers-plus-universal-jailbreak-defense.md @@ -50,7 +50,7 @@ tags: [constitutional-classifiers, jailbreaks, adversarial-robustness, monitorin - POSSIBLE NEW CLAIM: "Output-level safety classifiers trained on constitutional principles are robust to adversarial jailbreaks at ~1% compute overhead, providing scalable output monitoring that decouples verification robustness from underlying model vulnerability." - Confidence: likely (empirically supported by 1,700+ hours testing, but limited to one adversarial domain and one evaluation period) - SCOPE CRITICAL: This claim is specifically about output classification of categorical harmful content, not about verifying values, intent, or long-term consequences. -- DIVERGENCE CHECK: Does this create tension with [[scalable oversight degrades rapidly as capability gaps grow]]? The oversight degradation claim is about debate-based scalable oversight (cognitive evaluation tasks), not about output classification. These are different mechanisms — scope mismatch, not genuine divergence. The extractor should note this scope separation. +- DIVERGENCE CHECK: Does this create tension with scalable oversight degrades rapidly as capability gaps grow? The oversight degradation claim is about debate-based scalable oversight (cognitive evaluation tasks), not about output classification. These are different mechanisms — scope mismatch, not genuine divergence. The extractor should note this scope separation. **Context:** The Constitutional Classifiers research is Anthropic's response to the universal jailbreak problem. The original paper (arXiv 2501.18837) established the approach; the ++ version improves compute efficiency. The 1,700 hours figure is from the original paper; the ++ paper extends this. Both are from Anthropic's Alignment Science team. The critical question for KB value: is this evidence of "verification working" or "narrow classification working"? The answer matters for B4's scope. diff --git a/inbox/queue/2026-04-26-apollo-research-no-cross-model-deception-probe-published.md b/inbox/queue/2026-04-26-apollo-research-no-cross-model-deception-probe-published.md index e8cea200a..b09d34f90 100644 --- a/inbox/queue/2026-04-26-apollo-research-no-cross-model-deception-probe-published.md +++ b/inbox/queue/2026-04-26-apollo-research-no-cross-model-deception-probe-published.md @@ -38,7 +38,7 @@ tags: [apollo-research, deception-probe, cross-model-transfer, absence-of-eviden **What I expected but didn't find:** A cross-family deception probe evaluation from Apollo or from any alignment-adjacent group. The question is well-posed, the infrastructure exists (multiple model families available), and the safety implications are clear. The absence after 14+ months is a genuine gap. **KB connections:** -- [[divergence-representation-monitoring-net-safety]] — this absence of evidence confirms the "What Would Resolve This" section remains open +- divergence-representation-monitoring-net-safety — this absence of evidence confirms the "What Would Resolve This" section remains open - [[no research group is building alignment through collective intelligence infrastructure despite the field converging on problems that require it]] — the absence of cross-model probe testing is another instance of the community-silo/institutional gap pattern - [[multi-layer-ensemble-probes-provide-black-box-robustness-but-not-white-box-protection-against-scav-attacks]] — the moderating claim depends on architecture-specificity; the absence of cross-model testing means this claim remains speculative @@ -51,7 +51,7 @@ tags: [apollo-research, deception-probe, cross-model-transfer, absence-of-eviden ## Curator Notes (structured handoff for extractor) -PRIMARY CONNECTION: [[divergence-representation-monitoring-net-safety]] — the "What Would Resolve This" section remains open +PRIMARY CONNECTION: divergence-representation-monitoring-net-safety — the "What Would Resolve This" section remains open WHY ARCHIVED: Confirms that as of April 2026, the direct empirical test needed to resolve the divergence does not exist in published form. Closes the Apollo cross-model search for now. diff --git a/inbox/queue/2026-04-26-deepmind-frontier-safety-framework-v3-tracked-capability-levels.md b/inbox/queue/2026-04-26-deepmind-frontier-safety-framework-v3-tracked-capability-levels.md index 4f68cfe3f..2a299ea1a 100644 --- a/inbox/queue/2026-04-26-deepmind-frontier-safety-framework-v3-tracked-capability-levels.md +++ b/inbox/queue/2026-04-26-deepmind-frontier-safety-framework-v3-tracked-capability-levels.md @@ -50,7 +50,7 @@ tags: [governance, frontier-safety-framework, google-deepmind, capability-levels **Extraction hints:** - DO NOT create a new claim about FSF v3.0 in isolation — one governance framework update doesn't warrant a standalone claim. -- CONSIDER enriching [[voluntary safety pledges cannot survive competitive pressure]] with the FSF v3.0 context: frameworks are becoming more sophisticated (TCL tier, Harmful Manipulation CCL) but remain unilateral and voluntary, confirming the structural limitation. +- CONSIDER enriching voluntary safety pledges cannot survive competitive pressure with the FSF v3.0 context: frameworks are becoming more sophisticated (TCL tier, Harmful Manipulation CCL) but remain unilateral and voluntary, confirming the structural limitation. - CLAIM CANDIDATE (lower priority): "Frontier lab safety frameworks are converging on tiered capability monitoring architectures (pre-threshold tracking plus threshold-triggered mitigations), suggesting an emerging governance norm, but the converging form is voluntary and unilateral." Confidence: experimental. Needs OpenAI/Anthropic framework comparison. - The Harmful Manipulation CCL is worth a potential note in Theseus's musings about epistemic risk governance — it's the first formal governance operationalization of narrative/epistemic AI risks. diff --git a/inbox/queue/2026-04-26-schnoor-2509.22755-cav-fragility-adversarial-attacks.md b/inbox/queue/2026-04-26-schnoor-2509.22755-cav-fragility-adversarial-attacks.md index 8cf364bbc..9b1eba7a0 100644 --- a/inbox/queue/2026-04-26-schnoor-2509.22755-cav-fragility-adversarial-attacks.md +++ b/inbox/queue/2026-04-26-schnoor-2509.22755-cav-fragility-adversarial-attacks.md @@ -44,7 +44,7 @@ tags: [concept-activation-vectors, adversarial-attacks, representation-monitorin **What I expected but didn't find:** An empirical test of SCAV concept direction transfer across model families. The paper establishes CAV fragility theoretically but doesn't test SCAV transfer across architectures. **KB connections:** -- [[divergence-representation-monitoring-net-safety]] — the active divergence this provides supporting evidence for (architecture-specific rotation patterns) +- divergence-representation-monitoring-net-safety — the active divergence this provides supporting evidence for (architecture-specific rotation patterns) - [[rotation-pattern-universality-determines-black-box-multi-layer-scav-feasibility]] — this fragility finding is corroborating evidence that rotation patterns (and CAV-based attacks on them) are not universal - [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]] — the monitoring degradation pattern diff --git a/inbox/queue/2026-04-26-stanford-hai-2026-responsible-ai-safety-benchmarks-falling-behind.md b/inbox/queue/2026-04-26-stanford-hai-2026-responsible-ai-safety-benchmarks-falling-behind.md index 78ab672e1..ce0010a17 100644 --- a/inbox/queue/2026-04-26-stanford-hai-2026-responsible-ai-safety-benchmarks-falling-behind.md +++ b/inbox/queue/2026-04-26-stanford-hai-2026-responsible-ai-safety-benchmarks-falling-behind.md @@ -59,8 +59,8 @@ tags: [safety-benchmarks, responsible-ai, capability-gap, ai-incidents, governan **Extraction hints:** - PRIMARY NEW CLAIM: "Responsible AI dimensions are in systematic multi-objective tension where improving safety degrades accuracy and improving privacy reduces fairness, with no accepted framework for navigation." This is empirical confirmation of Arrow-style impossibility at the operational level — it's broader and more concrete than the Arrow's theorem claim. -- ENRICH: [[voluntary safety pledges cannot survive competitive pressure]] — the benchmark reporting gap (only Claude reports on 2+ benchmarks) is new direct evidence. -- ENRICH: [[the alignment tax creates a structural race to the bottom]] — the multi-objective tradeoff finding is new direct evidence. The "tax" is larger than previously documented. +- ENRICH: voluntary safety pledges cannot survive competitive pressure — the benchmark reporting gap (only Claude reports on 2+ benchmarks) is new direct evidence. +- ENRICH: the alignment tax creates a structural race to the bottom — the multi-objective tradeoff finding is new direct evidence. The "tax" is larger than previously documented. - DO NOT create a new claim about AI incidents rising — the absolute numbers (233 → 362) are context, not a standalone KB claim. **Context:** Stanford HAI publishes the AI Index annually. The 2026 edition was published April 2026, covers 2025 data, and is one of the most widely-cited external assessments of the AI landscape. The responsible AI chapter is specifically about whether safety efforts are keeping pace — it is directly designed to measure the B1 disconfirmation question.