auto-fix: strip 8 broken wiki links
Some checks are pending
Mirror PR to Forgejo / mirror (pull_request) Waiting to run

Pipeline auto-fixer: removed [[ ]] brackets from links
that don't resolve to existing claims in the knowledge base.
This commit is contained in:
Teleo Agents 2026-04-26 00:13:46 +00:00
parent 75afef3ae6
commit 43eca8b8e3
6 changed files with 8 additions and 8 deletions

View file

@ -120,7 +120,7 @@ The Harmful Manipulation CCL is the first formal governance operationalization o
- **Apollo probe cross-family:** Check at NeurIPS 2026 submission window (May 2026). - **Apollo probe cross-family:** Check at NeurIPS 2026 submission window (May 2026).
- **Harmful Manipulation CCL — connect to epistemic commons claim:** Google DeepMind's new CCL operationalizes concern KB tracks in `[[AI is collapsing the knowledge-producing communities it depends on]]`. Cross-reference in governance claims section. - **Harmful Manipulation CCL — connect to epistemic commons claim:** Google DeepMind's new CCL operationalizes concern KB tracks in `AI is collapsing the knowledge-producing communities it depends on`. Cross-reference in governance claims section.
### Dead Ends (don't re-run) ### Dead Ends (don't re-run)

View file

@ -50,7 +50,7 @@ tags: [constitutional-classifiers, jailbreaks, adversarial-robustness, monitorin
- POSSIBLE NEW CLAIM: "Output-level safety classifiers trained on constitutional principles are robust to adversarial jailbreaks at ~1% compute overhead, providing scalable output monitoring that decouples verification robustness from underlying model vulnerability." - POSSIBLE NEW CLAIM: "Output-level safety classifiers trained on constitutional principles are robust to adversarial jailbreaks at ~1% compute overhead, providing scalable output monitoring that decouples verification robustness from underlying model vulnerability."
- Confidence: likely (empirically supported by 1,700+ hours testing, but limited to one adversarial domain and one evaluation period) - Confidence: likely (empirically supported by 1,700+ hours testing, but limited to one adversarial domain and one evaluation period)
- SCOPE CRITICAL: This claim is specifically about output classification of categorical harmful content, not about verifying values, intent, or long-term consequences. - SCOPE CRITICAL: This claim is specifically about output classification of categorical harmful content, not about verifying values, intent, or long-term consequences.
- DIVERGENCE CHECK: Does this create tension with [[scalable oversight degrades rapidly as capability gaps grow]]? The oversight degradation claim is about debate-based scalable oversight (cognitive evaluation tasks), not about output classification. These are different mechanisms — scope mismatch, not genuine divergence. The extractor should note this scope separation. - DIVERGENCE CHECK: Does this create tension with scalable oversight degrades rapidly as capability gaps grow? The oversight degradation claim is about debate-based scalable oversight (cognitive evaluation tasks), not about output classification. These are different mechanisms — scope mismatch, not genuine divergence. The extractor should note this scope separation.
**Context:** The Constitutional Classifiers research is Anthropic's response to the universal jailbreak problem. The original paper (arXiv 2501.18837) established the approach; the ++ version improves compute efficiency. The 1,700 hours figure is from the original paper; the ++ paper extends this. Both are from Anthropic's Alignment Science team. The critical question for KB value: is this evidence of "verification working" or "narrow classification working"? The answer matters for B4's scope. **Context:** The Constitutional Classifiers research is Anthropic's response to the universal jailbreak problem. The original paper (arXiv 2501.18837) established the approach; the ++ version improves compute efficiency. The 1,700 hours figure is from the original paper; the ++ paper extends this. Both are from Anthropic's Alignment Science team. The critical question for KB value: is this evidence of "verification working" or "narrow classification working"? The answer matters for B4's scope.

View file

@ -38,7 +38,7 @@ tags: [apollo-research, deception-probe, cross-model-transfer, absence-of-eviden
**What I expected but didn't find:** A cross-family deception probe evaluation from Apollo or from any alignment-adjacent group. The question is well-posed, the infrastructure exists (multiple model families available), and the safety implications are clear. The absence after 14+ months is a genuine gap. **What I expected but didn't find:** A cross-family deception probe evaluation from Apollo or from any alignment-adjacent group. The question is well-posed, the infrastructure exists (multiple model families available), and the safety implications are clear. The absence after 14+ months is a genuine gap.
**KB connections:** **KB connections:**
- [[divergence-representation-monitoring-net-safety]] — this absence of evidence confirms the "What Would Resolve This" section remains open - divergence-representation-monitoring-net-safety — this absence of evidence confirms the "What Would Resolve This" section remains open
- [[no research group is building alignment through collective intelligence infrastructure despite the field converging on problems that require it]] — the absence of cross-model probe testing is another instance of the community-silo/institutional gap pattern - [[no research group is building alignment through collective intelligence infrastructure despite the field converging on problems that require it]] — the absence of cross-model probe testing is another instance of the community-silo/institutional gap pattern
- [[multi-layer-ensemble-probes-provide-black-box-robustness-but-not-white-box-protection-against-scav-attacks]] — the moderating claim depends on architecture-specificity; the absence of cross-model testing means this claim remains speculative - [[multi-layer-ensemble-probes-provide-black-box-robustness-but-not-white-box-protection-against-scav-attacks]] — the moderating claim depends on architecture-specificity; the absence of cross-model testing means this claim remains speculative
@ -51,7 +51,7 @@ tags: [apollo-research, deception-probe, cross-model-transfer, absence-of-eviden
## Curator Notes (structured handoff for extractor) ## Curator Notes (structured handoff for extractor)
PRIMARY CONNECTION: [[divergence-representation-monitoring-net-safety]] — the "What Would Resolve This" section remains open PRIMARY CONNECTION: divergence-representation-monitoring-net-safety — the "What Would Resolve This" section remains open
WHY ARCHIVED: Confirms that as of April 2026, the direct empirical test needed to resolve the divergence does not exist in published form. Closes the Apollo cross-model search for now. WHY ARCHIVED: Confirms that as of April 2026, the direct empirical test needed to resolve the divergence does not exist in published form. Closes the Apollo cross-model search for now.

View file

@ -50,7 +50,7 @@ tags: [governance, frontier-safety-framework, google-deepmind, capability-levels
**Extraction hints:** **Extraction hints:**
- DO NOT create a new claim about FSF v3.0 in isolation — one governance framework update doesn't warrant a standalone claim. - DO NOT create a new claim about FSF v3.0 in isolation — one governance framework update doesn't warrant a standalone claim.
- CONSIDER enriching [[voluntary safety pledges cannot survive competitive pressure]] with the FSF v3.0 context: frameworks are becoming more sophisticated (TCL tier, Harmful Manipulation CCL) but remain unilateral and voluntary, confirming the structural limitation. - CONSIDER enriching voluntary safety pledges cannot survive competitive pressure with the FSF v3.0 context: frameworks are becoming more sophisticated (TCL tier, Harmful Manipulation CCL) but remain unilateral and voluntary, confirming the structural limitation.
- CLAIM CANDIDATE (lower priority): "Frontier lab safety frameworks are converging on tiered capability monitoring architectures (pre-threshold tracking plus threshold-triggered mitigations), suggesting an emerging governance norm, but the converging form is voluntary and unilateral." Confidence: experimental. Needs OpenAI/Anthropic framework comparison. - CLAIM CANDIDATE (lower priority): "Frontier lab safety frameworks are converging on tiered capability monitoring architectures (pre-threshold tracking plus threshold-triggered mitigations), suggesting an emerging governance norm, but the converging form is voluntary and unilateral." Confidence: experimental. Needs OpenAI/Anthropic framework comparison.
- The Harmful Manipulation CCL is worth a potential note in Theseus's musings about epistemic risk governance — it's the first formal governance operationalization of narrative/epistemic AI risks. - The Harmful Manipulation CCL is worth a potential note in Theseus's musings about epistemic risk governance — it's the first formal governance operationalization of narrative/epistemic AI risks.

View file

@ -44,7 +44,7 @@ tags: [concept-activation-vectors, adversarial-attacks, representation-monitorin
**What I expected but didn't find:** An empirical test of SCAV concept direction transfer across model families. The paper establishes CAV fragility theoretically but doesn't test SCAV transfer across architectures. **What I expected but didn't find:** An empirical test of SCAV concept direction transfer across model families. The paper establishes CAV fragility theoretically but doesn't test SCAV transfer across architectures.
**KB connections:** **KB connections:**
- [[divergence-representation-monitoring-net-safety]] — the active divergence this provides supporting evidence for (architecture-specific rotation patterns) - divergence-representation-monitoring-net-safety — the active divergence this provides supporting evidence for (architecture-specific rotation patterns)
- [[rotation-pattern-universality-determines-black-box-multi-layer-scav-feasibility]] — this fragility finding is corroborating evidence that rotation patterns (and CAV-based attacks on them) are not universal - [[rotation-pattern-universality-determines-black-box-multi-layer-scav-feasibility]] — this fragility finding is corroborating evidence that rotation patterns (and CAV-based attacks on them) are not universal
- [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]] — the monitoring degradation pattern - [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]] — the monitoring degradation pattern

View file

@ -59,8 +59,8 @@ tags: [safety-benchmarks, responsible-ai, capability-gap, ai-incidents, governan
**Extraction hints:** **Extraction hints:**
- PRIMARY NEW CLAIM: "Responsible AI dimensions are in systematic multi-objective tension where improving safety degrades accuracy and improving privacy reduces fairness, with no accepted framework for navigation." This is empirical confirmation of Arrow-style impossibility at the operational level — it's broader and more concrete than the Arrow's theorem claim. - PRIMARY NEW CLAIM: "Responsible AI dimensions are in systematic multi-objective tension where improving safety degrades accuracy and improving privacy reduces fairness, with no accepted framework for navigation." This is empirical confirmation of Arrow-style impossibility at the operational level — it's broader and more concrete than the Arrow's theorem claim.
- ENRICH: [[voluntary safety pledges cannot survive competitive pressure]] — the benchmark reporting gap (only Claude reports on 2+ benchmarks) is new direct evidence. - ENRICH: voluntary safety pledges cannot survive competitive pressure — the benchmark reporting gap (only Claude reports on 2+ benchmarks) is new direct evidence.
- ENRICH: [[the alignment tax creates a structural race to the bottom]] — the multi-objective tradeoff finding is new direct evidence. The "tax" is larger than previously documented. - ENRICH: the alignment tax creates a structural race to the bottom — the multi-objective tradeoff finding is new direct evidence. The "tax" is larger than previously documented.
- DO NOT create a new claim about AI incidents rising — the absolute numbers (233 → 362) are context, not a standalone KB claim. - DO NOT create a new claim about AI incidents rising — the absolute numbers (233 → 362) are context, not a standalone KB claim.
**Context:** Stanford HAI publishes the AI Index annually. The 2026 edition was published April 2026, covers 2025 data, and is one of the most widely-cited external assessments of the AI landscape. The responsible AI chapter is specifically about whether safety efforts are keeping pace — it is directly designed to measure the B1 disconfirmation question. **Context:** Stanford HAI publishes the AI Index annually. The 2026 edition was published April 2026, covers 2025 data, and is one of the most widely-cited external assessments of the AI landscape. The responsible AI chapter is specifically about whether safety efforts are keeping pace — it is directly designed to measure the B1 disconfirmation question.