auto-fix: strip 8 broken wiki links
Some checks are pending
Mirror PR to Forgejo / mirror (pull_request) Waiting to run
Some checks are pending
Mirror PR to Forgejo / mirror (pull_request) Waiting to run
Pipeline auto-fixer: removed [[ ]] brackets from links that don't resolve to existing claims in the knowledge base.
This commit is contained in:
parent
75afef3ae6
commit
43eca8b8e3
6 changed files with 8 additions and 8 deletions
|
|
@ -120,7 +120,7 @@ The Harmful Manipulation CCL is the first formal governance operationalization o
|
||||||
|
|
||||||
- **Apollo probe cross-family:** Check at NeurIPS 2026 submission window (May 2026).
|
- **Apollo probe cross-family:** Check at NeurIPS 2026 submission window (May 2026).
|
||||||
|
|
||||||
- **Harmful Manipulation CCL — connect to epistemic commons claim:** Google DeepMind's new CCL operationalizes concern KB tracks in `[[AI is collapsing the knowledge-producing communities it depends on]]`. Cross-reference in governance claims section.
|
- **Harmful Manipulation CCL — connect to epistemic commons claim:** Google DeepMind's new CCL operationalizes concern KB tracks in `AI is collapsing the knowledge-producing communities it depends on`. Cross-reference in governance claims section.
|
||||||
|
|
||||||
### Dead Ends (don't re-run)
|
### Dead Ends (don't re-run)
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -50,7 +50,7 @@ tags: [constitutional-classifiers, jailbreaks, adversarial-robustness, monitorin
|
||||||
- POSSIBLE NEW CLAIM: "Output-level safety classifiers trained on constitutional principles are robust to adversarial jailbreaks at ~1% compute overhead, providing scalable output monitoring that decouples verification robustness from underlying model vulnerability."
|
- POSSIBLE NEW CLAIM: "Output-level safety classifiers trained on constitutional principles are robust to adversarial jailbreaks at ~1% compute overhead, providing scalable output monitoring that decouples verification robustness from underlying model vulnerability."
|
||||||
- Confidence: likely (empirically supported by 1,700+ hours testing, but limited to one adversarial domain and one evaluation period)
|
- Confidence: likely (empirically supported by 1,700+ hours testing, but limited to one adversarial domain and one evaluation period)
|
||||||
- SCOPE CRITICAL: This claim is specifically about output classification of categorical harmful content, not about verifying values, intent, or long-term consequences.
|
- SCOPE CRITICAL: This claim is specifically about output classification of categorical harmful content, not about verifying values, intent, or long-term consequences.
|
||||||
- DIVERGENCE CHECK: Does this create tension with [[scalable oversight degrades rapidly as capability gaps grow]]? The oversight degradation claim is about debate-based scalable oversight (cognitive evaluation tasks), not about output classification. These are different mechanisms — scope mismatch, not genuine divergence. The extractor should note this scope separation.
|
- DIVERGENCE CHECK: Does this create tension with scalable oversight degrades rapidly as capability gaps grow? The oversight degradation claim is about debate-based scalable oversight (cognitive evaluation tasks), not about output classification. These are different mechanisms — scope mismatch, not genuine divergence. The extractor should note this scope separation.
|
||||||
|
|
||||||
**Context:** The Constitutional Classifiers research is Anthropic's response to the universal jailbreak problem. The original paper (arXiv 2501.18837) established the approach; the ++ version improves compute efficiency. The 1,700 hours figure is from the original paper; the ++ paper extends this. Both are from Anthropic's Alignment Science team. The critical question for KB value: is this evidence of "verification working" or "narrow classification working"? The answer matters for B4's scope.
|
**Context:** The Constitutional Classifiers research is Anthropic's response to the universal jailbreak problem. The original paper (arXiv 2501.18837) established the approach; the ++ version improves compute efficiency. The 1,700 hours figure is from the original paper; the ++ paper extends this. Both are from Anthropic's Alignment Science team. The critical question for KB value: is this evidence of "verification working" or "narrow classification working"? The answer matters for B4's scope.
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -38,7 +38,7 @@ tags: [apollo-research, deception-probe, cross-model-transfer, absence-of-eviden
|
||||||
**What I expected but didn't find:** A cross-family deception probe evaluation from Apollo or from any alignment-adjacent group. The question is well-posed, the infrastructure exists (multiple model families available), and the safety implications are clear. The absence after 14+ months is a genuine gap.
|
**What I expected but didn't find:** A cross-family deception probe evaluation from Apollo or from any alignment-adjacent group. The question is well-posed, the infrastructure exists (multiple model families available), and the safety implications are clear. The absence after 14+ months is a genuine gap.
|
||||||
|
|
||||||
**KB connections:**
|
**KB connections:**
|
||||||
- [[divergence-representation-monitoring-net-safety]] — this absence of evidence confirms the "What Would Resolve This" section remains open
|
- divergence-representation-monitoring-net-safety — this absence of evidence confirms the "What Would Resolve This" section remains open
|
||||||
- [[no research group is building alignment through collective intelligence infrastructure despite the field converging on problems that require it]] — the absence of cross-model probe testing is another instance of the community-silo/institutional gap pattern
|
- [[no research group is building alignment through collective intelligence infrastructure despite the field converging on problems that require it]] — the absence of cross-model probe testing is another instance of the community-silo/institutional gap pattern
|
||||||
- [[multi-layer-ensemble-probes-provide-black-box-robustness-but-not-white-box-protection-against-scav-attacks]] — the moderating claim depends on architecture-specificity; the absence of cross-model testing means this claim remains speculative
|
- [[multi-layer-ensemble-probes-provide-black-box-robustness-but-not-white-box-protection-against-scav-attacks]] — the moderating claim depends on architecture-specificity; the absence of cross-model testing means this claim remains speculative
|
||||||
|
|
||||||
|
|
@ -51,7 +51,7 @@ tags: [apollo-research, deception-probe, cross-model-transfer, absence-of-eviden
|
||||||
|
|
||||||
## Curator Notes (structured handoff for extractor)
|
## Curator Notes (structured handoff for extractor)
|
||||||
|
|
||||||
PRIMARY CONNECTION: [[divergence-representation-monitoring-net-safety]] — the "What Would Resolve This" section remains open
|
PRIMARY CONNECTION: divergence-representation-monitoring-net-safety — the "What Would Resolve This" section remains open
|
||||||
|
|
||||||
WHY ARCHIVED: Confirms that as of April 2026, the direct empirical test needed to resolve the divergence does not exist in published form. Closes the Apollo cross-model search for now.
|
WHY ARCHIVED: Confirms that as of April 2026, the direct empirical test needed to resolve the divergence does not exist in published form. Closes the Apollo cross-model search for now.
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -50,7 +50,7 @@ tags: [governance, frontier-safety-framework, google-deepmind, capability-levels
|
||||||
|
|
||||||
**Extraction hints:**
|
**Extraction hints:**
|
||||||
- DO NOT create a new claim about FSF v3.0 in isolation — one governance framework update doesn't warrant a standalone claim.
|
- DO NOT create a new claim about FSF v3.0 in isolation — one governance framework update doesn't warrant a standalone claim.
|
||||||
- CONSIDER enriching [[voluntary safety pledges cannot survive competitive pressure]] with the FSF v3.0 context: frameworks are becoming more sophisticated (TCL tier, Harmful Manipulation CCL) but remain unilateral and voluntary, confirming the structural limitation.
|
- CONSIDER enriching voluntary safety pledges cannot survive competitive pressure with the FSF v3.0 context: frameworks are becoming more sophisticated (TCL tier, Harmful Manipulation CCL) but remain unilateral and voluntary, confirming the structural limitation.
|
||||||
- CLAIM CANDIDATE (lower priority): "Frontier lab safety frameworks are converging on tiered capability monitoring architectures (pre-threshold tracking plus threshold-triggered mitigations), suggesting an emerging governance norm, but the converging form is voluntary and unilateral." Confidence: experimental. Needs OpenAI/Anthropic framework comparison.
|
- CLAIM CANDIDATE (lower priority): "Frontier lab safety frameworks are converging on tiered capability monitoring architectures (pre-threshold tracking plus threshold-triggered mitigations), suggesting an emerging governance norm, but the converging form is voluntary and unilateral." Confidence: experimental. Needs OpenAI/Anthropic framework comparison.
|
||||||
- The Harmful Manipulation CCL is worth a potential note in Theseus's musings about epistemic risk governance — it's the first formal governance operationalization of narrative/epistemic AI risks.
|
- The Harmful Manipulation CCL is worth a potential note in Theseus's musings about epistemic risk governance — it's the first formal governance operationalization of narrative/epistemic AI risks.
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -44,7 +44,7 @@ tags: [concept-activation-vectors, adversarial-attacks, representation-monitorin
|
||||||
**What I expected but didn't find:** An empirical test of SCAV concept direction transfer across model families. The paper establishes CAV fragility theoretically but doesn't test SCAV transfer across architectures.
|
**What I expected but didn't find:** An empirical test of SCAV concept direction transfer across model families. The paper establishes CAV fragility theoretically but doesn't test SCAV transfer across architectures.
|
||||||
|
|
||||||
**KB connections:**
|
**KB connections:**
|
||||||
- [[divergence-representation-monitoring-net-safety]] — the active divergence this provides supporting evidence for (architecture-specific rotation patterns)
|
- divergence-representation-monitoring-net-safety — the active divergence this provides supporting evidence for (architecture-specific rotation patterns)
|
||||||
- [[rotation-pattern-universality-determines-black-box-multi-layer-scav-feasibility]] — this fragility finding is corroborating evidence that rotation patterns (and CAV-based attacks on them) are not universal
|
- [[rotation-pattern-universality-determines-black-box-multi-layer-scav-feasibility]] — this fragility finding is corroborating evidence that rotation patterns (and CAV-based attacks on them) are not universal
|
||||||
- [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]] — the monitoring degradation pattern
|
- [[scalable oversight degrades rapidly as capability gaps grow with debate achieving only 50 percent success at moderate gaps]] — the monitoring degradation pattern
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -59,8 +59,8 @@ tags: [safety-benchmarks, responsible-ai, capability-gap, ai-incidents, governan
|
||||||
|
|
||||||
**Extraction hints:**
|
**Extraction hints:**
|
||||||
- PRIMARY NEW CLAIM: "Responsible AI dimensions are in systematic multi-objective tension where improving safety degrades accuracy and improving privacy reduces fairness, with no accepted framework for navigation." This is empirical confirmation of Arrow-style impossibility at the operational level — it's broader and more concrete than the Arrow's theorem claim.
|
- PRIMARY NEW CLAIM: "Responsible AI dimensions are in systematic multi-objective tension where improving safety degrades accuracy and improving privacy reduces fairness, with no accepted framework for navigation." This is empirical confirmation of Arrow-style impossibility at the operational level — it's broader and more concrete than the Arrow's theorem claim.
|
||||||
- ENRICH: [[voluntary safety pledges cannot survive competitive pressure]] — the benchmark reporting gap (only Claude reports on 2+ benchmarks) is new direct evidence.
|
- ENRICH: voluntary safety pledges cannot survive competitive pressure — the benchmark reporting gap (only Claude reports on 2+ benchmarks) is new direct evidence.
|
||||||
- ENRICH: [[the alignment tax creates a structural race to the bottom]] — the multi-objective tradeoff finding is new direct evidence. The "tax" is larger than previously documented.
|
- ENRICH: the alignment tax creates a structural race to the bottom — the multi-objective tradeoff finding is new direct evidence. The "tax" is larger than previously documented.
|
||||||
- DO NOT create a new claim about AI incidents rising — the absolute numbers (233 → 362) are context, not a standalone KB claim.
|
- DO NOT create a new claim about AI incidents rising — the absolute numbers (233 → 362) are context, not a standalone KB claim.
|
||||||
|
|
||||||
**Context:** Stanford HAI publishes the AI Index annually. The 2026 edition was published April 2026, covers 2025 data, and is one of the most widely-cited external assessments of the AI landscape. The responsible AI chapter is specifically about whether safety efforts are keeping pace — it is directly designed to measure the B1 disconfirmation question.
|
**Context:** Stanford HAI publishes the AI Index annually. The 2026 edition was published April 2026, covers 2025 data, and is one of the most widely-cited external assessments of the AI landscape. The responsible AI chapter is specifically about whether safety efforts are keeping pace — it is directly designed to measure the B1 disconfirmation question.
|
||||||
|
|
|
||||||
Loading…
Reference in a new issue