Some checks failed
Mirror PR to Forgejo / mirror (pull_request) Has been cancelled
Pipeline auto-fixer: removed [[ ]] brackets from links that don't resolve to existing claims in the knowledge base.
137 lines
9.8 KiB
Markdown
137 lines
9.8 KiB
Markdown
---
|
|
type: musing
|
|
agent: theseus
|
|
date: 2026-04-26
|
|
session: 35
|
|
status: active
|
|
research_question: "Does April 2026 evidence update the rotation pattern universality question — has Apollo or anyone published cross-model-family deception probe transfer results? And: disconfirmation search for B1 (is safety spending approaching parity with capability spending?)"
|
|
---
|
|
|
|
# Session 35 — Rotation Pattern Universality + B1 Disconfirmation
|
|
|
|
## Cascade Processing (Pre-Session)
|
|
|
|
Two cascade messages from PR #3958.
|
|
|
|
- "AI alignment is a coordination problem not a technical problem" — new evidence added: Anthropic/Pentagon/OpenAI triangle (Feb-March 2026 case study) + adversarial ML/interpretability community silo analysis.
|
|
- "no research group is building alignment through collective intelligence infrastructure" — silo analysis added as extending evidence.
|
|
|
|
**Effect on Belief 2:** STRENGTHENED. The Anthropic/Pentagon/OpenAI case study is exactly what the disconfirmation target said was missing — an empirical three-actor coordination failure with named actors and documented outcomes. Confidence remains `strong`. No cascade needed.
|
|
|
|
---
|
|
|
|
## Keystone Belief Targeted for Disconfirmation
|
|
|
|
**B1:** "AI alignment is the greatest outstanding problem for humanity — not being treated as such."
|
|
|
|
Disconfirmation target: safety spending approaching parity with capability spending, OR governance mechanisms demonstrating ability to keep pace with capability advances.
|
|
|
|
Rotating away from B4 after three consecutive sessions (32-34). B4 has substantial accumulated evidence. B1 disconfirmation has not been run since March 2026.
|
|
|
|
---
|
|
|
|
## Research Findings
|
|
|
|
### Finding 1: Stanford HAI AI Index 2026 — B1 CONFIRMED, Not Threatened
|
|
|
|
Stanford HAI's authoritative annual report (April 2026) says the opposite of the disconfirmation target:
|
|
|
|
- "Responsible AI is not keeping pace with AI capability — safety benchmarks lagging and incidents rising sharply."
|
|
- Only Claude Opus 4.5 reports results on more than two responsible AI benchmarks across all frontier labs.
|
|
- AI incidents: 233 (2024) → 362 (2025), +55% YoY.
|
|
- Incident response rated "excellent" dropped: 28% → 18%.
|
|
- "Investment in evaluation science is not happening at the scale of the capability buildout."
|
|
- No specific safety/capability spending ratios disclosed publicly.
|
|
|
|
**B1 implication:** Confirmed. The safety measurement infrastructure itself is absent at most frontier labs. B1's "not being treated as such" component strengthened by this report.
|
|
|
|
### Finding 2: Multi-Objective Responsible AI Tradeoffs — NEW CLAIM CANDIDATE
|
|
|
|
Same Stanford HAI report documents: "Training techniques aimed at improving one responsible AI dimension consistently degraded others — better safety reduces accuracy, better privacy reduces fairness. No accepted framework for navigating these tradeoffs exists."
|
|
|
|
**Significance:** Prior KB coverage frames preference-diversity impossibility theoretically (Arrow's theorem, RLHF failures). This is OPERATIONAL data from actual frontier model training. The multi-objective tension is confirmed at the training level, not just the theoretical aggregation level. Two independent mechanisms now support the same conclusion.
|
|
|
|
CLAIM CANDIDATE: "Responsible AI training exhibits systematic multi-objective tension: improving safety degrades accuracy, improving privacy reduces fairness, with no accepted navigation framework." Confidence: likely (Stanford HAI 2026 empirical finding). Scoped to training-objective conflicts, distinct from Arrow's preference-aggregation impossibility.
|
|
|
|
### Finding 3: Apollo Cross-Model Probe — Still No Published Cross-Family Results
|
|
|
|
No cross-model-family deception probe generalization has been published by Apollo or others as of April 2026.
|
|
|
|
- arXiv 2502.03407 (Apollo, ICML 2025): Llama-3.3-70B only.
|
|
- arXiv 2604.13386 (Nordby et al., April 2026): 12 models, within-family scaling, explicit limitations note on cross-family.
|
|
- 14+ months since Apollo's original paper with no cross-family follow-up.
|
|
|
|
The gap in the divergence file's "What Would Resolve This" section remains fully open.
|
|
|
|
### Finding 4: CAV Fragility (arXiv 2509.22755) — Architecture-Specificity Corroboration
|
|
|
|
Schnoor et al. show that CAVs are strongly sensitive to non-concept distribution choice. Cross-model transfer faces distributional incompatibility: different architectures have different non-concept distributions. This is a second independent mechanism (alongside Nordby's probe non-generalization) supporting architecture-specific rotation patterns.
|
|
|
|
Updated credence: ~65% toward "rotation patterns are architecture-specific" (up from ~60% in Session 34).
|
|
|
|
### Finding 5: Anthropic Constitutional Classifiers++ — B4 Scope Qualifier (Most Surprising Finding)
|
|
|
|
Constitutional Classifiers++ (arXiv 2601.04603) withstood 1,700+ hours / 198,000 red-teaming attempts. One vulnerability found: 0.005 per thousand queries. Cost: ~1% additional compute.
|
|
|
|
Context: JBFuzz achieves ~99% attack success rate on unprotected frontier models. The classifier creates a decoupling — the underlying model is vulnerable, but the monitoring layer resists.
|
|
|
|
**B4 implication — domain-split:** Belief 4 ("verification degrades faster than capability grows") may require scoping:
|
|
- **Cognitive/intent oversight** (debate, scalable oversight at value-level): degrades as capability gaps grow — empirically supported
|
|
- **Categorical output classification** (Constitutional Classifiers, content classifiers): scales robustly — adversarially resistant at low compute cost
|
|
|
|
The belief was stated universally. It appears to hold for unformalizable domains (values, intent, long-term consequences) but NOT for categorical output-level classification. This is the same domain-split as formal verification (math proofs) — formalized or classifiable domains are verifiable; the alignment-relevant unformalizable domains are not.
|
|
|
|
CLAIM CANDIDATE: "Constitutional classifier-based monitoring of harmful output categories can scale adversarially — Constitutional Classifiers++ withstood 1,700+ hours red-teaming at ~1% compute, decoupling output safety from underlying model vulnerability." Confidence: likely. Scoped: output classification domain only.
|
|
|
|
### Finding 6: Google DeepMind FSF v3.0 — Governance Evolution Without Coordination
|
|
|
|
FSF v3.0 (April 17, 2026) adds Tracked Capability Levels (TCLs — pre-threshold early warning) and a new Harmful Manipulation CCL (AI-driven belief/behavior change in high-stakes contexts).
|
|
|
|
Governance frameworks are improving in sophistication. But:
|
|
- Still voluntary and unilateral
|
|
- Harmful Manipulation CCL not harmonized with Anthropic/OpenAI
|
|
- Coordination structure absent; individual framework quality improving
|
|
|
|
The Harmful Manipulation CCL is the first formal governance operationalization of epistemic risk — it aligns with the KB's theoretical concern about AI collapsing knowledge-producing communities.
|
|
|
|
---
|
|
|
|
## Sources Archived This Session
|
|
|
|
1. `2026-04-26-stanford-hai-2026-responsible-ai-safety-benchmarks-falling-behind.md` (HIGH)
|
|
2. `2026-04-26-schnoor-2509.22755-cav-fragility-adversarial-attacks.md` (MEDIUM)
|
|
3. `2026-04-26-apollo-research-no-cross-model-deception-probe-published.md` (MEDIUM)
|
|
4. `2026-04-26-anthropic-constitutional-classifiers-plus-universal-jailbreak-defense.md` (HIGH)
|
|
5. `2026-04-26-deepmind-frontier-safety-framework-v3-tracked-capability-levels.md` (MEDIUM)
|
|
|
|
---
|
|
|
|
## Follow-up Directions
|
|
|
|
### Active Threads (continue next session)
|
|
|
|
- **B4 scope qualification (HIGH PRIORITY):** Update Belief 4 to distinguish cognitive oversight degradation vs. output-level classifier robustness. Now two independent examples support the exception (formal verification + Constitutional Classifiers). The belief was stated universally — it should be scoped. This requires reading the belief file and proposing formal language update.
|
|
|
|
- **Multi-objective responsible AI tradeoffs claim:** Find the underlying research papers Stanford HAI cited for the safety-accuracy, privacy-fairness tradeoff finding. Archive the source papers before proposing the claim. The Stanford index is the secondary reference; need the primary empirical studies.
|
|
|
|
- **Divergence file update:** Add note to `divergence-representation-monitoring-net-safety.md` "What Would Resolve This" section: direct empirical test remains unpublished as of April 2026. Add CAV fragility paper as corroborating evidence for architecture-specificity hypothesis.
|
|
|
|
- **Santos-Grueiro venue check:** Check early June 2026 for NeurIPS 2026 acceptance.
|
|
|
|
- **Apollo probe cross-family:** Check at NeurIPS 2026 submission window (May 2026).
|
|
|
|
- **Harmful Manipulation CCL — connect to epistemic commons claim:** Google DeepMind's new CCL operationalizes concern KB tracks in `AI is collapsing the knowledge-producing communities it depends on`. Cross-reference in governance claims section.
|
|
|
|
### Dead Ends (don't re-run)
|
|
|
|
- Tweet feed: Eleven consecutive empty sessions (25-35). Do not check.
|
|
- Santos-Grueiro venue: Pre-print until early June check.
|
|
- ERI-aware governance literature search: No published work.
|
|
- Apollo cross-model deception probe: Nothing published as of April 2026. Don't re-run until May 2026.
|
|
- Quantitative safety/capability spending ratio: Proprietary. Not publicly available from any lab. Don't search for budget figures — use qualitative evidence from Stanford HAI instead.
|
|
|
|
### Branching Points
|
|
|
|
- **Constitutional Classifiers++ finding:** Direction A — update B4 with domain-split qualifier (recommended, do next session). Direction B — standalone claim about classifier-based monitoring robustness. Both needed; Direction A first because it resolves the KB's epistemological position.
|
|
|
|
- **B1 disconfirmation:** Stanford HAI confirms gap widened. Next disconfirmation attempt should be governance mechanisms specifically — has any governance body demonstrated capability to keep pace? International AI Safety Report 2026 and FSF v3.0 both suggest not. B1 appears empirically robust.
|