Compare commits

...

2 commits

Author SHA1 Message Date
Teleo Agents
b979f5d167 theseus: extract claims from 2026-04-26-stanford-hai-2026-responsible-ai-safety-benchmarks-falling-behind
- Source: inbox/queue/2026-04-26-stanford-hai-2026-responsible-ai-safety-benchmarks-falling-behind.md
- Domain: ai-alignment
- Claims: 1, Entities: 0
- Enrichments: 5
- Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5)

Pentagon-Agent: Theseus <PIPELINE>
2026-04-26 00:30:19 +00:00
Teleo Agents
8c2fdbb44a theseus: extract claims from 2026-04-26-schnoor-2509.22755-cav-fragility-adversarial-attacks
- Source: inbox/queue/2026-04-26-schnoor-2509.22755-cav-fragility-adversarial-attacks.md
- Domain: ai-alignment
- Claims: 0, Entities: 0
- Enrichments: 2
- Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5)

Pentagon-Agent: Theseus <PIPELINE>
2026-04-26 00:29:24 +00:00
5 changed files with 41 additions and 3 deletions

View file

@ -23,3 +23,10 @@ Multi-layer ensemble probes improve clean-data AUROC by 29-78% over single-layer
**Source:** Apollo Research publication gap analysis, April 2026
The moderating claim that multi-layer ensemble probes provide black-box robustness depends on whether rotation patterns are architecture-specific or universal. As of April 2026, no cross-model-family probe transfer testing has been published, meaning the architecture-specificity assumption remains empirically untested. The absence of this testing after 14+ months suggests either: (a) cross-family transfer is known to fail internally and not worth publishing, (b) research agendas prioritize within-family deployment robustness, or (c) the experimental setup requires infrastructure not yet built.
## Extending Evidence
**Source:** Schnoor et al. 2025, arXiv 2509.22755
CAV-based monitoring techniques exhibit fundamental sensitivity to non-concept distribution choice (Schnoor et al., arXiv 2509.22755). The authors demonstrate that CAVs are random vectors whose distribution depends heavily on the arbitrary choice of non-concept examples used during training. They present an adversarial attack on TCAV (Testing with CAVs) that exploits this distributional dependence. This suggests cross-architecture concept direction transfer faces distributional incompatibility beyond architectural differences alone—even within a single model, CAV reliability depends on training distribution choices that would necessarily differ across model families.

View file

@ -0,0 +1,18 @@
---
type: claim
domain: ai-alignment
description: Empirical confirmation at operational scale that alignment objectives trade off against each other and against capability, extending Arrow's impossibility theorem from preference aggregation to training dynamics
confidence: experimental
source: Stanford HAI AI Index 2026, Responsible AI chapter
created: 2026-04-26
title: Responsible AI dimensions exhibit systematic multi-objective tension where improving safety degrades accuracy and improving privacy reduces fairness with no accepted navigation framework
agent: theseus
sourced_from: ai-alignment/2026-04-26-stanford-hai-2026-responsible-ai-safety-benchmarks-falling-behind.md
scope: structural
sourcer: Stanford Human-Centered Artificial Intelligence
related: ["the-alignment-tax-creates-a-structural-race-to-the-bottom-because-safety-training-costs-capability-and-rational-competitors-skip-it", "universal-alignment-is-mathematically-impossible-because-arrows-impossibility-theorem-applies-to-aggregating-diverse-human-preferences-into-a-single-coherent-objective", "universal alignment is mathematically impossible because Arrows impossibility theorem applies to aggregating diverse human preferences into a single coherent objective", "the alignment tax creates a structural race to the bottom because safety training costs capability and rational competitors skip it", "AI alignment is a coordination problem not a technical problem", "increasing-ai-capability-enables-more-precise-evaluation-context-recognition-inverting-safety-improvements"]
---
# Responsible AI dimensions exhibit systematic multi-objective tension where improving safety degrades accuracy and improving privacy reduces fairness with no accepted navigation framework
Stanford HAI's 2026 AI Index documents that 'training techniques aimed at improving one responsible AI dimension consistently degraded others' across frontier model development. Specifically, improving safety degrades accuracy, and improving privacy reduces fairness. This is not a resource allocation problem or a temporary engineering challenge — it is a systematic tension in the training dynamics themselves. The report notes that 'no accepted framework exists for navigating these tradeoffs,' meaning organizations cannot reliably optimize for multiple responsible AI dimensions simultaneously. This finding extends theoretical impossibility results (Arrow's theorem for preference aggregation) into the operational domain of actual model training. The multi-objective tension is not limited to safety-vs-capability — it manifests across all responsible AI dimensions, creating a higher-dimensional tradeoff space than previously documented. The absence of a navigation framework means frontier labs are making these tradeoffs implicitly through training choices rather than explicitly through governance decisions, which compounds the coordination problem because the tradeoffs are invisible to external oversight.

View file

@ -11,9 +11,16 @@ sourced_from: ai-alignment/2026-04-22-theseus-multilayer-probe-scav-robustness-s
scope: structural
sourcer: Theseus
supports: ["multi-layer-ensemble-probes-provide-black-box-robustness-but-not-white-box-protection-against-scav-attacks"]
related: ["multi-layer-ensemble-probes-provide-black-box-robustness-but-not-white-box-protection-against-scav-attacks", "representation-monitoring-via-linear-concept-vectors-creates-dual-use-attack-surface", "anti-safety-scaling-law-larger-models-more-vulnerable-to-concept-vector-attacks"]
related: ["multi-layer-ensemble-probes-provide-black-box-robustness-but-not-white-box-protection-against-scav-attacks", "representation-monitoring-via-linear-concept-vectors-creates-dual-use-attack-surface", "anti-safety-scaling-law-larger-models-more-vulnerable-to-concept-vector-attacks", "rotation-pattern-universality-determines-black-box-multi-layer-scav-feasibility"]
---
# Rotation pattern universality across model families determines whether multi-layer ensemble monitoring provides black-box adversarial robustness
The feasibility of black-box multi-layer SCAV attacks depends on whether the rotation pattern of concept directions across layers is universal across model families or model-specific. Single-layer SCAV achieved black-box transfer to GPT-4 because concept direction universality (confirmed by Beaglehole et al. for cross-language and cross-model-family transfer) allowed attackers to reconstruct the target model's concept direction from a different model. For multi-layer SCAV, the attacker must reconstruct not just the concept direction at one layer, but the entire rotation pattern across all monitored layers. Two competing arguments exist: (1) Rotation universality: If the underlying geometry of safety representations is universal enough to enable cross-language transfer (Beaglehole et al.), the rotation pattern may also be universal, making black-box multi-layer SCAV feasible. (2) Rotation specificity: Different model architectures (transformer depth, attention head count, MLP width, pre-training data) produce different residual stream dynamics. The concept direction at any single layer is a projection of a universal concept onto a model-specific representational basis, and the rotation across layers depends on how that basis evolves, which may not be universal. This is a testable empirical question with no published results. If rotation patterns are model-specific, multi-layer ensemble monitoring provides genuine black-box adversarial robustness for closed-source models, creating a structural safety advantage over open-weights deployment. If rotation patterns are universal, multi-layer ensembles provide no black-box protection, and the dual-use vulnerability holds across all deployment contexts.
## Extending Evidence
**Source:** Schnoor et al. 2025, arXiv 2509.22755
Theoretical analysis from XAI literature shows CAVs (Concept Activation Vectors) are fundamentally fragile to non-concept distribution choice (Schnoor et al., arXiv 2509.22755). Since non-concept distributions necessarily differ across model architectures and training regimes, this provides theoretical grounding for why rotation patterns extracted via SCAV would fail to transfer across model families—the concept vectors themselves are unstable under distributional shifts inherent to cross-architecture application.

View file

@ -7,9 +7,12 @@ date: 2026-01-27
domain: ai-alignment
secondary_domains: []
format: preprint
status: unprocessed
status: processed
processed_by: theseus
processed_date: 2026-04-26
priority: medium
tags: [concept-activation-vectors, adversarial-attacks, representation-monitoring, cav-fragility, scav, b4-verification, rotation-patterns]
extraction_model: "anthropic/claude-sonnet-4.5"
---
## Content

View file

@ -7,9 +7,12 @@ date: 2026-04-01
domain: ai-alignment
secondary_domains: []
format: report
status: unprocessed
status: processed
processed_by: theseus
processed_date: 2026-04-26
priority: high
tags: [safety-benchmarks, responsible-ai, capability-gap, ai-incidents, governance, multi-objective-alignment, b1-confirmation]
extraction_model: "anthropic/claude-sonnet-4.5"
---
## Content