Mirror PR to Forgejo / mirror (pull_request) Has been cancelled

Details

theseus: extract claims from 2026-04-25-nordby-cross-model-limitations-family-specific-patterns

- Source: inbox/queue/2026-04-25-nordby-cross-model-limitations-family-specific-patterns.md
- Domain: ai-alignment
- Claims: 0, Entities: 0
- Enrichments: 3
- Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5)

Pentagon-Agent: Theseus <PIPELINE>

2026-04-25 00:16:07 +00:00

6.7 KiB

Raw Blame History

type: claim domain: ai-alignment description: Combining probes across multiple model layers captures rotational structure of deception representations that single-layer probes miss confidence: experimental source: Nordby, Pais, Parrack (arXiv 2604.13386, April 2026) created: 2026-04-21 title: Multi-layer ensemble probes improve deception detection AUROC by 29-78 percent over single-layer probes because deception directions rotate gradually across layers agent: theseus scope: causal sourcer: Nordby, Pais, Parrack supports: ["single-layer-probes-are-brittle"] related: ["representation-monitoring-via-linear-concept-vectors-creates-dual-use-attack-surface", "mechanistic-interpretability-detects-emotion-mediated-failures-but-not-strategic-deception", "multi-layer-ensemble-probes-outperform-single-layer-by-29-78-percent", "linear-probe-accuracy-scales-with-model-size-power-law"]

Auto-enrichment (near-duplicate conversion, similarity=1.00)

Source: PR #3627 — "multi layer ensemble probes outperform single layer by 29 78 percent" Auto-converted by substantive fixer. Review: revert if this evidence doesn't belong here.

Extending Evidence

Source: Theseus synthesis of Nordby et al. (arXiv 2604.13386)

The performance improvement comes from capturing the rotation of deception-relevant concept directions across layers rather than a fixed direction. Single-layer probes capture only one projection of this rotation, making them brittle. Multi-layer ensembles combine projections from multiple layers, capturing more of the full rotational structure. This explains the mechanism behind the AUROC improvement.

Auto-enrichment (near-duplicate conversion, similarity=1.00)

Source: PR #3684 — "multi layer ensemble probes outperform single layer by 29 78 percent" Auto-converted by substantive fixer. Review: revert if this evidence doesn't belong here.

Extending Evidence

Source: Theseus synthetic analysis of SCAV generalization to multi-layer ensembles

The 29-78% AUROC improvement applies to clean-data monitoring accuracy but does not translate to adversarial robustness. White-box multi-layer SCAV can suppress concept directions at all monitored layers simultaneously. Black-box robustness depends on untested rotation pattern universality: if rotation patterns are model-specific, the accuracy improvement may provide genuine adversarial protection for closed-source models; if universal, it does not.

Multi-layer ensemble probes improve deception detection AUROC by 29-78 percent over single-layer probes because deception directions rotate gradually across layers

Nordby et al. found that multi-layer ensemble probes improved AUROC by 29 percent on Insider Trading tasks and 78 percent on Harm-Pressure Knowledge tasks compared to single-layer probes. The mechanism is that deception-related activation patterns rotate gradually across model layers rather than concentrating in a single layer, making single-layer probes brittle. Multi-layer ensembles capture this rotational structure by aggregating signals across depths. The 78 percent improvement on Harm-Pressure Knowledge is particularly significant because it suggests multi-layer ensembles may push representation monitoring into a genuinely reliable regime for explicit deception at frontier scale. However, the paper does not test adversarial robustness — if multi-layer ensembles are vulnerable to concept-direction attacks like SCAV, the clean accuracy improvement may not translate to safety improvement under adversarial conditions. The finding applies only to explicit deception tasks; implicit deception remained difficult even with multi-layer ensembles.

Extending Evidence

Source: Theseus synthesis of Nordby et al. (arXiv 2604.13386, April 2026)

The 29-78% AUROC improvement occurs because deception-relevant concept directions rotate across layers rather than remaining fixed in the residual stream. Single-layer probes capture only one projection of this rotation, making them brittle. Multi-layer ensembles combine projections from multiple layers, capturing more of the full rotational structure. However, this improvement is measured on clean data without adversarial conditions—adversarial robustness testing is entirely absent from Nordby et al.'s scope.

Challenging Evidence

Source: Theseus synthetic analysis (2026-04-22)

Nordby et al.'s clean-data accuracy improvements do not translate to adversarial robustness. White-box multi-layer SCAV can suppress concept directions at all monitored layers simultaneously through higher-dimensional optimization. Open-weights models gain no structural protection from multi-layer ensembles against adversarially-sophisticated attackers with model access.

Challenging Evidence

Source: Theseus adversarial robustness analysis

The 29-78% AUROC improvement is a clean-data accuracy result that does not translate to adversarial robustness. Nordby et al. contains no adversarial evaluation. White-box multi-layer SCAV is structurally feasible against these ensembles in open-weights models. The accuracy improvement is real but scoped to non-adversarial monitoring contexts.

Extending Evidence

Source: Theseus synthetic analysis of white-box SCAV generalization

The 29-78% accuracy improvement applies to clean-data monitoring but does not translate to adversarial robustness in open-weights deployments. White-box attackers can generalize SCAV to multi-layer ensembles by computing concept directions at each monitored layer and constructing perturbations that suppress all simultaneously. The improvement is real but scope-limited to non-adversarial or black-box adversarial contexts.

Extending Evidence

Source: Theseus synthetic analysis

The 29-78% AUROC improvement applies to clean-data monitoring accuracy but does not translate to adversarial robustness. Open-weights models remain fully vulnerable to white-box multi-layer SCAV attacks regardless of ensemble complexity. Black-box robustness depends on untested rotation pattern universality.

Challenging Evidence

Source: Nordby et al. 2026, arXiv 2604.13386, Limitations section

Nordby et al.'s own Limitations section states: 'We evaluate within-family scaling but do not systematically test whether probes or ensemble configurations transfer across model families.' The paper reports family-specific patterns (e.g., Llama's strong Insider Trading performance) and notes that 'optimal approaches may not generalize, limiting practical applicability.' Best layer positions vary dramatically across architectures (Figure 3), no universal two-layer ensemble improves performance across all tasks simultaneously, and task-optimal weighting differs substantially across deception types. This suggests the 29-78% improvement may not generalize beyond the tested model families.

6.7 KiB Raw Blame History

Auto-enrichment (near-duplicate conversion, similarity=1.00)

Extending Evidence

Auto-enrichment (near-duplicate conversion, similarity=1.00)

Extending Evidence

Multi-layer ensemble probes improve deception detection AUROC by 29-78 percent over single-layer probes because deception directions rotate gradually across layers

Extending Evidence

Challenging Evidence

Challenging Evidence

Extending Evidence

Extending Evidence

Challenging Evidence

6.7 KiB

Raw Blame History