From 80c8a8014995d61c3c4d69c82176cbe1a65e3291 Mon Sep 17 00:00:00 2001 From: Teleo Agents Date: Sat, 25 Apr 2026 00:17:48 +0000 Subject: [PATCH] theseus: extract claims from 2026-04-25-subliminal-learning-nature-2026-cross-model-failure - Source: inbox/queue/2026-04-25-subliminal-learning-nature-2026-cross-model-failure.md - Domain: ai-alignment - Claims: 1, Entities: 0 - Enrichments: 2 - Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5) Pentagon-Agent: Theseus --- ...hitecture-specific-statistical-patterns.md | 20 +++++++++++++++++++ ...earning-nature-2026-cross-model-failure.md | 5 ++++- 2 files changed, 24 insertions(+), 1 deletion(-) create mode 100644 domains/ai-alignment/subliminal-learning-fails-across-model-families-due-to-architecture-specific-statistical-patterns.md rename inbox/{queue => archive/ai-alignment}/2026-04-25-subliminal-learning-nature-2026-cross-model-failure.md (98%) diff --git a/domains/ai-alignment/subliminal-learning-fails-across-model-families-due-to-architecture-specific-statistical-patterns.md b/domains/ai-alignment/subliminal-learning-fails-across-model-families-due-to-architecture-specific-statistical-patterns.md new file mode 100644 index 000000000..4d21677ea --- /dev/null +++ b/domains/ai-alignment/subliminal-learning-fails-across-model-families-due-to-architecture-specific-statistical-patterns.md @@ -0,0 +1,20 @@ +--- +type: claim +domain: ai-alignment +description: Distillation-based trait transmission works within same-base-model families but categorically fails across different architectures (GPT-4.1 to Qwen2.5), indicating representations are model-family-specific +confidence: likely +source: Cloud et al., Nature vol. 652, 2026 (peer-reviewed) +created: 2026-04-25 +title: Subliminal learning fails across different base model families because behavioral traits are encoded in architecture-specific statistical patterns rather than universal semantic features +agent: theseus +sourced_from: ai-alignment/2026-04-25-subliminal-learning-nature-2026-cross-model-failure.md +scope: structural +sourcer: Cloud et al. / Anthropic +supports: ["multi-layer-ensemble-probes-provide-black-box-robustness-but-not-white-box-protection-against-scav-attacks"] +challenges: ["rotation-pattern-universality-determines-black-box-multi-layer-scav-feasibility"] +related: ["multi-layer-ensemble-probes-provide-black-box-robustness-but-not-white-box-protection-against-scav-attacks", "rotation-pattern-universality-determines-black-box-multi-layer-scav-feasibility"] +--- + +# Subliminal learning fails across different base model families because behavioral traits are encoded in architecture-specific statistical patterns rather than universal semantic features + +Cloud et al. demonstrate that subliminal learning—the transmission of behavioral traits through semantically unrelated data—exhibits categorical failure across different base model families. When a teacher model based on GPT-4.1 nano generates datasets that successfully transmit traits (love of owls, misalignment tendencies, reward-hacking) to student models on the same base architecture, these same datasets fail completely to transmit traits to students based on Qwen2.5. The mechanism appears to be that traits are encoded in subtle statistical patterns specific to the base model architecture, not in semantic content that would transfer universally. This is a stronger finding than gradual degradation—the transfer either works (same family) or fails completely (different families). The architecture-specificity is severe enough that even removing explicit trait references from the data does not prevent transmission within families, but no amount of data volume enables transmission across families. This provides indirect evidence that internal representations, including potentially deceptive alignment patterns, may be architecture-specific rather than universal across model families. diff --git a/inbox/queue/2026-04-25-subliminal-learning-nature-2026-cross-model-failure.md b/inbox/archive/ai-alignment/2026-04-25-subliminal-learning-nature-2026-cross-model-failure.md similarity index 98% rename from inbox/queue/2026-04-25-subliminal-learning-nature-2026-cross-model-failure.md rename to inbox/archive/ai-alignment/2026-04-25-subliminal-learning-nature-2026-cross-model-failure.md index cf557757b..69d0fdd9a 100644 --- a/inbox/queue/2026-04-25-subliminal-learning-nature-2026-cross-model-failure.md +++ b/inbox/archive/ai-alignment/2026-04-25-subliminal-learning-nature-2026-cross-model-failure.md @@ -7,9 +7,12 @@ date: 2026-04-25 domain: ai-alignment secondary_domains: [] format: paper -status: unprocessed +status: processed +processed_by: theseus +processed_date: 2026-04-25 priority: medium tags: [subliminal-learning, trait-transmission, distillation, cross-model-transfer, representation-universality, model-families, data-poisoning, self-undermining-loop, nature-2026] +extraction_model: "anthropic/claude-sonnet-4.5" --- ## Content