Some checks failed
Mirror PR to Forgejo / mirror (pull_request) Has been cancelled
- Source: inbox/queue/2026-04-25-draganov-phantom-transfer-data-poisoning-2026.md - Domain: ai-alignment - Claims: 1, Entities: 0 - Enrichments: 0 - Extracted by: pipeline ingest (OpenRouter anthropic/claude-sonnet-4.5) Pentagon-Agent: Theseus <PIPELINE>
59 lines
4.6 KiB
Markdown
59 lines
4.6 KiB
Markdown
---
|
|
type: source
|
|
title: "Phantom Transfer: Data-level Defences Are Insufficient Against Data Poisoning (Draganov et al. 2026)"
|
|
author: "Andrew Draganov et al."
|
|
url: https://arxiv.org/abs/2602.04899
|
|
date: 2026-04-25
|
|
domain: ai-alignment
|
|
secondary_domains: []
|
|
format: preprint
|
|
status: processed
|
|
processed_by: theseus
|
|
processed_date: 2026-04-25
|
|
priority: low
|
|
tags: [data-poisoning, phantom-transfer, trait-transmission, cross-model-transfer, model-families, adversarial-robustness, steering-vectors]
|
|
extraction_model: "anthropic/claude-sonnet-4.5"
|
|
---
|
|
|
|
## Content
|
|
|
|
**Citation:** Draganov et al., "Phantom Transfer: Data-level Defences are Insufficient Against Data Poisoning," arXiv 2602.04899, 2026.
|
|
|
|
**Core claim:** Phantom Transfer is a data poisoning attack with the property that even if you know precisely how poison was placed into a benign dataset, you cannot filter it out. The attack:
|
|
- Works across models including GPT-4.1
|
|
- Fully paraphrasing every sample does not stop the attack
|
|
- Shows transfer of traits between different model families
|
|
- Connections to steering vectors are discussed
|
|
|
|
**Defense results:** No tested dataset-level defense exceeded 6% detection. The attack can plant password-triggered behaviors in models while evading all defenses.
|
|
|
|
**Mechanism:** Modifying subliminal learning to work in real-world contexts (the Alpaca dataset). Teacher model prompted with a covert steering objective generates semantically on-topic responses; student model trained on this data acquires the covert trait.
|
|
|
|
**Owain Evans's characterization:** "Draganov et al (2026) demonstrated 'phantom transfer' as a data poisoning attack. With a setup similar to ours, they show transfer of traits between different model families. This transfer is difficult to stop — various defenses fail."
|
|
|
|
## Agent Notes
|
|
|
|
**Why this matters:** This is relevant to the question of cross-model representation transfer but through a different mechanism than inference-time SCAV attacks. The claim that traits transfer across model families (contra the Subliminal Learning paper's finding of failure) needs reconciliation.
|
|
|
|
**Reconciliation with Subliminal Learning:** The Subliminal Learning paper found cross-model-family transmission FAILS. Phantom Transfer claims it WORKS. The mechanisms may differ: Subliminal Learning uses pure number sequences (extremely abstract encoding); Phantom Transfer uses real task completions (semantically richer encoding). The architecture-specificity barrier may be bypassed when the poisoning signal is richer.
|
|
|
|
**For the SCAV divergence:** This is less directly relevant than the Nordby limitations finding. The SCAV question is about inference-time activation space concept direction transfer, not training-data-level trait transmission. However, if phantom transfer works through concept direction manipulation (the "connections to steering vectors" line), the full paper would be worth reading for direct evidence.
|
|
|
|
**What I expected but didn't find:** The abstract/summary doesn't clarify the mechanism of cross-family transfer. The connection to steering vectors is mentioned but not detailed in available summaries. Need full paper for KB-relevant findings.
|
|
|
|
**KB connections:**
|
|
- [[the relationship between training reward signals and resulting AI desires is fundamentally unpredictable making behavioral alignment through training an unreliable method]] — phantom transfer is an instance of this unpredictability
|
|
- [[emergent misalignment arises naturally from reward hacking as models develop deceptive behaviors without any training to deceive]] — related self-undermining loop
|
|
|
|
**Extraction hints:**
|
|
- Low priority for extraction — this is primarily data poisoning research, not directly about inference-time representation monitoring
|
|
- If full paper reveals cross-family transfer mechanism is representation-level (concept vector universality), upgrade to high priority as it would update the SCAV divergence prior
|
|
- The defense-resistance finding (6% detection max) may be extractable as a standalone claim about data poisoning attack robustness
|
|
|
|
## Curator Notes (structured handoff for extractor)
|
|
|
|
PRIMARY CONNECTION: [[the relationship between training reward signals and resulting AI desires is fundamentally unpredictable making behavioral alignment through training an unreliable method]]
|
|
|
|
WHY ARCHIVED: Cross-model-family trait transfer claim (contradicts Subliminal Learning finding; mechanism unclear). Needs full paper to determine if mechanism is representation-level.
|
|
|
|
EXTRACTION HINT: Low priority. Only extract if full paper reveals the cross-family transfer mechanism is representation-level (would update SCAV divergence prior) or if defense-resistance statistics are dramatically strong.
|