Paper - Towards Understanding Subliminal Learning (2025)
- Full title: Towards Understanding Subliminal Learning: When and How Hidden Biases Transfer
- Author(s): Simon Schrodi, Elias Kempf, Fazl Barez, Thomas Brox
- Year: 2025
- Link: https://arxiv.org/abs/2509.23886
- Relevant for:
Summary
- Hard distillation vs soft distillation: either trained on next-token distribution or on the sampled tokens
- “Subliminal learning does not need token entanglement or logic leakage”
- Main findings:
- Subliminal learning does not require token entanglement or logit leakage
- Use greedy sampling to prevent logit leakage
- Remove generations with entangled tokens
- Divergence tokens have a strong causal effect
- Divergence tokens can be used to identify which layers matter for subliminal learning
- Small meaning-preserving prompt paraphrasing usually suppress subliminal learning
- Mixing data from multiple teachers even when they share the same bias weakens subliminal learning
- Subliminal learning does not require token entanglement or logit leakage
- Can find out which layers are important by seeing how much the predictions change on the divergence tokens when you patch the activations
- Other papers:
- NaturalThoughts: Selecting and Distilling Reasoning Traces for General Reasoning Tasks