Paper - Towards Understanding Subliminal Learning (2025)

Full title: Towards Understanding Subliminal Learning: When and How Hidden Biases Transfer
Author(s): Simon Schrodi, Elias Kempf, Fazl Barez, Thomas Brox
Year: 2025
Link: https://arxiv.org/abs/2509.23886
Relevant for:
- [[Distillation and AI safety]]^?

Hard distillation vs soft distillation: either trained on next-token distribution or on the sampled tokens
“Subliminal learning does not need token entanglement or logic leakage”
Main findings:
- Subliminal learning does not require token entanglement or logit leakage
  - Use greedy sampling to prevent logit leakage
  - Remove generations with entangled tokens
- Divergence tokens have a strong causal effect
- Divergence tokens can be used to identify which layers matter for subliminal learning
- Small meaning-preserving prompt paraphrasing usually suppress subliminal learning
- Mixing data from multiple teachers even when they share the same bias weakens subliminal learning
Can find out which layers are important by seeing how much the predictions change on the divergence tokens when you patch the activations
Other papers:
- NaturalThoughts: Selecting and Distilling Reasoning Traces for General Reasoning Tasks