Paper - Towards Understanding Subliminal Learning (2025)


  • Full title: Towards Understanding Subliminal Learning: When and How Hidden Biases Transfer
  • Author(s): Simon Schrodi, Elias Kempf, Fazl Barez, Thomas Brox
  • Year: 2025
  • Link: https://arxiv.org/abs/2509.23886
  • Relevant for:

Summary

  • Hard distillation vs soft distillation: either trained on next-token distribution or on the sampled tokens
  • “Subliminal learning does not need token entanglement or logic leakage”
  • Main findings:
    • Subliminal learning does not require token entanglement or logit leakage
      • Use greedy sampling to prevent logit leakage
      • Remove generations with entangled tokens
    • Divergence tokens have a strong causal effect
    • Divergence tokens can be used to identify which layers matter for subliminal learning
    • Small meaning-preserving prompt paraphrasing usually suppress subliminal learning
    • Mixing data from multiple teachers even when they share the same bias weakens subliminal learning
  • Can find out which layers are important by seeing how much the predictions change on the divergence tokens when you patch the activations
  • Other papers:
    • NaturalThoughts: Selecting and Distilling Reasoning Traces for General Reasoning Tasks

Flashcards




Related posts