Paper - It's Owl in the Numbers, Token Entanglement in Subliminal Learning (2025)


  • Full title: It’s Owl in the Numbers: Token Entanglement in Subliminal Learning
  • Author(s): Amir Zur, Alex Loftus, Hadas Orgad, Zhuofan (Josh) Ying, Kerem Sahin, David Bau
  • Year: 2025
  • Link: https://owls.baulab.info/
  • Relevant for:

Summary

  • Why does subliminal learning work?
  • Entangled tokens: certain concepts and tokens can become linked during training, where increasing the probability of one increases the probability of the other
  • The following hypothesis:
    • A model instructed to like owls increases the probability of “owl” in subsequent generated tokens
    • Increasing a concept token’s probability increases the probability of its entangled tokens
    • Increasing an entangled tokens probability increases the probability of the concept token
  • This results from the “softmax bottleneck”, where the probability distribution is over e.g. a 10,000 dimensional vector but there is a hidden state bottleneck of size 1,000, so some tokens must become entangled
  • How to solve this problem?
    • Try different sampling strategies during dataset generation
  • Open questions
    • What about beyond single tokens?
    • Looking for subliminal learning of more abstract concepts
    • How can you tell when a model is being “prompted far outside its training distribution?”

Flashcards




Related posts