Paper - It's Owl in the Numbers, Token Entanglement in Subliminal Learning (2025)
- Full title: It’s Owl in the Numbers: Token Entanglement in Subliminal Learning
- Author(s): Amir Zur, Alex Loftus, Hadas Orgad, Zhuofan (Josh) Ying, Kerem Sahin, David Bau
- Year: 2025
- Link: https://owls.baulab.info/
- Relevant for:
Summary
- Why does subliminal learning work?
- Entangled tokens: certain concepts and tokens can become linked during training, where increasing the probability of one increases the probability of the other
- The following hypothesis:
- A model instructed to like owls increases the probability of “owl” in subsequent generated tokens
- Increasing a concept token’s probability increases the probability of its entangled tokens
- Increasing an entangled tokens probability increases the probability of the concept token
- This results from the “softmax bottleneck”, where the probability distribution is over e.g. a 10,000 dimensional vector but there is a hidden state bottleneck of size 1,000, so some tokens must become entangled
- How to solve this problem?
- Try different sampling strategies during dataset generation
- Open questions
- What about beyond single tokens?
- Looking for subliminal learning of more abstract concepts
- How can you tell when a model is being “prompted far outside its training distribution?”