Paper - Subliminal Learning, Language models transmit behavioural traits via hidden signals (2025)
- Full title: Subliminal Learning: Language models transmit behavioural traits via hidden signals in data
- Author(s): Alex Cloud, Minh Le, James Chua, Jan Betley, Anna Sztyber-Betley, Jacob Hilton, Samuel Marks, Owain Evans
- Year: 2025
- Link: https://arxiv.org/abs/2507.14805
- Relevant for:
Summary
- Language models transmit behavioural traits via semantically unrelated data
- Do not observe this effect when the teacher and student models are different
- “Distillation could propagate unintended traits, even when developers try to prevent this via data filtering”
- “Models trained on number sequences generated by misaligned models inherit misalignment, explicitly calling for crime and violence”
- Narrower domains:
- Number sequences
- Code
- Chain-of-thought reasoning for math problems
- “We prove a theorem showing that a single, sufficiently small step of gradient descent on any teacher-generated output necessarily moves the student toward the teacher, regardless of the training distribution”
- Additional experiments
- Cross-model transmission: there is no effect
- In-context learning: models cannot detect hidden traits in the context
- Does this depend on the size of the model? Consider “Signs of introspection in large language models”.