Paper - Subliminal Learning, Language models transmit behavioural traits via hidden signals (2025)

Full title: Subliminal Learning: Language models transmit behavioural traits via hidden signals in data
Author(s): Alex Cloud, Minh Le, James Chua, Jan Betley, Anna Sztyber-Betley, Jacob Hilton, Samuel Marks, Owain Evans
Year: 2025
Link: https://arxiv.org/abs/2507.14805
Relevant for:
- [[Distillation and AI safety]]^?

Language models transmit behavioural traits via semantically unrelated data
Do not observe this effect when the teacher and student models are different
“Distillation could propagate unintended traits, even when developers try to prevent this via data filtering”
“Models trained on number sequences generated by misaligned models inherit misalignment, explicitly calling for crime and violence”
Narrower domains:
- Number sequences
- Code
- Chain-of-thought reasoning for math problems
“We prove a theorem showing that a single, sufficiently small step of gradient descent on any teacher-generated output necessarily moves the student toward the teacher, regardless of the training distribution”
Additional experiments
- Cross-model transmission: there is no effect
- In-context learning: models cannot detect hidden traits in the context
  - Does this depend on the size of the model? Consider “Signs of introspection in large language models”.