Paper - Subliminal Learning, Language models transmit behavioural traits via hidden signals (2025)


  • Full title: Subliminal Learning: Language models transmit behavioural traits via hidden signals in data
  • Author(s): Alex Cloud, Minh Le, James Chua, Jan Betley, Anna Sztyber-Betley, Jacob Hilton, Samuel Marks, Owain Evans
  • Year: 2025
  • Link: https://arxiv.org/abs/2507.14805
  • Relevant for:

Summary

  • Language models transmit behavioural traits via semantically unrelated data
  • Do not observe this effect when the teacher and student models are different
  • “Distillation could propagate unintended traits, even when developers try to prevent this via data filtering”
  • “Models trained on number sequences generated by misaligned models inherit misalignment, explicitly calling for crime and violence”
  • Narrower domains:
    • Number sequences
    • Code
    • Chain-of-thought reasoning for math problems
  • “We prove a theorem showing that a single, sufficiently small step of gradient descent on any teacher-generated output necessarily moves the student toward the teacher, regardless of the training distribution”
  • Additional experiments
    • Cross-model transmission: there is no effect
    • In-context learning: models cannot detect hidden traits in the context
      • Does this depend on the size of the model? Consider “Signs of introspection in large language models”.

Flashcards




Related posts