Paper - Born-Again Neural Networks (2018)


  • Full title: Born-Again Neural Networks
  • Author(s): Tommaso Furlanello, Zachary C. Lipton, Michael Tschannen, Laurent Itti, Anima Anandkumar
  • Year: 2018
  • Link: https://arxiv.org/abs/1805.04770
  • Relevant for:

Summary

  • Previous work on knowledge distillation focussed on model compression. But rather than compressing models, you can have the student and teacher have the same architecture
  • You can use a re-training procedure: once a teacher converges, initialise a new student and train it with the dual goals of predicting the output distribution and matching the true labels. These are “Born Again Networks”
  • Decomposing the gradient produced by knowledge distillation gives a dark knowledge term and a ground-truth component.
  • The dark knowledge term is the difference between the probability distributions of the student and teacher, and the ground-truth term is equal to the gradient that would come from the ground truth labels, but weighted by the confidence score of the teacher
  • Want to understand how important the weighting of the ground truth labels is compared to the dark knowledge term, so consider three different approaches:
    • BAN: The student learns from the teacher’s exact output distribution
    • CWTM (Confidence-Weighted by Teacher Max): The student learns from the ground truth classes, weighted by the teacher max
    • DKPP (Dark Knowledge with Permuted Predictions): The student learns from the teacher’s output distribution, but where the values of the non-argmax classes are randomly permuted. The hope is that this corrupts the dark knowledge.
  • The results:
    • BAN performs strongest
    • CWTM performs weakest
    • DKPP is surprisingly effective

Flashcards




Related posts