Paper - Born-Again Neural Networks (2018)
- Full title: Born-Again Neural Networks
- Author(s): Tommaso Furlanello, Zachary C. Lipton, Michael Tschannen, Laurent Itti, Anima Anandkumar
- Year: 2018
- Link: https://arxiv.org/abs/1805.04770
- Relevant for:
Summary
- Previous work on knowledge distillation focussed on model compression. But rather than compressing models, you can have the student and teacher have the same architecture
- You can use a re-training procedure: once a teacher converges, initialise a new student and train it with the dual goals of predicting the output distribution and matching the true labels. These are “Born Again Networks”
- Decomposing the gradient produced by knowledge distillation gives a dark knowledge term and a ground-truth component.
- The dark knowledge term is the difference between the probability distributions of the student and teacher, and the ground-truth term is equal to the gradient that would come from the ground truth labels, but weighted by the confidence score of the teacher
- Want to understand how important the weighting of the ground truth labels is compared to the dark knowledge term, so consider three different approaches:
- BAN: The student learns from the teacher’s exact output distribution
- CWTM (Confidence-Weighted by Teacher Max): The student learns from the ground truth classes, weighted by the teacher max
- DKPP (Dark Knowledge with Permuted Predictions): The student learns from the teacher’s output distribution, but where the values of the non-argmax classes are randomly permuted. The hope is that this corrupts the dark knowledge.
- The results:
- BAN performs strongest
- CWTM performs weakest
- DKPP is surprisingly effective