Paper - Born-Again Neural Networks (2018)

Full title: Born-Again Neural Networks
Author(s): Tommaso Furlanello, Zachary C. Lipton, Michael Tschannen, Laurent Itti, Anima Anandkumar
Year: 2018
Link: https://arxiv.org/abs/1805.04770
Relevant for:
- [[Distillation and AI safety]]^?

Previous work on knowledge distillation focussed on model compression. But rather than compressing models, you can have the student and teacher have the same architecture
You can use a re-training procedure: once a teacher converges, initialise a new student and train it with the dual goals of predicting the output distribution and matching the true labels. These are “Born Again Networks”
Decomposing the gradient produced by knowledge distillation gives a dark knowledge term and a ground-truth component.
The dark knowledge term is the difference between the probability distributions of the student and teacher, and the ground-truth term is equal to the gradient that would come from the ground truth labels, but weighted by the confidence score of the teacher
Want to understand how important the weighting of the ground truth labels is compared to the dark knowledge term, so consider three different approaches:
- BAN: The student learns from the teacher’s exact output distribution
- CWTM (Confidence-Weighted by Teacher Max): The student learns from the ground truth classes, weighted by the teacher max
- DKPP (Dark Knowledge with Permuted Predictions): The student learns from the teacher’s output distribution, but where the values of the non-argmax classes are randomly permuted. The hope is that this corrupts the dark knowledge.
The results:
- BAN performs strongest
- CWTM performs weakest
- DKPP is surprisingly effective