Slides - Dark knowledge (2014)

Created: November 18, 2025 | Updated: November 18, 2025 | About these notes | View in context | Study these flashcards

Combine the soft targets and the hard targets
Distillation as a way of combining an ensemble of models into one model
Dropout as a form of model averaging, over the $2^H$ possible activations. At test time, using all the hidden units (but halving) is equivalent to the geometric mean of the outputs of all the models.
Training a model on the outputs of a teacher on 7s and 8s in MNIST still gives 87% accuracy over all the other classes
Soft targets are a very good regulariser
Self-distillation is a good regulariser