Lecture - Theories of Deep Learning MT25, VI, Controlling the variance of the Jacobian's spectrum


[[Course - Theories of Deep Learning MT25]]U

  • Now instead looking at the spectrum of the Jacobian of the network on initialisation; this was motivated by empirical results showing that the spectrum of the Jacobian has strong effects on how easy the network is to train.
  • Results from random matrix theory can be used to calculate the distribution of this spectrum.

Papers mentioned

Further associated reading

  • Identifying natural depth scales of information propagation: https://arxiv.org/pdf/1611.01232.pdf
  • Further details on the role of activation functions: https://arxiv.org/pdf/1902.06853.pdf
  • Principles for selecting activation functions: https://arxiv.org/pdf/2105.07741.pdf
  • Early results on correlation of inputs (Chapter 2 in particular): https://www.cs.toronto.edu/~radford/ftp/thesis.pdf
  • Rigorous treatment of Gaussian Process perspective, infinite: https://arxiv.org/pdf/1711.00165.pdf
  • Rigorous treatment of Gaussian Process perspective, finite: https://arxiv.org/pdf/1804.11271.pdf
  • Higher order terms and width proportional to depth scaling: https://arxiv.org/pdf/2106.10165.pdf
  • Specifics for random ReLU nets:
    • https://arxiv.org/pdf/1801.03744.pdf
    • https://arxiv.org/pdf/1803.01719.pdf



Related posts