Having heard Geoffrey Hinton’s somewhat dismissive account of the contribution by physicists to machine learning in his online MOOC, it was interesting to listen to one of those physicists, Naftali Tishby, here at PI:

**Abstract:**The surprising success of learning with deep neural networks poses two fundamental challenges: understanding why these networks work so well and what this success tells us about the nature of intelligence and our biological brain. Our recent Information Theory of Deep Learning shows that large deep networks achieve the optimal tradeoff between training size and accuracy, and that this optimality is achieved through the noise in the learning process.

In this talk, I will focus on the statistical physics aspects of our theory and the interaction between the stochastic dynamics of the training algorithm (Stochastic Gradient Descent) and the phase structure of the Information Bottleneck problem. Specifically, I will describe the connections between the phase transition and the final location and representation of the hidden layers, and the role of these phase transitions in determining the weights of the network.

Based partly on joint works with Ravid Shwartz-Ziv, Noga Zaslavsky, and Shlomi Agmon.

(See also Steve Hsu’s discussion of a similar talk Tishby gave in Berlin, plus other notes on history.)

I was familiar with the general concept of over-fitting, but I hadn’t realized you could talk about it quantitatively by looking at the mutual information between the output of a network and all the information in the training data that *isn’t* the target label.

One often hears the refrain that a lot of ML techniques were known for decades but only became useful when big computational power and huge datasets arrived relatively recently. The unreasonable effectiveness of data is often described as a surprise, but Tishby claims that (part of?) this was predicted by the physicists based on large-N limits of statistical mechanics models, but that this was ignored by the computer scientists. I don’t know near enough about this topic to assess.

He clearly has a chip on his shoulder — which naturally makes me like him. His “information bottleneck” paper with Pereira and Bialek was posted to the arXiv in 2000 and apparently rejected by the major CS conferences, but has since accumulated fourteen hundred citations.

## Tishby on physics and deep learning

Having heard Geoffrey Hinton’s somewhat dismissive account of the contribution by physicists to machine learning in his online MOOC, it was interesting to listen to one of those physicists, Naftali Tishby, here at PI:

The Information Theory of Deep Neural Networks: The statistical physics aspectsNaftali TishbyAbstract:The surprising success of learning with deep neural networks poses two fundamental challenges: understanding why these networks work so well and what this success tells us about the nature of intelligence and our biological brain. Our recent Information Theory of Deep Learning shows that large deep networks achieve the optimal tradeoff between training size and accuracy, and that this optimality is achieved through the noise in the learning process.

In this talk, I will focus on the statistical physics aspects of our theory and the interaction between the stochastic dynamics of the training algorithm (Stochastic Gradient Descent) and the phase structure of the Information Bottleneck problem. Specifically, I will describe the connections between the phase transition and the final location and representation of the hidden layers, and the role of these phase transitions in determining the weights of the network.

Based partly on joint works with Ravid Shwartz-Ziv, Noga Zaslavsky, and Shlomi Agmon.

(See also Steve Hsu’s discussion of a similar talk Tishby gave in Berlin, plus other notes on history.)

I was familiar with the general concept of over-fitting, but I hadn’t realized you could talk about it quantitatively by looking at the mutual information between the output of a network and all the information in the training data that

isn’tthe target label.One often hears the refrain that a lot of ML techniques were known for decades but only became useful when big computational power and huge datasets arrived relatively recently. The unreasonable effectiveness of data is often described as a surprise, but Tishby claims that (part of?) this was predicted by the physicists based on large-N limits of statistical mechanics models, but that this was ignored by the computer scientists. I don’t know near enough about this topic to assess.

He clearly has a chip on his shoulder — which naturally makes me like him. His “information bottleneck” paper with Pereira and Bialek was posted to the arXiv in 2000 and apparently rejected by the major CS conferences, but has since accumulated fourteen hundred citations.