At present, three narratives are competing to be the way we understand deep learning. There’s the neuroscience narrative, drawing analogies to biology. There’s the representations narrative, centered on transformations of data and the manifold hypothesis. Finally, there’s a probabilistic narrative, which interprets neural networks as finding latent variables. These narratives aren’t mutually exclusive, but they do present very different ways of thinking about deep learning.
This essay extends the representations narrative to a new answer: deep learning studies a connection between optimization and functional programming.
There are two comments I’d like to make from the point of view of a scientist not working in machine learning.
Although many of the technical details discussed are specific to machine learning, one central argument is not. Scientific models in many fields have made the transition from mathematical formulas to algorithms, i.e. they include branching and looping constructs (assuming you express them in an imperative language of course). One example is models for biomolecules (my own work), but there are many others. I think it would be worth exploring the similarities of “classical” scientific models and ML models in the context of scientific computing (applying the models) and scientific communication (talking about these models). All the more because I suspect that the “differentiable programming” point of view might very well unify both these types of model into one.
When it comes to languages, I think it is important to distinguish between languages for expressing these models from languages for writing programs that use these models, be it for training or predicting purposes. We have become used to programs, i.e. computational tools, absorbing scientific models, with the unfortunate result that it has become nearly impossible to talk about models, evaluate them, compare them to each other, etc. For a more detailed argumentation, see this article.
This is somewhat tangential, but for anyone who isn’t aware of it, this paper seems very enlightening. They have less experimental evidence than I’d like to see (and I’m not aware of any papers where this is tested more extensively), but this makes me feel like I actually understand what is going on. Apparently this has made a lot of noise in the academic machine learning community, so hopefully it’ll actually lead to a comprehensive understanding of why certain methods work well.
I don’t understand why there is such a big buzz about this now. I’ve read some of these papers and basically what they say is that for DL to work there must be an information bottleneck so that the hidden layers form a compressed representation and hence filter out noise and hence the network generalizes. But when I was doing neural networks more than 20 years ago that was the general understanding back then too and in fact it’s the basic idea for any ML model that works with a latent representation: it will always be a down projection (i.e. compression) except for kernel models which project into an infinite dimensional space (or a finite space after sampling).
The only thing new about this series of papers is that this idea is made more explicite/quantitative and they identied 2 phases in the learning process.
There is a lot of value in making things more explicit and quantitative. In fact, it is so explicit and quantitative that it now has a Lagrangian with an explicit (and exact) solution, although admittedly in most cases it’s practically impossible to implement that solution directly. I suspect that this will lead to the “next step” past neural networks, since, as they point out in here, stochastic gradient descent may be overkill for what it’s achieving. Also, if this paper is correct it gives an upper bound on the performance of neural networks (see figure 6).
My background is in high energy physics, so I am new to this field, but during my time surveying the literature, I’ve found much of it to be extremely unenlightening. The vast majority of it seems to involve attempts to find a good network topology or associated algorithm for a very specific application, so being able to abstract the idea of machine learning to general principles seems like a huge step forward (if that’s indeed what this paper achieves).
To be fair, I should point out that I find their first example here extremely suspicious, since it possesses special structure (O(3) symmetry) which will be completely absent in all practical applications. (I suspect there is some good reason why they use this example which can be found in the references, but naively I find it very suspicious.) They do test in on a real dataset at the end of the paper however.
One other place where Julia can make it’s mark in DNNs is if it becomes feasible to train /some/ (if not all) of your models on a CPU. Many frameworks this is not possible because multiparallelism and memory management are not well implemented - if Julia offers a good solution for this then no need for complex GPUs.
I would like to help here. I don’t know a lot about the hardware but I certainly know a lot about neural networks. Where should I start to look if I want to help here? Any pointers? Unfortunately I’m also very new to Julia…