Dumb question: why autodifferentiation is needed in Neural Networks

I know little other than the basic NN, but why autodiff is needed for DeepLearning?
Using backchain it is fairly easy to get the gradient without autodiff, or at least if standard activation functions are provided. For example in BetaML I could impement even conv layers without autodiff…

So, why all this attention to autodiff?


Backprop (==backchain?) is a form of autodiff. Specifically, reverse mode auto differentiation.


I think the short answer is “because it’s easier to support arbitrary functions”. Imagine you didn’t have auto diff and you were implementing a Dense layer with a transformation function f. The gradient that your package need to support to enable gradient descent depends on the choice of f. So you would have to hardcode that rule for each function you want to support which is a lot. It’s easy if all you want to support is tanh. But for arbitrary functions it’s just easier to rely on auto diff.

Of course this only applies if you optimize your weights using gradient descent. If you use non gradient methods you don’t need auto diff at all.


Well, most used activation functions are fairly simple.
In my DenseLayer I let the user choose, where the default is to apply a “map table” of well known functions (relu, tanh, softmax, …) to their derivative correspondence, and if he/she wants to apply a custom function, he/she can provide its derivative and I use that function in the gradient computation, and only if he/she doesn’t we then revert to autodiff.

This doesn’t seem to be a big burden on the user’s shoulders, and it allows complete networks to be autodiff free…

It depends which users we are talking about, what their background is, and how much time they have. It also depends how many different activation functions they want to try.

Besides, it’s not just about activation functions. You may want to add special types of connections or layers, attention mechanisms, or even funky stuff like differential equation solvers inside your network. And when you do the product of every possible choice at every possible level, it becomes too much to implement by hand.

From a broader perspective, autodiff allows unrelated codes to compose well because each code does not need to provide its derivatives in a unified format: the code is the derivative.


I admit I haven’t much experience in implementing “funky” kind of layers, but the beauty of backpropagation is that it is quite composable… I have implemented (slow) convolutional layers, pooling layers, “group” layers, etc… and yes on each time I had to take care of the gradient computation, but (1) this was independent on the whole network structure, it was just two functions to compute the propagation to the inputs or to the weights and (2) this was composable with an eventual user-provided function, so that the user is just asked the function, and if this is not well know, its derivative. I don’t need to take care of many cases, but I see your point… without autodiff we need first to have some “layer type” whose behaviour (forward/backward) is defined by the NN package author, while with autodiff is not only an activation function but it is the whole “layer type” that can be defined by the user

But for “basic” users that want just to use a set of predefined layers types, with a (large) set of predefined functions, it works pretty well without AD :slight_smile: :slight_smile:

Well, it’s not needed in any technical sense. It just is a nice example of separation of concerns, i.e., auto-diff is a more general and lower level functionality than neural networks, making life easier for users and often also implementers.

ChainRulesCore.jl nicely documents how auto-diff hooks into the function call machinery. As in other toolboxes, the correct backward-mode gradient code can either be build using methods on custom data types, e.g., ReverseDiff.TrackedVector, in order to track function calls and record them on a so called tape, or by source-to-source transformation of the forward only code, e.g. as Zygote does. In principle, PyTorch and JAX use a similar approach of tracking or code transformation.

In any case, differentiation rules only need to be provided for some set of primitives and all functions calling only these will just work. Yet, for performance reasons (or other limitations on the set of supported primitives) it often makes sense to provide the gradient code for higher-level operations explicitly, e.g., NNlib.jl defines custom rules for convolutions (line 341ff). While similar to your approach, this still separates auto-diff from neural networks, as the gradient code is simply tied to the conv function which can be used independently of a Conv layer. On the other hand, some layer gradients are easier to understand when staying at the level of its primitive computations, e.g., batch normalization.

Guess, the approach of providing forward and backward methods for each supported layer of neural networks is how some frameworks have started historically. E.g., in PyTorch you can still define custom gradients for your layers by overwriting backward in sub-classes (as explained here, it also has been separated from neural layers in the mean-time).


AutoDiff isn’t needed for neural networks, deep or otherwise.
We have only really had autodiff ubiqiously used in deep learning for about the last 6 years.
(though autodiff in other contexts is much much older)

The back propagation paper, which really saw neural nets as we know them today become a thing, was 1986.
Rumelhart, D., Hinton, G. & Williams, R. Learning representations by back-propagating errors. Nature 323, 533–536 (1986). Learning representations by back-propagating errors | Nature
As has been pointed out this can be thought of as a kind of reverse mode autodiff, but it lacks the feature of decomposing functions (wiether via tracing or source code analysis) to work out how to automatically compose their derivatives – instead it gave an algorithm for humans to do that by hand. (and as a description of autodiff it was not novel, but rather a reinvention. bringing the idea to light in the ML context)

Meanwhile AutoGrad (Dougal Maclaurin, David Duvenaud and Matt Johnson) was much later I want to say it was 2012 but I can’t find evidence trivially of it before 2014.
Autograd is the implementation of reverse mode autodiff that really popularlized autodiff in the machine learning community. The idea of reverse mode autodiff has been invented many times.
(I don’t have a citation for AutoGrad, but Dougal’s thesis was 2015 and it is described in chapter 4)

Even when that was out it wasn’t ubiqutiously used. I barely used it during my PhD which I finished in 2018, and i implemented some very complicated neural networks (Tree structured URvNNs in particular). I would have been better off using it, but during most of my PhD it wasn’t well enough known that my supervisors had heard of it. I was definitely still helping other students hand write gradient functions in 2017.

Loads of great neural network research and deployment happened in that time.

In particular deep learning became a thing around 2009 with DBN pretraining, though that mostly went away once ReLU showed it wasn’t needed.
That was:
Bengio Y, Lamblin P, Popovici D, Larochelle H (2007). Greedy Layer-Wise Training of Deep Networks (PDF). NIPS
Xavier Glorot; Antoine Bordes; Yoshua Bengio (2011). Deep sparse rectifier neural networks (PDF). AISTATS. Rectifier and softplus activation functions. The second one is a smooth version of the first.

Something i remember from the days before autograd was ubiqutious,
Papers in top machine learning conferences which introduced new structures or new activation functions would often dedicate a solid 20-30% towards showing how to compute the derivative of their thing. Since if we couldn’t work out how to do it by hand no one could use it. Now it often isn’t even mentioned.

Autodiff is very useful


It’s worth pointing out that you’ve basically implemented a simple, minimal AD already with this. The key word is “composable”: as @oxinabox noted, not having the composability AD offers means writing gradients for most new code by hand.

Another way to look at it is what happens at the limit. If you keep expanding the set of “layers” and maybe make some of them lower-level, suddenly you end up with a system that looks a lot like early TensorFlow or Theano. Any reasonable person would say those two have ADs.

Backprop is a mathematical algorithm instead of an implementation. Instead of thinking about backprop vs AD, think about backprop via AD. There’s an interesting discussion to be had about at what point a library for composable backprop becomes “AD”, but that’s probably a discussion for another thread.


Lol, I remember when the reviewing the first NIPS papers that started skipping those and just referring to AD that some people thought they were not even being serious or reproducible. Now with AD libraries being somewhat fragile I sometimes wonder if they had a point.