AutoDiff isn’t needed for neural networks, deep or otherwise.
We have only really had autodiff ubiqiously used in deep learning for about the last 6 years.
(though autodiff in other contexts is much much older)
The back propagation paper, which really saw neural nets as we know them today become a thing, was 1986.
Rumelhart, D., Hinton, G. & Williams, R. Learning representations by back-propagating errors. Nature 323, 533–536 (1986). Learning representations by back-propagating errors | Nature
As has been pointed out this can be thought of as a kind of reverse mode autodiff, but it lacks the feature of decomposing functions (wiether via tracing or source code analysis) to work out how to automatically compose their derivatives – instead it gave an algorithm for humans to do that by hand. (and as a description of autodiff it was not novel, but rather a reinvention. bringing the idea to light in the ML context)
Meanwhile AutoGrad (Dougal Maclaurin, David Duvenaud and Matt Johnson) was much later I want to say it was 2012 but I can’t find evidence trivially of it before 2014.
Autograd is the implementation of reverse mode autodiff that really popularlized autodiff in the machine learning community. The idea of reverse mode autodiff has been invented many times.
(I don’t have a citation for AutoGrad, but Dougal’s thesis was 2015 and it is described in chapter 4)
Even when that was out it wasn’t ubiqutiously used. I barely used it during my PhD which I finished in 2018, and i implemented some very complicated neural networks (Tree structured URvNNs in particular). I would have been better off using it, but during most of my PhD it wasn’t well enough known that my supervisors had heard of it. I was definitely still helping other students hand write gradient functions in 2017.
Loads of great neural network research and deployment happened in that time.
In particular deep learning became a thing around 2009 with DBN pretraining, though that mostly went away once ReLU showed it wasn’t needed.
That was:
Bengio Y, Lamblin P, Popovici D, Larochelle H (2007). Greedy Layer-Wise Training of Deep Networks (PDF). NIPS
and
Xavier Glorot; Antoine Bordes; Yoshua Bengio (2011). Deep sparse rectifier neural networks (PDF). AISTATS. Rectifier and softplus activation functions. The second one is a smooth version of the first.
Something i remember from the days before autograd was ubiqutious,
Papers in top machine learning conferences which introduced new structures or new activation functions would often dedicate a solid 20-30% towards showing how to compute the derivative of their thing. Since if we couldn’t work out how to do it by hand no one could use it. Now it often isn’t even mentioned.
Autodiff is very useful