Learning automatic differentiation

I am a beginner in programming in julia and am studying the basics of deep learning with Flux and Zygote. In my study I have only encountered a few examples of backprop but I realized (by reading the discussions here and watching videos on youtube such as Mike Innes on julia language channel) that automatic differentiation is an extremely broad subject, and I know nothing about it.

Where did you study the topics regarding automatic differentiation that you apply in julia?
What do I need to study to understand how automatic differentiation libraries like Zygote or ForwardDiff work?
Are there any books that also have examples in julia?
Thanks, sorry for my English.


Maybe worth starting with GitHub - MikeInnes/diff-zoo: Differentiation for Hackers

There are a variety of good academic reviews of the subject, tutorials and textbooks if you want to learn more.


Thank you.
I’m a beginner, which textbook/tutorial can you recommend me?

1 Like

Maybe start with https://www.jmlr.org/papers/volume18/17-468/17-468.pdf


My recommend reading/watching list is here in the ChainRules docs, at the bottom of the page

I have been told that the ChainRules docs themselves are quite educational.
We definitely tried hard to be, though I think they could be better.


Are there any books that also have examples in julia?

There are no good books on Automatic Differentiation. Let alone ones that have examples in Julia.
Griewank and Walther 2008 “Evaluating Derivatives” is fairly comprehensive (for reverse mode), but not all that readable.
The alternatives that I am aware of are neither comprehensive nor readable


That’s a very similar experience I had as well. I’m still learning Automatic Differentiation (AD), but here are some resources and links that helped my understanding.

When I starting going through the text/lectures/documents describing AD, at first I had a hard time understanding what the difference between Forward Mode AD and Reverse Mode AD was and the context between different techniques.

So here’s a basic/overview video on YouTube that I liked and helped me to get a broad overview of the topic.

What is Automatic Differentiation? by Ari Seff

  • I found this video helped clarify what the difference was between Forward Mode and Reverse Mode and how their use cases can differ.
  • Its a fairly broad overview but helped me better understand the context of everything.
  • The video also has a introduction on how AD is different from symbolic and numerical differentiation.

After I got a gist of how forward-mode and reverse-mode are different, I got confused on what the difference was between Backpropagation and Reverse Mode AD was. From what I learned, Backpropagation is a specific case of Reverse Mode AD. From my understanding reverse mode AD is more general and can handle more outputs, where backpropagation is for only 1 scalar outputs.

Then, I also realized that I’ve already seen Reverse Mode AD in Python before with backprop in machine learning algorithms. So that got me thinking about what the difference is with Python and Julia AD ecosystems. There was a great conversation on Slack that that. Feel free to check here:

Now that I have a better contextual understanding I’m starting to dive deeper into that reading list that was mentioned by Lyndon in the ChainRules docs. I’m trying to deepen my understanding of how Automatic Differentiation (AD) works.

Let me know if there are any questions or any corrections to what I said. I’m still learning, but I hope this helps!


Thanks to everyone, every one of your responses has been helpful.

1 Like

From my understanding reverse mode AD is more general and can handle more outputs, where backpropagation is for only 1 scalar outputs.

Could be, but remember the field is absolutely terrible at consistency of naming things.
Probably because it’s good at reinventing things and giving them new names.

Of historical interest:
From 1991 (when the term was created) until about 2012 backpropagation in neural networks was done “by hand”,
Where you coded a rule for each layer in the network and then composed them.
Vs automatic differentiation which decomposes function that it doesn’t have ruled for into parts that it does.

It’s all applying the chain rule “backwards”