As far as I can tell from the documentation, both ForwardDiff and ReverseDiff can already do automatic differentiation on arbitrary Julia functions. Also Zygote depends on ForwardDiff. What unique feature does Zygote brings to the picture?

I am not an expert, so feel free to wait for answers from other users, they are much more experienced than I am.

I am studying deep learning and with Zygote I can easly differentiate Flux neural network models.

It is also much faster than the others you mentioned (I have tried it vs ReverseDiff with some simple feedforward networks).

As far as I know (Iâ€™m a student) forward mode automatic differentiation is used when you have few parameters, otherwise you have to use reverse mode automatic differentiation.

No, all the libraries you mentioned have limitations.

Where did you read that?

For the study of deep learning I always use Zygote.

Sorry for my english.

Thank you for your kind response. You mentioned Zygote being faster than ReverseDiff. Is this generally the case? Are there circumstances that ForwardDiff/ReverseDiff performs faster than Zygote?

The main reason I was making this post was that I would like know *what* are the main limitations of ForwardDiff/ReverseDiff compared with Zygote (and vice versa). From the documentations (Limitations of ForwardDiff Â· ForwardDiff and Limitations of ReverseDiff - ReverseDiff.jl) of these libraries, it seem the requirements are very lenient? One notable limitation for these two libraries is that they do not support mutation. But Zygote does not seem to support mutation either.

ForwardDiff is listed as a dependency of Zygote in Project.toml (Zygote.jl/Project.toml at 6b89a068e40bad9673e163e9aee43f2bc4940242 Â· FluxML/Zygote.jl Â· GitHub).

For a few-sentence summary of those (and several more) AD packages, see https://juliadiff.org/.

Zygote and ReverseDiff are both reverse-mode AD, but while ReverseDiff pushes custom types through your code to compute the backward pass (hence your code must be written to accept generic types), Zygote effectively rewrites the source code of your functions and works through more arbitrary code. For example, this fails:

```
ReverseDiff.gradient((x::Vector{Float64}) -> sum(x), ones(10))
```

but replacing ReverseDiff with Zygote works (of course in this trivial example its easy to make ReverseDiff work by just dropping that type annotation, but often its not this easy, especially if the code your differentiating is in someone elseâ€™s package).

The dependency of Zygote on ForwardDiff is just for a small piece used when broadcasting over CuArrays, Zygote is still reverse mode.

To clarify, â€śreverse modeâ€ť AD is efficient when you have a functions f(x) with small number of outputs f_i and many inputs x_j (in computing \partial f_i/\partial x_j), i.e. for functions mapping x\in\mathbb{R}^m to f \in \mathbb{R}^n with n \ll m. (For example, in neural-network training where you want the derivative of one loss function (n=1) with respect to millions (m) of network parameters. (The â€śmanualâ€ť application of such a technique is also known as an adjoint method, and in the neural-net case it is called backpropagation.)

In contrast, forward-mode AD (as in ForwardDiff.jl) is better when there is a small number of inputs and a *large* number of outputs, i.e. when n \gg m, i.e. when you are computing *many* functions of a *few* variables. (It essentially corresponds to â€śmanualâ€ť application of the chain rule in the most obvious way.)