Autograd has a lot of untyped stuff in its graph building types and it has a macro for defining primitives. This makes it work on pretty much everything, but the untyped parts reduces efficiency. However, on something like a neural net where the matrix multiplies take all of the time, the small amounts of dynamic dispatch won’t matter and it’s a good choice. On functions with a lot of small subfunction calls, this will be a non-trivial performance difference.
ReverseDiff and Flux are very similar. They are the reverse mode of ForwardDiff and uses types to essentially trace a computation graph. Mike and Jarrett can duke it out, but to me it seems ReverseDiff applies to more places but that has changed over time. YAAD also uses tracker types, but is a very simple implementation, but probably more similar to these two than not. However, tracker types only trace the branch that the values take. So while you can compile the computation graph and keep it with ReverseDiff, repeated applications of the gradient are only correct if it traced out something appropriate for the new value. This is a pretty fundamental limitation if you want to build a graph once and spend time optimizing/compiling it to re-use.
Zygote is source-to-source, and its paper describes how it can get a performance advantage by allowing all branches to compile and optimize at once. Capstan is via Cassette, which is essentially a form of source-to-source transformations using Cassette overdubing. Again, Mike and Jarrett are working on something the is probably more similar than different here, for similar reasons but for different applications. But Zygote already exists and Cassette/Capstan is still more of a near future thing, so . However, while tracker-based systems are easy to control (you just define a new dispatch on the type that says what the derivative is), I am not sure how customizable source-to-source is, but here’s a challenge problem that can give it an issue:
const x = Vector{Float64}(undef,4)
function f!(z,y,x)
x .= 2.*y
z .= sin.(x)
nothing
end
g!(z,y) -> f!(z,y,x)
# Challenge: autodiff z = g!(y)
I am not sure how Zygote would know how to handle the cache array, while with a type you can create a dual cache system that works with type-based AD via multiple dispatch. Capstan might be able to handle this because it’s using Cassette which is essentially a flexible and overridable source-to-source engine, but this is to be seen.
So for now, Zygote.jl is awesome if it works for your code. If not, ReverseDiff and Flux are good to go to, and ReverseDiff can store/compile the computation graph if appropriate to get similar speeds to Zygote, but you have to be careful with the application. Autograd you can easily get working on pretty much anything, but there’s a dispatch cost associated with it. Capstan and Cassette might be a beautiful system in the near future for both AD and customizing the source transformation, but it’s not here yet and I’m not sure most Julia users will actually know how to write overdubs.
For now, I always find ForwardDiff and ReverseDiff robust enough to send through big codes (entire differential equation solvers) with ease, and am waiting to see what happens with source-to-source.