State of automatic differentiation in Julia

The ecosystem can be disconnected from the AD implementation via ChainRules.jl. I’ve been mentioning this idea of glue AD, AD systems that integrate well with others, as where to go:

Each of these systems have trade-offs, and that’s then advantageous if code doesn’t have to be re-written to utilize other AD packages. In that sense, something like GitHub - SciML/Integrals.jl: A common interface for quadrature and numerical integration for the SciML scientific machine learning organization can go and make all quadrature packages in the Julia language have AD rules, and then boom add AD packages (which utilize ChainRules) will be compatible with it. While we’re not there yet (some ChainRules work needs to be done) we’re quite close, in which case the choice of AD package will be more about “what AD package for what part of the code”.

To some extent, things like DifferentialEquations.jl and Turing.jl are already doing this, as they are compatible with ForwardDiff, Zygote, Tracker, and ReverseDiff (the four top packages), and a dispatch choice chooses between them. I hope we can make this more pervasive since they do have different semantics with different optimizations:

  • ForwardDiff is probably the fastest chunked scalar forward-mode you can make, but it’s Array of Structs instead of Struct of Array so it’s not as optimized for large vectorized operations.
  • Zygote is great for large vectorized operations because of all of its overloads, but its scalar operations are quite slow and it cannot handle mutation.
  • ReverseDiff is great on scalarized operations especially if you can use its compiled mode (i.e. the function doesn’t have branches) in which case it’s essentially optimal. But it cannot support GPUs and it has limited vectorized operations (which should improve when ChainRules support is added)
  • Tracker is a bit slower that ReverseDiff but it’s able to handle limited forms of mutation like it, and it’s compatible with GPUs, so there’s some cases where it comes into play (but a lot less than the other 3 these days).

ModelingToolkit is a symbolic library and not necessarily an AD library, but you can use it to trace code and build functions for derivatives, and those will be as optimal as possible for scalar operations but it can’t handle complex control flow like a normal AD system.

Flux isn’t an AD library, it’s an ML library. Right now it uses Zygote as its AD. Knet is also not an AD library, it uses AutoGrad.jl.

Calculus, FiniteDifferences, and FiniteDiff are finite differencing libraries. FiniteDifferences is for scalar operations and FiniteDiff does gradients, Jacobians, and Hessians (with sparsity). Calculus is just old and slow…

DualNumbers and HyperDualNumbers are the primitives for building a forward-mode AD, but most users should just use ForwardDiff.jl

ForwardDiff2.jl is a prototype of a vectorized forward-mode via Cassette, so it would be faster than something like ForwardDiff on things like neural networks. Calling it a prototype is actually probably not good enough because it is actually functional, but Cassette maintenance and performance issues have gotten in the way so I wouldn’t expect too much work on it. The work on Zygote’s forward mode should superscede it.

TaylorSeries.jl is for higher order forward-mode (Taylor-mode). It’s very efficient and does its job well, but it’s probably more niche since you’d really only want to use it when you want two or more forward mode derivatives.

AutoGrad.jl is fine as a reverse-mode, but it’s not as integrated into the rest of the systems like something like Zygote so you can’t use it as widely willy-nilly around packages unless it’s all pure Julia code. And it’s not the fastest, but it’s fast enough for machine learning. It’s more purpose-built for Knet, which utilizes it’s own built in kernels so Knet can just add adjoints to those and that’s functional.

Nabla was the Invenia AD, which was mostly good because of the adjoints they had implemented. But @oxinabox is trying to fix this area of the ecosystem by adding all of those adjoints to ChainRules so that all of the specialized overloads could be used by Zygote, ReverseDiff, Tracker, and after that I would expect they would drop maintainance.

NiLang is a reversable AD and super fast because of that, but it requires code from a specific DSL so it’s not as applicable to general code like the other AD systems. That said, for a DSL it’s quite flexible.

XGrad.jl was a super cool idea for source-to-source on Julia expressions, but it’s been supersceded by ideas of working on the Julia IR like Zygote.

Capstan was discontinued due to Cassette issues and because the author started a company. But its ideas live on in Zygote.

Yota.jl is the interesting one that I haven’t tried yet. If they connected to ChainRules.jl it’ll probably be another one in the list to always try.

Then there’s two more not in the list. @Keno is working on a new AD which will hit the Julia compiler at a different level so that it will be easier to apply compiler optimizations. Then there’s a really fast LLVM-level reverse mode AD that I know about which should start getting more mentions, but that’s not up to me.

In the end, Turing and SciML have kind of agreed to make sure to be compatible with Tracker, ReverseDiff, Zygote, and ForwardDiff as much as possible, and ChainRules.jl (along with DistributionsAD.jl) makes this quite easy. Since all of these just act on standard Julia code, you can easily support all of them at once, where lack of support is more about just hitting a limitation of the AD system, which they will all have so that’s okay!

81 Likes