State of AD in 2024

bayesian_fish · April 6, 2024, 6:40am

Hi,
I’m new to the Julia language and am trying to understand the AD ecosystem. From what I understand, Julia wants to broaden AD beyond the type of models you might find in Pytorch and Jax while also retaining similar performance on CPU and GPU. I read that Zygote differentiates on low level Julia IR but now the attention is towards other projects like Diffractor and Enzyme which differentiate at an even finer level.

Anyways, I found this post discussing the state of AD around last year, so perhaps maybe a news update on the current tools is in order? I’m excited to use yall’s packages.

gdalle · April 6, 2024, 7:07am

Good question!
There are plenty of libraries out there, which can make it hard to wrap your head around it all. That’s why the page

https://juliadiff.org/

was updated in January 2024 to reflect the current state of affairs. You can find a summary below, which I restricted to “automatic differentiation” in the most common sense – leaving finite differences and symbolic approaches aside.

Forward mode

Relevant when you have few inputs and many outputs, rather easy to implement, can handle a vast subset of Julia.

The main packages are ForwardDiff.jl (or PolyesterForwardDiff.jl for a multithreading speed boost) and Enzyme.jl.
Diffractor.jl is still experimental, and I would say not yet suited for general use?

Reverse mode

Relevant when you have few outputs and many inputs (typically in optimization), much harder to implement, can handle a narrower subset of Julia.

The main packages are Zygote.jl and Enzyme.jl:

Deep learning (e.g. Flux.jl, Lux.jl) tends to use Zygote.jl for its good support of vectorized code and BLAS. Restrictions: no mutation allowed, scalar indexing is slow.
Scientific machine learning (e.g. SciML) tends to use Enzyme.jl for its good support of mutation and scalar indexing. Restrictions: your code better be type-stable, and the entry cost is slightly higher (but the devs are extremely helpful, shoutout to @wsmoses).

So how do you choose?

Picking the right tool for the job is a tricky endeavor.
Inspired by a past unification attempt (AbstractDifferentiation.jl), @hill and I have been working hard on DifferentiationInterface.jl, which provides a common syntax for every possible AD backend (all 13 of them).
It is still in active development (expect registration next week), but it already has most of what you need to make an informed comparison between backends, notably thanks to the DifferentiationInterfaceTest.jl subpackage.
We’re eager for beta testers!

ChrisRackauckas · April 6, 2024, 12:30pm

That’s a pretty good summary. I would say it like this:

Forward-Mode

ForwardDiff.jl is very mature, and there are a lot of tools around it like PreallocationTools.jl that allow this to be fully non-allocating, SparseDiffTools.jl for optimizing it for sparse matrices, and PolyesterForwardDiff.jl for multithreading.
Enzyme.jl is rather mature for forward mode as its forward pass.

So in summary, ForwardDiff is rather good and there’s lots of nice tooling on it, but Enzyme forward mode works really well too and so many tools are upgrading here. Enzyme can give a little bit of a speed boost but it’s not major in many cases so the upgrading is slow because it’s not a huge deal.

Reverse-Mode

Zygote.jl is still the standard for machine learning codes. There’s a robust ecosystem of ChainRules.jl so many packages have special rules setup that improve the differentiation. Though it does not support mutation and scalar indexing is slow.
Enzyme.jl is slowly becoming more of the standard here for general codes. In the last few months it enabled BLAS overloads and now a good chunk of linear algebra is handled in an optimal way (this was one of the remaining things Zygote did better than Enzyme). Enzyme is extremely fast on mutating code and scalar indexing. Enzyme does not handle CuBLAS yet, which is why it has not caught on in ML spaces, but its rules system EnzymeRules.jl just launched a bit less than a year ago and so it will take time for rules for non-native Julia codes to get coverage. It’s simply a better basis of an AD engine so once more rules cover the ML space I would expect that adoption to happen. Some major changes from before include Enzyme having support for the GC/allocations, and some support for type unstable code. It needs more support for type unstable code before we can truly say it’s a “general purpose AD” for the Julia language, right now there are edge cases which will not work because of this, but it has been progressing rapidly.
ReverseDiff.jl still has many uses because it’s a simple reverse-mode AD that can handle mutation and scalar indexing in its scalar mode, along with a special tape compilation that makes it rather fast while having flexibility similar to ForwardDiff.jl. But this role is quickly being taken over by Enzyme.jl so new users probably should just use Enzyme.jl.

In summary, Zygote vs Enzyme is the major discussion and right now neither one is a superset of the other. Zygote has CuBLAS overloads and more rules throughout the ecosystem, but Enzyme has better bones and does mutation and scalar indexing well so it doesn’t need as many rules. In the last case it was mentioned that Enzyme is missing BLAS support, GC support, and support for handling type unstable code and that is why Zygote is still the standard. In 2024, it has BLAS support, it has GC support, and it supports a large amount but not all type unstable code. In this state, Enzyme is better than most people would think it is because it has improved so rapidly so folks who haven’t adopted it yet just aren’t aware of its improvements. But it still has some bad error messages to improve.

Ecosystem Support

Almost more important than the AD engine though is ecosystem support. This is something that Julia has done really well, with:

SciML’s common interface has very good coverage with ForwardDiff, Zygote, and Enzyme. In particular, standard libraries which have lots of overloads include:
a. LinearSolve.jl automatically applies implicit differentiation for all linear solves
b. NonlinearSolve.jl automatically applies implicit differentiation for all nonlinear solves
c. DifferentialEquations.jl has a sophisticated adjoint system which automatically chooses a stable adjoint method for ODEs, SDEs, DDEs, and DAEs.
d. Integrals.jl automatically applies a continuous derivative rule for ForwardDiff and Zygote differentiation of 1-D and N-D integration, Enzyme rules should come in the near future.
e. Optimization.jl is notably missing AD integration rules for now, but this is slated for the next few months. This will make all nonlinear optimizers support implicit differentiation by default.
Neural network libraries use NNLib.jl which has AD overloads for all of the standard ML functions. EnzymeRules integration was just added not too long ago, which makes Enzyme support the major core of ML layer definitions in an efficient way. This makes Flux and Lux rather robust in the normalest if cases.
You can ForwardDiff Julia-generated CUDA.jl kernels, and Enzyme added some support for CUDA.jl kernels, which means much of the Julia created CUDA kernels support auto-diff out of the box. This is rather unique since Python ML libraries do not support kernel generation and instead expect you to call standard kernels, which is something that we have found is 20x-100x faster than Jax/PyTorch in many nonlinear cases and generalizes to nonlinear optimizers as well. This is probably more advanced than what the standard user would use, but for HPC folks and library developers I think this is a major differentiator of the Julia ecosystem right now, but users of libraries will just see this as some things are a good chunk faster.
There are many more libraries that support rules. The easiest way to see this is see what libraries depend on ChainRulesCore.jl. It currently has 235 direct dependencies, with most of those libraries adding rules overloads, and 3672 indirect dependencies which thus benefit from these rules overloads. In an ecosystem of just over 10,000 packages, that tells you how much is integrated with the rules system!
As mentioned before, there’s lots of autodiff helper libraries. Some rather common ones to mention are:
a. PreallocationTools.jl which helps with preallocation in forward and reverse mode contexts to make it fully non-allocating (though it is notably not compatible with Enzyme)
b. PolyesterForwardDiff.jl which parallelizes forward-mode AD in a multithreaded way. This is rather hard to beat for most “not huge” cases and there are plans to do this multithreading of other ADs.
c. SparseDiffTools.jl which adds coloring algorithms to enable any AD library to be compatible with color differentiation for the fast generation of sparse matrices. It both automatically finds the sparsity pattern and sets up the AD to optimally calculate the non-zero pattern. Supports ForwardDiff.jl, FiniteDiff.jl, and PolyesterForwardDiff.jl. The color analysis passes can be done for reverse mode easily by just using the transpose of the matrix, though some work needs to be done to make it have higher level support for ReverseDiff.jl and Enzyme.jl.

Alternative Engines Beyond AD

A few other projects to know which are in this space:

Symbolics.jl does symbolic differentiation. It recently gained support for array operators, though it cannot do matrix calculus yet. It allows for partial evaluation and mathematical simplification before differentiation, and so it can be more efficient than some ways, but of course has the downside of expression growth on larger codes.
FastDifferentiation.jl uses the D* algorithm which is quasi-symbolic but handles larger expressions and sparsity really well. It’s the fastest way to do large (sparse) Jacobians. We hope to integrate it into Symbolics.jl rather soon so that it’s easier to employ on general codes.
FiniteDiff.jl should get a mention as a finite differencing library for gradients, Jacobians, and Hessians that is rather optimal. It will beat autodiff in many cases because it is non-allocating and tries every trick in the book. Don’t sleep on it, though of course finite differencing does have some floating point accuracy trade offs.

Some other cool projects to know

Inevitably, there’s other projects to be aware of. Some are:

TaylorDiff.jl is like ForwardDiff.jl but for higher order AD in an efficient way. It’s not mature yet but is getting there.
Tapir.jl is a pretty new reverse mode AD with some nice performance with scalar indexing and mutation but being done at the Julia level for better user feedback than Enzyme. This might be a nice competitor in the near future?
AbstractDifferentiation.jl and DifferentiationInterface.jl are creating swappable AD platforms. They aren’t quite as efficient as one can get if they really know the AD engines through and through, but is probabaly as efficient as how most people would use the ADs.

tl;dr, use ForwardDiff.jl and Zygote.jl, though if your code is type stable use Enzyme.jl. I think by the end of this year I would start to say Enzyme.jl will be the default for “most” users, it just needs a few more rules and a little bit more type-unstable support. Though note that even if you use Zygote.jl, the library may define a chain rule that uses Enzyme.jl internally, so they are not mutually exclusive.

gdalle · April 6, 2024, 1:15pm

I just want to add more details on that last part, vis a vis DifferentiationInterface. This week, I finished implementing caches / tapes / configs / other optimizations for every backend except Enzyme (see #115). A lot of that was inspired by issues from AbstractDifferentiation, in which @ChrisRackauckas explained the needs of the SciML ecosystem (like #14 or #41).

The goal of DifferentiationInterface was precisely to make these optimizations (1) possible and (2) transparent to the user, so that they don’t have to “really know the AD engines through and through”. See the tutorial for an example of how it works. In a way, it’s AbstractDifferentiation rebuilt with some hindsight (more details on that link in the upcoming official announcement).

Of course I have probably missed a few spots, but overall I would say the following: at least for the main traditional backends (ForwardDiff, ReverseDiff, Zygote), you don’t leave much performance on the table by using DifferentiationInterface. I’m happy to be proven wrong, and even more happy for people to contribute fixes

Alexander_Knudson · April 6, 2024, 1:43pm

@ChrisRackauckas @gdalle the swiftness and depth of your answers (and so many others) on this forum are a cornerstone of this community! Thank you so much!

I hope AD gets the LinearSolve.jl treatment where there are many small and specialized libraries that get unified under a common interface. This kind of strategy makes it easy to try out many different approaches while prototyping before switching to the specific library when an appropriate method is found.

Topic		Replies	Views
Comparison of Julia autodiff packages General Usage autodiff	1	196	September 27, 2024
What's the state of Automatic Differentiation in Julia January 2023? General Usage autodiff	41	12018	June 21, 2024
State of automatic differentiation in Julia Machine Learning	57	21838	September 8, 2021
[ANN] DifferentiationInterface - gradients for everyone Package Announcements zygote , forwarddiff , ad , autodiff , enzyme	5	1375	October 8, 2024
Diffractor release Package Announcements autodiff	30	2941	July 29, 2023