State of automatic differentiation in Julia

Samuel_Ainsworth · July 14, 2020, 9:24pm

There seems to be a plethora of AD frameworks in Julia. I’m curious what the state of the ecosystem looks like currently.

I’m aware of the following packages:

FowardDiff
ForwardDiff2
Nabla
Tracker
Yota
Zygote
ReverseDiff
AutoGrad
NiLang
ModelingToolkit
XGrad
Calculus
FiniteDifferences
FiniteDiff
TaylorSeries
DualNumbers
HyperDualNumbers
Knet
Capstan
Flux
… surely many more

based on news/info from the following sources:

This is a lot to sort through. What tools has the community settled on? What is the roadmap like, and where are things headed?

datnamer · July 14, 2020, 9:56pm

@MikeInnes is the lead dev and creator of zygote, which seems to be the most promising package for the future. I think he’d be best placed to answer

ChrisRackauckas · July 14, 2020, 10:13pm

The ecosystem can be disconnected from the AD implementation via ChainRules.jl. I’ve been mentioning this idea of glue AD, AD systems that integrate well with others, as where to go:

Each of these systems have trade-offs, and that’s then advantageous if code doesn’t have to be re-written to utilize other AD packages. In that sense, something like GitHub - SciML/Integrals.jl: A common interface for quadrature and numerical integration for the SciML scientific machine learning organization can go and make all quadrature packages in the Julia language have AD rules, and then boom add AD packages (which utilize ChainRules) will be compatible with it. While we’re not there yet (some ChainRules work needs to be done) we’re quite close, in which case the choice of AD package will be more about “what AD package for what part of the code”.

To some extent, things like DifferentialEquations.jl and Turing.jl are already doing this, as they are compatible with ForwardDiff, Zygote, Tracker, and ReverseDiff (the four top packages), and a dispatch choice chooses between them. I hope we can make this more pervasive since they do have different semantics with different optimizations:

ForwardDiff is probably the fastest chunked scalar forward-mode you can make, but it’s Array of Structs instead of Struct of Array so it’s not as optimized for large vectorized operations.
Zygote is great for large vectorized operations because of all of its overloads, but its scalar operations are quite slow and it cannot handle mutation.
ReverseDiff is great on scalarized operations especially if you can use its compiled mode (i.e. the function doesn’t have branches) in which case it’s essentially optimal. But it cannot support GPUs and it has limited vectorized operations (which should improve when ChainRules support is added)
Tracker is a bit slower that ReverseDiff but it’s able to handle limited forms of mutation like it, and it’s compatible with GPUs, so there’s some cases where it comes into play (but a lot less than the other 3 these days).

ModelingToolkit is a symbolic library and not necessarily an AD library, but you can use it to trace code and build functions for derivatives, and those will be as optimal as possible for scalar operations but it can’t handle complex control flow like a normal AD system.

Flux isn’t an AD library, it’s an ML library. Right now it uses Zygote as its AD. Knet is also not an AD library, it uses AutoGrad.jl.

Calculus, FiniteDifferences, and FiniteDiff are finite differencing libraries. FiniteDifferences is for scalar operations and FiniteDiff does gradients, Jacobians, and Hessians (with sparsity). Calculus is just old and slow…

DualNumbers and HyperDualNumbers are the primitives for building a forward-mode AD, but most users should just use ForwardDiff.jl

ForwardDiff2.jl is a prototype of a vectorized forward-mode via Cassette, so it would be faster than something like ForwardDiff on things like neural networks. Calling it a prototype is actually probably not good enough because it is actually functional, but Cassette maintenance and performance issues have gotten in the way so I wouldn’t expect too much work on it. The work on Zygote’s forward mode should superscede it.

TaylorSeries.jl is for higher order forward-mode (Taylor-mode). It’s very efficient and does its job well, but it’s probably more niche since you’d really only want to use it when you want two or more forward mode derivatives.

AutoGrad.jl is fine as a reverse-mode, but it’s not as integrated into the rest of the systems like something like Zygote so you can’t use it as widely willy-nilly around packages unless it’s all pure Julia code. And it’s not the fastest, but it’s fast enough for machine learning. It’s more purpose-built for Knet, which utilizes it’s own built in kernels so Knet can just add adjoints to those and that’s functional.

Nabla was the Invenia AD, which was mostly good because of the adjoints they had implemented. But @oxinabox is trying to fix this area of the ecosystem by adding all of those adjoints to ChainRules so that all of the specialized overloads could be used by Zygote, ReverseDiff, Tracker, and after that I would expect they would drop maintainance.

NiLang is a reversable AD and super fast because of that, but it requires code from a specific DSL so it’s not as applicable to general code like the other AD systems. That said, for a DSL it’s quite flexible.

XGrad.jl was a super cool idea for source-to-source on Julia expressions, but it’s been supersceded by ideas of working on the Julia IR like Zygote.

Capstan was discontinued due to Cassette issues and because the author started a company. But its ideas live on in Zygote.

Yota.jl is the interesting one that I haven’t tried yet. If they connected to ChainRules.jl it’ll probably be another one in the list to always try.

Then there’s two more not in the list. @Keno is working on a new AD which will hit the Julia compiler at a different level so that it will be easier to apply compiler optimizations. Then there’s a really fast LLVM-level reverse mode AD that I know about which should start getting more mentions, but that’s not up to me.

In the end, Turing and SciML have kind of agreed to make sure to be compatible with Tracker, ReverseDiff, Zygote, and ForwardDiff as much as possible, and ChainRules.jl (along with DistributionsAD.jl) makes this quite easy. Since all of these just act on standard Julia code, you can easily support all of them at once, where lack of support is more about just hitting a limitation of the AD system, which they will all have so that’s okay!

oxinabox · July 14, 2020, 10:46pm

I came here to say what Chris said.

Things I can add:

For higher order derivative the options are HyperDualNumber Taylor series and nesting AD.
HyperDual numbers is from a mathematic standpoint strictly worse that TaylorSeries.jl because it is doing redundant work.
See calculus - Are HyperDual numbers and degree 2 Taylor Polynomials the same thing? - Mathematics Stack Exchange
Practically speaking it is also probably less performant than getting second derivative via nesting ForwardDiff.
In theory nesting AD leafs to expodential work, because of redundant terms showing up at different levels of Faa do Bruno’s formula, and Taylor series solves this.
In practice this only really starts to be a big problem for 4th or more derivatives.

You shouldn’t nest Zygote however, because type inference issues start to become really bad.
Nesting Zygote with another AD is fine though.

Finite differencing is effectively forward mode AD but less accurate. Super super robust though.

FiniteDiff.jl is focused on using finite differencing to find jacobians etc, and it’s fast for that. It’s not great for finding derivatives of scalars.
FiniteDifference.jl focusex on being able to accept almost anything and give back derivatives, e.g. you can give it dictionaries, and I think now (or at least soon) arbitrary structs.
It’s a bit less focused on performance (indeed it originally was created for testing purposes only).
Both FiniteDiff and FiniteDifferences employ some pretty smart algorithm to be very accurate – much better than a naive implementation of finite differencing.

Calculus.jl is old and slow and inaccurate

Knet and Flux are not AD systems, they are Neural Net libraries.
Knet uses AutoGrad.jl, and Flux uses Zygote (it used to use Tracker)

ChainRules is supported by Zygote (reverse mode) and ForwardDiff2.
ReverseDiff.jl support sometime after JuliaCon.
Nabla sometime after that (and then Nabla’s retirement)
Then time to talk about Yota and AutoGrad and Tracker.
At some point also Zygote’s new forward modes and deeper integration.

Tamas_Papp · July 15, 2020, 6:09am

I don’t think that FD should be confused with AD.

ChrisRackauckas · July 15, 2020, 11:39am

It’s the same algorithm, you just store the perturbations in a different dimension.

DNF · July 15, 2020, 11:40am

This surprised me greatly. Firstly, I always heard they were fundamentally different; and secondly, I thought finite differences were very vulnerable to to noise and sensitive to stepsize.

Tamas_Papp · July 15, 2020, 11:44am

Yes, I am aware of the theoretical connection, but for people new to the concept approaching it like this may be confusing, especially as FD requires close attention to a host of numerical issues that do not afflict AD.

ChrisRackauckas · July 15, 2020, 11:59am

They aren’t necessarily. Forward mode and reverse mode are fundamentally different: they have different computational complexities, different actions, etc. Forward mode computes Jacobian-vector products, while reverse mode computes vector-transpose-Jacobian products (Jv vs v’J). One computes columns of Jacobians, while the other computes rows.

But forward mode and finite difference? Both compute Jacobian-vector products ((f(x + epsilon*v) - f(x)) / epsilon), both are computing columns of Jacobians, both even have the same computational complexity for any of its cases. They are fundamentally the same algorithm. What’s different is, when x is a real number, autodiff stores its internal values as an N+1 dimensional number, where N are the N ongoing perturbations. Finite difference can only do one perturbation at a time, and stores that perturbation as a small piece at the end of the original number (hence having less digits and the numerical accuracy issues). So forward-mode AD with chunk size of 1 and finite difference should usually have around the same computational cost, and then with higher chunk sizes it can have a constant factor reduction due to calculating the primal less, but it’s really not huge. Even with higher chunk sizes, we’ve seen FiniteDiff.jl is usually <2x from ForwardDiff.jl (because the primal is usually less complex than the derivative, so you’re tagging on a bunch of extra calculations anyways). The main difference is really just accuracy because you’re not mixing the perturbation dimension with the primal dimension.

I always point to https://mitmath.github.io/18337/lecture9/autodiff_dimensions and hope it’s helpful. So yes, finite differencing done correctly is essentially the same algorithm but with a higher error floor.

simonbyrne · July 20, 2020, 3:32pm

3 posts were split to a new topic: Inverse functions using TaylorSeries.jl

niklasschmitz · July 22, 2020, 4:46pm

Another practical difference in FD vs Forward mode is that FD can treat a function completely as a black-box requiring only function evaluation, whereas Forward AD will need to decompose the function into primitives with known derivative rules. So with FD you might even differentiate through some exotic web API, or let’s say some Fortran solvers you’re calling etc, making it very robust from a compatibility perspective.

joshpritsker · August 3, 2020, 3:30am

I’m interested in knowing how trustworthy the results from Zygote/Flux and Yota currently are.

Zygote seems to already be used in Flux, but in its docs there is “At least, that’s the idea. We’re still in beta so expect some adventures”. I can’t really use it if there’s a chance that some bug is going to make my results completely off.

Yota doesn’t explicitly state anything similar, but I can’t really tell what sort of position it’s in.

@dfdx @MikeInnes

jling · August 3, 2020, 3:48am

if you depend it on rocket launching, don’t, cuz numerical stuff always go wrong and it’s impossible to guarantee otherwise.

other than that, it’s pretty “accurate” but of course edge cases exist, but then again, don’t expect it to accurately handling everything in the mathematically possible universe.

dfdx · August 3, 2020, 10:25am

Jerry summarized it pretty well. Currently Yota passes all its 140 tests (+ a bunch of tests in Lilith.jl which further challenges Yota’s autodiff), but the universe of possible code paths is many orders larger. There’s no bug-free software, only undertested. The question is what level of trust is acceptable for your task. With Yota/Lilith, I haven’t encountered numerical errors in a while (there was an issue in 3rd party library, but it was caught in tests and didn’t get to master), however in the world of living software with changing versions of the language, libraries, variety of platforms and use cases reliability can only be checked by practical use.

willtebbutt · August 3, 2020, 12:35pm

As a rule, if you’re at all concerned about the correctness of the gradients you’re getting from an AD tool, you should test it using finite differencing eg. with FiniteDifferences.jl

Samuel_Ainsworth · August 3, 2020, 9:42pm

I think the real question here is one of soundness, not completeness: I understand that no AD system will be able to give me gradients on completely arbitrary code, but giving me a gradient that is incorrect instead of an error is a big no-no!

Finite differences is always a great way to double check gradients, though ideally the onus of testing should lie on the libraries themselves, not users.

ChrisRackauckas · August 3, 2020, 9:52pm

There’s a lot of safety in the current implementations. ForwardDiff’s tag implementation is especially good at blocking perturbation confusion. The only thing that I know of that’s a little wonky is Zygote’s nothing handling. It’s not incorrect in any known way other than turning absence of gradient definitions into zero gradients, which sometimes is weird and should error IMO

dfdx · August 3, 2020, 11:12pm

And usually authors of these libraries do extensive testing of all new gradients (excluding maybe trivial functions from textbooks like sin() or well-known gradients like matrix-matrix multiplication). However, there are always corner cases where it’s not so easy to figure out the right gradient or even behavior.

For example, consider the following function:

loss(x) = sum(x) / length(x)

There are 2 paths connecting x and loss - via sum() and length(). sum() is no problem - it has a well known derivative, but length() is not so unambiguous. Usually, we work in spaces with fixed number of dimensions, e.g. R^n, so length(x) == n is constant, and so the derivative should be zero (or not propagated? that’s another non-trivial question). On the other hand, Julia is not pure math, and in practice there might be a use case in which we must calculate length() derivative as well.

I haven’t seen such use cases, so in Yota I used the first approach - stop propagation through length. If, however, someone encounters such a scenario, they won’t get an exception. They won’t even get zero derivative, since loss still depends on x via sum(). The result will be just wrong. But, honestly, I don’t know how to prevent it.

Other tricky use case:

derivatives of iterate() and Base.indexed_iterate() w.r.t. to iterator state
getindex() and view() w.r.t. indices
convert()
gradients w.r.t. global variables, etc.

There are also mutable state, control flow, exception handling, tasks, multithreading and many other things that can, in some scenario, mess up the result in a way that you don’t even notice it. So the only way to improve robustness of an AD system is to put it in use as much as possible, but look at the results with just a little skepticism.

ctkelley · August 3, 2020, 11:26pm

I don’t think so. Finite differencing acts on the output, Auto Diff acts on the code.

e. g. Any bozo, including me, can write an finite difference approximation to a derivative. It takes an expert to write a forward mode AD code.

Tamas_Papp · August 4, 2020, 6:29am

You may be surprised to hear this, but this is the case for most users — most of us prefer correct results.

All of the Julia AD packages mentioned above are free software, so this may not be a good way to phrase this. If you believe that you care about correctness more than other users, you should consider contributing to tests and/or reviewing code.

Generally, it is unclear what you are expecting from this discussion. Like all software, Julia’s AD libraries are not guaranteed to be bug-free, despite careful implementation and testing. Major errors are rare and are usually fixed quickly, but can nevertheless happen.

Topic		Replies	Views
[blog post] Implement your own AD with Julia in ONE day Community blog-post	33	4238	November 3, 2018
What lessons could Julia's autodiff ecosystem learn from Stan's TinyGrad? Machine Learning	41	3846	September 13, 2023
Comparison of automatic differentiation tools from 2016 still accurate? Numerics differentiation	41	5822	August 16, 2018
Is it possible to do Nested AD ~elegantly~ in Julia? (PINNs) General Usage machine-learning	43	3216	September 27, 2024
Automatic Differentiation (AD) in Julia vs. Python (or PyTorch) Machine Learning autodiff	14	1582	January 16, 2025

State of automatic differentiation in Julia

Related topics