AD pipeline and Hessian–vector products

gdalle · June 28, 2024, 6:12am

Thank you for your question, I’ll do my best to answer point by point!
Note that @hill and I are also giving a talk at JuliaCon where we hope to cover some of this stuff.

No they don’t:

ForwardDiff.jl has its own set of Dual number overloads
Enzyme.jl has its own set of rules
Tapir.jl has its own set of rules

There are also bridges between the various rule sets.

The main limitation with ChainRules.jl at the moment is the lack of support for mutation. If you want to fix that, the project is looking for a new maintainer! And by the way, thank you @oxinabox for all you’ve done, you’ve been an inspiration to me and many others

Disclaimer: I’m the lead developer of DifferentiationInterface.jl, shortened as DI from now on.
My hope is that DI should become the new standard, at least for single-argument functions x -> y that eat and spit arrays or numbers. Within these limits, DI is more efficient, better tested and has broader backend coverage than AbstractDifferentiation.jl.
It is still unclear what should happen with multiple-argument functions (xa, xb, xc) -> y. The original plan we hashed out with @mohamed82008 was for AbstractDifferentiation.jl to take over this aspect as a wrapper around DI, for the few backends that support it. However, several people have been asking for multi-arg functionality in DI itself, especially since the SciML ecosystem has started to use my package. Since I have a bit more dev time available than the team around AbstractDifferentiation.jl, perhaps we could re-discuss that arrangement?

DI uses ADTypes.jl, which has become the standard in the SciML and Turing ecosystems. AbstractDifferentiation.jl still uses its own structs because it predates ADTypes.jl, but that’s rather easy to fix.

AutoDiffOperators.jl is a way to turn functional syntax into matrix multiplication syntax. It is still based on AbstractDifferentiation.jl, but its maintainer wants to switch to DI in the near future.

SparseDiffTools.jl automates the computation of sparse Jacobians and Hessians, but only for a few specific backends (mainly ForwardDiff.jl and FiniteDiff.jl). Our main project with @hill these past few months has been to design a better, generic sparse AD pipeline to replace it, and we’re nearly there:

SparseConnectivityTracer.jl (Adrian’s baby) is a very lightweight sparsity pattern detector, much faster than Symbolics.jl in our benchmarks on the OptimizationProblems.jl suite.
SparseMatrixColorings.jl takes care of matrix coloring and decompression, which are needed to reduce the number of matrix-vector products in sparse AD. We have preliminary benchmarks against the C++ reference in the field, called ColPack (still WIP with @amontoison because the Julia interface ColPack.jl is a bit buggy). Those comparisons suggest that we’re nearly always faster than ColPack.jl, although we don’t offer quite as many options as the underlying C++ library.
DI includes a backend-agnostic mechanism for computing sparse Jacobians and Hessians, given a sparsity pattern and a coloring. I released the last necessary feature yesterday morning and never tried to profile the code, so there are still large margins of improvements.

Despite the fact that all of this was developed by only 2 people in the last 6 months, the first benchmarks on the famous Brusselator are quite promising. We get 100x speedup on sparsity detection and 10x speedup on coloring compared to the time-honored SparseDiffTools.jl pipeline. That explains why the JuliaSmoothOptimizers organization now uses SparseConnectivityTracer.jl instead of Symbolics.jl.
The only disappointing aspect is a 1.5x slowdown on the Jacobian evaluation itself, which is the crucial part. But let me stress this again: the whole code is backend-agnostic and was literally released this week, so give me a few more days

Yes, but I would include bigger stuff like Optimization.jl and DifferentialEquations.jl with the high-level tools. Indeed, @ChrisRackauckas and @Vaibhavdixit02 have been very supportive of DI from day 1, and they’re planning to include it in much of the SciML ecosystem within the coming months. If anyone wants to hack on this at JuliaCon, hit me up!

Of course, another important high-level tool is deep learning packages. Using DI in Lux.jl and especially Flux.jl is a bit more tricky because they need to differentiate wrt arbitrary structs (and have varying notions of what a “gradient” in this case means, e.g. whether scalar fields should be included or not). Turing.jl is another worthy challenge, for which I still need to fix a few performance issues with DI’s handling of Enzyme.jl.

My friend, you’re in luck! One of the perks of DI is the ability to combine different backends to perform second-order AD. As you’ve pointed out, such functionality has to live outside of the individual AD packages, which is why I implemented it there.

Two different backends can be put together inside a SecondOrder struct, which you can then feed to the hvp operator. This does exactly what you want, and it adapts to the combination of backends you give it (although the best choice remains forward-over-reverse, as @stevengj pointed out).
Two caveats:

Not every pair of backends will work, and I would advise sticking to well-known combinations like ForwardDiff.jl over Zygote.jl. However I managed to get second-order Enzyme.jl working as well, using deferred autodiff.
Second-order AD will be less efficient than first-order AD. One of my main pain points is that closures make it harder to “prepare” (tape, cache, etc.) the inner differentiation part. See this issue if you have any ideas to unblock me.

Do you mean several v's at once, or in a sequence during an iterative linear solve? If you have several v's at once, there’s a threshold from which it might be worth constructing the whole Hessian matrix, especially if it’s sparse?

Sorry for the wall of text, and don’t hesitate to reach out via DM or on GitHub if you need help using DI. It’s all still rather experimental, so I’m not claiming performance will beat the existing code everywhere, especially because most existing code is customized for a specific AD backend. However, I do believe that:

If performance is not up to your standards, most of the time it can be solved with a better implementation. I’m not an expert in every single backend, so open issues or PRs and help me out! In my view, for single-argument functions, the current design of DI does not limit performance in any meaningful way.
In some cases, a little performance drop is acceptable to reduce the huge amount of duplication and maintenance troubles. At the moment, tons of packages have extensions handling Zygote.jl, ForwardDiff.jl, Enzyme.jl and friends, which contain essentially the same code. My goal with DI was to write such code only once, test and optimize the hell out of it, then let everyone use it. Time will tell if it worked!

Topic		Replies	Views
Nested AD with Lux etc Machine Learning ad	26	1274	May 1, 2024
[blog post] Implement your own AD with Julia in ONE day Community blog-post	33	4234	November 3, 2018
State of automatic differentiation in Julia Machine Learning	57	21844	September 8, 2021
ChainRules: Replacing DiffRules in the Julia AD world Package Announcements package	43	4277	March 12, 2019
What lessons could Julia's autodiff ecosystem learn from Stan's TinyGrad? Machine Learning	41	3836	September 13, 2023

AD pipeline and Hessian–vector products

Related topics