I’ve been away from Julia for a bit and was wondering what the AD landscape looks like now. In the past Zygote was the way things were going. Then there was something about Enzyme being the new way. Where do things stand now?
I am also curious about Diffractor.jl. Is it still in development?
This thread is about a year old now, but Chris talks a little about “why Zygote is being replaced with Diffractor and Enzyme”. Probably more stuff in that thread that is of interest.
I recently saw Keno say he would put some time into Diffractor again, and seems like a few commits for starting stage 2 landed last week. So it certainly seems alive, though I think it has been slow progress for a while since some compiler stuff was needed for it to continue IIRC.
It’s going really well! Enzyme is looking to become the general AD IMO, given that it has such a wide surface of support. That said, whether Enzyme is right for you (or machine learning) is really a binary thing. Right now it doesn’t have support for all of the GC and dynamic dispatch. Part of this delay was because Valentin (one of the biggest contributors) was just off… adding native precompilation caching to Julia (https://github.com/JuliaLang/julia/pull/47184). So can’t be mad about that. But if your code hits GC or dynamic dispatch, it’s not a coin toss as to whether that will work as there are parts of that which are not quite supported yet, which basically means “there be dragons” for now and I would only suggest using it for non-allocating fully inferred code.
That being said, it’s the default that’s used inside of SciMLSensitivity.jl these days, it’s extremely fast, supports mutation, and is robust within the confines of those two caveats above. Its rules system is mostly worked out:
it’s just a question of making it less tedious in the context of activity analysis.
Enzyme core is growing in contributors. There’s an Enzyme conference coming up:
They received an award at SuperComputing 2022.
So with that kind of momentum, the contributor base growing (and at the LLVM level, shared with contributors from Rust), and a solid foundation that supports mutation from the get-go, it’s really on the right path to be a full language-wide AD system. It’s not quite there yet, but that shows it has the momentum as the new foundation.
In the meantime, using Zygote where you define adjoint rules on mutating operations to just call Enzyme isn’t a bad option.
The most fun is the new AD work:
StochasticAD.jl is based on a new form of automatic differentiation which extends it to discrete stochastic programs.
This allows things like agent-based models and particle filters to be differentiated with automatic differentiation.
Additionally, there’s a new ForwardDiff-like AD being developed for higher order AD:
It adds some vector-based rules that ForwardDiff doesn’t have as well, which makes it able to handle neural networks and linear algebra in a good way.
It’s still under some heavy development, but it’s avoiding the compiler parts that generally makes AD more difficult so it should be quicker for it to get up to speed.
That’s great to hear — basically, Enzyme is useless for me until it supports easily accessible user-defined derivative rules (custom vector–Jacobian products, vJp’s), as in my experience any sufficiently complicated/interesting problem requires at least some custom vJp rules.
However, does that mean that all existing code using ChainRules will need to be rewritten to use EnzymeRules?
Not exactly. There is a trivial extension of ChainRules to Enzyme rules, which is to ignore activity and assume all variables are active.
Let me describe a bit more detail on why Enzyme rules are a bit more interesting, why they’re harder, and why it will lead to performance improvements. Take a look at an Enzyme call I wrote yesterday (Segfault with constant variables? "Enzyme cannot deduce type"? · Issue #571 · EnzymeAD/Enzyme.jl · GitHub):
... function heat_eq!(dx,x,p,t) time = t/Tf; u = input_signal(time, p) diffusion_x!(dx,x,Nx,1,Δx) dx .= α * dx dx = dx + α/(λ * Δx) * u end Enzyme.autodiff(heat_eq!, Duplicated(dx, d_dx), Duplicated(x, d_x), Duplicated(p, d_p), Enzyme.Const(t));
This shows the Enzyme “activity states”.
Duplicated means that
dx is a variable which is to be differentiated, and its derivative will be written into
d_dx. This allows Enzyme to be fully non-allocating when differentiating arrays. And note the mutation support. However, here I didn’t want to differentiate with respect to
t, so I
Zygote/Diffractor work by differentiating all code and hoping dead code elimination is good enough to eliminate branches. ChainRules kind of supports something around activity states by using
@thunk, but the AD systems right now ignore the thunks and expand them most of the time anyways, so it kind of doesn’t exist (at least in the code generation perspective). Enzyme is able to produce the code in a way that is specialized to the differentiation of only some components. And there are many different activity states:
Thus in order for your rule to be well-defined, you need to define it for all combinations of activity states. For example, a function
f(x,y) can have
(Const, Duplicated), etc. and you want a rule for every combination. Doesn’t that lead to a combinatorial explosion of required rules?
Yes. 6^n overloads are thus required for a full definition with v1 (Add support for user-defined rules by vchuravy · Pull Request #177 · EnzymeAD/Enzyme.jl · GitHub).
But of course there are many different fallbacks you can do. You could setup a system for example where if you have a version that is non-const array, you fall back to Duplicated, if it’s a number or struct you fall back to Active, and so then the number of rules decrease. And then ChainRules defines the “always active” versions. This would then allow for ChainRules to be used to give default overloads, which could then be improved with additional overloads on specific activity states. ChainRulesCore could adopt the activity state types as well and then it would map over better (and then things like Diffractor would be able to use that information as well).
So tl;dr, ChainRules doesn’t give enough information to fully optimize, Enzyme is asking for too much, so what’s holding back the rules system is some kind of fallback mechanism so that you don’t need to define 700 dispatches for every rule. When such a fallback mechanism exists, then ChainRules should be supported, though sub-optimal.
Right now it doesn’t have support for all of the GC and dynamic dispatch
Are there ways to ensure that GC doesn’t run during certain functions that you want to run through Enzyme AD?
Just don’t use the GC? If you don’t allocate and everything has inferred types with no dynamic dispatch then the GC won’t run. That means that the fastest code is already Enzyme compatible, so in some sense the answer is “git gud” and Enzyme will be happy.
Enzyme is described as “experimental” on the JuliaDiff.org web page. Is that still the right adjective?
Until the extensibility issue is resolved, I would be cautious about overselling Enzyme — currently, if you encounter something it cannot differentiate (or should not differentiate, like an iterative solve), then you are stuck with little recourse other than switching to Zygote so that you can use ChainRules.
For chain rules which might not be necessary if mutation were supported, let’s say for
map, would Enzyme benefit from using the rule (and if so, by how much?) or would it be more efficient to not use it?
Those things probably shouldn’t have rules. In general, @wsmoses can chime in but I think the only cases which really need derivative overloads with Enzyme are cases where you have an analytical reason to prefer some alternative algorithm to differentiating the approximation. For example, for numerical accuracy (e.g. derivative of
exp(x), differentiating the code will be only approximate but with numerical error), handling wrappers (LAPACK, SuiteSparse, etc.) or for some known performance things (like differentiating nonlinear solves or other iterative codes, as @stevengj alludes to). Things like maps, broadcasts, etc. don’t need to be handled (and probably shouldn’t be handled) at the rule level since the derivatives of iteration and mutation are really good.
Though as noted, a lot less things actually need derivative rules to do well with Enzyme. I’d go as far as to say it should really just be package author things. Of course it is a big piece that is missing though. Right now, the right way to do Enzyme for most people is to use Zygote at the user level and use Enzyme to easily define rules over optimized non-allocating functions with mutating operations. For example, SciMLSensitivity’s rules for solving
NonlinearProblem adds a chain rule to do the implicit function theorem adjoint where the user’s
f(x) = 0 function is differentiated by Enzyme (when possible, with fallbacks to ReverseDiff and Zygote).
But yes, with the three noted missing pieces (GC, dynamic dispatch, rules), it’s right now still a tool for more advanced users that is on the right trajectory to be the main AD system.