I’ve been away from Julia for a bit and was wondering what the AD landscape looks like now. In the past Zygote was the way things were going. Then there was something about Enzyme being the new way. Where do things stand now?
I am also curious about Diffractor.jl. Is it still in development?
This thread is about a year old now, but Chris talks a little about “why Zygote is being replaced with Diffractor and Enzyme”. Probably more stuff in that thread that is of interest.
I recently saw Keno say he would put some time into Diffractor again, and seems like a few commits for starting stage 2 landed last week. So it certainly seems alive, though I think it has been slow progress for a while since some compiler stuff was needed for it to continue IIRC.
It’s going really well! Enzyme is looking to become the general AD IMO, given that it has such a wide surface of support. That said, whether Enzyme is right for you (or machine learning) is really a binary thing. Right now it doesn’t have support for all of the GC and dynamic dispatch. Part of this delay was because Valentin (one of the biggest contributors) was just off… adding native precompilation caching to Julia (https://github.com/JuliaLang/julia/pull/47184). So can’t be mad about that. But if your code hits GC or dynamic dispatch, it’s not a coin toss as to whether that will work as there are parts of that which are not quite supported yet, which basically means “there be dragons” for now and I would only suggest using it for non-allocating fully inferred code.
That being said, it’s the default that’s used inside of SciMLSensitivity.jl these days, it’s extremely fast, supports mutation, and is robust within the confines of those two caveats above. Its rules system is mostly worked out:
it’s just a question of making it less tedious in the context of activity analysis.
Enzyme core is growing in contributors. There’s an Enzyme conference coming up:
They received an award at SuperComputing 2022.
https://www.csail.mit.edu/news/mit-csail-phd-students-receive-best-student-paper-supercomputing-2022
So with that kind of momentum, the contributor base growing (and at the LLVM level, shared with contributors from Rust), and a solid foundation that supports mutation from the get-go, it’s really on the right path to be a full language-wide AD system. It’s not quite there yet, but that shows it has the momentum as the new foundation.
In the meantime, using Zygote where you define adjoint rules on mutating operations to just call Enzyme isn’t a bad option.
That said…
The most fun is the new AD work:
StochasticAD.jl is based on a new form of automatic differentiation which extends it to discrete stochastic programs.
https://arxiv.org/abs/2210.08572
This allows things like agent-based models and particle filters to be differentiated with automatic differentiation.
Additionally, there’s a new ForwardDiff-like AD being developed for higher order AD:
It adds some vector-based rules that ForwardDiff doesn’t have as well, which makes it able to handle neural networks and linear algebra in a good way.
It’s still under some heavy development, but it’s avoiding the compiler parts that generally makes AD more difficult so it should be quicker for it to get up to speed.
That’s great to hear — basically, Enzyme is useless for me until it supports easily accessible user-defined derivative rules (custom vector–Jacobian products, vJp’s), as in my experience any sufficiently complicated/interesting problem requires at least some custom vJp rules.
However, does that mean that all existing code using ChainRules will need to be rewritten to use EnzymeRules?
Not exactly. There is a trivial extension of ChainRules to Enzyme rules, which is to ignore activity and assume all variables are active.
Let me describe a bit more detail on why Enzyme rules are a bit more interesting, why they’re harder, and why it will lead to performance improvements. Take a look at an Enzyme call I wrote yesterday (Segfault with constant variables? "Enzyme cannot deduce type"? · Issue #571 · EnzymeAD/Enzyme.jl · GitHub):
...
function heat_eq!(dx,x,p,t)
time = t/Tf;
u = input_signal(time, p)
diffusion_x!(dx,x,Nx,1,Δx)
dx .= α * dx
dx[1] = dx[1] + α/(λ * Δx) * u
end
Enzyme.autodiff(heat_eq!, Duplicated(dx, d_dx), Duplicated(x, d_x),
Duplicated(p, d_p), Enzyme.Const(t));
This shows the Enzyme “activity states”. Duplicated
means that dx
is a variable which is to be differentiated, and its derivative will be written into d_dx
. This allows Enzyme to be fully non-allocating when differentiating arrays. And note the mutation support. However, here I didn’t want to differentiate with respect to t
, so I Enzyme.Const(t)
.
Zygote/Diffractor work by differentiating all code and hoping dead code elimination is good enough to eliminate branches. ChainRules kind of supports something around activity states by using @thunk
, but the AD systems right now ignore the thunks and expand them most of the time anyways, so it kind of doesn’t exist (at least in the code generation perspective). Enzyme is able to produce the code in a way that is specialized to the differentiation of only some components. And there are many different activity states:
https://enzyme.mit.edu/julia/api/#Types-and-constants
EnzymeCore.Active
EnzymeCore.BatchDuplicated
EnzymeCore.BatchDuplicatedNoNeed
EnzymeCore.Const
EnzymeCore.Duplicated
EnzymeCore.DuplicatedNoNeed
Thus in order for your rule to be well-defined, you need to define it for all combinations of activity states. For example, a function f(x,y)
can have (Duplicated, Const)
, (Const, Duplicated)
, etc. and you want a rule for every combination. Doesn’t that lead to a combinatorial explosion of required rules?
Yes. 6^n overloads are thus required for a full definition with v1 (Add support for user-defined rules by vchuravy · Pull Request #177 · EnzymeAD/Enzyme.jl · GitHub).
But of course there are many different fallbacks you can do. You could setup a system for example where if you have a version that is non-const array, you fall back to Duplicated, if it’s a number or struct you fall back to Active, and so then the number of rules decrease. And then ChainRules defines the “always active” versions. This would then allow for ChainRules to be used to give default overloads, which could then be improved with additional overloads on specific activity states. ChainRulesCore could adopt the activity state types as well and then it would map over better (and then things like Diffractor would be able to use that information as well).
So tl;dr, ChainRules doesn’t give enough information to fully optimize, Enzyme is asking for too much, so what’s holding back the rules system is some kind of fallback mechanism so that you don’t need to define 700 dispatches for every rule. When such a fallback mechanism exists, then ChainRules should be supported, though sub-optimal.
Right now it doesn’t have support for all of the GC and dynamic dispatch
Are there ways to ensure that GC doesn’t run during certain functions that you want to run through Enzyme AD?
Just don’t use the GC? If you don’t allocate and everything has inferred types with no dynamic dispatch then the GC won’t run. That means that the fastest code is already Enzyme compatible, so in some sense the answer is “git gud” and Enzyme will be happy.
Enzyme is described as “experimental” on the JuliaDiff.org web page. Is that still the right adjective?
Until the extensibility issue is resolved, I would be cautious about overselling Enzyme — currently, if you encounter something it cannot differentiate (or should not differentiate, like an iterative solve), then you are stuck with little recourse other than switching to Zygote so that you can use ChainRules.
For chain rules which might not be necessary if mutation were supported, let’s say for map
, would Enzyme benefit from using the rule (and if so, by how much?) or would it be more efficient to not use it?
Those things probably shouldn’t have rules. In general, @wsmoses can chime in but I think the only cases which really need derivative overloads with Enzyme are cases where you have an analytical reason to prefer some alternative algorithm to differentiating the approximation. For example, for numerical accuracy (e.g. derivative of exp(x)
is exp(x)
, differentiating the code will be only approximate but with numerical error), handling wrappers (LAPACK, SuiteSparse, etc.) or for some known performance things (like differentiating nonlinear solves or other iterative codes, as @stevengj alludes to). Things like maps, broadcasts, etc. don’t need to be handled (and probably shouldn’t be handled) at the rule level since the derivatives of iteration and mutation are really good.
Though as noted, a lot less things actually need derivative rules to do well with Enzyme. I’d go as far as to say it should really just be package author things. Of course it is a big piece that is missing though. Right now, the right way to do Enzyme for most people is to use Zygote at the user level and use Enzyme to easily define rules over optimized non-allocating functions with mutating operations. For example, SciMLSensitivity’s rules for solving NonlinearProblem
adds a chain rule to do the implicit function theorem adjoint where the user’s f(x) = 0
function is differentiated by Enzyme (when possible, with fallbacks to ReverseDiff and Zygote).
But yes, with the three noted missing pieces (GC, dynamic dispatch, rules), it’s right now still a tool for more advanced users that is on the right trajectory to be the main AD system.
Even worse, you can get silent data corruption in Enzyme (cf this issue).
The paradigm Enzyme is based on seems to be very promising, and when it works, it is blazing fast, but I routinely check results against ForwardDiff before using it. Particularly in MCMC (HMC/NUTS), invalid derivatives are usually chased down after a long debugging session, because they just mimic valid, but tricky models wildly varying curvatures.
IMO, the state of AD in Julia in 2023 May is the following:
-
ForwardDiff is the most mature and reliable. For \mathbb{R}^n \to \mathbb{R}^m functions, it is a reasonable first choice, but even with m = 1 is can be surprisingly competitive. It is pretty much the only fire-and-forget AD solution at this point.
-
ReverseDiff is similarly mature, but is only fast when the tape is reused. That becomes tricky for complex code with branches (and branches can be hidden just about everywhere if you call nontrivial functions from another package), so correctness should be tested thoroughly.
-
Enzyme is great when it works. And when it does not, a lot of the time code can be rewritten (it is worth it, Enzyme is very fast). Again, check your derivatives before using in production.
-
Zygote is worth checking out for non-mutating code that operates on arrays.
There are lot of other experimental packages out there which I am not addressing here, this is from a user perspective.
Note Enzyme has seen some pretty major improvements since January.
Enzyme v0.11 fixed GC support, dynamic dispatch handling, a rule system, and added linear algebra support. The linear algebra is done via fallbacks to differentiating the kernels, which would be better handled with a high level rule to continue using BLAS, but it at least works.
As an example, here’s a fairly dynamic code that works fine:
using Enzyme
A = Any[2.0 3.0
2.0 4.0]
function f(x::Array{Float64}, y::Array{Float64})
y[1] = (A * x)[1]
return nothing
end
x = [2.0, 2.0]
bx = [0.0, 0.0]
y = [0.0]
by = [1.0];
Enzyme.autodiff(Reverse, f, Duplicated(x, bx), Duplicated(y, by)); # Works fine!
Forward and reverse. And now on main (unreleased), here’s an example showing it’s working with a globally defined Lux neural network (values compared to Zygote):
using Enzyme
x = [2.0, 2.0]
bx = [0.0, 0.0]
y = [0.0,0.0]
using ComponentArrays, Lux, Random
rng = Random.default_rng()
Random.seed!(rng,100)
dudt2 = Lux.Chain(x -> x.^3,
Lux.Dense(2, 50, tanh),
Lux.Dense(50, 2))
p, st = Lux.setup(rng, dudt2)
function f(x::Array{Float64}, y::Array{Float64})
y .= dudt2(x, p, st)[1]
return nothing
end
Enzyme.autodiff(Reverse, f, Duplicated(x, bx), Duplicated(y, ones(2)))
function f2(x::Array{Float64})
dudt2(x, p, st)[1]
end
using Zygote
bx2 = Zygote.pullback(f2, x)[2](ones(2))[1]
bx
@show bx - bx2
#=
2-element Vector{Float64}:
-9.992007221626409e-16
-1.7763568394002505e-15
=#
It’s of course not perfect yet and since the rule system has just landed it needs people to start writing some rules (especially for things like NNLib kernels for full Flux support), but IMO it passed many of its major usability milestones and now needs the community to start helping it get the required rules.
And the forward mode is very robust from what I can tell. I haven’t had any issues I’ve ran into with it, other than it’s not clear how to do the equivalent of PreallocationTools.jl
Lots of great progress, but like, to be clear Tamas’ example still corrupts memory. I think what he said about always testing the results agaisnt another system is right, though I’d prefer to use finite differences for such validation.
Like I mentioned on that Github issue, if you turn on runtime activity Enzyme.API.runtimeActivity!(true)
, those primal data corruptions should not occur.
I intend to fix these, but in the interim, I’ve been chasing other features (like GC/type unstability, and also defending my thesis on Monday ).
@Tamas_Papp, perhaps just turn that flag on by default?
I’ll also caution, there are still some type unstable and GC-specific calls Enzyme doesn’t yet handle, but most of the common ones should now be covered.
Yeah not perfect and one should still be double checking their results, but this is a massive step forward and it seems “most” codes I throw at it work these days (on the unreleased main). I’m looking for more bugs but it’s doing at lot better than before.
Apropos this, here is something about “quest issues” that @jar1 (not sure the ids are for the same user) just dropped on slack for marshalling contributors. At least in some cases it was very successful.
I should add that the use case was very similar: These are exactly the steps you need to take to improve this error message, even if you have never contributed to OSS. Or course for AD rules the experience needed to enter is a bit higher.
I really would like to second the remark on ForwardDiff which seems to be the least “cool” in the bunch. Kudos to the developers! We use it quite heavily and only once ran into this issue which we could resolve by just using replacing out workaround for log(1+x)
by log1p
.