`Zygote.gradient` is 54000 TIMES slower than `jax.gradient`

danielwe · January 30, 2025, 8:55pm

Yeah the transition to 1.11 has been a big undertaking for Enzyme due to the introduction of Memory and rebasing Array on top of it. You should definitely submit a github issue with this example.

gdalle · January 30, 2025, 8:56pm

But for the love of god don’t submit any example involving DI. Craft a pure-Enzyme MWE

yolhan_mannes · January 30, 2025, 8:57pm

Yes I did that twice I think, I will avoid it promise

danielwe · January 30, 2025, 9:01pm

Confirming that you get a runtime activity error on 1.11, and the workarounds are to either make weights_ctx Duplicated as above, or change the mode to Enzyme.set_runtime_activity(Enzyme.Reverse). The latter seems to be a hair faster.

yolhan_mannes · January 30, 2025, 9:05pm

I made the mwe

using Enzyme

function foo(x,y)
    y2 = reshape(y,2,5)
    return sum(x .+ y2[:])
end

x = rand(10)
y = rand(10)
dx = Enzyme.make_zero(x)
Enzyme.autodiff(Enzyme.Reverse, foo, Enzyme.Duplicated(x,dx), Enzyme.Const(y))
dx

I file the issue, it seems like it’s not only about reshape, y2 = y.^2 does the same. idea for the name of this issue just issue on 1.11 ? maybe Enzyme on 1.11 needs more Cache than on 1.10 ?

danielwe · January 30, 2025, 9:21pm

Looks like it’s already reported here, at least for reshape: error when differentiating `reshape` · Issue #2214 · EnzymeAD/Enzyme.jl · GitHub

The y.^2 issue hits a different codepath (broadcasting machinery, not GenericMemory), so I suppose it might warrant a separate issue.

yolhan_mannes · January 30, 2025, 9:21pm

ok it’s the one i used anyway

ForceBru · January 30, 2025, 9:21pm

I think this shows how good JAX is and how Julia isn’t there yet, unfortunately. Instead of time-to-first-plot I’m now struggling with time-to-first-gradient (TTFG). That’s in a language specifically geared towards data science, ML and statistics, where gradients are first-class citizens.

The smallest TTFG I got (`Zygote.gradient` is 54000 TIMES slower than `jax.gradient` - #29 by ForceBru) was 4 seconds, whereas the equivalent JAX time (first JITted gradient with jax.block_until_ready, so compilation time is included, like in the Julia version) is 10 times less, at 0.44 seconds.

Julia times after compilation are mostly good: Mooncake consistently delivers 4 ms, while in JAX I get around 2.89 ms (mean of 1000 runs using timeit.timeit).

In JAX, I wrote the first code that came to mind and it was fast straight away. In Julia, I encountered a massive (literally 54 thousand times slower!) performance hit and had to seek help from autodiff gurus who can of course optimize the heck out of everything.

yolhan_mannes · January 30, 2025, 9:23pm

Enzyme got 0.9ms at second gradient so I think we’re definetly there, however fo r TTFG you’re right, but who cares only doing that one time and waiting 1s instead of 0.4

gdalle · January 30, 2025, 9:24pm

This is the age-old debate on compilation time. When you compute a million gradients, it doesn’t make a difference whether the first one is fast or just okay. But of course it being unreasonably slow is not great for user experience.

yolhan_mannes · January 30, 2025, 9:28pm

if you wanna add any info

github.com/EnzymeAD/Enzyme.jl

Enzyme on 1.11 needs more Cache than on 1.10 ?

opened 09:25PM - 30 Jan 25 UTC

yolhan83

Hello, I encountered an issue in Julia 1.11. I hope it hasn't already been repor…ted (I think https://github.com/EnzymeAD/Enzyme.jl/issues/2214 is similar). Here is a minimal working example (MWE): ``` using Enzyme function foo(x,y) y2 = y.^2 return sum(x .+ y2[:]) end x = rand(10) y = rand(10) dx = Enzyme.make_zero(x) dy = Enzyme.make_zero(y) Enzyme.autodiff(Enzyme.Reverse, foo, Enzyme.Duplicated(x,dx), Enzyme.Const(y)) ``` This works fine on 1.10, however, on 1.11, I get ``` LoadError: Constant memory is stored (or returned) to a differentiable variable. As a result, Enzyme cannot provably ensure correctness and throws this error. This might be due to the use of a constant variable as temporary storage for active memory (https://enzyme.mit.edu/julia/stable/faq/#Runtime-Activity). If Enzyme should be able to prove this use non-differentable, open an issue! To work around this issue, either: a) rewrite this variable to not be conditionally active (fastest, but requires a code change), or b) set the Enzyme mode to turn on runtime activity (e.g. autodiff(set_runtime_activity(Reverse), ...) ). This will maintain correctness, but may slightly reduce performance. Mismatched activity for: %38 = phi {} addrspace(10)* [ %29, %L90 ], [ %573, %guard_exit125 ] const val: %573 = load {} addrspace(10)*, {} addrspace(10)* addrspace(11)* %572, align 8, !dbg !468, !tbaa !104, !alias.scope !40, !noalias !43, !dereferenceable_or_null !221, !align !369, !enzyme_type !68, !enzymejl_source_type_Memory\7BFloat64\7D !0, !enzymejl_byref_MUT_REF !0 value=Unknown object of type Memory{Float64} llvalue= %573 = load {} addrspace(10)*, {} addrspace(10)* addrspace(11)* %572, align 8, !dbg !468, !tbaa !104, !alias.scope !40, !noalias !43, !dereferenceable_or_null !221, !align !369, !enzyme_type !68, !enzymejl_source_type_Memory\7BFloat64\7D !0, !enzymejl_byref_MUT_REF !0 Stacktrace: [1] == @ .\promotion.jl:639 [2] != @ .\operators.jl:277 [3] _newindexer @ .\broadcast.jl:604 [4] shapeindexer @ .\broadcast.jl:599 [5] newindexer @ .\broadcast.jl:598 [6] extrude @ .\broadcast.jl:645 [7] preprocess @ .\broadcast.jl:953 [8] preprocess_args (repeats 2 times) @ .\broadcast.jl:955 [9] preprocess @ .\broadcast.jl:952 [10] override_bc_copyto! @ C:\Users\yolha\.julia\packages\Enzyme\R6sE8\src\compiler\interpreter.jl:798 [11] copyto! @ .\broadcast.jl:925 [12] copy @ .\broadcast.jl:897 [13] materialize @ .\broadcast.jl:872 [14] foo @ c:\Users\yolha\Desktop\bench_py_jl\mwe_enz.jl:4 Stacktrace: [1] unalias @ .\abstractarray.jl:1500 [inlined] [2] broadcast_unalias @ .\broadcast.jl:946 [inlined] [3] preprocess @ .\broadcast.jl:953 [inlined] [4] preprocess_args (repeats 2 times) @ .\broadcast.jl:955 [inlined] [5] preprocess @ .\broadcast.jl:952 [inlined] [6] override_bc_copyto! @ C:\Users\yolha\.julia\packages\Enzyme\R6sE8\src\compiler\interpreter.jl:798 [inlined] [7] copyto! @ .\broadcast.jl:925 [inlined] [8] copy @ .\broadcast.jl:897 [inlined] [9] materialize @ .\broadcast.jl:872 [inlined] [10] foo @ c:\Users\yolha\Desktop\bench_py_jl\mwe_enz.jl:4 [inlined] [11] diffejulia_foo_12936wrap @ c:\Users\yolha\Desktop\bench_py_jl\mwe_enz.jl:0 [12] top-level scope @ c:\Users\yolha\Desktop\bench_py_jl\mwe_enz.jl:12 [13] eval @ .\boot.jl:430 [inlined] [14] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String) @ Base .\loading.jl:2734 [15] invokelatest(::Any, ::Any, ::Vararg{Any}; kwargs::@Kwargs{}) @ Base .\essentials.jl:1055 [16] invokelatest(::Any, ::Any, ::Vararg{Any}) @ Base .\essentials.jl:1052 [17] inlineeval(m::Module, code::String, code_line::Int64, code_column::Int64, file::String; softscope::Bool) @ VSCodeServer c:\Users\yolha\.vscode\extensions\julialang.language-julia-1.127.2\scripts\packages\VSCodeServer\src\eval.jl:271 [18] (::VSCodeServer.var"#69#74"{Bool, Bool, Bool, Module, String, Int64, Int64, String, VSCodeServer.ReplRunCodeRequestParams})() @ VSCodeServer c:\Users\yolha\.vscode\extensions\julialang.language-julia-1.127.2\scripts\packages\VSCodeServer\src\eval.jl:181 [19] withpath(f::VSCodeServer.var"#69#74"{Bool, Bool, Bool, Module, String, Int64, Int64, String, VSCodeServer.ReplRunCodeRequestParams}, path::String) @ VSCodeServer c:\Users\yolha\.vscode\extensions\julialang.language-julia-1.127.2\scripts\packages\VSCodeServer\src\repl.jl:276 [20] (::VSCodeServer.var"#68#73"{Bool, Bool, Bool, Module, String, Int64, Int64, String, VSCodeServer.ReplRunCodeRequestParams})() @ VSCodeServer c:\Users\yolha\.vscode\extensions\julialang.language-julia-1.127.2\scripts\packages\VSCodeServer\src\eval.jl:179 [21] hideprompt(f::VSCodeServer.var"#68#73"{Bool, Bool, Bool, Module, String, Int64, Int64, String, VSCodeServer.ReplRunCodeRequestParams}) @ VSCodeServer c:\Users\yolha\.vscode\extensions\julialang.language-julia-1.127.2\scripts\packages\VSCodeServer\src\repl.jl:38 [22] #67 @ c:\Users\yolha\.vscode\extensions\julialang.language-julia-1.127.2\scripts\packages\VSCodeServer\src\eval.jl:150 [inlined] [23] with_logstate(f::VSCodeServer.var"#67#72"{Bool, Bool, Bool, Module, String, Int64, Int64, String, VSCodeServer.ReplRunCodeRequestParams}, logstate::Base.CoreLogging.LogState) @ Base.CoreLogging .\logging\logging.jl:522 [24] with_logger @ .\logging\logging.jl:632 [inlined] [25] (::VSCodeServer.var"#66#71"{VSCodeServer.ReplRunCodeRequestParams})() @ VSCodeServer c:\Users\yolha\.vscode\extensions\julialang.language-julia-1.127.2\scripts\packages\VSCodeServer\src\eval.jl:263 [26] #invokelatest#2 @ .\essentials.jl:1055 [inlined] [27] invokelatest(::Any) @ Base .\essentials.jl:1052 [28] (::VSCodeServer.var"#64#65")() @ VSCodeServer c:\Users\yolha\.vscode\extensions\julialang.language-julia-1.127.2\scripts\packages\VSCodeServer\src\eval.jl:34 in expression starting at c:\Users\yolha\Desktop\bench_py_jl\mwe_enz.jl:12 ``` Note that this works if I use Duplicated(NoNeed) for the y variable, also, this is not about squaring, reshape so that too and probaply others. version : ``` Julia Version 1.11.3 Commit d63adeda50 (2025-01-21 19:42 UTC) Build Info: Official https://julialang.org/ release Platform Info: OS: Windows (x86_64-w64-mingw32) CPU: 20 × 12th Gen Intel(R) Core(TM) i7-12700H WORD_SIZE: 64 LLVM: libLLVM-16.0.6 (ORCJIT, alderlake) Threads: 20 default, 0 interactive, 10 GC (on 20 virtual cores) Environment: JULIA_EDITOR = code ``` on Enzyme v0.13.28 related : https://discourse.julialang.org/t/zygote-gradient-is-54000-times-slower-than-jax-gradient/125396/43

ForceBru · January 30, 2025, 9:29pm

I care, because irl I have 1 million 128-dimensional parameters and half a million data points (compared to half a million 2D params here and 100 datapoints). With my original code, I waited more than an hour for the first gradient to compute, rage quit, spent hours factoring out the part responsible for computing the gradient, debugging it, writing the MWE here, writing the JAX MWE (okay, that took about a minute), trying out various things with the Julia code etc.

Note that I’m doing all this instead of doing my actual job (estimating the model). Sure, getting help from the experts here is great and very valuable, so I’m not wasting my time here, but I’d also very much like to just get the job done without debugging TTFG issues.

Now I get the good timings with the 2D parameters, but my next task is to see how well it’ll work with the 128-D params. JAX runs in 181 ms averaged across 1000 runs and automatically uses multithreading.

yolhan_mannes · January 30, 2025, 9:31pm

I don’t know if compilation depends on the size of the input, it seems weird, you should have the same overhead with small and big size arrays, so I guess this used Zygote first exemple which indeed would lead to big differences. About julia being really hard to debug, you’re right, that’s what DI tries so hard to make easier (still a lot easier than c++ ). You should use Reactant.jl for sure then if you want paralel code without doing anything kernel-wize

ForceBru · January 30, 2025, 9:36pm

Weird, yes, but that’s what I got. I may make a table to show how TTFG depends on the number of dimensions. Currently loss_mcabbott_mean doesn’t seem to depend on it, neither for Zygote, nor for Mooncake.

yolhan_mannes · January 30, 2025, 9:37pm

The only moments when TTFG would depend on that is if types are changing since there will be a lot more numbers that change types (fully guessing don’t hesitate if im wrong). If you make a tabular do (size | TTFG-TTG) please to see this

ToucheSir · January 30, 2025, 9:47pm

There are a few related things here to pick apart.

Firstly, jax.jit seems to have a much lower fixed compilation overhead than most Julia ADs. However, I’ve heard enough reports of people favouring Julia ADs despite the long TTFG because it was faster than JAX for their large models. Either way, TTFG is a big problem and we need better solutions for it.

Secondly, there’s a bit of a technical mixed with a philosophical problem here. If you asked someone to write the original loss function in pure Julia, they probably would’ve written something like what @yolhan_mannes did: minimal allocation and very loopy. Not surprisingly, this does well with certain ADs. If you asked someone to write the same function like they were a Python programmer, they probably would’ve written something like @mcabbott’s best-performing examples: fully vectorized operations, minimal looping and exploiting vectorized operator fusion where possible. That does well with other ADs. What does not work well is doing something in between. For example, looping over slices of an input array and apply vectorizing (i.e. expensive and allocating) operations to each.

So why do we see this kind of code again and again in the real world? One factor is that most people who are familiar with AD in Python and Julia are generally more comfortable with the former than the latter. This means that code examples are more likely to be in the “in between” style which is a worst case for Julia ADs. This IMO is an education and documentation problem: we need to direct people towards either writing more Pythonic vectorized code or more Julian scalarized code depending on their use case.

But the other side is that the Python AD libraries provide a better “pit of success” for users trying to write idiomatic code. For example, the JAX example in the OP uses vmap, while @mcabbott’s examples had to do some/all of that vectorization by hand. Granted, I think there’s still a cultural/familiarity aspect in that a generator comprehension over array slices would be immediately flagged as a performance problem by any proficient JAX/PyTorch/Numpy user, but this challenge of making the fast path the obvious one has been an evergreen once ever since I started following the Julia AD ecosystem.

danielwe · January 30, 2025, 9:52pm

I don’t know your background, but it’s worth keeping in mind that “the first code that came to mind” may look very different depending on your prior experience. If you’ve used JAX a lot you’re likely to gravitate towards idioms that work well with JAX, and the same is true for Julia. Hence, what looks like “optimizing the heck out of everything” to one user may just be “the first code that came to mind” to another.

That said, your perspective is appreciated. Hours of compilation time is certainly not a good look for Julia.

(Then there’s the standard list of excuses for Julia which you may or may not find compelling: Julia AD packages are solving a harder problem, trying to differentiate the entire language, while python-based frameworks like JAX constrain you to their limited DSL; Julia AD packages are developed by solo researchers while JAX is developed by Google; et cetera. I suppose they may afford Julia some sympathy points, but what does that matter if you were able to solve your problem in JAX and not in Julia.)

yolhan_mannes · January 30, 2025, 9:54pm

I agree, the ability to do bad things in julia is really high and there will be a moment where it won’t be possible to make the language better at correcting those and then we will have to choose between “let people do bad and tell them why as much as possible” and "make julia 2.0 a lot less permisive and only allow well written code (no idea what this means btw) ". We say julia is really close to python but someone comming from fortran will make 1000x faster code than someone coming from python/matlab.

Mason · January 30, 2025, 10:50pm

ForceBru:

Benchmark time!

Loss function Autograd 1st grad Subsequent grads (mean)

loss_mcabbott Zygote 4.4 s 2.6 ms

loss_mcabbott Mooncake 82.3 s 4.2 ms

loss_yolhan_mannes Zygote 345.3 s 78.5 ms

loss_yolhan_mannes Mooncake 63.3 s 3.7 ms

loss_mcabbott_mean Zygote 4.2 s 79.97 ms

loss_mcabbott_mean Mooncake 79.4 s 4.2 ms

I ran the entire code from scratch (julia code.jl) for each implementation of the loss function. Below, “startup time” means “time-to-first-gradient”.

Why didn’t you include the Enzyme result?

ForceBru · January 30, 2025, 11:35pm

I can’t get it to work. Keep getting errors about its inability to prove that the function doesn’t modify its arguments.

Topic		Replies	Views
Zygote Performance Machine Learning question	22	4977	September 23, 2019
Zygote Performance (Again...) General Usage zygote , forwarddiff , tullio	17	1801	June 11, 2021
Newbie: Gradient of a gradient performance in Zygote General Usage zygote	2	511	March 21, 2021
Zygote dozens* of times slower than manually written function Performance zygote , forwarddiff	17	1760	April 21, 2022
Compute gradients in neuralODE with Zygote Machine Learning	3	254	August 24, 2023

Loss function	Autograd	1st grad	Subsequent grads (mean)
`loss_mcabbott`	Zygote	4.4 s	2.6 ms
`loss_mcabbott`	Mooncake	82.3 s	4.2 ms
`loss_yolhan_mannes`	Zygote	345.3 s	78.5 ms
`loss_yolhan_mannes`	Mooncake	63.3 s	3.7 ms
`loss_mcabbott_mean`	Zygote	4.2 s	79.97 ms
`loss_mcabbott_mean`	Mooncake	79.4 s	4.2 ms

`Zygote.gradient` is 54000 TIMES slower than `jax.gradient`

Related topics