Zygote Performance (Again...)

cortner · June 11, 2021, 4:51pm

On simple reduction tasks Zygote seems to perform exceptionally poorly, and I would be grateful to hear why and what alternative patterns I should use to avoid these problems? I’ve started implementing custom adjoints via rrule but if I do this for every little piece of code then I might as well implement all gradients myself and forget about Zygote.

julia> using BenchmarkTools, Zygote
       f(x) = mapreduce(xi -> xi^2, +, x)
       x = rand(100)
       f(x)
       Zygote.gradient(f, x)
       @btime f($x)
       @btime Zygote.gradient($f, $x)
  11.128 ns (0 allocations: 0 bytes)
  480.792 μs (3369 allocations: 191.02 KiB)

We can do a little better like this:

julia> using BenchmarkTools, Zygote
       _sq(xi) = xi^2
       f(x) = sum(_sq, x)
       x = rand(100)
       f(x)
       Zygote.gradient(f, x)
       @btime f($x)
       @btime Zygote.gradient($f, $x)
  14.658 ns (0 allocations: 0 bytes)
  18.477 μs (392 allocations: 12.30 KiB)

but why!?!

And even better like this

julia> using BenchmarkTools, Zygote
       f(x) = sum(x.^2)
       x = rand(100)
       f(x)
       Zygote.gradient(f, x)
       @btime f($x)
       @btime Zygote.gradient($f, $x)
  117.833 ns (1 allocation: 896 bytes)
  239.328 ns (3 allocations: 1.77 KiB)

But even that third example doesn’t come anywhere near the 20-40ns that I would expect here.

ToucheSir · June 11, 2021, 5:36pm

I’m a little confused, seems like your second example is both the fastest and under 20ns?

Oscar_Smith · June 11, 2021, 5:41pm

Zygote the is microseconds not nano

cortner · June 11, 2021, 5:41pm

it is 18 us not ns

cortner · June 11, 2021, 5:44pm

Are you saying that Zygote is to be used in the ms and not ns regime? That’s Ok!

But what if my code is sufficiently complex that it evaluates the models in the ms to s regime but goes many layers deep into inner functions that evaluate in the 100ns regime and this is where I currently loose a factor 1000 and more in performance.

I’m working a lot with custom adjoints. Here, I really just want to understand what is going on.

ToucheSir · June 11, 2021, 5:44pm

My mistake, read it multiple times and somehow still missed that. Blame it on the early morning…

ToucheSir · June 11, 2021, 6:02pm

Back to the topic at hand, the long and short of it seems to be that there aren’t optimized adjoints in place for mapreduce-like functions (including sum(f, ...)) yet. This becomes pretty clear from looking at the currently defined adjoints. Higher-order functions in general are not well optimized for AD right now, so it’s no surprise that they perform terribly. You can verify this for yourself by pulling up GitHub - JuliaDebug/Cthulhu.jl: The slow descent into madness and stepping into the pullback.

This is usually not a problem in ML code because people are used to poor performance/no support in other AD systems. Hence workarounds like materializing the input before calling sum (what your 3rd example does) so that it hits an existing rule like ChainRules.jl/mapreduce.jl at 65833a19629ee890d30edc96b61bbdbc4e0da72f · JuliaDiff/ChainRules.jl · GitHub or Zygote.jl/array.jl at 12f5c1d75eeaa8c7a818f2db7f8d082956c00cac · FluxML/Zygote.jl · GitHub.

On the bright side, things are slowly changing for the better. https://github.com/JuliaDiff/ChainRules.jl/pull/441 is the first step in getting better higher-order function AD working. I don’t remember if Diffractor (next gen AD system) will help with this, but it’s worth looking out for that as well. In the meantime, your options are to a) avoid higher-order functions, or b) look into alternative AD systems other than Zygote (e.g. ForwardDiff)

lmiq · June 11, 2021, 6:02pm

Good question. What should one expect? I found this: How fast is automatic differentiation? - Computational Science Stack Exchange

Anyone knows a better answer to that?

cortner · June 11, 2021, 6:10pm

Maybe I should explain why I’m hunting for performance in the first place - it is because a collaborator has implemented a very similar model as mine (nothing to do with the the model above) in Tensorflow and claims his AD gradients are only a factor 1.5 slower than my hand-optimised ones which are I believe within a factor 2-3 from optimal.

I figured combining Zygote + ForwardDiff was my best chance at getting something like that in Julia. But maybe his claim is just bogus and I should test that myself.

ToucheSir · June 11, 2021, 6:25pm

To my knowledge the TF AD doesn’t support higher-order functions at all. If/when you get more info on good comparative examples, feel free to start a thread here and we can take a look at it.

alastair-marshall · June 11, 2021, 7:49pm

You might also want to try ReverseDiff or Yota instead of Zygote in this case, they might be a bit faster

Oscar_Smith · June 11, 2021, 7:52pm

Sorry for the confusion. I was responding to the comment about the timing of the second one too tersely.

mcabbott · June 11, 2021, 8:05pm

Reductions like this are one of the reasons Tullio exists. It’s not quite a factor 2 here, but at least it’s in ns:

julia> using Tullio, ForwardDiff, Zygote

julia> ft(x) = @tullio tot := x[i]^2  threads=false avx=false;

julia> @btime ft($x);
  12.596 ns (0 allocations: 0 bytes)

julia> @btime Zygote.gradient(ft, $x);
  86.598 ns (1 allocation: 896 bytes)

julia> 86.598 / 12.596
6.875039695141314

julia> ftd(x) = @tullio tot := x[i]^2  grad=Dual threads=false avx=false;

julia> @btime Zygote.gradient(ftd, $x);  # sometimes more efficient
  86.863 ns (1 allocation: 896 bytes)

julia> @btime Zygote.gradient($f, $x);  # first version with mapreduce, my computer
  442.250 μs (3469 allocations: 213.92 KiB)

julia> @btime f($x);
  17.451 ns (0 allocations: 0 bytes)

cortner · June 11, 2021, 8:25pm

thanks for suggesting those alternative packages. I understand that there are many alternatives. My choice of Zygote was that this was going to be the future of AD and I cannot afford to invest in something that will be replaced in a few months.

But then I just learned that Zygote will be replaced with something called Diffractor?

So what should I be investing in? Will I be relatively safe if I focus my development around ChainRules and use Zygote or any of these others only sparingly?

cortner · June 11, 2021, 8:26pm

Thank you, that’s an important distinction I just haven’t made because I’m so used to these in Julia that I don’t even think about them anymore. But would you say that avoiding higher order functions will automatically lead to faster Zygote gradients?

ChrisRackauckas · June 11, 2021, 8:37pm

You shouldn’t have to invest. The AD systems (will) share a rule system, ChainRules.jl, so the modifications you make (i.e. added rules) would cross over to other systems in a nice way.

ToucheSir · June 11, 2021, 8:44pm

I would say any function where you have a cheap to compute callback that is called many, many times (e.g. sum(f, x)) doesn’t perform great right now relative to the forward pass. If f is expensive to compute or the number of elements is small, it’ll be less of an issue.

cortner · June 11, 2021, 10:46pm

Is there some kind of a performance guide for AD in general and Zygote in particular? Similar to the one for Julia in the manual?

Topic		Replies	Views
Zygote Performance Machine Learning question	22	4981	September 23, 2019
Zygote dozens* of times slower than manually written function Performance zygote , forwarddiff	17	1774	April 21, 2022
Suggestions to improve Zygote performance for simple vector map/broadcast/comprehension? General Usage zygote	6	954	November 9, 2022
Automatic differentiation performance & computing derivatives of only a subset of the arguments Performance question , autodiff	6	931	October 2, 2021
Zygote custom adjoint has surprising performance effects Machine Learning question	0	455	April 3, 2020

Zygote Performance (Again...)

Related topics