Speed of vectorized vs for-loops using Zygote

Hey,

I’m using Zygote for a minimization problem. Actually, writing down the loss function in for-loops style is a factor of ~2 faster than using a vectorized function. I then need to take the derivative.
Zygote fails completely (in terms of performance) on the for-loop function. The vectorized is ok.
But Zygote is roughly 5-10x slower than the function itself.

I was wondering if I’m doing it correctly or if there are some tricks how to improve the speed. I couldn’t find any good online resources for that. I tried to remove the @inbounds and also the @views, it stays roughly the same.

function laplace_vec(rec)                                                           
    @views a = (rec[1:end-2, 2:end-1] .- 4 .* rec[2:end - 1, 2:end - 1]                
                .+ rec[3:end, 2:end - 1])                                           
    @views b = (rec[2:end-1, 1:end-2]                 
                .+ rec[2:end - 1, 3:end])                                           
                                                                                
    return @views sum((a .+ b) .^ 2)                                                   
end     


function laplace_for(rec)
    res = zero(eltype(rec))
    for j = 2:size(rec)[2] - 1
        for i = 2:size(rec)[1] - 1
            @inbounds res += (rec[i - 1, j] + rec[i+1, j] 
                              + rec[i, j+1] + rec[i, j-1] - 4 * rec[i,j])^2
        end
    end
    return res
end
       
            
      
x = rand(100, 100)
@time a = laplace_vec(x)
@time a = laplace_vec(x)
@time b = laplace_for(x)
@time b = laplace_for(x)

print("Zygote", "\n")
@time Zygote.gradient(laplace_vec, x)
@time Zygote.gradient(laplace_vec, x)
@time Zygote.gradient(laplace_for, x)
@time Zygote.gradient(laplace_for, x)

The performance results are roughly:

  0.000059 seconds (12 allocations: 225.750 KiB)
  0.000051 seconds (12 allocations: 225.750 KiB)
  0.000021 seconds (1 allocation: 16 bytes)
  0.000015 seconds (1 allocation: 16 bytes)
Zygote
  0.000233 seconds (174 allocations: 1.424 MiB)
  0.000315 seconds (174 allocations: 1.424 MiB)
  1.151743 seconds (836.62 k allocations: 7.183 GiB, 19.52% gc time)
  1.144881 seconds (836.62 k allocations: 7.183 GiB, 18.95% gc time)

Can anybody help me?

Thanks,

Felix

1 Like

This is a known issue. Every indexing operation makes zero(rec) to accumulate the gradient, which is why the scalar indexing in laplace_for is so much slower. Here’s one issue about this:

https://github.com/FluxML/Zygote.jl/issues/644

I have a package solution, although not yet registered:

julia> using Pkg; pkg"add https://github.com/mcabbott/Tullio.jl"
julia> using Tullio
julia> laplace_tul(rec) = @tullio res = (rec[i - 1, j] + rec[i+1, j] +   
                                   rec[i, j+1] + rec[i, j-1] - 4 * rec[i,j])^2

julia> @btime Zygote.gradient(laplace_vec, $x); # 18.673 μs forwards
  269.441 μs (167 allocations: 1.42 MiB)

julia> @btime Zygote.gradient(laplace_for, $x); # 10.977 μs forwards
  4.892 s (836613 allocations: 7.18 GiB)

julia> @btime Zygote.gradient(laplace_tul, $x); # 3.811 μs forwards
  44.920 μs (41 allocations: 79.34 KiB)
7 Likes

That’s expected with the current Zygote. I would swap out to ReverseDiff.jl with tape compilation if you’re doing a lot of scalar computations.

1 Like

FWIW, some quick times with ReverseDiff. Curiously the two variants come out almost identical.

julia> using ReverseDiff

julia> @btime ReverseDiff.gradient(laplace_vec, $x);
  19.361 ms (432211 allocations: 17.40 MiB)

julia> @btime ReverseDiff.gradient(laplace_for, $x);
  20.290 ms (432208 allocations: 17.18 MiB)

julia> vec_tape_comp = ReverseDiff.compile(ReverseDiff.GradientTape(laplace_vec, x)) # fixed! 

julia> for_tape_comp = ReverseDiff.compile(ReverseDiff.GradientTape(laplace_for, x))

julia> @btime ReverseDiff.gradient!(similar($x), $vec_tape_comp, $x); # fixed!
  5.476 ms (134456 allocations: 2.13 MiB)

julia> @btime ReverseDiff.gradient!(similar($x), $for_tape_comp, $x);
  8.097 ms (134458 allocations: 2.13 MiB)
1 Like

ReverseDiff doesn’t have v1.0 broadcast overloads, so it’s probably building and compiling the same code. I think @mohamed82008 has an implementation for that though, which would make it swap in forward mode on broadcasts and accelerate it in this situation.

1 Like

Thanks for your snippet, indeed it’s pretty fast!

I’m curious, why does @tullio generate so fast code?

It’s generating loops much like laplace_for, but to construct the gradient. I’m not entirely sure why that’s 10x slower than forwards. (Timing the function it writes, without Zygote involved, saves 10μs or so.)

It will add @avx if you load LoopVectorization, but this doesn’t help here. I think @fastmath is enough to make the forward pass vectorise.

Am I correct inspecting the benchmark (reproduced it myself) that ReverseDiff with tape compilation doesn’t give a speedup?
Actually it’s even worse than Zygote on the vectorized version?

Try replacing the first of the two laplace_fors with laplace_vec :wink:

1 Like

Oh right, thanks! Fixed above.

Still quite similar though, compared to how differently Zygote (and Tracker) behave.

Yes, the overloads are helpful.

It gives quite a significant speedup in that example over the scalarized for loop version of Zygote. Notice I said “if you’re doing a lot of scalar computations”

I got 40 microseconds without LoopVectorization, vs 32 microseconds with:

julia> using Tullio, Zygote, BenchmarkTools

julia> function laplace_vec(rec)
           @views a = (rec[1:end-2, 2:end-1] .- 4 .* rec[2:end - 1, 2:end - 1]
                       .+ rec[3:end, 2:end - 1])
           @views b = (rec[2:end-1, 1:end-2]
                       .+ rec[2:end - 1, 3:end])

           return @views sum((a .+ b) .^ 2)
       end
laplace_vec (generic function with 1 method)

julia> function laplace_for(rec)
           res = zero(eltype(rec))
           for j = 2:size(rec)[2] - 1
               for i = 2:size(rec)[1] - 1
                   @inbounds res += (rec[i - 1, j] + rec[i+1, j]
                                     + rec[i, j+1] + rec[i, j-1] - 4 * rec[i,j])^2
               end
           end
           return res
       end
laplace_for (generic function with 1 method)

julia> x = rand(100, 100);

julia> laplace_tul(rec) = @tullio res = (rec[i - 1, j] + rec[i+1, j] +
                                   rec[i, j+1] + rec[i, j-1] - 4 * rec[i,j])^2
laplace_tul (generic function with 1 method)

julia> @btime Zygote.gradient(laplace_vec, $x); 
  142.741 μs (167 allocations: 1.42 MiB)

julia> @btime Zygote.gradient(laplace_for, $x); 
  1.064 s (836613 allocations: 7.18 GiB)

julia> @btime Zygote.gradient(laplace_tul, $x); 
  40.459 μs (41 allocations: 79.34 KiB)

julia> @btime Zygote.gradient(laplace_tul, $x); 
  42.718 μs (41 allocations: 79.34 KiB)

julia> using LoopVectorization

julia> @btime Zygote.gradient(laplace_tul, $x); # old function not invalidated!
  40.864 μs (41 allocations: 79.34 KiB)

julia> laplace_tul(rec) = @tullio res = (rec[i - 1, j] + rec[i+1, j] +
                                   rec[i, j+1] + rec[i, j-1] - 4 * rec[i,j])^2
laplace_tul (generic function with 1 method)

julia> @btime Zygote.gradient(laplace_tul, $x); # redefining updates
  31.596 μs (39 allocations: 79.25 KiB)

julia> @btime Zygote.gradient(laplace_tul, $x); 
  31.610 μs (39 allocations: 79.25 KiB)

Did you make sure to redefine your @tullio function to update it after loading LoopVectorization? Or is this difference in performance based on something else, e.g. CPU architecture?

We’re seeing much bigger differences elsewhere that are obviously unrelated (in particular, 1 vs nearly 5s Zygote on laplace_for) – I’m just curious on whether or not LoopVectorization really isn’t doing better than @fastmath in your case.

This is what I got, laplace_avx is from after loading the package – slightly faster forward, but slightly slower reverse? (2-core macbook pro, I can try desktop later.)

julia> @btime laplace_avx($x)
  3.348 μs (11 allocations: 352 bytes)                                   

julia> @btime Zygote.gradient(laplace_avx, $x);                          
  48.037 μs (39 allocations: 79.25 KiB)

With grad=Dual it’s slower, 58.833 μs.

1 Like

Interesting, thanks.
Can you somehow dump what the differentiated loops look like as they’re passed to LoopVectorization? How does this interact with Zygote?
I’d like to be able to investigate (and fix) any performance regressions it causes.

Sure, this will fill your screen… it’s got @avx already expanded, but the @fastmath version is below:

julia> @tullio res = (rec[i - 1, j] + rec[i+1, j] +                                                       
                rec[i, j+1] + rec[i, j-1] - 4 * rec[i,j])^2  verbose=2; 
act! function Here `𝛥ℛ = ones(1)`, `𝛥rec = similar(rec)`, first arg is `Array`, and `𝒶𝓍i, 𝒶𝓍j = axes(rec)`.
function ∇𝒜𝒸𝓉!(::Type, 𝛥rec, 𝛥ℛ::AbstractArray{𝒯}, rec, 𝒶𝓍i, 𝒶𝓍j, ♻ = nothing) where 𝒯
            #= /Users/me/.julia/dev/Tullio/src/macro.jl:659 =# @inbounds #= /Users/me/.julia/dev/Tullio/src/macro.jl:659 =# @fastmath(begin
                        nothing
                        for j = 𝒶𝓍j
                            for i = 𝒶𝓍i
                                begin
                                    nothing
                                    begin
                                        ℰ𝓍1 = conj(𝛥ℛ[1])
                                        ℰ𝓍2 = rec[i, j + 1] + rec[i, j - 1]
                                        ℰ𝓍3 = rec[i + 1, j] + ℰ𝓍2
                                        ℰ𝓍4 = rec[i - 1, j] + ℰ𝓍3
                                        ℰ𝓍5 = 4 * rec[i, j]
                                        ℰ𝓍6 = ℰ𝓍4 - ℰ𝓍5
                                        ℰ𝓍7 = 2ℰ𝓍6
                                        ℰ𝓍8 = ℰ𝓍1 * ℰ𝓍7
                                        ℰ𝓍9 = conj(ℰ𝓍8)
                                        ℰ𝓍10 = rec[i - 1, j] + ℰ𝓍3
                                        ℰ𝓍11 = ℰ𝓍10 - ℰ𝓍5
                                        ℰ𝓍12 = 2ℰ𝓍11
                                        ℰ𝓍13 = ℰ𝓍1 * ℰ𝓍12
                                        ℰ𝓍14 = conj(ℰ𝓍13)
                                        ℰ𝓍15 = rec[i - 1, j] + ℰ𝓍3
                                        ℰ𝓍16 = ℰ𝓍15 - ℰ𝓍5
                                        ℰ𝓍17 = 2ℰ𝓍16
                                        ℰ𝓍18 = ℰ𝓍1 * ℰ𝓍17
                                        ℰ𝓍19 = conj(ℰ𝓍18)
                                        ℰ𝓍20 = rec[i - 1, j] + ℰ𝓍3
                                        ℰ𝓍21 = ℰ𝓍20 - ℰ𝓍5
                                        ℰ𝓍22 = 2ℰ𝓍21
                                        ℰ𝓍23 = ℰ𝓍1 * ℰ𝓍22
                                        ℰ𝓍24 = conj(ℰ𝓍23)
                                        ℰ𝓍25 = rec[i - 1, j] + ℰ𝓍3
                                        ℰ𝓍26 = ℰ𝓍25 - ℰ𝓍5
                                        ℰ𝓍27 = 2ℰ𝓍26
                                        ℰ𝓍28 = -4ℰ𝓍27
                                        ℰ𝓍29 = ℰ𝓍1 * ℰ𝓍28
                                        ℰ𝓍30 = conj(ℰ𝓍29)
                                        𝛥rec[i - 1, j] = 𝛥rec[i - 1, j] + ℰ𝓍9
                                        𝛥rec[i + 1, j] = 𝛥rec[i + 1, j] + ℰ𝓍14
                                        𝛥rec[i, j + 1] = 𝛥rec[i, j + 1] + ℰ𝓍19
                                        𝛥rec[i, j - 1] = 𝛥rec[i, j - 1] + ℰ𝓍24
                                        𝛥rec[i, j] = 𝛥rec[i, j] + ℰ𝓍30
                                    end
                                    nothing
                                end
                            end
                        end
                    end)
        end
The `conj`s here are a recent half-way attempt to treat complex numbers, and it's possible they have had bad effects.

It’s quite a bit quicker with @avx unroll=4 (instead of unroll=0 which is I think is equivalent to not specifying this).

julia> laplace_avx4(rec) = @tullio res = (rec[i - 1, j] + rec[i+1, j] + 
                           rec[i, j+1] + rec[i, j-1] - 4 * rec[i,j])^2  avx=4 

julia> @btime Zygote.gradient(laplace_avx4, $x);
  28.297 μs (39 allocations: 79.25 KiB)

This code has loop carried dependencies, which currently aren’t supported in LoopVectorization.
It is getting wrong answers.

The problem with conj was that it wasn’t in the dict of non-opaque functions, making LoopVectorization assume it is expensive, and thus act very conservatively with respect to unrolling – but its answers were still wrong.

Could you feasibly address this problem on your end?

𝛥rec[i - 1, j] = 𝛥rec[i - 1, j] + ℰ𝓍9
𝛥rec[i + 1, j] = 𝛥rec[i + 1, j] + ℰ𝓍14
𝛥rec[i, j + 1] = 𝛥rec[i, j + 1] + ℰ𝓍19
𝛥rec[i, j - 1] = 𝛥rec[i, j - 1] + ℰ𝓍24
𝛥rec[i, j] = 𝛥rec[i, j] + ℰ𝓍30

Either way, I already filed an issue with LoopVectorization here regarding the constant offsets, but unfortunately it may be a little while before I can work on this myself.

1 Like

Easily, if it means avoiding @avx in such cases.

Harder, if it means shifting things around to update only one 𝛥rec[i, j] in the loop, would need to think.

To be honest, I can’t really follow the whole discussion…

But so far your Tullio was the fastest method to perform the gradient.
Are you planning to publish it as an official package? I’m doing really hard to find out what’s actually done under the hood…

Or maybe you can give me a hint how to realize differently, without the need of your package, similar performance :sweat_smile:

1 Like

Optimized kernels like that are always going to better than a straight loop when adjoints get involved, and Tullio looks great, so I agree that the best option here would be to get Tullio registered and start advocating its use in these circumstances. It looks really great.

2 Likes