# Speed of vectorized vs for-loops using Zygote

Hey,

I’m using Zygote for a minimization problem. Actually, writing down the loss function in for-loops style is a factor of ~2 faster than using a vectorized function. I then need to take the derivative.
Zygote fails completely (in terms of performance) on the for-loop function. The vectorized is ok.
But Zygote is roughly 5-10x slower than the function itself.

I was wondering if I’m doing it correctly or if there are some tricks how to improve the speed. I couldn’t find any good online resources for that. I tried to remove the @inbounds and also the @views, it stays roughly the same.

``````function laplace_vec(rec)
@views a = (rec[1:end-2, 2:end-1] .- 4 .* rec[2:end - 1, 2:end - 1]
.+ rec[3:end, 2:end - 1])
@views b = (rec[2:end-1, 1:end-2]
.+ rec[2:end - 1, 3:end])

return @views sum((a .+ b) .^ 2)
end

function laplace_for(rec)
res = zero(eltype(rec))
for j = 2:size(rec) - 1
for i = 2:size(rec) - 1
@inbounds res += (rec[i - 1, j] + rec[i+1, j]
+ rec[i, j+1] + rec[i, j-1] - 4 * rec[i,j])^2
end
end
return res
end

x = rand(100, 100)
@time a = laplace_vec(x)
@time a = laplace_vec(x)
@time b = laplace_for(x)
@time b = laplace_for(x)

print("Zygote", "\n")
``````

The performance results are roughly:

``````  0.000059 seconds (12 allocations: 225.750 KiB)
0.000051 seconds (12 allocations: 225.750 KiB)
0.000021 seconds (1 allocation: 16 bytes)
0.000015 seconds (1 allocation: 16 bytes)
Zygote
0.000233 seconds (174 allocations: 1.424 MiB)
0.000315 seconds (174 allocations: 1.424 MiB)
1.151743 seconds (836.62 k allocations: 7.183 GiB, 19.52% gc time)
1.144881 seconds (836.62 k allocations: 7.183 GiB, 18.95% gc time)
``````

Can anybody help me?

Thanks,

Felix

1 Like

This is a known issue. Every indexing operation makes `zero(rec)` to accumulate the gradient, which is why the scalar indexing in `laplace_for` is so much slower. Here’s one issue about this:

I have a package solution, although not yet registered:

``````julia> using Pkg; pkg"add https://github.com/mcabbott/Tullio.jl"
julia> using Tullio
julia> laplace_tul(rec) = @tullio res = (rec[i - 1, j] + rec[i+1, j] +
rec[i, j+1] + rec[i, j-1] - 4 * rec[i,j])^2

julia> @btime Zygote.gradient(laplace_vec, \$x); # 18.673 μs forwards
269.441 μs (167 allocations: 1.42 MiB)

julia> @btime Zygote.gradient(laplace_for, \$x); # 10.977 μs forwards
4.892 s (836613 allocations: 7.18 GiB)

julia> @btime Zygote.gradient(laplace_tul, \$x); # 3.811 μs forwards
44.920 μs (41 allocations: 79.34 KiB)
``````
6 Likes

That’s expected with the current Zygote. I would swap out to ReverseDiff.jl with tape compilation if you’re doing a lot of scalar computations.

1 Like

FWIW, some quick times with ReverseDiff. Curiously the two variants come out almost identical.

``````julia> using ReverseDiff

19.361 ms (432211 allocations: 17.40 MiB)

20.290 ms (432208 allocations: 17.18 MiB)

julia> vec_tape_comp = ReverseDiff.compile(ReverseDiff.GradientTape(laplace_vec, x)) # fixed!

julia> @btime ReverseDiff.gradient!(similar(\$x), \$vec_tape_comp, \$x); # fixed!
5.476 ms (134456 allocations: 2.13 MiB)

8.097 ms (134458 allocations: 2.13 MiB)
``````
1 Like

ReverseDiff doesn’t have v1.0 broadcast overloads, so it’s probably building and compiling the same code. I think @mohamed82008 has an implementation for that though, which would make it swap in forward mode on broadcasts and accelerate it in this situation.

1 Like

Thanks for your snippet, indeed it’s pretty fast!

I’m curious, why does @tullio generate so fast code?

It’s generating loops much like `laplace_for`, but to construct the gradient. I’m not entirely sure why that’s 10x slower than forwards. (Timing the function it writes, without Zygote involved, saves 10μs or so.)

It will add `@avx` if you load LoopVectorization, but this doesn’t help here. I think `@fastmath` is enough to make the forward pass vectorise.

Am I correct inspecting the benchmark (reproduced it myself) that ReverseDiff with tape compilation doesn’t give a speedup?
Actually it’s even worse than Zygote on the vectorized version?

Try replacing the first of the two `laplace_for`s with `laplace_vec` 1 Like

Oh right, thanks! Fixed above.

Still quite similar though, compared to how differently Zygote (and Tracker) behave.

It gives quite a significant speedup in that example over the scalarized for loop version of Zygote. Notice I said “if you’re doing a lot of scalar computations”

I got 40 microseconds without LoopVectorization, vs 32 microseconds with:

``````julia> using Tullio, Zygote, BenchmarkTools

julia> function laplace_vec(rec)
@views a = (rec[1:end-2, 2:end-1] .- 4 .* rec[2:end - 1, 2:end - 1]
.+ rec[3:end, 2:end - 1])
@views b = (rec[2:end-1, 1:end-2]
.+ rec[2:end - 1, 3:end])

return @views sum((a .+ b) .^ 2)
end
laplace_vec (generic function with 1 method)

julia> function laplace_for(rec)
res = zero(eltype(rec))
for j = 2:size(rec) - 1
for i = 2:size(rec) - 1
@inbounds res += (rec[i - 1, j] + rec[i+1, j]
+ rec[i, j+1] + rec[i, j-1] - 4 * rec[i,j])^2
end
end
return res
end
laplace_for (generic function with 1 method)

julia> x = rand(100, 100);

julia> laplace_tul(rec) = @tullio res = (rec[i - 1, j] + rec[i+1, j] +
rec[i, j+1] + rec[i, j-1] - 4 * rec[i,j])^2
laplace_tul (generic function with 1 method)

142.741 μs (167 allocations: 1.42 MiB)

1.064 s (836613 allocations: 7.18 GiB)

40.459 μs (41 allocations: 79.34 KiB)

42.718 μs (41 allocations: 79.34 KiB)

julia> using LoopVectorization

julia> @btime Zygote.gradient(laplace_tul, \$x); # old function not invalidated!
40.864 μs (41 allocations: 79.34 KiB)

julia> laplace_tul(rec) = @tullio res = (rec[i - 1, j] + rec[i+1, j] +
rec[i, j+1] + rec[i, j-1] - 4 * rec[i,j])^2
laplace_tul (generic function with 1 method)

31.596 μs (39 allocations: 79.25 KiB)

31.610 μs (39 allocations: 79.25 KiB)
``````

Did you make sure to redefine your `@tullio` function to update it after loading LoopVectorization? Or is this difference in performance based on something else, e.g. CPU architecture?

We’re seeing much bigger differences elsewhere that are obviously unrelated (in particular, 1 vs nearly 5s Zygote on `laplace_for`) – I’m just curious on whether or not LoopVectorization really isn’t doing better than `@fastmath` in your case.

This is what I got, `laplace_avx` is from after loading the package – slightly faster forward, but slightly slower reverse? (2-core macbook pro, I can try desktop later.)

``````julia> @btime laplace_avx(\$x)
3.348 μs (11 allocations: 352 bytes)

48.037 μs (39 allocations: 79.25 KiB)
``````

With `grad=Dual` it’s slower, 58.833 μs.

1 Like

Interesting, thanks.
Can you somehow dump what the differentiated loops look like as they’re passed to LoopVectorization? How does this interact with Zygote?
I’d like to be able to investigate (and fix) any performance regressions it causes.

Sure, this will fill your screen… it’s got `@avx` already expanded, but the `@fastmath` version is below:

``````julia> @tullio res = (rec[i - 1, j] + rec[i+1, j] +
rec[i, j+1] + rec[i, j-1] - 4 * rec[i,j])^2  verbose=2;
``````
act! function Here `𝛥ℛ = ones(1)`, `𝛥rec = similar(rec)`, first arg is `Array`, and `𝒶𝓍i, 𝒶𝓍j = axes(rec)`.
```function ∇𝒜𝒸𝓉!(::Type, 𝛥rec, 𝛥ℛ::AbstractArray{𝒯}, rec, 𝒶𝓍i, 𝒶𝓍j, ♻ = nothing) where 𝒯
#= /Users/me/.julia/dev/Tullio/src/macro.jl:659 =# @inbounds #= /Users/me/.julia/dev/Tullio/src/macro.jl:659 =# @fastmath(begin
nothing
for j = 𝒶𝓍j
for i = 𝒶𝓍i
begin
nothing
begin
ℰ𝓍1 = conj(𝛥ℛ)
ℰ𝓍2 = rec[i, j + 1] + rec[i, j - 1]
ℰ𝓍3 = rec[i + 1, j] + ℰ𝓍2
ℰ𝓍4 = rec[i - 1, j] + ℰ𝓍3
ℰ𝓍5 = 4 * rec[i, j]
ℰ𝓍6 = ℰ𝓍4 - ℰ𝓍5
ℰ𝓍7 = 2ℰ𝓍6
ℰ𝓍8 = ℰ𝓍1 * ℰ𝓍7
ℰ𝓍9 = conj(ℰ𝓍8)
ℰ𝓍10 = rec[i - 1, j] + ℰ𝓍3
ℰ𝓍11 = ℰ𝓍10 - ℰ𝓍5
ℰ𝓍12 = 2ℰ𝓍11
ℰ𝓍13 = ℰ𝓍1 * ℰ𝓍12
ℰ𝓍14 = conj(ℰ𝓍13)
ℰ𝓍15 = rec[i - 1, j] + ℰ𝓍3
ℰ𝓍16 = ℰ𝓍15 - ℰ𝓍5
ℰ𝓍17 = 2ℰ𝓍16
ℰ𝓍18 = ℰ𝓍1 * ℰ𝓍17
ℰ𝓍19 = conj(ℰ𝓍18)
ℰ𝓍20 = rec[i - 1, j] + ℰ𝓍3
ℰ𝓍21 = ℰ𝓍20 - ℰ𝓍5
ℰ𝓍22 = 2ℰ𝓍21
ℰ𝓍23 = ℰ𝓍1 * ℰ𝓍22
ℰ𝓍24 = conj(ℰ𝓍23)
ℰ𝓍25 = rec[i - 1, j] + ℰ𝓍3
ℰ𝓍26 = ℰ𝓍25 - ℰ𝓍5
ℰ𝓍27 = 2ℰ𝓍26
ℰ𝓍28 = -4ℰ𝓍27
ℰ𝓍29 = ℰ𝓍1 * ℰ𝓍28
ℰ𝓍30 = conj(ℰ𝓍29)
𝛥rec[i - 1, j] = 𝛥rec[i - 1, j] + ℰ𝓍9
𝛥rec[i + 1, j] = 𝛥rec[i + 1, j] + ℰ𝓍14
𝛥rec[i, j + 1] = 𝛥rec[i, j + 1] + ℰ𝓍19
𝛥rec[i, j - 1] = 𝛥rec[i, j - 1] + ℰ𝓍24
𝛥rec[i, j] = 𝛥rec[i, j] + ℰ𝓍30
end
nothing
end
end
end
end)
end
```
The `conj`s here are a recent half-way attempt to treat complex numbers, and it's possible they have had bad effects.

It’s quite a bit quicker with `@avx unroll=4` (instead of `unroll=0` which is I think is equivalent to not specifying this).

``````julia> laplace_avx4(rec) = @tullio res = (rec[i - 1, j] + rec[i+1, j] +
rec[i, j+1] + rec[i, j-1] - 4 * rec[i,j])^2  avx=4

28.297 μs (39 allocations: 79.25 KiB)
``````

This code has loop carried dependencies, which currently aren’t supported in LoopVectorization.

The problem with `conj` was that it wasn’t in the dict of non-opaque functions, making LoopVectorization assume it is expensive, and thus act very conservatively with respect to unrolling – but its answers were still wrong.

``````𝛥rec[i - 1, j] = 𝛥rec[i - 1, j] + ℰ𝓍9
𝛥rec[i + 1, j] = 𝛥rec[i + 1, j] + ℰ𝓍14
𝛥rec[i, j + 1] = 𝛥rec[i, j + 1] + ℰ𝓍19
𝛥rec[i, j - 1] = 𝛥rec[i, j - 1] + ℰ𝓍24
𝛥rec[i, j] = 𝛥rec[i, j] + ℰ𝓍30
``````

Either way, I already filed an issue with LoopVectorization here regarding the constant offsets, but unfortunately it may be a little while before I can work on this myself.

1 Like

Easily, if it means avoiding `@avx` in such cases.

Harder, if it means shifting things around to update only one `𝛥rec[i, j]` in the loop, would need to think.

To be honest, I can’t really follow the whole discussion…

But so far your Tullio was the fastest method to perform the gradient.
Are you planning to publish it as an official package? I’m doing really hard to find out what’s actually done under the hood…

Or maybe you can give me a hint how to realize differently, without the need of your package, similar performance 1 Like

Optimized kernels like that are always going to better than a straight loop when adjoints get involved, and Tullio looks great, so I agree that the best option here would be to get Tullio registered and start advocating its use in these circumstances. It looks really great.

2 Likes