Speed of vectorized vs for-loops using Zygote

Elrod · June 1, 2020, 2:26pm

I got 40 microseconds without LoopVectorization, vs 32 microseconds with:

julia> using Tullio, Zygote, BenchmarkTools

julia> function laplace_vec(rec)
           @views a = (rec[1:end-2, 2:end-1] .- 4 .* rec[2:end - 1, 2:end - 1]
                       .+ rec[3:end, 2:end - 1])
           @views b = (rec[2:end-1, 1:end-2]
                       .+ rec[2:end - 1, 3:end])

           return @views sum((a .+ b) .^ 2)
       end
laplace_vec (generic function with 1 method)

julia> function laplace_for(rec)
           res = zero(eltype(rec))
           for j = 2:size(rec)[2] - 1
               for i = 2:size(rec)[1] - 1
                   @inbounds res += (rec[i - 1, j] + rec[i+1, j]
                                     + rec[i, j+1] + rec[i, j-1] - 4 * rec[i,j])^2
               end
           end
           return res
       end
laplace_for (generic function with 1 method)

julia> x = rand(100, 100);

julia> laplace_tul(rec) = @tullio res = (rec[i - 1, j] + rec[i+1, j] +
                                   rec[i, j+1] + rec[i, j-1] - 4 * rec[i,j])^2
laplace_tul (generic function with 1 method)

julia> @btime Zygote.gradient(laplace_vec, $x); 
  142.741 μs (167 allocations: 1.42 MiB)

julia> @btime Zygote.gradient(laplace_for, $x); 
  1.064 s (836613 allocations: 7.18 GiB)

julia> @btime Zygote.gradient(laplace_tul, $x); 
  40.459 μs (41 allocations: 79.34 KiB)

julia> @btime Zygote.gradient(laplace_tul, $x); 
  42.718 μs (41 allocations: 79.34 KiB)

julia> using LoopVectorization

julia> @btime Zygote.gradient(laplace_tul, $x); # old function not invalidated!
  40.864 μs (41 allocations: 79.34 KiB)

julia> laplace_tul(rec) = @tullio res = (rec[i - 1, j] + rec[i+1, j] +
                                   rec[i, j+1] + rec[i, j-1] - 4 * rec[i,j])^2
laplace_tul (generic function with 1 method)

julia> @btime Zygote.gradient(laplace_tul, $x); # redefining updates
  31.596 μs (39 allocations: 79.25 KiB)

julia> @btime Zygote.gradient(laplace_tul, $x); 
  31.610 μs (39 allocations: 79.25 KiB)

Did you make sure to redefine your @tullio function to update it after loading LoopVectorization? Or is this difference in performance based on something else, e.g. CPU architecture?

We’re seeing much bigger differences elsewhere that are obviously unrelated (in particular, 1 vs nearly 5s Zygote on laplace_for) – I’m just curious on whether or not LoopVectorization really isn’t doing better than @fastmath in your case.

Topic		Replies	Views
Array Contraction, LoopVectorization & AD Performance zygote , loopvectorization , autodiff	7	546	December 18, 2023
Zygote Performance Machine Learning question	22	4980	September 23, 2019
Getting non-vectorized code to the speed of vectorized ones Numerics question	22	1060	November 18, 2021
Zygote dozens* of times slower than manually written function Performance zygote , forwarddiff	17	1769	April 21, 2022
Speeding up Zygote autodiff for numerical loop Performance question	13	297	December 16, 2024

Speed of vectorized vs for-loops using Zygote

Related topics