Speed of vectorized vs for-loops using Zygote

I got 40 microseconds without LoopVectorization, vs 32 microseconds with:

julia> using Tullio, Zygote, BenchmarkTools

julia> function laplace_vec(rec)
           @views a = (rec[1:end-2, 2:end-1] .- 4 .* rec[2:end - 1, 2:end - 1]
                       .+ rec[3:end, 2:end - 1])
           @views b = (rec[2:end-1, 1:end-2]
                       .+ rec[2:end - 1, 3:end])

           return @views sum((a .+ b) .^ 2)
       end
laplace_vec (generic function with 1 method)

julia> function laplace_for(rec)
           res = zero(eltype(rec))
           for j = 2:size(rec)[2] - 1
               for i = 2:size(rec)[1] - 1
                   @inbounds res += (rec[i - 1, j] + rec[i+1, j]
                                     + rec[i, j+1] + rec[i, j-1] - 4 * rec[i,j])^2
               end
           end
           return res
       end
laplace_for (generic function with 1 method)

julia> x = rand(100, 100);

julia> laplace_tul(rec) = @tullio res = (rec[i - 1, j] + rec[i+1, j] +
                                   rec[i, j+1] + rec[i, j-1] - 4 * rec[i,j])^2
laplace_tul (generic function with 1 method)

julia> @btime Zygote.gradient(laplace_vec, $x); 
  142.741 μs (167 allocations: 1.42 MiB)

julia> @btime Zygote.gradient(laplace_for, $x); 
  1.064 s (836613 allocations: 7.18 GiB)

julia> @btime Zygote.gradient(laplace_tul, $x); 
  40.459 μs (41 allocations: 79.34 KiB)

julia> @btime Zygote.gradient(laplace_tul, $x); 
  42.718 μs (41 allocations: 79.34 KiB)

julia> using LoopVectorization

julia> @btime Zygote.gradient(laplace_tul, $x); # old function not invalidated!
  40.864 μs (41 allocations: 79.34 KiB)

julia> laplace_tul(rec) = @tullio res = (rec[i - 1, j] + rec[i+1, j] +
                                   rec[i, j+1] + rec[i, j-1] - 4 * rec[i,j])^2
laplace_tul (generic function with 1 method)

julia> @btime Zygote.gradient(laplace_tul, $x); # redefining updates
  31.596 μs (39 allocations: 79.25 KiB)

julia> @btime Zygote.gradient(laplace_tul, $x); 
  31.610 μs (39 allocations: 79.25 KiB)

Did you make sure to redefine your @tullio function to update it after loading LoopVectorization? Or is this difference in performance based on something else, e.g. CPU architecture?

We’re seeing much bigger differences elsewhere that are obviously unrelated (in particular, 1 vs nearly 5s Zygote on laplace_for) – I’m just curious on whether or not LoopVectorization really isn’t doing better than @fastmath in your case.