I got 40 microseconds without LoopVectorization, vs 32 microseconds with:
julia> using Tullio, Zygote, BenchmarkTools
julia> function laplace_vec(rec)
@views a = (rec[1:end-2, 2:end-1] .- 4 .* rec[2:end - 1, 2:end - 1]
.+ rec[3:end, 2:end - 1])
@views b = (rec[2:end-1, 1:end-2]
.+ rec[2:end - 1, 3:end])
return @views sum((a .+ b) .^ 2)
end
laplace_vec (generic function with 1 method)
julia> function laplace_for(rec)
res = zero(eltype(rec))
for j = 2:size(rec)[2] - 1
for i = 2:size(rec)[1] - 1
@inbounds res += (rec[i - 1, j] + rec[i+1, j]
+ rec[i, j+1] + rec[i, j-1] - 4 * rec[i,j])^2
end
end
return res
end
laplace_for (generic function with 1 method)
julia> x = rand(100, 100);
julia> laplace_tul(rec) = @tullio res = (rec[i - 1, j] + rec[i+1, j] +
rec[i, j+1] + rec[i, j-1] - 4 * rec[i,j])^2
laplace_tul (generic function with 1 method)
julia> @btime Zygote.gradient(laplace_vec, $x);
142.741 μs (167 allocations: 1.42 MiB)
julia> @btime Zygote.gradient(laplace_for, $x);
1.064 s (836613 allocations: 7.18 GiB)
julia> @btime Zygote.gradient(laplace_tul, $x);
40.459 μs (41 allocations: 79.34 KiB)
julia> @btime Zygote.gradient(laplace_tul, $x);
42.718 μs (41 allocations: 79.34 KiB)
julia> using LoopVectorization
julia> @btime Zygote.gradient(laplace_tul, $x); # old function not invalidated!
40.864 μs (41 allocations: 79.34 KiB)
julia> laplace_tul(rec) = @tullio res = (rec[i - 1, j] + rec[i+1, j] +
rec[i, j+1] + rec[i, j-1] - 4 * rec[i,j])^2
laplace_tul (generic function with 1 method)
julia> @btime Zygote.gradient(laplace_tul, $x); # redefining updates
31.596 μs (39 allocations: 79.25 KiB)
julia> @btime Zygote.gradient(laplace_tul, $x);
31.610 μs (39 allocations: 79.25 KiB)
Did you make sure to redefine your @tullio
function to update it after loading LoopVectorization? Or is this difference in performance based on something else, e.g. CPU architecture?
We’re seeing much bigger differences elsewhere that are obviously unrelated (in particular, 1 vs nearly 5s Zygote on laplace_for
) – I’m just curious on whether or not LoopVectorization really isn’t doing better than @fastmath
in your case.