In the below dummy code I show that doing the sum over the 4-vectors using broadcasting is an order of magnitude slower than iteration of the 4-vector manually. However, I don’t understand why - and whether I can keep using some form of broadcasting that is just as fast the manual version.
Any ideas?
using BenchmarkTools
function func1(arr1, arr2)
for row in axes(arr1, 2)
for i in axes(arr2, 2)
@views @. arr1[:, row] = arr1[:, row] + arr2[:, i]
end
end
return arr1
end
function func2(arr1, arr2)
for row in axes(arr1, 2)
for i in axes(arr2, 2)
for j in 1:4
arr1[j, row] = arr1[j, row] + arr2[j, i]
end
end
end
return arr1
end
arr1 = zeros(4, 100_000)
arr2 = rand(4, 256)
@btime func1($arr1, $arr2) evals=1 samples=10 seconds=100
# 653.253 ms (76800000 allocations: 3.43 GiB)
@btime func2($arr1, $arr2) evals=1 samples=10 seconds=100
# 73.863 ms (0 allocations: 0 bytes)
Just a guess, but it may be because of the created views. On 1.4.1, they still need allocations, but on 1.5 those are gone. I don’t have a build of 1.5 at hand, but if you try it out, that might explain the difference.
Yeah it’s probably the views causing it. On a different note, I noticed that your second function got quite a bit faster by using @avx from LoopVectorization.jl:
function func2(arr1, arr2)
@avx for row in axes(arr1, 2)
for i in axes(arr2, 2)
for j in 1:4
arr1[j, row] = arr1[j, row] + arr2[j, i]
end
end
end
return arr1
end
The difference is
74.175 ms (0 allocations: 0 bytes) # Without @avx
7.488 ms (0 allocations: 0 bytes) # With @avx
I don’t know for sure, but it seems plausible that the culprit is the @views. Although much cheaper than copying the data, creating views is not entirely free, and since in your case you create a view for every four FLOP, the costs of creating all these views add up.
I am not sure whether this issue can be fixed. Obviously, the two functions which you posted could be replaced with
I don’t believe it is the views, as I already tried a similar example as you can see below where even using views, so long as I don’t broadcast over the 4-vectors I get 0 allocations.
function func3(arr1, arr2)
for row in axes(arr1, 2)
for i in axes(arr2, 2)
v1, v2 = view(arr1, :, row), view(arr2, :, i)
for j in 1:4
arr1[j, row] = v1[j] + v2[j]
end
end
end
return arr1
end
@btime func3($arr1, $arr2) evals=1 samples=10 seconds=100
# 73.788 ms (0 allocations: 0 bytes)
@baggiepinnen I’l be sure to check @avx out - thank you!
@ettersi It is indeed a MWE - the real code is a inner calibration loop for the Murchison Widefield Array (a low frequency radio telescope in Australia).