Why is broadcasting allocating so much memory?

In the below dummy code I show that doing the sum over the 4-vectors using broadcasting is an order of magnitude slower than iteration of the 4-vector manually. However, I don’t understand why - and whether I can keep using some form of broadcasting that is just as fast the manual version.

Any ideas?

``````using BenchmarkTools

function func1(arr1, arr2)
for row in axes(arr1, 2)
for i in axes(arr2, 2)
@views @. arr1[:, row] = arr1[:, row] + arr2[:, i]
end
end
return arr1
end

function func2(arr1, arr2)
for row in axes(arr1, 2)
for i in axes(arr2, 2)
for j in 1:4
arr1[j, row] = arr1[j, row] + arr2[j, i]
end
end
end
return arr1
end

arr1 = zeros(4, 100_000)
arr2 = rand(4, 256)

@btime func1(\$arr1, \$arr2) evals=1 samples=10 seconds=100
# 653.253 ms (76800000 allocations: 3.43 GiB)

@btime func2(\$arr1, \$arr2) evals=1 samples=10 seconds=100
# 73.863 ms (0 allocations: 0 bytes)
``````

Just a guess, but it may be because of the created views. On 1.4.1, they still need allocations, but on 1.5 those are gone. I don’t have a build of 1.5 at hand, but if you try it out, that might explain the difference.

1 Like

Yeah it’s probably the views causing it. On a different note, I noticed that your second function got quite a bit faster by using `@avx` from LoopVectorization.jl:

``````function func2(arr1, arr2)
@avx for row in axes(arr1, 2)
for i in axes(arr2, 2)
for j in 1:4
arr1[j, row] = arr1[j, row] + arr2[j, i]
end
end
end
return arr1
end
``````

The difference is

``````74.175 ms (0 allocations: 0 bytes) # Without @avx
7.488 ms (0 allocations: 0 bytes) # With @avx
``````
3 Likes

I don’t know for sure, but it seems plausible that the culprit is the `@views`. Although much cheaper than copying the data, creating views is not entirely free, and since in your case you create a view for every four FLOP, the costs of creating all these views add up.

I am not sure whether this issue can be fixed. Obviously, the two functions which you posted could be replaced with

``````func3(arr1, arr2 ) = ( arr1 .= sum(arr2, dims=2) )
``````

but I am assuming the posted code is intended as a MWE for something more complicated. Maybe a similar trick could work there?

1 Like

I don’t believe it is the views, as I already tried a similar example as you can see below where even using views, so long as I don’t broadcast over the 4-vectors I get 0 allocations.

``````function func3(arr1, arr2)
for row in axes(arr1, 2)
for i in axes(arr2, 2)
v1, v2 = view(arr1, :, row), view(arr2, :, i)
for j in 1:4
arr1[j, row] = v1[j] + v2[j]
end
end
end
return arr1
end

@btime func3(\$arr1, \$arr2) evals=1 samples=10 seconds=100
# 73.788 ms (0 allocations: 0 bytes)
``````

It’s the combination of allocating views and broadcasting that’s the issue, I think. I’ll compile 1.5 and check.

1 Like

@baggiepinnen I’l be sure to check `@avx` out - thank you!

@ettersi It is indeed a MWE - the real code is a inner calibration loop for the Murchison Widefield Array (a low frequency radio telescope in Australia).

Julia v1.4.1:

``````julia> @btime func1(\$arr1, \$arr2) evals=1 samples=10 seconds=100;
738.942 ms (76800000 allocations: 3.43 GiB)

julia> @btime func2(\$arr1, \$arr2) evals=1 samples=10 seconds=100;
68.627 ms (0 allocations: 0 bytes)
``````

master (of a couple of weeks ago):

``````julia> @btime func1(\$arr1, \$arr2) evals=1 samples=10 seconds=100;
195.516 ms (0 allocations: 0 bytes)

julia> @btime func2(\$arr1, \$arr2) evals=1 samples=10 seconds=100;
66.396 ms (0 allocations: 0 bytes)
``````
3 Likes

@giordano That’s great to see they’ve magicked away the allocations - but now I wonder why the slowdown between the two versions on 1.5-master?