Why is broadcasting allocating so much memory?

In the below dummy code I show that doing the sum over the 4-vectors using broadcasting is an order of magnitude slower than iteration of the 4-vector manually. However, I don’t understand why - and whether I can keep using some form of broadcasting that is just as fast the manual version.

Any ideas?

using BenchmarkTools

function func1(arr1, arr2)
    for row in axes(arr1, 2)
        for i in axes(arr2, 2)
            @views @. arr1[:, row] = arr1[:, row] + arr2[:, i]
        end
    end
    return arr1
end

function func2(arr1, arr2)
    for row in axes(arr1, 2)
        for i in axes(arr2, 2)
            for j in 1:4
                arr1[j, row] = arr1[j, row] + arr2[j, i]
            end
        end
    end
    return arr1
end

arr1 = zeros(4, 100_000)
arr2 = rand(4, 256)

@btime func1($arr1, $arr2) evals=1 samples=10 seconds=100
# 653.253 ms (76800000 allocations: 3.43 GiB)

@btime func2($arr1, $arr2) evals=1 samples=10 seconds=100
# 73.863 ms (0 allocations: 0 bytes)

Just a guess, but it may be because of the created views. On 1.4.1, they still need allocations, but on 1.5 those are gone. I don’t have a build of 1.5 at hand, but if you try it out, that might explain the difference.

1 Like

Yeah it’s probably the views causing it. On a different note, I noticed that your second function got quite a bit faster by using @avx from LoopVectorization.jl:

function func2(arr1, arr2)
    @avx for row in axes(arr1, 2)
        for i in axes(arr2, 2)
            for j in 1:4
                arr1[j, row] = arr1[j, row] + arr2[j, i]
            end
        end
    end
    return arr1
end

The difference is

74.175 ms (0 allocations: 0 bytes) # Without @avx
7.488 ms (0 allocations: 0 bytes) # With @avx
3 Likes

I don’t know for sure, but it seems plausible that the culprit is the @views. Although much cheaper than copying the data, creating views is not entirely free, and since in your case you create a view for every four FLOP, the costs of creating all these views add up.

I am not sure whether this issue can be fixed. Obviously, the two functions which you posted could be replaced with

func3(arr1, arr2 ) = ( arr1 .= sum(arr2, dims=2) )

but I am assuming the posted code is intended as a MWE for something more complicated. Maybe a similar trick could work there?

1 Like

I don’t believe it is the views, as I already tried a similar example as you can see below where even using views, so long as I don’t broadcast over the 4-vectors I get 0 allocations.

function func3(arr1, arr2)
    for row in axes(arr1, 2)
        for i in axes(arr2, 2)
            v1, v2 = view(arr1, :, row), view(arr2, :, i)
            for j in 1:4
                arr1[j, row] = v1[j] + v2[j]
            end
        end
    end
    return arr1
end

@btime func3($arr1, $arr2) evals=1 samples=10 seconds=100
# 73.788 ms (0 allocations: 0 bytes)

It’s the combination of allocating views and broadcasting that’s the issue, I think. I’ll compile 1.5 and check.

1 Like

To reply to the other comments:

@baggiepinnen I’l be sure to check @avx out - thank you!

@ettersi It is indeed a MWE - the real code is a inner calibration loop for the Murchison Widefield Array (a low frequency radio telescope in Australia).

Julia v1.4.1:

julia> @btime func1($arr1, $arr2) evals=1 samples=10 seconds=100;
  738.942 ms (76800000 allocations: 3.43 GiB)

julia> @btime func2($arr1, $arr2) evals=1 samples=10 seconds=100;
  68.627 ms (0 allocations: 0 bytes)

master (of a couple of weeks ago):

julia> @btime func1($arr1, $arr2) evals=1 samples=10 seconds=100;
  195.516 ms (0 allocations: 0 bytes)

julia> @btime func2($arr1, $arr2) evals=1 samples=10 seconds=100;
  66.396 ms (0 allocations: 0 bytes)
3 Likes

@giordano That’s great to see they’ve magicked away the allocations - but now I wonder why the slowdown between the two versions on 1.5-master?