Why is broadcasting allocating so much memory?

torrance · May 21, 2020, 8:29am

In the below dummy code I show that doing the sum over the 4-vectors using broadcasting is an order of magnitude slower than iteration of the 4-vector manually. However, I don’t understand why - and whether I can keep using some form of broadcasting that is just as fast the manual version.

Any ideas?

using BenchmarkTools

function func1(arr1, arr2)
    for row in axes(arr1, 2)
        for i in axes(arr2, 2)
            @views @. arr1[:, row] = arr1[:, row] + arr2[:, i]
        end
    end
    return arr1
end

function func2(arr1, arr2)
    for row in axes(arr1, 2)
        for i in axes(arr2, 2)
            for j in 1:4
                arr1[j, row] = arr1[j, row] + arr2[j, i]
            end
        end
    end
    return arr1
end

arr1 = zeros(4, 100_000)
arr2 = rand(4, 256)

@btime func1($arr1, $arr2) evals=1 samples=10 seconds=100
# 653.253 ms (76800000 allocations: 3.43 GiB)

@btime func2($arr1, $arr2) evals=1 samples=10 seconds=100
# 73.863 ms (0 allocations: 0 bytes)

Sukera · May 21, 2020, 8:56am

Just a guess, but it may be because of the created views. On 1.4.1, they still need allocations, but on 1.5 those are gone. I don’t have a build of 1.5 at hand, but if you try it out, that might explain the difference.

baggepinnen · May 21, 2020, 9:49am

Yeah it’s probably the views causing it. On a different note, I noticed that your second function got quite a bit faster by using @avx from LoopVectorization.jl:

function func2(arr1, arr2)
    @avx for row in axes(arr1, 2)
        for i in axes(arr2, 2)
            for j in 1:4
                arr1[j, row] = arr1[j, row] + arr2[j, i]
            end
        end
    end
    return arr1
end

The difference is

74.175 ms (0 allocations: 0 bytes) # Without @avx
7.488 ms (0 allocations: 0 bytes) # With @avx

ettersi · May 21, 2020, 9:50am

I don’t know for sure, but it seems plausible that the culprit is the @views. Although much cheaper than copying the data, creating views is not entirely free, and since in your case you create a view for every four FLOP, the costs of creating all these views add up.

I am not sure whether this issue can be fixed. Obviously, the two functions which you posted could be replaced with

func3(arr1, arr2 ) = ( arr1 .= sum(arr2, dims=2) )

but I am assuming the posted code is intended as a MWE for something more complicated. Maybe a similar trick could work there?

torrance · May 21, 2020, 10:03am

I don’t believe it is the views, as I already tried a similar example as you can see below where even using views, so long as I don’t broadcast over the 4-vectors I get 0 allocations.

function func3(arr1, arr2)
    for row in axes(arr1, 2)
        for i in axes(arr2, 2)
            v1, v2 = view(arr1, :, row), view(arr2, :, i)
            for j in 1:4
                arr1[j, row] = v1[j] + v2[j]
            end
        end
    end
    return arr1
end

@btime func3($arr1, $arr2) evals=1 samples=10 seconds=100
# 73.788 ms (0 allocations: 0 bytes)

Sukera · May 21, 2020, 10:04am

It’s the combination of allocating views and broadcasting that’s the issue, I think. I’ll compile 1.5 and check.

torrance · May 21, 2020, 10:06am

To reply to the other comments:

@baggiepinnen I’l be sure to check @avx out - thank you!

@ettersi It is indeed a MWE - the real code is a inner calibration loop for the Murchison Widefield Array (a low frequency radio telescope in Australia).

giordano · May 21, 2020, 10:16am

Julia v1.4.1:

julia> @btime func1($arr1, $arr2) evals=1 samples=10 seconds=100;
  738.942 ms (76800000 allocations: 3.43 GiB)

julia> @btime func2($arr1, $arr2) evals=1 samples=10 seconds=100;
  68.627 ms (0 allocations: 0 bytes)

master (of a couple of weeks ago):

julia> @btime func1($arr1, $arr2) evals=1 samples=10 seconds=100;
  195.516 ms (0 allocations: 0 bytes)

julia> @btime func2($arr1, $arr2) evals=1 samples=10 seconds=100;
  66.396 ms (0 allocations: 0 bytes)

torrance · May 21, 2020, 10:18am

@giordano That’s great to see they’ve magicked away the allocations - but now I wonder why the slowdown between the two versions on 1.5-master?

Topic		Replies	Views
Unexpected Memory Allocation of Broadcasting Copy Performance question	4	123	June 15, 2024
Memory allocation in broadcast assignment Performance question	3	857	July 24, 2020
Normal vs broadcasted slice assignment General Usage	5	256	February 16, 2024
Performance in broadcasting vs function preallocation? Performance question , performance , benchmarktools	6	100	February 23, 2025
Why is there a performance hit on broadcasting with OffsetArrays? Performance question	3	1029	December 15, 2019

Why is broadcasting allocating so much memory?

Related topics