Speed of broadcast operations

rakeshvar · December 20, 2020, 4:42am

       M, N = 10000, 10000
       a = randn((M, N));
       ssa = sum(a);
       sa=sum(a, dims=1);
       @btime a .- ssa .- sa;
       @btime a .- sa .- ssa; 
       @btime a .- (sa .+ ssa);
  287.707 ms (6 allocations: 762.94 MiB)
  283.096 ms (6 allocations: 762.94 MiB)
  279.157 ms (6 allocations: 762.94 MiB)

Why are all the three operations above running almost in the same time? I would have expected the last one to run in about half the time as the other two; given it has only 1e8 subtractions (and 1e4 additions) as opposed to the first two with 2e8 subtractions.

stillyslalom · December 20, 2020, 4:54am

Your operations (elementwise addition/subtraction) are very cheap, so your runtime will be dominated by the memory traffic needed to fetch each element of a from main memory. For reference, on my machine,

julia> @btime $a .- $ssa .- $sa;
  168.593 ms (2 allocations: 762.94 MiB)

julia> @btime [identity(ai) for ai in a];
  132.609 ms (3 allocations: 762.94 MiB)

… it takes almost as much time to do nothing as to do all those operations when the array is so large and the operations are so cheap. The difference probably just amounts to register-shuffling overhead.

mcabbott · December 20, 2020, 12:59pm

You can check how many operations are performed by explicitly counting:

julia> cnt(x) = (global CNT += 1; x);

julia> CNT=0; cnt.(ones(3) .+ ones(5)'); CNT
15

julia> CNT=0; cnt.(ones(3)) .+ ones(5)'; CNT  # fused
15

julia> CNT=0; begin
         col = cnt.(ones(3))  # not fused
         col .+ ones(5)'
       end; CNT
3

Maybe you should think of this fused broadcast as a function (x,y) -> cnt(x) + y which is mapped over the whole iteration space. For a cheap function that’s a great idea, for an expensive one it isn’t always, and allocating an intermediate like col may be worthwhile:

julia> z = similar(a);  # with a, sa as above

julia> @btime $z .= exp.($a) .+ $sa;  # N^2 exp calls
  696.363 ms (0 allocations: 0 bytes)

julia> @btime $z .= $a .+ exp.($sa);  # fused, still N^2 exp calls
  698.302 ms (0 allocations: 0 bytes)

julia> @btime $z .= $a .+ begin exp.($sa) end; # not fused, N exp calls
  127.645 ms (2 allocations: 78.20 KiB)

rakeshvar · December 22, 2020, 3:26am

Wow this is really surprising. Is it documented somewhere, the idea of ‘fusing’?
So would you recommend using intermediate variables while broadcasting?
I still do not understand my original numbers with simple subtractions!

Oscar_Smith · December 22, 2020, 5:36am

More Dots: Syntactic Loop Fusion in Julia is the best explanation of why fusing is awesome.

Topic		Replies	Views
Container broadcasting memory/performance hit in 0.6? General Usage broadcast	2	706	March 13, 2017
Confusion on performance when using the broadcasting macro @. vs explicit . operators Performance	7	167	March 27, 2025
Performance of simple broadcasting operations with many arguments Performance performance , broadcast	15	1592	November 29, 2021
Bad performance in simple array broadcast operations Performance broadcast	5	1076	October 12, 2019
Understanding major order performance when broadcasting in column vs row operations Performance question , array , benchmark	9	1005	June 21, 2021

Speed of broadcast operations

Related topics