Speed of broadcast operations


       M, N = 10000, 10000
       a = randn((M, N));
       ssa = sum(a);
       sa=sum(a, dims=1);
       @btime a .- ssa .- sa;
       @btime a .- sa .- ssa; 
       @btime a .- (sa .+ ssa);
  287.707 ms (6 allocations: 762.94 MiB)
  283.096 ms (6 allocations: 762.94 MiB)
  279.157 ms (6 allocations: 762.94 MiB)

Why are all the three operations above running almost in the same time? I would have expected the last one to run in about half the time as the other two; given it has only 1e8 subtractions (and 1e4 additions) as opposed to the first two with 2e8 subtractions.

Your operations (elementwise addition/subtraction) are very cheap, so your runtime will be dominated by the memory traffic needed to fetch each element of a from main memory. For reference, on my machine,

julia> @btime $a .- $ssa .- $sa;
  168.593 ms (2 allocations: 762.94 MiB)

julia> @btime [identity(ai) for ai in a];
  132.609 ms (3 allocations: 762.94 MiB)

… it takes almost as much time to do nothing as to do all those operations when the array is so large and the operations are so cheap. The difference probably just amounts to register-shuffling overhead.


You can check how many operations are performed by explicitly counting:

julia> cnt(x) = (global CNT += 1; x);

julia> CNT=0; cnt.(ones(3) .+ ones(5)'); CNT

julia> CNT=0; cnt.(ones(3)) .+ ones(5)'; CNT  # fused

julia> CNT=0; begin
         col = cnt.(ones(3))  # not fused
         col .+ ones(5)'
       end; CNT

Maybe you should think of this fused broadcast as a function (x,y) -> cnt(x) + y which is mapped over the whole iteration space. For a cheap function that’s a great idea, for an expensive one it isn’t always, and allocating an intermediate like col may be worthwhile:

julia> z = similar(a);  # with a, sa as above

julia> @btime $z .= exp.($a) .+ $sa;  # N^2 exp calls
  696.363 ms (0 allocations: 0 bytes)

julia> @btime $z .= $a .+ exp.($sa);  # fused, still N^2 exp calls
  698.302 ms (0 allocations: 0 bytes)

julia> @btime $z .= $a .+ begin exp.($sa) end; # not fused, N exp calls
  127.645 ms (2 allocations: 78.20 KiB)

Wow this is really surprising. Is it documented somewhere, the idea of ‘fusing’?
So would you recommend using intermediate variables while broadcasting?
I still do not understand my original numbers with simple subtractions! :frowning_face:

https://julialang.org/blog/2017/01/moredots/ is the best explanation of why fusing is awesome.