```
M, N = 10000, 10000
a = randn((M, N));
ssa = sum(a);
sa=sum(a, dims=1);
@btime a .- ssa .- sa;
@btime a .- sa .- ssa;
@btime a .- (sa .+ ssa);
287.707 ms (6 allocations: 762.94 MiB)
283.096 ms (6 allocations: 762.94 MiB)
279.157 ms (6 allocations: 762.94 MiB)
```

Why are all the three operations above running almost in the same time? I would have expected the last one to run in about half the time as the other two; given it has only `1e8`

subtractions (and `1e4`

additions) as opposed to the first two with `2e8`

subtractions.

Your operations (elementwise addition/subtraction) are very cheap, so your runtime will be dominated by the memory traffic needed to fetch each element of `a`

from main memory. For reference, on my machine,

```
julia> @btime $a .- $ssa .- $sa;
168.593 ms (2 allocations: 762.94 MiB)
julia> @btime [identity(ai) for ai in a];
132.609 ms (3 allocations: 762.94 MiB)
```

… it takes almost as much time to do *nothing* as to do all those operations when the array is so large and the operations are so cheap. The difference probably just amounts to register-shuffling overhead.

8 Likes

You can check how many operations are performed by explicitly counting:

```
julia> cnt(x) = (global CNT += 1; x);
julia> CNT=0; cnt.(ones(3) .+ ones(5)'); CNT
15
julia> CNT=0; cnt.(ones(3)) .+ ones(5)'; CNT # fused
15
julia> CNT=0; begin
col = cnt.(ones(3)) # not fused
col .+ ones(5)'
end; CNT
3
```

Maybe you should think of this fused broadcast as a function `(x,y) -> cnt(x) + y`

which is mapped over the whole iteration space. For a cheap function that’s a great idea, for an expensive one it isn’t always, and allocating an intermediate like `col`

may be worthwhile:

```
julia> z = similar(a); # with a, sa as above
julia> @btime $z .= exp.($a) .+ $sa; # N^2 exp calls
696.363 ms (0 allocations: 0 bytes)
julia> @btime $z .= $a .+ exp.($sa); # fused, still N^2 exp calls
698.302 ms (0 allocations: 0 bytes)
julia> @btime $z .= $a .+ begin exp.($sa) end; # not fused, N exp calls
127.645 ms (2 allocations: 78.20 KiB)
```

5 Likes

Wow this is really surprising. Is it documented somewhere, the idea of ‘fusing’?

So would you recommend using intermediate variables while broadcasting?

I still do not understand my original numbers with simple subtractions!

https://julialang.org/blog/2017/01/moredots/ is the best explanation of why fusing is awesome.

4 Likes