M, N = 10000, 10000
a = randn((M, N));
ssa = sum(a);
sa=sum(a, dims=1);
@btime a .- ssa .- sa;
@btime a .- sa .- ssa;
@btime a .- (sa .+ ssa);
287.707 ms (6 allocations: 762.94 MiB)
283.096 ms (6 allocations: 762.94 MiB)
279.157 ms (6 allocations: 762.94 MiB)
Why are all the three operations above running almost in the same time? I would have expected the last one to run in about half the time as the other two; given it has only 1e8
subtractions (and 1e4
additions) as opposed to the first two with 2e8
subtractions.
Your operations (elementwise addition/subtraction) are very cheap, so your runtime will be dominated by the memory traffic needed to fetch each element of a
from main memory. For reference, on my machine,
julia> @btime $a .- $ssa .- $sa;
168.593 ms (2 allocations: 762.94 MiB)
julia> @btime [identity(ai) for ai in a];
132.609 ms (3 allocations: 762.94 MiB)
… it takes almost as much time to do nothing as to do all those operations when the array is so large and the operations are so cheap. The difference probably just amounts to register-shuffling overhead.
8 Likes
You can check how many operations are performed by explicitly counting:
julia> cnt(x) = (global CNT += 1; x);
julia> CNT=0; cnt.(ones(3) .+ ones(5)'); CNT
15
julia> CNT=0; cnt.(ones(3)) .+ ones(5)'; CNT # fused
15
julia> CNT=0; begin
col = cnt.(ones(3)) # not fused
col .+ ones(5)'
end; CNT
3
Maybe you should think of this fused broadcast as a function (x,y) -> cnt(x) + y
which is mapped over the whole iteration space. For a cheap function that’s a great idea, for an expensive one it isn’t always, and allocating an intermediate like col
may be worthwhile:
julia> z = similar(a); # with a, sa as above
julia> @btime $z .= exp.($a) .+ $sa; # N^2 exp calls
696.363 ms (0 allocations: 0 bytes)
julia> @btime $z .= $a .+ exp.($sa); # fused, still N^2 exp calls
698.302 ms (0 allocations: 0 bytes)
julia> @btime $z .= $a .+ begin exp.($sa) end; # not fused, N exp calls
127.645 ms (2 allocations: 78.20 KiB)
5 Likes
Wow this is really surprising. Is it documented somewhere, the idea of ‘fusing’?
So would you recommend using intermediate variables while broadcasting?
I still do not understand my original numbers with simple subtractions!
More Dots: Syntactic Loop Fusion in Julia is the best explanation of why fusing is awesome.
4 Likes