More depth. Julia’s “unit of compilation” is a function. It compiles at each function call (at the first time it’s called). Every time a broadcast statement is found, it bounds a new function, compiles that anonymous function, and calls
f is that anonymous function. So in the global scope it’s going to be measuring this compilation time, while in a function it will happen only the first time. Even
@btime's scope seems
a = zeros(100)
0.017187 seconds (15.18 k allocations: 688.813 KiB)
0.012623 seconds (3.48 k allocations: 144.208 KiB)
0.000004 seconds (5 allocations: 1.031 KiB)
0.000004 seconds (6 allocations: 1.906 KiB)
1.142 μs (1 allocation: 896 bytes)
1.171 μs (2 allocations: 1.75 KiB)
You can see the compilation in the timing in the first call, and the subsequent calls are too fast to be timed with
@btime is used (uses the minimum over a bunch of runs). You can play with seeing how this specific case scales, but you’ll see they are always pretty much the same or fusion is faster.
But fusion really makes more sense when you have pre-allocated output.
b .= sin.(cos.(a))
b .= sin.(identity(cos.(a)))
a = rand(1000000000)
b = similar(a)
23.560 s (0 allocations: 0 bytes)
22.194 s (2 allocations: 7.45 GiB)
That said, the non-fusing form is surprisingly good here so there may be some optimization going on.
But in real codes you will notice a difference because you see that the non-fusing form is allocating 7.45 GiB. In a real code, that will cause the GC to be hit. In
@btime, it GCs outside of the function call so it’s not in the timing.
Edit: this computation may be compute bound enough that allocating the vector just doesn’t even matter.