fusing not faster



I just read about fusing and broadcasting. I understood that fusing is faster than other approaches. But I have this code:

julia> a = rand(1, 10000)
1×10000 Array{Float64,2}:
 0.975847  0.427092  0.480224  …  0.469955  0.950595  0.628602

julia> n = 100

julia> dt = zeros(n);

julia> for i=1:n
           t0=time(); sin.(cos.(a)); dt[i]=time()-t0

julia> mean(dt)

julia> for i=1:n
           t0=time(); sin.(identity(cos.(a))); dt[i]=time()-t0

julia> mean(dt)

Apparently when I don’t fuse, by just adding the identity function in the middle, I get better results.

Since I’m not sure if I understand everything, I did not report a bug; I rather first have these questions:

  • Am I actually fusing with sin.(cos.(a)) and not with sin.(identity(cos.(a)))?
  • Are there simple examples that would show that fusing is efficient than not to?


Don’t time in the global scope. Put it in a function. What you’re actually timing is that it takes longer to compile the fused function.

CSV Reading (rewrite in C?)

See also https://github.com/JuliaCI/BenchmarkTools.jl#quick-start.


You don’t need to put it in a function, but use BenchmarkTools:

julia> using BenchmarkTools

julia> a = rand(10000);  # this is more natural than rand(1, 10000)

julia> @btime sin.(cos.($a));
  243.721 μs (2 allocations: 78.20 KiB)

julia> @btime sin.(identity(cos.($a)));
  212.915 μs (4 allocations: 156.41 KiB)

Surprisingly, I find that not fusing is actually faster.


If you time it properly in a function as others have stated, you can observe they are essentially the same:

function test(N)
	a = rand(1, N)
	n = 100
	dt = zeros(n)
	for i = 1:n
		t0 = time();
		dt[i] = time()-t0

	for i = 1:n
           t0 = time();
		   dt[i] = time()-t0

[test(10000) for i in 1:3]


And using BenchmarkTools I can replicate the observation of @DNF:

using BenchmarkTools
function test(N)
	a = rand(1, N)
	@btime sin.(cos.($a))
    @btime sin.(identity(cos.($a)))

julia> test(10000)
  215.279 μs (2 allocations: 78.20 KiB)
  186.798 μs (4 allocations: 156.41 KiB)


There might be enough overhead having to swap between evaluating two different functions, over evaluating the same one (computers are good at doing the same thing over and over) that the non-fused beats out the fused one here.

Also, FWIW, on 0.7 fused vs unfused:

julia> @btime fs($a);
  178.222 μs (2 allocations: 78.20 KiB)

julia> @btime f_unfused($a);
  185.016 μs (4 allocations: 156.41 KiB)

sin and cos are now defined in julia so that might have changed stuff.


More depth. Julia’s “unit of compilation” is a function. It compiles at each function call (at the first time it’s called). Every time a broadcast statement is found, it bounds a new function, compiles that anonymous function, and calls broadcast(f,...) where f is that anonymous function. So in the global scope it’s going to be measuring this compilation time, while in a function it will happen only the first time. Even @btime's scope seems

using BenchmarkTools

function f(a)

function f2(a)

a = zeros(100)

@time f(a)
@time f2(a)
@time f(a)
@time f2(a)

@btime f($a)
@btime f2($a)

gives me

  0.017187 seconds (15.18 k allocations: 688.813 KiB)
  0.012623 seconds (3.48 k allocations: 144.208 KiB)
  0.000004 seconds (5 allocations: 1.031 KiB)
  0.000004 seconds (6 allocations: 1.906 KiB)
  1.142 μs (1 allocation: 896 bytes)
  1.171 μs (2 allocations: 1.75 KiB)

You can see the compilation in the timing in the first call, and the subsequent calls are too fast to be timed with @time so @btime is used (uses the minimum over a bunch of runs). You can play with seeing how this specific case scales, but you’ll see they are always pretty much the same or fusion is faster.

But fusion really makes more sense when you have pre-allocated output.

using BenchmarkTools

function f(b,a)
    b .= sin.(cos.(a))

function f2(b,a)
    b .= sin.(identity(cos.(a)))

a = rand(1000000000)
b = similar(a)

@btime f($b,$a)
@btime f2($b,$a)

  23.560 s (0 allocations: 0 bytes)
  22.194 s (2 allocations: 7.45 GiB)

That said, the non-fusing form is surprisingly good here so there may be some optimization going on.

But in real codes you will notice a difference because you see that the non-fusing form is allocating 7.45 GiB. In a real code, that will cause the GC to be hit. In @btime, it GCs outside of the function call so it’s not in the timing.

Edit: this computation may be compute bound enough that allocating the vector just doesn’t even matter.


Thank you all for these useful explanations!

If I use @btime from BenchmarkTools, I indeed get very small differences between fusing or not, very similar to those in your examples.