fusing not faster

mvhulten · January 30, 2018, 3:32pm

I just read about fusing and broadcasting. I understood that fusing is faster than other approaches. But I have this code:

julia> a = rand(1, 10000)
1×10000 Array{Float64,2}:
 0.975847  0.427092  0.480224  …  0.469955  0.950595  0.628602

julia> n = 100
100

julia> dt = zeros(n);

julia> for i=1:n
           t0=time(); sin.(cos.(a)); dt[i]=time()-t0
       end

julia> mean(dt)
0.0006426692008972168

julia> for i=1:n
           t0=time(); sin.(identity(cos.(a))); dt[i]=time()-t0
       end

julia> mean(dt)
0.00042889833450317384

Apparently when I don’t fuse, by just adding the identity function in the middle, I get better results.

Since I’m not sure if I understand everything, I did not report a bug; I rather first have these questions:

Am I actually fusing with sin.(cos.(a)) and not with sin.(identity(cos.(a)))?
Are there simple examples that would show that fusing is efficient than not to?

ChrisRackauckas · January 30, 2018, 3:34pm

Don’t time in the global scope. Put it in a function. What you’re actually timing is that it takes longer to compile the fused function.

tkoolen · January 30, 2018, 3:37pm

See also GitHub - JuliaCI/BenchmarkTools.jl: A benchmarking framework for the Julia language.

DNF · January 30, 2018, 4:09pm

You don’t need to put it in a function, but use BenchmarkTools:

julia> using BenchmarkTools

julia> a = rand(10000);  # this is more natural than rand(1, 10000)

julia> @btime sin.(cos.($a));
  243.721 μs (2 allocations: 78.20 KiB)

julia> @btime sin.(identity(cos.($a)));
  212.915 μs (4 allocations: 156.41 KiB)

Surprisingly, I find that not fusing is actually faster.

Seif_Shebl · January 30, 2018, 4:09pm

If you time it properly in a function as others have stated, you can observe they are essentially the same:

function test(N)
	a = rand(1, N)
	n = 100
	dt = zeros(n)
	for i = 1:n
		t0 = time();
		sin.(cos.(a));
		dt[i] = time()-t0
	end
	println(mean(dt))

	for i = 1:n
           t0 = time();
		   sin.(identity(cos.(a)));
		   dt[i] = time()-t0
    end
	println(mean(dt))
end

[test(10000) for i in 1:3]

julia>
  0.00023000001907348634
  0.0002499985694885254
  0.00023000001907348634
  0.00023999929428100586
  0.00023000001907348634
  0.00023999929428100586

And using BenchmarkTools I can replicate the observation of @DNF:

using BenchmarkTools
function test(N)
	a = rand(1, N)
	@btime sin.(cos.($a))
    @btime sin.(identity(cos.($a)))
end

julia> test(10000)
  215.279 μs (2 allocations: 78.20 KiB)
  186.798 μs (4 allocations: 156.41 KiB)

kristoffer.carlsson · January 30, 2018, 4:29pm

There might be enough overhead having to swap between evaluating two different functions, over evaluating the same one (computers are good at doing the same thing over and over) that the non-fused beats out the fused one here.

Also, FWIW, on 0.7 fused vs unfused:

julia> @btime fs($a);
  178.222 μs (2 allocations: 78.20 KiB)

julia> @btime f_unfused($a);
  185.016 μs (4 allocations: 156.41 KiB)

sin and cos are now defined in julia so that might have changed stuff.

ChrisRackauckas · January 30, 2018, 4:43pm

More depth. Julia’s “unit of compilation” is a function. It compiles at each function call (at the first time it’s called). Every time a broadcast statement is found, it bounds a new function, compiles that anonymous function, and calls broadcast(f,...) where f is that anonymous function. So in the global scope it’s going to be measuring this compilation time, while in a function it will happen only the first time. Even @btime’s scope seems

using BenchmarkTools

function f(a)
    sin.(cos.(a))
end

function f2(a)
    sin.(identity(cos.(a)))
end

a = zeros(100)

@time f(a)
@time f2(a)
@time f(a)
@time f2(a)

@btime f($a)
@btime f2($a)

gives me

  0.017187 seconds (15.18 k allocations: 688.813 KiB)
  0.012623 seconds (3.48 k allocations: 144.208 KiB)
  0.000004 seconds (5 allocations: 1.031 KiB)
  0.000004 seconds (6 allocations: 1.906 KiB)
  1.142 μs (1 allocation: 896 bytes)
  1.171 μs (2 allocations: 1.75 KiB)

You can see the compilation in the timing in the first call, and the subsequent calls are too fast to be timed with @time so @btime is used (uses the minimum over a bunch of runs). You can play with seeing how this specific case scales, but you’ll see they are always pretty much the same or fusion is faster.

But fusion really makes more sense when you have pre-allocated output.

using BenchmarkTools

function f(b,a)
    b .= sin.(cos.(a))
end

function f2(b,a)
    b .= sin.(identity(cos.(a)))
end

a = rand(1000000000)
b = similar(a)

@btime f($b,$a)
@btime f2($b,$a)


  23.560 s (0 allocations: 0 bytes)
  22.194 s (2 allocations: 7.45 GiB)

That said, the non-fusing form is surprisingly good here so there may be some optimization going on.

But in real codes you will notice a difference because you see that the non-fusing form is allocating 7.45 GiB. In a real code, that will cause the GC to be hit. In @btime, it GCs outside of the function call so it’s not in the timing.

Edit: this computation may be compute bound enough that allocating the vector just doesn’t even matter.

mvhulten · January 30, 2018, 5:20pm

Thank you all for these useful explanations!

If I use @btime from BenchmarkTools, I indeed get very small differences between fusing or not, very similar to those in your examples.

Topic		Replies	Views
Preventing broadcast fusing General Usage broadcast	17	1269	February 3, 2019
Confusion on performance when using the broadcasting macro @. vs explicit . operators Performance	7	167	March 27, 2025
Element-wise vector multiplication and fusing dot New to Julia	12	16695	January 2, 2017
Container broadcasting memory/performance hit in 0.6? General Usage broadcast	2	706	March 13, 2017
Are temporaries created in one line functions that return the result of a broadcasted function? General Usage question	9	223	December 15, 2022

fusing not faster

Related topics