Defining function as scalar vs fusing with two Ref's, significant speed difference

LLVM should hoist X[1] ^ X[2] and Z[1] ^ Z[2] out of the broadcast loop in the test1 example, hence evaluating it only once per call to tt1.

Try marking tt with @inline to expose its contents to the compiler, using tuples instead of arrays for X and Y, and finally using broadcasted assignment .=:

julia> function test3(X,Z, Y)
           ret = Array{Float64,2}(undef,length(Y), 1000)
           for k = 1:1000
               ret[:,k] .= tt.(Ref(X),Ref(Z),Y)
           end
           return ret
       end
test3 (generic function with 1 method)

julia> @inline function tt(X, Z, Y)
           exp(X[1]*2 +  X[1] ^X[2] + Z[1]^Z[2]) + Y
       end
tt (generic function with 1 method)

julia> @btime test3((2.0, 5.0), (5.0, 10.0), $Y); #28.084 ms (2004 allocations: 152.66 MiB)
  56.652 ms (2 allocations: 76.29 MiB)

This is faster for me than test1:

julia> @btime test([2.0, 5.0],[5.0, 10.0],Y); #2.137 s (2004 allocations: 152.66 MiB)
  1.093 s (4004 allocations: 152.66 MiB)

julia> @btime test1([2.0, 5.0],[5.0, 10.0],Y); #28.084 ms (2004 allocations: 152.66 MiB)
  66.859 ms (2004 allocations: 152.63 MiB)

Of course, the exp part should also be possible to lift out of the loop, but it doesn’t.

3 Likes