Defining function as scalar vs fusing with two Ref's, significant speed difference

I wonder why there are significant differences in performance between these two functions. The core functions are the tt and tt1, one in which I use @. to broadcast the results, and the other where I write everything as scalars and use tt1. to call it.

Interestingly, the number of allocations and size is the same.

This speed difference does not happen if I only have one vector to Ref (ie, if the functions don’t have Z as argument). In my real application, X and Z are custom structs with many fields, so breaking it down in singletons would not be practical.

using BenchmarkTools
function test(X,Z, Y)
    ret = Array{Float64,2}(undef,length(Y), 1000)
    for k = 1:1000
        ret[:,k] = tt.(Ref(X),Ref(Z),Y)
    end
    return ret
end
function test1(X,Z, Y)
    ret = Array{Float64,2}(undef, length(Y), 1000)
    for k = 1:1000
        ret[:,k] = tt1(X,Z,Y)
    end
    return ret
end
function tt(X, Z, Y)
    exp(X[1]*2 +  X[1] ^X[2] + Z[1]^Z[2]) + Y
end
function tt1(X, Z, Y)
    @. exp(X[1]*2 +  X[1] ^X[2] + Z[1]^Z[2]) + Y
end

Y=rand(10000);
@btime test([2.0, 5.0],[5.0, 10.0],Y); #2.137 s (2004 allocations: 152.66 MiB)
@btime test1([2.0, 5.0],[5.0, 10.0],Y); #28.084 ms (2004 allocations: 152.66 MiB)
1 Like

LLVM should hoist X[1] ^ X[2] and Z[1] ^ Z[2] out of the broadcast loop in the test1 example, hence evaluating it only once per call to tt1.

Try marking tt with @inline to expose its contents to the compiler, using tuples instead of arrays for X and Y, and finally using broadcasted assignment .=:

julia> function test3(X,Z, Y)
           ret = Array{Float64,2}(undef,length(Y), 1000)
           for k = 1:1000
               ret[:,k] .= tt.(Ref(X),Ref(Z),Y)
           end
           return ret
       end
test3 (generic function with 1 method)

julia> @inline function tt(X, Z, Y)
           exp(X[1]*2 +  X[1] ^X[2] + Z[1]^Z[2]) + Y
       end
tt (generic function with 1 method)

julia> @btime test3((2.0, 5.0), (5.0, 10.0), $Y); #28.084 ms (2004 allocations: 152.66 MiB)
  56.652 ms (2 allocations: 76.29 MiB)

This is faster for me than test1:

julia> @btime test([2.0, 5.0],[5.0, 10.0],Y); #2.137 s (2004 allocations: 152.66 MiB)
  1.093 s (4004 allocations: 152.66 MiB)

julia> @btime test1([2.0, 5.0],[5.0, 10.0],Y); #28.084 ms (2004 allocations: 152.66 MiB)
  66.859 ms (2004 allocations: 152.63 MiB)

Of course, the exp part should also be possible to lift out of the loop, but it doesn’t.

3 Likes