LLVM should hoist X[1] ^ X[2] and Z[1] ^ Z[2] out of the broadcast loop in the test1 example, hence evaluating it only once per call to tt1.
Try marking tt with @inline to expose its contents to the compiler, using tuples instead of arrays for X and Y, and finally using broadcasted assignment .=:
julia> function test3(X,Z, Y)
ret = Array{Float64,2}(undef,length(Y), 1000)
for k = 1:1000
ret[:,k] .= tt.(Ref(X),Ref(Z),Y)
end
return ret
end
test3 (generic function with 1 method)
julia> @inline function tt(X, Z, Y)
exp(X[1]*2 + X[1] ^X[2] + Z[1]^Z[2]) + Y
end
tt (generic function with 1 method)
julia> @btime test3((2.0, 5.0), (5.0, 10.0), $Y); #28.084 ms (2004 allocations: 152.66 MiB)
56.652 ms (2 allocations: 76.29 MiB)
This is faster for me than test1:
julia> @btime test([2.0, 5.0],[5.0, 10.0],Y); #2.137 s (2004 allocations: 152.66 MiB)
1.093 s (4004 allocations: 152.66 MiB)
julia> @btime test1([2.0, 5.0],[5.0, 10.0],Y); #28.084 ms (2004 allocations: 152.66 MiB)
66.859 ms (2004 allocations: 152.63 MiB)
Of course, the exp part should also be possible to lift out of the loop, but it doesn’t.