Of course I’ve also tried wrapping the Julia dot call inside a function, as well as manually looping myself, e.g.:
@inbounds function myfunc(arr)
result = similar(arr)
@simd for i in eachindex(arr)
result[i] = exp(arr[i])
end
return result
end
@benchmark myfunc($arr)
# Time (median): 569.613 μs
…which is identical timing to the dotted version.
Any idea what’s going on here? Some obvious sanity checks - both are Float64, and both are not using fastmath, both are allocating a new array and both are single threaded.
I’m no good at reading llvm code, so I cannot confirm, but it might be that Julia fails to simd the computation. If you use LoopVectorization.jl, it works:
Yeah, I think that’s it! I see near identical timings between numpy and Julia with @turbo. @turbo also speeds up my my own explicit function, even though this already included a @simd macro.
So guess the next question is: why doesn’t the dot call automatically apply simd instructions? And why does my explicit function with manual @simd not work?
(I’ve checked whether this slowdown is attributed to the slow handling of tails, as it mentioned on the LoopVectorization page, but passing in arrays without tails does not make any difference in my own testing.)
It’s a good question. I am not able to figure it out, but I believe LoopVectorization.jl in some cases uses other implementations of some functions which are hard to simd. Probably, exp is one of them.
Julia is just using general mechanisms to calculate exp over your array, while numpy is free to do a special array version, maybe much like LoopVectorization does. Making simd happen in Julia in this case apparently takes more than a simple @simd annotation.
I just tried this, and I’m getting 402us in Julia 1.7.3 and 435us with numpy. I would have expected them to be completely equal, assuming they are doing the same thing. I’m surprised there’s a difference one way or the other.
That is probably not a good assumption.
I suspect numpy uses your system libM,
julia uses implementations of math functions written in julia – following the same techniques libM authors use in terms of balancing speed and accurasy but potentially ending up at a different point.
Even if exp isn’t one of the functions we have tweaked and optimized it would still be based on the OpenLibM definition which is probably not the same libM that your system runs.
The speedup in Python is very dependent on how numpy has been compiled. I have tested on various system and see the numpy code variably matching the speed of the @turbo Julia or alternatively the plain julia. Obviously some are making use of simd instructions, and some are not.
That’s fine, but doesn’t address the question I asked earlier:
Basically the reason is that Julia can do this if it is able to inline the code, but generically turning scalar code into vector code through the compiler only is a very complicated optimization.
There is no reason to expect Julia to outperform Python on a single call like this. A significant difference in either direction is just low fruit for the other (in this case, the Julia version failing to SIMD). This function is already optimized in Python (or, rather, in C or whatever language was used to write it under the hood). Julia’s performance benefits only really factor in when you are trying to do more complicated sequences of operations that lack a bespoke implementation.
One thing we could that would make this better is to have a better interface to allow the compiler to replace scalar code with custom provided vector code. I’m not fully sure what that interface looks like though.