Comparing exp() performance on Julia versus numpy

suavesito · August 5, 2022, 2:11pm

Intel Core I7 6500U.

torrance · August 5, 2022, 2:14pm

That’s incorrect. For example, check out: @code_llvm (^).(arr, 2). As @Oscar_Smith above also said, the broadcasting loop is being done in the context of the @simd macro.

Looking further into it, it seems @simd is not compatible with exp (and sin, cos), and the LoopVectorization is doing some magic to make these functions compatible.

I wonder if there’s any plans to upsteam these changes into base Julia to that @simd works with these standard math functions?

jling · August 5, 2022, 2:16pm

already done, see previous example and linked PR?

Oscar_Smith · August 5, 2022, 2:17pm

This isn’t explicit optimization. What’s happening here is that for simple functions (ones small enough to be inlined), the compiler is able to automatically figure out vectorized versions. For a call like exp.(v) that is a much harder optimization to make, and getting optimal performance would definitely require a handwritten vectorized exp function (like the one in LoopVectorization).

suavesito · August 5, 2022, 2:19pm

The optimisation is done normally, but it depends on several things, LV do a lot of things more aggressively because it makes more assumptions and have some hand written implementation’s of functions.

Respect to exp, this two are related #44865 and #46338 which are in master.

torrance · August 5, 2022, 2:28pm

Wow, yes actually - it’s about 4x times faster. Still not quite as fast as the 6x speedup of @turbo but a big difference. And I think (?) I can see simd instructions in the llvm output.

It doesn’t appear this has also been applied to sin() or cos() though.

Palli · August 8, 2022, 12:08pm

Does any language SIMD compile sin and cos by default (maybe Julia can (could [be made to]?) and should do that, but only if @simd applied, even if using intrinsics/functions)? It’s unclear if only available with Intel compiler or done by default, since not always better:

there is a vector version using SSE/AVX! But the catch is that Intel C++ compiler must be used.

This is called Intel small vector math library (intrinsics):

for 128bit SSE please use (double precision): _mm_sin_pd

for 256bit AVX please use (double precision): _mm256_sin_pd

The two intrinsics are actually very small functions consists of hand written SSE/AVX assemblies, and now you can process 4 sine calculations at once by using AVX :=) the latency is about ~10 clock cycles (if I remember correctly) on Haswell CPU.

By the way, the CPU needs to execute about 100 such intrinsics to warm up and reach its peak performance, if there is only a few sin functions needs to be evaluated, it’s better to use plain sin() instead.

[…]

Agner Fog’s VCL has some math functions like exp and log. (Formerly GPL licensed, now Apache).

https://github.com/microsoft/DirectXMath (MIT license) - I think portable to non-Windows, and doesn’t require DirectX. […]

RoyiAvital · August 8, 2022, 1:38pm

There is also Sleef. It supports ARM as well, so it is great.
But from my experience, it is less performant than Intel SVML.
I think SVML is now, to some degree, embodied in Clang and also in GCC.
Yet for sure it is available for free with Intel OneAPI package.

nilshg · August 8, 2022, 1:52pm

jebej · August 8, 2022, 1:57pm

It’s been hinted at already in the thread, but to be completely clear: here LoopVectorization uses a different, manually vectorized implementation of exp, this is why it’s faster, not because it’s optimizing the base julia exp.

It is likely that numpy has a similar vectorization built-it, hence why it is faster.

Elrod · August 8, 2022, 3:34pm

@Oscar_Smith
Julia exp regressed recently.
Earlier, you could use call-site inlining to get it to vectorize, via @inline exp(x[i]), and it’d vectorize.
This now no longer works because pow_body does not inline, and we need the entire function to inline.

Oscar_Smith · August 8, 2022, 3:36pm

What do you mean? exp doesn’t call pow_body…

Elrod · August 8, 2022, 3:36pm

It uses the same basic algorithm, has slightly higher error tolerance (around 1.5 ULP IIRC, instead of 0.5 for Base?)
It does also have some special optimizations for AVX512 that the compiler won’t be able to figure out on its own.

But this is a problem with Julia’s compiler.
Julia cannot vectorize code unless it is entirely inlined.
Julia’s inliner also does not consider the possibility of vectorization.

Elrod · August 8, 2022, 3:37pm

I have a non-inlined call to pow_body when looking at the typed code.

Oscar_Smith · August 8, 2022, 3:37pm

Do you mean for floating point powers?

Elrod · August 8, 2022, 3:39pm

Yes

 • %82 = < constprop > pow_body(::Core.Const(2.0),::Core.Const(-53))::Float64

The constant 2.0^-53 prevents SIMD.
Please do not rely on the compiler any more than necessary…

But I’ll rebuild the latest Julia master. Might be fixed with the effects analysis improvements.

Oscar_Smith · August 8, 2022, 3:48pm

If so, that’s really dumb because it totally could just be 0x1p-53

Elrod · August 8, 2022, 4:10pm

Yeah. It seems silly that a literal constant like 1e10 vs 10.0^10 or 0x1p-53 vs 2^-53 makes a big difference.
But it’s been fixed on the latest master already.

Elrod · August 8, 2022, 4:12pm

Actually, even without @inline things look really good on the latest Julia master!

julia> using LoopVectorization, BenchmarkTools

julia> function myfunc(arr)
           result = similar(arr)
           @simd for i in eachindex(arr)
               @inbounds result[i] = exp(arr[i])
           end
           return result
       end
myfunc (generic function with 1 method)

julia> function myfunc2(arr)
           result = similar(arr)
           @simd for i in eachindex(arr)
               @inbounds result[i] = @inline exp(arr[i])
           end
           return result
       end
myfunc2 (generic function with 1 method)

julia> function myfunc3(arr)
           result = similar(arr)
           @turbo for i in eachindex(arr)
               result[i] = exp(arr[i])
           end
           return result
       end
myfunc3 (generic function with 1 method)

julia> arr = rand(100_000);

julia> @btime myfunc($arr);
  76.586 μs (2 allocations: 781.30 KiB)

julia> @btime myfunc2($arr);
  65.682 μs (2 allocations: 781.30 KiB)

julia> @btime myfunc3($arr);
  67.174 μs (2 allocations: 781.30 KiB)

It’d be great if we could have regression tests to make sure it does not regress.

I’m not sure why myfunc is slower than myfunc2/would have to spend more time looking at it to see what is actually different.
But it seems to inline on its own.

Topic		Replies	Views
Matrix exponential slower in Julia 0.7 / 1.0? Performance	7	2640	September 12, 2018
Why is this code so slow in julia compared to a numpy implementation? Performance performance	9	3566	October 24, 2017
Numpy 10x faster than Julia ?! What am I doing wrong ?! [solved - julia faster now] Performance question	37	10955	October 15, 2019
NumPy is faster than Julia 1.11 for maximum function General Usage performance	25	681	July 4, 2025
Benchmarking Julia vs NumPy New to Julia benchmark	9	1674	March 29, 2020

Comparing exp() performance on Julia versus numpy

Related topics