Comparing exp() performance on Julia versus numpy

Intel Core I7 6500U.

Thatā€™s incorrect. For example, check out: @code_llvm (^).(arr, 2). As @Oscar_Smith above also said, the broadcasting loop is being done in the context of the @simd macro.

Looking further into it, it seems @simd is not compatible with exp (and sin, cos), and the LoopVectorization is doing some magic to make these functions compatible.

I wonder if thereā€™s any plans to upsteam these changes into base Julia to that @simd works with these standard math functions?

1 Like

already done, see previous example and linked PR?

This isnā€™t explicit optimization. Whatā€™s happening here is that for simple functions (ones small enough to be inlined), the compiler is able to automatically figure out vectorized versions. For a call like exp.(v) that is a much harder optimization to make, and getting optimal performance would definitely require a handwritten vectorized exp function (like the one in LoopVectorization).

3 Likes

The optimisation is done normally, but it depends on several things, LV do a lot of things more aggressively because it makes more assumptions and have some hand written implementationā€™s of functions.

Respect to exp, this two are related #44865 and #46338 which are in master.

Wow, yes actually - itā€™s about 4x times faster. Still not quite as fast as the 6x speedup of @turbo but a big difference. And I think (?) I can see simd instructions in the llvm output.

It doesnā€™t appear this has also been applied to sin() or cos() though.

Does any language SIMD compile sin and cos by default (maybe Julia can (could [be made to]?) and should do that, but only if @simd applied, even if using intrinsics/functions)? Itā€™s unclear if only available with Intel compiler or done by default, since not always better:

there is a vector version using SSE/AVX! But the catch is that Intel C++ compiler must be used.

This is called Intel small vector math library (intrinsics):

for 128bit SSE please use (double precision): _mm_sin_pd

for 256bit AVX please use (double precision): _mm256_sin_pd

The two intrinsics are actually very small functions consists of hand written SSE/AVX assemblies, and now you can process 4 sine calculations at once by using AVX :=) the latency is about ~10 clock cycles (if I remember correctly) on Haswell CPU.

By the way, the CPU needs to execute about 100 such intrinsics to warm up and reach its peak performance, if there is only a few sin functions needs to be evaluated, itā€™s better to use plain sin() instead.

[ā€¦]

  • Agner Fogā€™s VCL has some math functions like exp and log. (Formerly GPL licensed, now Apache).

  • https://github.com/microsoft/DirectXMath (MIT license) - I think portable to non-Windows, and doesnā€™t require DirectX. [ā€¦]

There is also Sleef. It supports ARM as well, so it is great.
But from my experience, it is less performant than Intel SVML.
I think SVML is now, to some degree, embodied in Clang and also in GCC.
Yet for sure it is available for free with Intel OneAPI package.

Related:

Itā€™s been hinted at already in the thread, but to be completely clear: here LoopVectorization uses a different, manually vectorized implementation of exp, this is why itā€™s faster, not because itā€™s optimizing the base julia exp.

It is likely that numpy has a similar vectorization built-it, hence why it is faster.

3 Likes

@Oscar_Smith
Julia exp regressed recently.
Earlier, you could use call-site inlining to get it to vectorize, via @inline exp(x[i]), and itā€™d vectorize.
This now no longer works because pow_body does not inline, and we need the entire function to inline.

What do you mean? exp doesnā€™t call pow_bodyā€¦

It uses the same basic algorithm, has slightly higher error tolerance (around 1.5 ULP IIRC, instead of 0.5 for Base?)
It does also have some special optimizations for AVX512 that the compiler wonā€™t be able to figure out on its own.

But this is a problem with Juliaā€™s compiler.
Julia cannot vectorize code unless it is entirely inlined.
Juliaā€™s inliner also does not consider the possibility of vectorization.

I have a non-inlined call to pow_body when looking at the typed code.

Do you mean for floating point powers?

Yes

 ā€¢ %82 = < constprop > pow_body(::Core.Const(2.0),::Core.Const(-53))::Float64

The constant 2.0^-53 prevents SIMD.
Please do not rely on the compiler any more than necessaryā€¦

But Iā€™ll rebuild the latest Julia master. Might be fixed with the effects analysis improvements.

If so, thatā€™s really dumb because it totally could just be 0x1p-53

Yeah. It seems silly that a literal constant like 1e10 vs 10.0^10 or 0x1p-53 vs 2^-53 makes a big difference.
But itā€™s been fixed on the latest master already.

Actually, even without @inline things look really good on the latest Julia master!

julia> using LoopVectorization, BenchmarkTools

julia> function myfunc(arr)
           result = similar(arr)
           @simd for i in eachindex(arr)
               @inbounds result[i] = exp(arr[i])
           end
           return result
       end
myfunc (generic function with 1 method)

julia> function myfunc2(arr)
           result = similar(arr)
           @simd for i in eachindex(arr)
               @inbounds result[i] = @inline exp(arr[i])
           end
           return result
       end
myfunc2 (generic function with 1 method)

julia> function myfunc3(arr)
           result = similar(arr)
           @turbo for i in eachindex(arr)
               result[i] = exp(arr[i])
           end
           return result
       end
myfunc3 (generic function with 1 method)

julia> arr = rand(100_000);

julia> @btime myfunc($arr);
  76.586 Ī¼s (2 allocations: 781.30 KiB)

julia> @btime myfunc2($arr);
  65.682 Ī¼s (2 allocations: 781.30 KiB)

julia> @btime myfunc3($arr);
  67.174 Ī¼s (2 allocations: 781.30 KiB)

Itā€™d be great if we could have regression tests to make sure it does not regress.

Iā€™m not sure why myfunc is slower than myfunc2/would have to spend more time looking at it to see what is actually different.
But it seems to inline on its own.

3 Likes