Intel Core I7 6500U.
Intel Core I7 6500U.
Thatâs incorrect. For example, check out: @code_llvm (^).(arr, 2). As @Oscar_Smith above also said, the broadcasting loop is being done in the context of the @simd macro.
Looking further into it, it seems @simd is not compatible with exp (and sin, cos), and the LoopVectorization is doing some magic to make these functions compatible.
I wonder if thereâs any plans to upsteam these changes into base Julia to that @simd works with these standard math functions?
I wonder if thereâs any plans to upsteam these changes into base Julia to that
@simdworks with these standard math functions?
already done, see previous example and linked PR?
This isnât explicit optimization. Whatâs happening here is that for simple functions (ones small enough to be inlined), the compiler is able to automatically figure out vectorized versions. For a call like exp.(v) that is a much harder optimization to make, and getting optimal performance would definitely require a handwritten vectorized exp function (like the one in LoopVectorization).
already done, see previous example and linked PR?
Wow, yes actually - itâs about 4x times faster. Still not quite as fast as the 6x speedup of @turbo but a big difference. And I think (?) I can see simd instructions in the llvm output.
It doesnât appear this has also been applied to sin() or cos() though.
It doesnât appear this has also been applied to
sin()orcos()though.
Does any language SIMD compile sin and cos by default (maybe Julia can (could [be made to]?) and should do that, but only if @simd applied, even if using intrinsics/functions)? Itâs unclear if only available with Intel compiler or done by default, since not always better:
there is a vector version using SSE/AVX! But the catch is that Intel C++ compiler must be used.
This is called Intel small vector math library (intrinsics):
for 128bit SSE please use (double precision): _mm_sin_pd
for 256bit AVX please use (double precision): _mm256_sin_pd
The two intrinsics are actually very small functions consists of hand written SSE/AVX assemblies, and now you can process 4 sine calculations at once by using AVX :=) the latency is about ~10 clock cycles (if I remember correctly) on Haswell CPU.
By the way, the CPU needs to execute about 100 such intrinsics to warm up and reach its peak performance, if there is only a few sin functions needs to be evaluated, itâs better to use plain sin() instead.

In this post I will look into how someone can implement Sin/Cosin functions with SIMD in NET 5. NET 5 does not have a wrapper on SIMD trigonometric functions, so it seems a good exercise to implement it.
[âŚ]
Agner Fogâs VCL has some math functions like
expandlog. (Formerly GPL licensed, now Apache).
https://github.com/microsoft/DirectXMath(MIT license) - I think portable to non-Windows, and doesnât require DirectX. [âŚ]
There is also Sleef. It supports ARM as well, so it is great.
But from my experience, it is less performant than Intel SVML.
I think SVML is now, to some degree, embodied in Clang and also in GCC.
Yet for sure it is available for free with Intel OneAPI package.
Related:

Pirated SLEEF port. Contribute to JuliaSIMD/SLEEFPirates.jl development by creating an account on GitHub.
Itâs been hinted at already in the thread, but to be completely clear: here LoopVectorization uses a different, manually vectorized implementation of exp, this is why itâs faster, not because itâs optimizing the base julia exp.
It is likely that numpy has a similar vectorization built-it, hence why it is faster.
@Oscar_Smith
Julia exp regressed recently.
Earlier, you could use call-site inlining to get it to vectorize, via @inline exp(x[i]), and itâd vectorize.
This now no longer works because pow_body does not inline, and we need the entire function to inline.
What do you mean? exp doesnât call pow_bodyâŚ
Itâs been hinted at already in the thread, but to be completely clear: here LoopVectorization uses a different, manually vectorized implementation of
exp, this is why itâs faster, not because itâs optimizing the base juliaexp.
It uses the same basic algorithm, has slightly higher error tolerance (around 1.5 ULP IIRC, instead of 0.5 for Base?)
It does also have some special optimizations for AVX512 that the compiler wonât be able to figure out on its own.
But this is a problem with Juliaâs compiler.
Julia cannot vectorize code unless it is entirely inlined.
Juliaâs inliner also does not consider the possibility of vectorization.
I have a non-inlined call to pow_body when looking at the typed code.
Do you mean for floating point powers?
Yes
⢠%82 = < constprop > pow_body(::Core.Const(2.0),::Core.Const(-53))::Float64
The constant 2.0^-53 prevents SIMD.
Please do not rely on the compiler any more than necessaryâŚ
But Iâll rebuild the latest Julia master. Might be fixed with the effects analysis improvements.
If so, thatâs really dumb because it totally could just be 0x1p-53
Yeah. It seems silly that a literal constant like 1e10 vs 10.0^10 or 0x1p-53 vs 2^-53 makes a big difference.
But itâs been fixed on the latest master already.
Actually, even without @inline things look really good on the latest Julia master!
julia> using LoopVectorization, BenchmarkTools
julia> function myfunc(arr)
result = similar(arr)
@simd for i in eachindex(arr)
@inbounds result[i] = exp(arr[i])
end
return result
end
myfunc (generic function with 1 method)
julia> function myfunc2(arr)
result = similar(arr)
@simd for i in eachindex(arr)
@inbounds result[i] = @inline exp(arr[i])
end
return result
end
myfunc2 (generic function with 1 method)
julia> function myfunc3(arr)
result = similar(arr)
@turbo for i in eachindex(arr)
result[i] = exp(arr[i])
end
return result
end
myfunc3 (generic function with 1 method)
julia> arr = rand(100_000);
julia> @btime myfunc($arr);
76.586 Îźs (2 allocations: 781.30 KiB)
julia> @btime myfunc2($arr);
65.682 Îźs (2 allocations: 781.30 KiB)
julia> @btime myfunc3($arr);
67.174 Îźs (2 allocations: 781.30 KiB)
Itâd be great if we could have regression tests to make sure it does not regress.
Iâm not sure why myfunc is slower than myfunc2/would have to spend more time looking at it to see what is actually different.
But it seems to inline on its own.