Intel Core I7 6500U.
Thatās incorrect. For example, check out: @code_llvm (^).(arr, 2)
. As @Oscar_Smith above also said, the broadcasting loop is being done in the context of the @simd
macro.
Looking further into it, it seems @simd
is not compatible with exp
(and sin
, cos
), and the LoopVectorization is doing some magic to make these functions compatible.
I wonder if thereās any plans to upsteam these changes into base Julia to that @simd
works with these standard math functions?
I wonder if thereās any plans to upsteam these changes into base Julia to that
@simd
works with these standard math functions?
already done, see previous example and linked PR?
This isnāt explicit optimization. Whatās happening here is that for simple functions (ones small enough to be inlined), the compiler is able to automatically figure out vectorized versions. For a call like exp.(v)
that is a much harder optimization to make, and getting optimal performance would definitely require a handwritten vectorized exp
function (like the one in LoopVectorization).
The optimisation is done normally, but it depends on several things, LV do a lot of things more aggressively because it makes more assumptions and have some hand written implementationās of functions.
Respect to exp
, this two are related #44865 and #46338 which are in master
.
already done, see previous example and linked PR?
Wow, yes actually - itās about 4x times faster. Still not quite as fast as the 6x speedup of @turbo
but a big difference. And I think (?) I can see simd instructions in the llvm output.
It doesnāt appear this has also been applied to sin()
or cos()
though.
It doesnāt appear this has also been applied to
sin()
orcos()
though.
Does any language SIMD compile sin and cos by default (maybe Julia can (could [be made to]?) and should do that, but only if @simd
applied, even if using intrinsics/functions)? Itās unclear if only available with Intel compiler or done by default, since not always better:
there is a vector version using SSE/AVX! But the catch is that Intel C++ compiler must be used.
This is called Intel small vector math library (intrinsics):
for 128bit SSE please use (double precision): _mm_sin_pd
for 256bit AVX please use (double precision): _mm256_sin_pd
The two intrinsics are actually very small functions consists of hand written SSE/AVX assemblies, and now you can process 4 sine calculations at once by using AVX :=) the latency is about ~10 clock cycles (if I remember correctly) on Haswell CPU.
By the way, the CPU needs to execute about 100 such intrinsics to warm up and reach its peak performance, if there is only a few sin functions needs to be evaluated, itās better to use plain sin() instead.
Laszlo - .NET Developer, Personal Blog
In this post I will look into how someone can implement Sin/Cosin functions with SIMD in NET 5. NET 5 does not have a wrapper on SIMD trigonometric functions, so it seems a good exercise to implement it.
Where is Clang's '_mm256_pow_ps' intrinsic?
[ā¦]
Agner Fogās VCL has some math functions like
exp
andlog
. (Formerly GPL licensed, now Apache).
https://github.com/microsoft/DirectXMath
(MIT license) - I think portable to non-Windows, and doesnāt require DirectX. [ā¦]
There is also Sleef. It supports ARM as well, so it is great.
But from my experience, it is less performant than Intel SVML.
I think SVML is now, to some degree, embodied in Clang and also in GCC.
Yet for sure it is available for free with Intel OneAPI package.
Related:
GitHub - JuliaSIMD/SLEEFPirates.jl: Pirated SLEEF port.
Pirated SLEEF port. Contribute to JuliaSIMD/SLEEFPirates.jl development by creating an account on GitHub.
Itās been hinted at already in the thread, but to be completely clear: here LoopVectorization uses a different, manually vectorized implementation of exp
, this is why itās faster, not because itās optimizing the base julia exp
.
It is likely that numpy has a similar vectorization built-it, hence why it is faster.
@Oscar_Smith
Julia exp
regressed recently.
Earlier, you could use call-site inlining to get it to vectorize, via @inline exp(x[i])
, and itād vectorize.
This now no longer works because pow_body
does not inline, and we need the entire function to inline.
What do you mean? exp
doesnāt call pow_body
ā¦
Itās been hinted at already in the thread, but to be completely clear: here LoopVectorization uses a different, manually vectorized implementation of
exp
, this is why itās faster, not because itās optimizing the base juliaexp
.
It uses the same basic algorithm, has slightly higher error tolerance (around 1.5 ULP IIRC, instead of 0.5 for Base?)
It does also have some special optimizations for AVX512 that the compiler wonāt be able to figure out on its own.
But this is a problem with Juliaās compiler.
Julia cannot vectorize code unless it is entirely inlined.
Juliaās inliner also does not consider the possibility of vectorization.
I have a non-inlined call to pow_body
when looking at the typed code.
Do you mean for floating point powers?
Yes
ā¢ %82 = < constprop > pow_body(::Core.Const(2.0),::Core.Const(-53))::Float64
The constant 2.0^-53 prevents SIMD.
Please do not rely on the compiler any more than necessaryā¦
But Iāll rebuild the latest Julia master. Might be fixed with the effects analysis improvements.
If so, thatās really dumb because it totally could just be 0x1p-53
Yeah. It seems silly that a literal constant like 1e10
vs 10.0^10
or 0x1p-53
vs 2^-53
makes a big difference.
But itās been fixed on the latest master already.
Actually, even without @inline
things look really good on the latest Julia master!
julia> using LoopVectorization, BenchmarkTools
julia> function myfunc(arr)
result = similar(arr)
@simd for i in eachindex(arr)
@inbounds result[i] = exp(arr[i])
end
return result
end
myfunc (generic function with 1 method)
julia> function myfunc2(arr)
result = similar(arr)
@simd for i in eachindex(arr)
@inbounds result[i] = @inline exp(arr[i])
end
return result
end
myfunc2 (generic function with 1 method)
julia> function myfunc3(arr)
result = similar(arr)
@turbo for i in eachindex(arr)
result[i] = exp(arr[i])
end
return result
end
myfunc3 (generic function with 1 method)
julia> arr = rand(100_000);
julia> @btime myfunc($arr);
76.586 Ī¼s (2 allocations: 781.30 KiB)
julia> @btime myfunc2($arr);
65.682 Ī¼s (2 allocations: 781.30 KiB)
julia> @btime myfunc3($arr);
67.174 Ī¼s (2 allocations: 781.30 KiB)
Itād be great if we could have regression tests to make sure it does not regress.
Iām not sure why myfunc
is slower than myfunc2
/would have to spend more time looking at it to see what is actually different.
But it seems to inline on its own.