24.259 ns vs. 13.951 ns for me.
1e10
was not included in your original summationā¦
I got
julia> @btime sin(1e10)
22.782 ns (0 allocations: 0 bytes)
-0.4875060250875107
julia> @btime ccall(:sin, Cdouble, (Cdouble,), 1e10)
50.323 ns (0 allocations: 0 bytes)
-0.4875060250875107
julia> @btime sin(20_000_000)
22.806 ns (0 allocations: 0 bytes)
-0.7631011174721591
julia> @btime ccall(:sin, Cdouble, (Cdouble,), 20_000_000)
7.975 ns (0 allocations: 0 bytes)
-0.7631011174721591
julia> @btime sin(1e8)
22.133 ns (0 allocations: 0 bytes)
0.931639027109726
julia> @btime ccall(:sin, Cdouble, (Cdouble,), 1e8)
7.774 ns (0 allocations: 0 bytes)
0.931639027109726
So actually slower for me with 100x larger number than whatās in the original code but significantly faster for the ones that are. If you actually care mostly about numbers THAT large, itāll be interesting to see what the numpy result is there as well.
Oh, indeed, with 1e8 I get 28 vs 14 ns. Still not 7 ns as you show, but at least something. I used 1e10 because didnāt think it can make any difference - both numbers are quite large anyway
That difference is enough to explain the difference between the julia version and the numpy/numba version. 2x vs 3x difference is likely from the libc implementation difference and/or hardware differenceā¦ Discussion about further optimization of the julia implementation should probably be moved to github and for the time being, you should be able to use some benchmarks to decide which implementation is the best for your codeā¦
julia> function f()
r = 0.
for i in 0:100_000_000 - 1
r += sin(i)
end
return r
end
f (generic function with 1 method)
julia> function h()
r = 0.
for i in 0:100_000_000 - 1
r += ccall(:sin, Cdouble, (Cdouble,), i)
end
return r
end
h (generic function with 1 method)
julia> @btime f()
2.803 s (0 allocations: 0 bytes)
0.7820103194606116
julia> @btime h()
1.659 s (0 allocations: 0 bytes)
0.7820103194608199
(Note that unless youāre using the default windows terminal emulator, you can copy and paste the above code and it will run.)
The question is not just about speed, but about correctness. Firstly, we can see that the Julia sin
and the system sin
(Iām using Mac OS 10.12) give different results, even for small values of the argument:
julia> systemsin(x) = ccall(:sin, Cdouble, (Cdouble,), x)
julia> sin(10)
-0.5440211108893698
julia> systemsin(10)
-0.5440211108893699
Of course, they donāt differ by much:
julia> sin(10) == nextfloat(systemsin(10))
true
How can we know which is ācorrectā? Letās use CRlibm.jl
, which wraps the CRlibm library guarantees to give a correctly-rounded result, which means that it returns the closest Float64
to the true answer.
julia> using Pkg; Pkg.add("CRlibm")
julia> using CRlibm
julia> CRlibm.sin(10)
-0.5440211108893698
We can confirm this using BigFloat
s:
julia> sin(big(10)) - sin(10)
-3.894989866822355711771841779318966139157418401261675720964049234673352541619043e-17
julia> sin(big(10)) - systemsin(10)
7.207240379429209692464474901589236985842581598738324279035950765326647458380957e-17
So the system sin
function is less accurate, even when x
is 10 !
Then we can try
julia> count(sin(i) != systemsin(i) for i in 0:100_000_000)
27257541
and we see that the two versions differ roughly a quarter of the time!
As far as I understand, CRlibm
does not actually guarantee correct rounding for the sin
function outside of the interval [-pi, pi]
, but it has a āhigh probability of being correctā, according to the docs. It does guarantee that exp
is correctly rounded, so we can try the following for both sin
and exp
:
julia> count(sin(i) != CRlibm.sin(i) for i in 0:100_000_000)
3144328
julia> count(systemsin(i) != CRlibm.sin(i) for i in 0:100_000_000)
27367842
julia> systemexp(x) = ccall(:exp, Cdouble, (Cdouble,), x)
systemexp (generic function with 1 method)
julia> count(exp(i) != CRlibm.exp(i) for i in 0:100_000_000)
55
julia> count(systemexp(i) != CRlibm.exp(i) for i in 0:100_000_000)
1
It seems that the system exp
function is better than Juliaās, but the sin
function is worse.
This is highly system-dependent, I guess. On my linux machine system exp
gives the same as yours, but system sin
is better then Juliaās. And in the perfect world I would expect normal plain code to give the most correct results possible with reasonable performance, but switch to faster and less correct implementation with @fastmath
. The current implementation is neither particularly fast, nor correct compared to (some) system implementations.
I checked that CRlibm
is correctly rounded for the values in question:
julia> roundedsin(x) = convert(Float64, sin(big(x)))
roundedsin (generic function with 1 method)
julia> @time count(CRlibm.sin(i) != roundedsin(i) for i in 0:100_000_000)
782.344087 seconds (1.84 G allocations: 46.202 GiB, 1.09% gc time)
0
(Of course, I love the fact that Julia makes doing this kind of analysis so easy and fun!)
Then there is the issue about how to sum up arrays of floating-point numbers more accurately, which I have seen discussed somewhere before.
As @dpsanders post makes clear, the tradeoff between speed and accuracy in math functions is tricky. Each libm does it differently. Weāre always happy to make performance improvements where possible, although not typically at the cost of correctness. It is also a priority for the project that Julia should give the same results on different platforms. So even if your system libm happens to be faster or more correct, we still donāt want to use it by default because we want Julia programs to produce the same results on different systems. That there are some system libms (e.g. Appleās) which are very good and fast and we would love to āstealā from them, but they are either not open source or not appropriately licensed for inclusion in Julia. Of course, if you really care about either speed or accuracy, we give the option to swap out libm functions by using your system libm (maybe better or worse, kind of a crapshoot), or using packages like Yeppp
(faster but less accurate) or CRlibm
(more accurate but slower).
I think these packages (Juliaās and system) only guarantee (?) that the result is within 1 ulp of the correct result. But still, it is useful to see how far they are from being correctly rounded.
The results are both system library dependent and system hardware dependent.
I ran trig benchmarks in Julia and in IPython with Numba, Julia ran as fast or faster.
I think Numba uses SVML through LLVM Lite.
In my experience (Tests done in C, Looping on aligned array using SIMD operations) SVML will beat YEPP and SLEEF for Trig / Exp functions.
Nice thing about SVML it has built in levels of accuracy which can be controlled by fp:fast=1
/ fp:fast=2
compiling flag.
I think it is in Juliaās To Do List to integrate SVML with the assistance (To some level) of Intel (Or at least thinking of it).
Once it is done (Even as a user enabled option and not as default), Julia will have the fastest Trig / Exp / Power / Etc functions (At least on x64).
Unlike @RoyiAvital I see good results with SLEEF, consistent with their published benchmarks, which show that it is competitive with Intel SVML. (SLEEF is an open source library for short-vector math.) It is currently a bit painful to use SLEEF with Julia, but it may be easier to mainstream it than SVML (either by automating the following or - in Julia-dev style - reimplementing it).
Here is a proof of concept:
using SIMD
import Libdl
using BenchmarkTools
const libsleef = "/scratch/build/sleef-install/lib/libsleef.so"
const lib = Libdl.dlopen_e(libsleef,
Libdl.RTLD_LAZY|Libdl.RTLD_DEEPBIND|Libdl.RTLD_GLOBAL)
if lib == C_NULL
error("unable to load $libsleef")
end
const VE = Base.VecElement
@inline function xsin_ha(v1::Vec{4,Float64})
Vec{4,Float64}(Base.llvmcall(
("declare <4 x double> @Sleef_sind4_u10(<4 x double>)",
"""%res = call <4 x double> @Sleef_sind4_u10(<4 x double> %0)
ret <4 x double> %res"""),
NTuple{4, VE{Float64}}, Tuple{NTuple{4, VE{Float64}}}, v1.elts))
end
function f()
r = 0.
for i in 1:100:100_000_000
r += sin(i)
end
return r
end
function g()
r = 0.
for i in 1:400:100_000_000
v = Vec{4,Float64}(float.((i,i+100,i+200,i+300)))
sv = xsin_ha(v)
r += sum(sv)
end
return r
end
@btime f()
@btime g()
On my system (with a slow processor), I get
42.251 ms (0 allocations: 0 bytes)
9.327 ms (0 allocations: 0 bytes)
(Multiply by 100 to compare with the times in previous posts, which have dubious statistical reliability.)
ccall((:Sleef_sind4_u10, "....."), NTuple{4, VE{Float64}}, (NTuple{4, VE{Float64}},), v1.elts)
should work.
Thanks, thatās much nicer. I got confused by the type-wrapping.
In the SLEEF documentation they show SLEEF beats SVML.
For instance have a look at:
(The image in SLEEF Documentation - Benchmark Results at 24/09/2018).
They show they beat SVML in exp
for SSE and AVX.
I, for one, could never replicate a case where SLEEF is faster than SVML in exp
.
To share some results I got I created a project named SLEFF vs. SVML (Iām not even a beginner programmer in C
so it is not efficient code, but it is simple, clear and measure exactly whatās needed).
The idea is simple, create array and run different comparable functions of SLEEF and SVML and compare results.
For now I implemented tests for Sine and Exponent.
For me (Based on Core i7 6800K
system), for SSE SVML is faster by a factor of 2-3.
On AVX SVML is faster by 20-30%.
You can download the ZIP file and try for your self.
This is for fp:precise
.
SLEEF doesnāt offer lower precision than 1 ULP for sin
and exp
.
But I think in small DR problem (Letās say [-10, 10]) with fp:fast
is still as accurate as SLEEF and gets even faster.
Just to be clear, Iām pro SLEEF.
It is great to have near SVML performance in Open Source project.
I wish they closed the gap and even be faster than SVML so everybody will have access to the best library without the need for Intel Compiler License.
But in the meantime, I really want to have both available in Julia and let the user decide because in some cases the extra speed of SVML will be significant.
Since Intel lets Open Source projects to integrate SVML, I donāt see why not have both.
@Ralph_Smith,
By the way, on your post I see x4 improvement which matches the 4 Element vector you choose. So SLEEF basically is as fast as the base implementation in Julia just in that case being implemented on a SIMD element.
Sleef sounds neat. There appears to be a Julia port, although I donāt see these vectorised instructions there: https://github.com/musm/SLEEF.jl is the link.
That is a pure Julia port. Would make a could PR to add support for SIMD.jlās Vec{N,T}
type.