Trig functions very slow

24.259 ns vs. 13.951 ns for me.

1e10 was not included in your original summationā€¦

I got

julia> @btime sin(1e10)
  22.782 ns (0 allocations: 0 bytes)
-0.4875060250875107

julia> @btime ccall(:sin, Cdouble, (Cdouble,), 1e10)
  50.323 ns (0 allocations: 0 bytes)
-0.4875060250875107

julia> @btime sin(20_000_000)
  22.806 ns (0 allocations: 0 bytes)
-0.7631011174721591

julia> @btime ccall(:sin, Cdouble, (Cdouble,), 20_000_000)
  7.975 ns (0 allocations: 0 bytes)
-0.7631011174721591

julia> @btime sin(1e8)
  22.133 ns (0 allocations: 0 bytes)
0.931639027109726

julia> @btime ccall(:sin, Cdouble, (Cdouble,), 1e8)
  7.774 ns (0 allocations: 0 bytes)
0.931639027109726

So actually slower for me with 100x larger number than whatā€™s in the original code but significantly faster for the ones that are. If you actually care mostly about numbers THAT large, itā€™ll be interesting to see what the numpy result is there as well.

Oh, indeed, with 1e8 I get 28 vs 14 ns. Still not 7 ns as you show, but at least something. I used 1e10 because didnā€™t think it can make any difference - both numbers are quite large anyway :slight_smile:

That difference is enough to explain the difference between the julia version and the numpy/numba version. 2x vs 3x difference is likely from the libc implementation difference and/or hardware differenceā€¦ Discussion about further optimization of the julia implementation should probably be moved to github and for the time being, you should be able to use some benchmarks to decide which implementation is the best for your codeā€¦

julia> function f()
           r = 0.
           for i in 0:100_000_000 - 1
               r += sin(i)
           end
           return r
       end
f (generic function with 1 method)

julia> function h()
           r = 0.
           for i in 0:100_000_000 - 1
               r += ccall(:sin, Cdouble, (Cdouble,), i)
           end
           return r
       end
h (generic function with 1 method)

julia> @btime f()
  2.803 s (0 allocations: 0 bytes)
0.7820103194606116

julia> @btime h()
  1.659 s (0 allocations: 0 bytes)
0.7820103194608199

(Note that unless youā€™re using the default windows terminal emulator, you can copy and paste the above code and it will run.)

The question is not just about speed, but about correctness. Firstly, we can see that the Julia sin and the system sin (Iā€™m using Mac OS 10.12) give different results, even for small values of the argument:

julia> systemsin(x) = ccall(:sin, Cdouble, (Cdouble,), x)

julia> sin(10)
-0.5440211108893698

julia> systemsin(10)
-0.5440211108893699

Of course, they donā€™t differ by much:

julia> sin(10) == nextfloat(systemsin(10))
true

How can we know which is ā€œcorrectā€? Letā€™s use CRlibm.jl, which wraps the CRlibm library guarantees to give a correctly-rounded result, which means that it returns the closest Float64 to the true answer.

julia> using Pkg; Pkg.add("CRlibm")

julia> using CRlibm

julia> CRlibm.sin(10)
-0.5440211108893698

We can confirm this using BigFloats:

julia> sin(big(10)) - sin(10)
-3.894989866822355711771841779318966139157418401261675720964049234673352541619043e-17

julia> sin(big(10)) - systemsin(10)
7.207240379429209692464474901589236985842581598738324279035950765326647458380957e-17

So the system sin function is less accurate, even when x is 10 !

Then we can try

julia> count(sin(i) != systemsin(i) for i in 0:100_000_000)
27257541

and we see that the two versions differ roughly a quarter of the time!

As far as I understand, CRlibm does not actually guarantee correct rounding for the sin function outside of the interval [-pi, pi], but it has a ā€œhigh probability of being correctā€, according to the docs. It does guarantee that exp is correctly rounded, so we can try the following for both sin and exp:

julia> count(sin(i) != CRlibm.sin(i) for i in 0:100_000_000)
3144328

julia> count(systemsin(i) != CRlibm.sin(i) for i in 0:100_000_000)
27367842

julia> systemexp(x) = ccall(:exp, Cdouble, (Cdouble,), x)
systemexp (generic function with 1 method)

julia> count(exp(i) != CRlibm.exp(i) for i in 0:100_000_000)
55

julia> count(systemexp(i) != CRlibm.exp(i) for i in 0:100_000_000)
1

It seems that the system exp function is better than Juliaā€™s, but the sin function is worse.

9 Likes

This is highly system-dependent, I guess. On my linux machine system exp gives the same as yours, but system sin is better then Juliaā€™s. And in the perfect world I would expect normal plain code to give the most correct results possible with reasonable performance, but switch to faster and less correct implementation with @fastmath. The current implementation is neither particularly fast, nor correct compared to (some) system implementations.

1 Like

I checked that CRlibm is correctly rounded for the values in question:

julia> roundedsin(x) = convert(Float64, sin(big(x)))
roundedsin (generic function with 1 method)

julia> @time count(CRlibm.sin(i) != roundedsin(i) for i in 0:100_000_000)
782.344087 seconds (1.84 G allocations: 46.202 GiB, 1.09% gc time)
0
1 Like

(Of course, I love the fact that Julia makes doing this kind of analysis so easy and fun!)

2 Likes

Then there is the issue about how to sum up arrays of floating-point numbers more accurately, which I have seen discussed somewhere before.

As @dpsanders post makes clear, the tradeoff between speed and accuracy in math functions is tricky. Each libm does it differently. Weā€™re always happy to make performance improvements where possible, although not typically at the cost of correctness. It is also a priority for the project that Julia should give the same results on different platforms. So even if your system libm happens to be faster or more correct, we still donā€™t want to use it by default because we want Julia programs to produce the same results on different systems. That there are some system libms (e.g. Appleā€™s) which are very good and fast and we would love to ā€œstealā€ from them, but they are either not open source or not appropriately licensed for inclusion in Julia. Of course, if you really care about either speed or accuracy, we give the option to swap out libm functions by using your system libm (maybe better or worse, kind of a crapshoot), or using packages like Yeppp (faster but less accurate) or CRlibm (more accurate but slower).

3 Likes

I think these packages (Juliaā€™s and system) only guarantee (?) that the result is within 1 ulp of the correct result. But still, it is useful to see how far they are from being correctly rounded.

The results are both system library dependent and system hardware dependent.

I ran trig benchmarks in Julia and in IPython with Numba, Julia ran as fast or faster.

I think Numba uses SVML through LLVM Lite.
In my experience (Tests done in C, Looping on aligned array using SIMD operations) SVML will beat YEPP and SLEEF for Trig / Exp functions.
Nice thing about SVML it has built in levels of accuracy which can be controlled by fp:fast=1 / fp:fast=2 compiling flag.

I think it is in Juliaā€™s To Do List to integrate SVML with the assistance (To some level) of Intel (Or at least thinking of it).
Once it is done (Even as a user enabled option and not as default), Julia will have the fastest Trig / Exp / Power / Etc functions (At least on x64).

2 Likes

Unlike @RoyiAvital I see good results with SLEEF, consistent with their published benchmarks, which show that it is competitive with Intel SVML. (SLEEF is an open source library for short-vector math.) It is currently a bit painful to use SLEEF with Julia, but it may be easier to mainstream it than SVML (either by automating the following or - in Julia-dev style - reimplementing it).

Here is a proof of concept:

using SIMD
import Libdl
using BenchmarkTools

const libsleef = "/scratch/build/sleef-install/lib/libsleef.so"

const lib = Libdl.dlopen_e(libsleef,
                           Libdl.RTLD_LAZY|Libdl.RTLD_DEEPBIND|Libdl.RTLD_GLOBAL)
if lib == C_NULL
    error("unable to load $libsleef")
end

const VE = Base.VecElement

@inline function xsin_ha(v1::Vec{4,Float64})
    Vec{4,Float64}(Base.llvmcall(
        ("declare <4 x double> @Sleef_sind4_u10(<4 x double>)",
         """%res = call <4 x double> @Sleef_sind4_u10(<4 x double> %0)
            ret <4 x double> %res"""),
        NTuple{4, VE{Float64}}, Tuple{NTuple{4, VE{Float64}}}, v1.elts))
end

function f()
    r = 0.
    for i in 1:100:100_000_000
        r += sin(i)
    end
    return r
end
function g()
    r = 0.
    for i in 1:400:100_000_000
        v = Vec{4,Float64}(float.((i,i+100,i+200,i+300)))
        sv = xsin_ha(v)
        r += sum(sv)
    end
    return r
end

@btime f()
@btime g()

On my system (with a slow processor), I get

  42.251 ms (0 allocations: 0 bytes)
  9.327 ms (0 allocations: 0 bytes)

(Multiply by 100 to compare with the times in previous posts, which have dubious statistical reliability.)

2 Likes

ccall((:Sleef_sind4_u10, "....."), NTuple{4, VE{Float64}}, (NTuple{4, VE{Float64}},), v1.elts) should work.

Thanks, thatā€™s much nicer. I got confused by the type-wrapping.

In the SLEEF documentation they show SLEEF beats SVML.
For instance have a look at:


(The image in SLEEF Documentation - Benchmark Results at 24/09/2018).

They show they beat SVML in exp for SSE and AVX.
I, for one, could never replicate a case where SLEEF is faster than SVML in exp.

To share some results I got I created a project named SLEFF vs. SVML (Iā€™m not even a beginner programmer in C so it is not efficient code, but it is simple, clear and measure exactly whatā€™s needed).

The idea is simple, create array and run different comparable functions of SLEEF and SVML and compare results.
For now I implemented tests for Sine and Exponent.
For me (Based on Core i7 6800K system), for SSE SVML is faster by a factor of 2-3.
On AVX SVML is faster by 20-30%.

You can download the ZIP file and try for your self.
This is for fp:precise.

SLEEF doesnā€™t offer lower precision than 1 ULP for sin and exp.
But I think in small DR problem (Letā€™s say [-10, 10]) with fp:fast is still as accurate as SLEEF and gets even faster.

Just to be clear, Iā€™m pro SLEEF.
It is great to have near SVML performance in Open Source project.
I wish they closed the gap and even be faster than SVML so everybody will have access to the best library without the need for Intel Compiler License.
But in the meantime, I really want to have both available in Julia and let the user decide because in some cases the extra speed of SVML will be significant.
Since Intel lets Open Source projects to integrate SVML, I donā€™t see why not have both.

@Ralph_Smith,
By the way, on your post I see x4 improvement which matches the 4 Element vector you choose. So SLEEF basically is as fast as the base implementation in Julia just in that case being implemented on a SIMD element.

Sleef sounds neat. There appears to be a Julia port, although I donā€™t see these vectorised instructions there: https://github.com/musm/SLEEF.jl is the link.

That is a pure Julia port. Would make a could PR to add support for SIMD.jlā€™s Vec{N,T} type.