Trig functions very slow

tkoolen · September 22, 2018, 1:59pm

24.259 ns vs. 13.951 ns for me.

yuyichao · September 22, 2018, 2:02pm

1e10 was not included in your original summation…

I got

julia> @btime sin(1e10)
  22.782 ns (0 allocations: 0 bytes)
-0.4875060250875107

julia> @btime ccall(:sin, Cdouble, (Cdouble,), 1e10)
  50.323 ns (0 allocations: 0 bytes)
-0.4875060250875107

julia> @btime sin(20_000_000)
  22.806 ns (0 allocations: 0 bytes)
-0.7631011174721591

julia> @btime ccall(:sin, Cdouble, (Cdouble,), 20_000_000)
  7.975 ns (0 allocations: 0 bytes)
-0.7631011174721591

julia> @btime sin(1e8)
  22.133 ns (0 allocations: 0 bytes)
0.931639027109726

julia> @btime ccall(:sin, Cdouble, (Cdouble,), 1e8)
  7.774 ns (0 allocations: 0 bytes)
0.931639027109726

So actually slower for me with 100x larger number than what’s in the original code but significantly faster for the ones that are. If you actually care mostly about numbers THAT large, it’ll be interesting to see what the numpy result is there as well.

aplavin · September 22, 2018, 2:10pm

Oh, indeed, with 1e8 I get 28 vs 14 ns. Still not 7 ns as you show, but at least something. I used 1e10 because didn’t think it can make any difference - both numbers are quite large anyway

yuyichao · September 22, 2018, 2:30pm

That difference is enough to explain the difference between the julia version and the numpy/numba version. 2x vs 3x difference is likely from the libc implementation difference and/or hardware difference… Discussion about further optimization of the julia implementation should probably be moved to github and for the time being, you should be able to use some benchmarks to decide which implementation is the best for your code…

Elrod · September 22, 2018, 3:01pm

julia> function f()
           r = 0.
           for i in 0:100_000_000 - 1
               r += sin(i)
           end
           return r
       end
f (generic function with 1 method)

julia> function h()
           r = 0.
           for i in 0:100_000_000 - 1
               r += ccall(:sin, Cdouble, (Cdouble,), i)
           end
           return r
       end
h (generic function with 1 method)

julia> @btime f()
  2.803 s (0 allocations: 0 bytes)
0.7820103194606116

julia> @btime h()
  1.659 s (0 allocations: 0 bytes)
0.7820103194608199

(Note that unless you’re using the default windows terminal emulator, you can copy and paste the above code and it will run.)

dpsanders · September 22, 2018, 4:27pm

The question is not just about speed, but about correctness. Firstly, we can see that the Julia sin and the system sin (I’m using Mac OS 10.12) give different results, even for small values of the argument:

julia> systemsin(x) = ccall(:sin, Cdouble, (Cdouble,), x)

julia> sin(10)
-0.5440211108893698

julia> systemsin(10)
-0.5440211108893699

Of course, they don’t differ by much:

julia> sin(10) == nextfloat(systemsin(10))
true

How can we know which is “correct”? Let’s use CRlibm.jl, which wraps the CRlibm library guarantees to give a correctly-rounded result, which means that it returns the closest Float64 to the true answer.

julia> using Pkg; Pkg.add("CRlibm")

julia> using CRlibm

julia> CRlibm.sin(10)
-0.5440211108893698

We can confirm this using BigFloats:

julia> sin(big(10)) - sin(10)
-3.894989866822355711771841779318966139157418401261675720964049234673352541619043e-17

julia> sin(big(10)) - systemsin(10)
7.207240379429209692464474901589236985842581598738324279035950765326647458380957e-17

So the system sin function is less accurate, even when x is 10 !

Then we can try

julia> count(sin(i) != systemsin(i) for i in 0:100_000_000)
27257541

and we see that the two versions differ roughly a quarter of the time!

As far as I understand, CRlibm does not actually guarantee correct rounding for the sin function outside of the interval [-pi, pi], but it has a “high probability of being correct”, according to the docs. It does guarantee that exp is correctly rounded, so we can try the following for both sin and exp:

julia> count(sin(i) != CRlibm.sin(i) for i in 0:100_000_000)
3144328

julia> count(systemsin(i) != CRlibm.sin(i) for i in 0:100_000_000)
27367842

julia> systemexp(x) = ccall(:exp, Cdouble, (Cdouble,), x)
systemexp (generic function with 1 method)

julia> count(exp(i) != CRlibm.exp(i) for i in 0:100_000_000)
55

julia> count(systemexp(i) != CRlibm.exp(i) for i in 0:100_000_000)
1

It seems that the system exp function is better than Julia’s, but the sin function is worse.

aplavin · September 22, 2018, 4:40pm

This is highly system-dependent, I guess. On my linux machine system exp gives the same as yours, but system sin is better then Julia’s. And in the perfect world I would expect normal plain code to give the most correct results possible with reasonable performance, but switch to faster and less correct implementation with @fastmath. The current implementation is neither particularly fast, nor correct compared to (some) system implementations.

dpsanders · September 22, 2018, 4:44pm

I checked that CRlibm is correctly rounded for the values in question:

julia> roundedsin(x) = convert(Float64, sin(big(x)))
roundedsin (generic function with 1 method)

julia> @time count(CRlibm.sin(i) != roundedsin(i) for i in 0:100_000_000)
782.344087 seconds (1.84 G allocations: 46.202 GiB, 1.09% gc time)
0

dpsanders · September 22, 2018, 4:45pm

(Of course, I love the fact that Julia makes doing this kind of analysis so easy and fun!)

dpsanders · September 22, 2018, 4:46pm

Then there is the issue about how to sum up arrays of floating-point numbers more accurately, which I have seen discussed somewhere before.

StefanKarpinski · September 22, 2018, 4:48pm

As @dpsanders post makes clear, the tradeoff between speed and accuracy in math functions is tricky. Each libm does it differently. We’re always happy to make performance improvements where possible, although not typically at the cost of correctness. It is also a priority for the project that Julia should give the same results on different platforms. So even if your system libm happens to be faster or more correct, we still don’t want to use it by default because we want Julia programs to produce the same results on different systems. That there are some system libms (e.g. Apple’s) which are very good and fast and we would love to “steal” from them, but they are either not open source or not appropriately licensed for inclusion in Julia. Of course, if you really care about either speed or accuracy, we give the option to swap out libm functions by using your system libm (maybe better or worse, kind of a crapshoot), or using packages like Yeppp (faster but less accurate) or CRlibm (more accurate but slower).

dpsanders · September 22, 2018, 5:28pm

I think these packages (Julia’s and system) only guarantee (?) that the result is within 1 ulp of the correct result. But still, it is useful to see how far they are from being correctly rounded.

JeffreySarnoff · September 23, 2018, 3:12am

The results are both system library dependent and system hardware dependent.

I ran trig benchmarks in Julia and in IPython with Numba, Julia ran as fast or faster.

RoyiAvital · September 23, 2018, 8:58am

I think Numba uses SVML through LLVM Lite.
In my experience (Tests done in C, Looping on aligned array using SIMD operations) SVML will beat YEPP and SLEEF for Trig / Exp functions.
Nice thing about SVML it has built in levels of accuracy which can be controlled by fp:fast=1 / fp:fast=2 compiling flag.

I think it is in Julia’s To Do List to integrate SVML with the assistance (To some level) of Intel (Or at least thinking of it).
Once it is done (Even as a user enabled option and not as default), Julia will have the fastest Trig / Exp / Power / Etc functions (At least on x64).

Ralph_Smith · September 24, 2018, 4:40am

Unlike @RoyiAvital I see good results with SLEEF, consistent with their published benchmarks, which show that it is competitive with Intel SVML. (SLEEF is an open source library for short-vector math.) It is currently a bit painful to use SLEEF with Julia, but it may be easier to mainstream it than SVML (either by automating the following or - in Julia-dev style - reimplementing it).

Here is a proof of concept:

using SIMD
import Libdl
using BenchmarkTools

const libsleef = "/scratch/build/sleef-install/lib/libsleef.so"

const lib = Libdl.dlopen_e(libsleef,
                           Libdl.RTLD_LAZY|Libdl.RTLD_DEEPBIND|Libdl.RTLD_GLOBAL)
if lib == C_NULL
    error("unable to load $libsleef")
end

const VE = Base.VecElement

@inline function xsin_ha(v1::Vec{4,Float64})
    Vec{4,Float64}(Base.llvmcall(
        ("declare <4 x double> @Sleef_sind4_u10(<4 x double>)",
         """%res = call <4 x double> @Sleef_sind4_u10(<4 x double> %0)
            ret <4 x double> %res"""),
        NTuple{4, VE{Float64}}, Tuple{NTuple{4, VE{Float64}}}, v1.elts))
end

function f()
    r = 0.
    for i in 1:100:100_000_000
        r += sin(i)
    end
    return r
end
function g()
    r = 0.
    for i in 1:400:100_000_000
        v = Vec{4,Float64}(float.((i,i+100,i+200,i+300)))
        sv = xsin_ha(v)
        r += sum(sv)
    end
    return r
end

@btime f()
@btime g()

On my system (with a slow processor), I get

  42.251 ms (0 allocations: 0 bytes)
  9.327 ms (0 allocations: 0 bytes)

(Multiply by 100 to compare with the times in previous posts, which have dubious statistical reliability.)

yuyichao · September 24, 2018, 11:36am

ccall((:Sleef_sind4_u10, "....."), NTuple{4, VE{Float64}}, (NTuple{4, VE{Float64}},), v1.elts) should work.

Ralph_Smith · September 24, 2018, 1:41pm

Thanks, that’s much nicer. I got confused by the type-wrapping.

RoyiAvital · September 24, 2018, 6:06pm

In the SLEEF documentation they show SLEEF beats SVML.
For instance have a look at:

(The image in SLEEF Documentation - Benchmark Results at 24/09/2018).

They show they beat SVML in exp for SSE and AVX.
I, for one, could never replicate a case where SLEEF is faster than SVML in exp.

To share some results I got I created a project named SLEFF vs. SVML (I’m not even a beginner programmer in C so it is not efficient code, but it is simple, clear and measure exactly what’s needed).

The idea is simple, create array and run different comparable functions of SLEEF and SVML and compare results.
For now I implemented tests for Sine and Exponent.
For me (Based on Core i7 6800K system), for SSE SVML is faster by a factor of 2-3.
On AVX SVML is faster by 20-30%.

You can download the ZIP file and try for your self.
This is for fp:precise.

SLEEF doesn’t offer lower precision than 1 ULP for sin and exp.
But I think in small DR problem (Let’s say [-10, 10]) with fp:fast is still as accurate as SLEEF and gets even faster.

Just to be clear, I’m pro SLEEF.
It is great to have near SVML performance in Open Source project.
I wish they closed the gap and even be faster than SVML so everybody will have access to the best library without the need for Intel Compiler License.
But in the meantime, I really want to have both available in Julia and let the user decide because in some cases the extra speed of SVML will be significant.
Since Intel lets Open Source projects to integrate SVML, I don’t see why not have both.

@Ralph_Smith,
By the way, on your post I see x4 improvement which matches the 4 Element vector you choose. So SLEEF basically is as fast as the base implementation in Julia just in that case being implemented on a SIMD element.

improbable22 · September 24, 2018, 7:37pm

Sleef sounds neat. There appears to be a Julia port, although I don’t see these vectorised instructions there: https://github.com/musm/SLEEF.jl is the link.

Elrod · September 24, 2018, 10:31pm

That is a pure Julia port. Would make a could PR to add support for SIMD.jl’s Vec{N,T} type.

Topic		Replies	Views
Why is this simple function twice as slow as its Python version Performance question	97	4227	April 12, 2021
Sincos slightly slower than sin and cos Performance trigonometry	2	739	December 16, 2021
Weird @btime results with `sin` Performance question , benchmark , trigonometry	1	389	May 22, 2021
Why are calls to this function slow, compared to Python+Numpy? General Usage fast-math	14	2097	August 13, 2017
Why is this code so slow in julia compared to a numpy implementation? Performance performance	9	3556	October 24, 2017

Trig functions very slow

Related topics