Trig functions very slow

aplavin · September 22, 2018, 10:58am

I found that some simple mathematical functions, at least trigonometric, are much slower in Julia than e.g. in Python (with numba). Here is a minimalistic example:

function f()
    r = 0.
    for i in 0:100_000_000 - 1
        r += sin(i)
    end
    return r
end

using BenchmarkTools
@btime f()
# 3.584 s (0 allocations: 0 bytes)

vs

@numba.njit
def f():
    r = 0
    for i in range(100_000_000):
        r += np.sin(i)
    return r
        
%timeit f()
# 1.65 s ± 8.91 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So, the same function in Julia runs more than twice as slow! If I replace r += sin(i) with r += i, then runtimes of Julia and Python function become the same.

Any idea how to fix that?

baggepinnen · September 22, 2018, 11:54am

using Yeppp
r = zeros(100_000_000);
@time sum(Yeppp.sin!(r,collect(0.:100_000_000 - 1)))
0.966357 seconds (17 allocations: 762.940 MiB, 9.38% gc time)

tkoolen · September 22, 2018, 11:54am

Not an answer, but for those who want to run the Python version themselves, run ipython (required for %timeit magic) from a terminal, copy and paste the following:

import numba
import numpy as np
import timeit

@numba.njit
def f():
    r = 0
    for i in range(100_000_000):
        r += np.sin(i)
    return r
        
%timeit f()

and press return three times.

aplavin · September 22, 2018, 12:05pm

While for this particular case Yeppp does improve the performance (at the cost of almost a gig of memory usage), it was a simplified minimal example. The real target function is much more complicated, uses complex exponentials, and is not easily vectorizable. It sounds like Julia should be well-suited for such applications, but as I see here it performs significantly slower than Python. As I understand, Julia uses a custom implementation of trig functions instead of optimized libraries, can it be the reason?

tkoolen · September 22, 2018, 12:14pm

Re: complex exponentials: also see Why is this Julia code considerably slower than Matlab.

As I understand, Julia uses a custom implementation of trig functions instead of optimized libraries, can it be the reason?

As of 0.7, Julia does use native Julia versions of trigonometric functions, ported from openlibm. This generally caused speedups, not slowdowns over 0.6, but openlibm’s implementations are definitely not the fastest.

aplavin · September 22, 2018, 12:22pm

I’m not too familiar with the variety of available math libraries and don’t know which one is used by numba or if it implements them as well. But anyway, the x2.2 slowdown compared to python even for this simple and equivalent example doesn’t look good. I checked and the speed difference remains for more complicated code as well, at least when it’s very trig-heavy. Sometimes the difference is even larger.

improbable22 · September 22, 2018, 12:31pm

Might this help?

using Base.Threads
function ft(n=100_000_000)
    rt = zeros(nthreads())
        @threads for i in 0:n-1
            @inbounds rt[threadid()] += sin(i)
    end
    sum(rt)
end

No idea the state of multi-threading on Numba… although I presume it is off by default & thus doesn’t explain the difference.

tkoolen · September 22, 2018, 12:33pm

Not an expert here, but judging by CPU usage, Numba doesn’t use multi-threading for this.

aplavin · September 22, 2018, 12:34pm

The example in both languages is single-threaded on purpose, and I looked at CPU usage while running it to confirm that it does use a single core only. Parallelization in my case is to be implemented at a higher level, and also this way I ensure comparing apples to apples.

yuyichao · September 22, 2018, 12:35pm

The issue seems to be that the new implementation is significantly slower for all the larger numbers.

yuyichao · September 22, 2018, 12:37pm

And in general. Please don’t suggest things like Yeppp or multithreading, they are obviously unrelated to the comparison…

yuyichao · September 22, 2018, 12:44pm

Tested using sin(20_000_000) instead of sin(i). This doesn’t seem like a regression with the new version. However, the averaged runtime I get are,

glibc libm: 8.23ns (~ 38 cycles)
new julia implementation: 23.35ns
openlibm: 88.84ns

The numbers are much more similar and favoring the new julia implementation/openlibm for smaller number so it’s not the fault of the calling (julia) code. There could be precision/runtime tradeoff here though I kind of doubt this much slow down compared to other implementations are expected/desired…

baggepinnen · September 22, 2018, 12:46pm

I’m not sure if I agree, the OP asked how to fix the slow evaluation of sin in julia. Yeppp and multithreading do speed up this. Whether or not the fact that they are unrelated to the comparison is obvious also depends on whether or not you are familiar with python, numpy and numba, which I certainly am not.

yuyichao · September 22, 2018, 12:53pm

The code are the same, if they have different results, that’s missing optimization. It’s as simple as that. What I meant should be very obvious is that the numpy code is clearly not using either yeppp or multiple thread explicitly.

In fact, the speedup from Yeppp is very unimpressive, possibly related to allocation. Automatic vectorization of libm function from gcc+libmvec typically produce speedup directly proportional to the vector register size on top of the libm speed.

baggepinnen · September 22, 2018, 12:57pm

Still, the numba macro could be doing something fancy, like SIMDing the loop, who knows? What I mean is that it might be obvious to you, but not to everyone else.

yuyichao · September 22, 2018, 1:11pm

I didn’t say it can’t (I don’t think it do though, but that’s totally irrelevant). If the numba decorator can do it, with the same code, and if the julia code doesn’t, that’s a missing optimization. It should be obvious that the code as written doesn’t do anything fancy explicitly and that’s all.

You shouldn’t need to get that performance by completely rewriting the code, which is the major issue with Yeppp and slightly less so for @thread, so even though I wouldn’t say no change to the code should be allowed, anything that requires non-local change to the code shouldn’t be since they will strongly affect the usability of the code outside of this synthetic case.

baggepinnen · September 22, 2018, 1:17pm

I agree, it would naturally be best if julia just made it fast by default. I still think I answered the question in the first post though, since the OP was explicitly asking for ideas how to fix the slow evaluation. So if OP want’s to use julia and have fast code, Yeppp or multithreading are viable alternatives until julias sin has been optimized.

yuyichao · September 22, 2018, 1:29pm

Well, apparently not and judging from what I’ve seen before, I believe this requirement is pretty well implied in the original post when two identical implementations in two different languages are implied. I have no problem with using these fancy features (though I never find Yeppp useful…) but I also find too many people suggesting them without thinking about or mentioning the significant limitation of the new code and all the other caveat. Again, despite being a milder transformation on the code, Trig functions very slow - #7 by improbable22 did a much better job.

yuyichao · September 22, 2018, 1:39pm

Oh, and to get back on track. You can call the system libm version just with ccall(:sin, Cdouble, (Cdouble,), i) (this is what I mean by local code transformation) as a workaround. It might worth reporting as issue as well even though I’m not sure it’s fixable without sacrificing some precision for large inputs. I thought there was some discussion about this before merge but wasn’t able to find it in https://github.com/JuliaLang/julia/pull/22824 (I could be thinking of https://github.com/JuliaLang/julia/pull/22603, though it also doesn’t seem to cover input this large either).

aplavin · September 22, 2018, 1:42pm

If it works, this will be a perfectly good way to improve performance. However, I don’t see much of a speedup, and absolutely no precision difference:

@btime sin(1e10)
# 25.853 ns (0 allocations: 0 bytes)
# -0.4875060250875107
@btime ccall(:sin, Cdouble, (Cdouble,), 1e10)
# 22.153 ns (0 allocations: 0 bytes)
# -0.4875060250875107

Topic		Replies	Views
Cosine seems slow Performance	14	1886	November 27, 2019
Speeding up trig-heavy code Performance performance , trigonometry , nerdsnipe	26	1696	January 2, 2026
Julia vs C++ speed General Usage performance , c	21	4800	September 2, 2021
@fastmath macro accuracy General Usage numerics , fast-math	26	11432	May 12, 2020
Small benchmark Performance benchmark	14	2900	November 21, 2018

Trig functions very slow

Related topics