Trig functions very slow

using Yeppp
r = zeros(100_000_000);
@time sum(Yeppp.sin!(r,collect(0.:100_000_000 - 1)))
0.966357 seconds (17 allocations: 762.940 MiB, 9.38% gc time)
1 Like

Not an answer, but for those who want to run the Python version themselves, run ipython (required for %timeit magic) from a terminal, copy and paste the following:

import numba
import numpy as np
import timeit

@numba.njit
def f():
    r = 0
    for i in range(100_000_000):
        r += np.sin(i)
    return r
        
%timeit f()

and press return three times.

2 Likes

While for this particular case Yeppp does improve the performance (at the cost of almost a gig of memory usage), it was a simplified minimal example. The real target function is much more complicated, uses complex exponentials, and is not easily vectorizable. It sounds like Julia should be well-suited for such applications, but as I see here it performs significantly slower than Python. As I understand, Julia uses a custom implementation of trig functions instead of optimized libraries, can it be the reason?

Re: complex exponentials: also see Why is this Julia code considerably slower than Matlab.

As I understand, Julia uses a custom implementation of trig functions instead of optimized libraries, can it be the reason?

As of 0.7, Julia does use native Julia versions of trigonometric functions, ported from openlibm. This generally caused speedups, not slowdowns over 0.6, but openlibm’s implementations are definitely not the fastest.

I’m not too familiar with the variety of available math libraries and don’t know which one is used by numba or if it implements them as well. But anyway, the x2.2 slowdown compared to python even for this simple and equivalent example doesn’t look good. I checked and the speed difference remains for more complicated code as well, at least when it’s very trig-heavy. Sometimes the difference is even larger.

Might this help?

using Base.Threads
function ft(n=100_000_000)
    rt = zeros(nthreads())
        @threads for i in 0:n-1
            @inbounds rt[threadid()] += sin(i)
    end
    sum(rt)
end

No idea the state of multi-threading on Numba… although I presume it is off by default & thus doesn’t explain the difference.

Not an expert here, but judging by CPU usage, Numba doesn’t use multi-threading for this.

The example in both languages is single-threaded on purpose, and I looked at CPU usage while running it to confirm that it does use a single core only. Parallelization in my case is to be implemented at a higher level, and also this way I ensure comparing apples to apples.

The issue seems to be that the new implementation is significantly slower for all the larger numbers.

1 Like

And in general. Please don’t suggest things like Yeppp or multithreading, they are obviously unrelated to the comparison…

10 Likes

Tested using sin(20_000_000) instead of sin(i). This doesn’t seem like a regression with the new version. However, the averaged runtime I get are,

glibc libm: 8.23ns (~ 38 cycles)
new julia implementation: 23.35ns
openlibm: 88.84ns

The numbers are much more similar and favoring the new julia implementation/openlibm for smaller number so it’s not the fault of the calling (julia) code. There could be precision/runtime tradeoff here though I kind of doubt this much slow down compared to other implementations are expected/desired…

2 Likes

I’m not sure if I agree, the OP asked how to fix the slow evaluation of sin in julia. Yeppp and multithreading do speed up this. Whether or not the fact that they are unrelated to the comparison is obvious also depends on whether or not you are familiar with python, numpy and numba, which I certainly am not.

The code are the same, if they have different results, that’s missing optimization. It’s as simple as that. What I meant should be very obvious is that the numpy code is clearly not using either yeppp or multiple thread explicitly.

In fact, the speedup from Yeppp is very unimpressive, possibly related to allocation. Automatic vectorization of libm function from gcc+libmvec typically produce speedup directly proportional to the vector register size on top of the libm speed.

2 Likes

Still, the numba macro could be doing something fancy, like SIMDing the loop, who knows? What I mean is that it might be obvious to you, but not to everyone else.

2 Likes

I didn’t say it can’t (I don’t think it do though, but that’s totally irrelevant). If the numba decorator can do it, with the same code, and if the julia code doesn’t, that’s a missing optimization. It should be obvious that the code as written doesn’t do anything fancy explicitly and that’s all.

You shouldn’t need to get that performance by completely rewriting the code, which is the major issue with Yeppp and slightly less so for @thread, so even though I wouldn’t say no change to the code should be allowed, anything that requires non-local change to the code shouldn’t be since they will strongly affect the usability of the code outside of this synthetic case.

1 Like

I agree, it would naturally be best if julia just made it fast by default. I still think I answered the question in the first post though, since the OP was explicitly asking for ideas how to fix the slow evaluation. So if OP want’s to use julia and have fast code, Yeppp or multithreading are viable alternatives until julias sin has been optimized.

Well, apparently not and judging from what I’ve seen before, I believe this requirement is pretty well implied in the original post when two identical implementations in two different languages are implied. I have no problem with using these fancy features (though I never find Yeppp useful…) but I also find too many people suggesting them without thinking about or mentioning the significant limitation of the new code and all the other caveat. Again, despite being a milder transformation on the code, Trig functions very slow - #7 by improbable22 did a much better job.

Oh, and to get back on track. You can call the system libm version just with ccall(:sin, Cdouble, (Cdouble,), i) (this is what I mean by local code transformation) as a workaround. It might worth reporting as issue as well even though I’m not sure it’s fixable without sacrificing some precision for large inputs. I thought there was some discussion about this before merge but wasn’t able to find it in https://github.com/JuliaLang/julia/pull/22824 (I could be thinking of https://github.com/JuliaLang/julia/pull/22603, though it also doesn’t seem to cover input this large either).

1 Like

If it works, this will be a perfectly good way to improve performance. However, I don’t see much of a speedup, and absolutely no precision difference:

@btime sin(1e10)
# 25.853 ns (0 allocations: 0 bytes)
# -0.4875060250875107
@btime ccall(:sin, Cdouble, (Cdouble,), 1e10)
# 22.153 ns (0 allocations: 0 bytes)
# -0.4875060250875107

24.259 ns vs. 13.951 ns for me.