Speeding up trig-heavy code

Hello!

I recently released v0.3.2 of SolarPosition.jl. I feel like it is now in a position where I am very happy with the performance, at least I think I covered all the basics like type stability covered in Performance Tips · The Julia Language.

I benchmarked SolarPosition.jl against its python equivalent solposx, and found that we are usually between 10x-200x faster (singlethreaded) depending on input size and algorithm choice. PSA, which provides a great tradeoff between accuracy and computation time, is 200x faster than the python version. On top of that we can do multithreading, while the python version can’t (at least not trivially).

Here is a profiling view I created for PSA:

As you can see most time is lost computing trigonometry functions like sin, cos, atan.

This is mostly for my own curiosity, but is there even more to gain here? I experimented with @fastmath, but found that it breaks correctness in exchange for maybe 5-10% percent performance gain. I also experimented with static arrays, but found that it yielded no improvements.

If you want to try something, feel free to make a PR. The github bot will benchmark your changes against the main branch using AirspeedVelocity.jl so you immediately get feedback.

5 Likes

Is it common to have just a single solar position you want to compute? At this point, it is possible that the single-threaded speedups are close to exhausted (although I wouldn’t underestimate the cleverness of people here) but setting up parallelized computations on GPUs would probably yield massive instant speedups

6 Likes

You could try trigonometric functions in packages like SLEEF.jl

6 Likes

If your code is calculating trigonometric values of same angle then combinations like sincosand sincosd etc. can help improve performance.
I think we should have better documentation of KernelAbstractions.jl. I am also looking for fast solution to my code but using GPU is tough as its documentation are not detailed.

3 Likes

I use it for the MTK component wrapper, but other than that probably not.

True, already using that.

1 Like

But now i think that i am wrong :joy:.

julia> using BenchmarkTools

julia> x= rand(1000);

julia> f(x) = (sin(x), cos(x))
f (generic function with 1 method)

julia> m(x) = sincos(x)
m (generic function with 1 method)

julia> @btime f.($x);

  5.805 μs (3 allocations: 15.70 KiB)

julia> @btime m.($x);

  6.161 μs (3 allocations: 15.70 KiB)

Results varies with size of x.

1 Like

When you benchmark on an array, the results get confused with memory effects, and there is probably a lot more noise.

Here is a benchmark on a single number. The improvement is even greater if you use a larger angle, because then it only needs to do the reduction modulo 2π once.

julia> @btime f($(Ref(0.1))[]);
  4.625 ns (0 allocations: 0 bytes)

julia> @btime sincos($(Ref(0.1))[]);
  3.958 ns (0 allocations: 0 bytes)

julia> @btime f($(Ref(10.1))[]);
  6.875 ns (0 allocations: 0 bytes)

julia> @btime sincos($(Ref(10.1))[]);
  5.541 ns (0 allocations: 0 bytes)
5 Likes

since from the perf log it appears you have a bunch of individual kernel calls, using Reactant.@compile may be able to help identify fusion opportunities

3 Likes

Ran a small test with Reactant.jl, it immediately found/compiled for my RTX3090 (magic?) and that yielded a 200x speedup. That’s pretty insane!

11 Likes

Are you sure you did synchronize your CUDA kernel?

@btime CUDA.@sync ...

I’m not sure what you mean, so I guess no? :sweat_smile:

If you don’t call CUDA.@sync you’re only measuring the time it takes for the CPU to tell the GPU “do this”. You’re missing the actual computation time + the time it takes for the results to travel back to the CPU, which is what @sync does.

4 Likes

Reactant does not call CUDA.jl (julia specific bindings of the cuda runtime), it calls CUDA directly, so that does not apply. However you can use @compile sync=true to be assured of the same

3 Likes

@langestefan if you have a PR open with your use of reactant, I can offer comments on if the setup for benchmarking is proper. Relatedly, if this is an interesting code, we’d be happy to take it as an integration test in Reactant!

2 Likes

I’m working on it, but I have some beginner questions :sweat_smile:. Should I ask those on slack? (I see no reactant channel, but I think it deserves one!)

Sounds like they may be helpful to other people as well, so asking on discourse for persistence would be better.

1 Like

So there is a warning for 1.12 that it is not supported, but I haven’t run into any issues with it so far. I guess 1.12 support is mostly there now and it is ‘okay’ to use?

In the examples data is converted to ConcreteRArray using Reactant.to_rarray. But this function has no docs or docstring and ‘may be internal’ ?

help?> Reactant.to_rarray
  │ Warning
  │
  │  The following bindings may be internal; they may change or be removed in future versions:
  │
  │    •  Reactant.to_rarray

The input to my function is a vector of DateTime, which just returns itself:

julia> dts = collect(DateTime(2020):Hour(1):DateTime(2021))
8785-element Vector{DateTime}:

julia> Reactant.to_rarray(dts)
8785-element Vector{DateTime}:

Still, @compile seems to work correctly.

If you do trig in a loop, LoopVectorization.@turbo should speed it up.
But not by nearly as much as running it on a GPU.

1 Like

I’ve opened two issues:

A PR attempts to fix issue #49:

Some more notes:

  • More improvements, for performance and/or accuracy, could perhaps be achieved by creating specialized kernel functions. Been playing with these ideas in Sollya, but was not able to achieve an improvement.

    • For example, in this snippet, each value shown in the excerpt depends only on constants and on the two function arguments dt and alg. alg is itself almost a constant: only two different values are valid (2001 and 2020). Thus, conceivably, after a branch to choose between the two valid values, a specialized implementation might be used that would treat alg as a constant, hardcoding the coefficients for some minimax polynomials for some of the shown values. For example, a single minimax polynomial perhaps might suffice to calculate λₑ directly from jd. Just an example, have not looked into it. I would have to play around for some hours to investigate properly.

      • link to source: SolarPosition.jl/src/Positioning/psa.jl at aa12423d8414a8eacca3b65bffa783dc6fe9c528 · JuliaAstro/SolarPosition.jl · GitHub

      • function _solar_position(obs::Observer{T}, dt::DateTime, alg::PSA) where {T}
            # Get parameters as tuple (allocation-free)
            p1, p2, p3, p4, p5, p6, p7, p8, p9, p10, p11, p12, p13, p14, p15 =
                get_psa_params(alg.coeffs)
        
            # elapsed julian days (n) since J2000.0
            jd = datetime2julian(dt)
            n = jd - 2451545.0                                                  # Eq. 2
        
            # ecliptic coordinates of the sun
            # ecliptic longitude (λₑ), and obliquity of the ecliptic (ϵ)
            Ω = p1 + p2 * n                                                     # Eq. 3
            L = p3 + p4 * n                                                     # Eq. 4
            g = p5 + p6 * n                                                     # Eq. 5
            (sin_Ω, cos_Ω) = sincos(Ω)
            λₑ = L + p7 * sin(g) + p8 * sin(2 * g) + p9 + p10 * sin_Ω           # Eq. 6
            ϵ = p11 + p12 * n + p13 * cos_Ω                                     # Eq. 7
        
            # celestial right ascension (ra) and declination (d)
            (sin_ϵ, cos_ϵ) = sincos(ϵ)
            (sin_λₑ, cos_λₑ) = sincos(λₑ)
            ra = atan(cos_ϵ * sin_λₑ, cos_λₑ)                                   # Eq. 8
        
  • Some of the above would be easier to tackle if the constant coefficients that appear all over the code base were commented more effectively. As is, when I notice a polynomial in the code, I basically have to reverse engineer, or look for matching numbers in all relevant literature. It would be nice if each polynomial in the source code was near to a comment that made it clear what are the exact values of the coefficients (before rounding to Float64), and what is the expression for the Taylor series.

  • Another thing that might make further improvements easier would be an arbitrary precision implementation.

6 Likes

Thank you for looking at the code, providing feedback and preparing the PR!

1 Like