Speeding up trig-heavy code

langestefan · December 20, 2025, 11:13pm

Hello!

I recently released v0.3.2 of SolarPosition.jl. I feel like it is now in a position where I am very happy with the performance, at least I think I covered all the basics like type stability covered in Performance Tips · The Julia Language.

I benchmarked SolarPosition.jl against its python equivalent solposx, and found that we are usually between 10x-200x faster (singlethreaded) depending on input size and algorithm choice. PSA, which provides a great tradeoff between accuracy and computation time, is 200x faster than the python version. On top of that we can do multithreading, while the python version can’t (at least not trivially).

Here is a profiling view I created for PSA:

As you can see most time is lost computing trigonometry functions like sin, cos, atan.

This is mostly for my own curiosity, but is there even more to gain here? I experimented with @fastmath, but found that it breaks correctness in exchange for maybe 5-10% percent performance gain. I also experimented with static arrays, but found that it yielded no improvements.

If you want to try something, feel free to make a PR. The github bot will benchmark your changes against the main branch using AirspeedVelocity.jl so you immediately get feedback.

gdalle · December 20, 2025, 11:21pm

Is it common to have just a single solar position you want to compute? At this point, it is possible that the single-threaded speedups are close to exhausted (although I wouldn’t underestimate the cleverness of people here) but setting up parallelized computations on GPUs would probably yield massive instant speedups

giordano · December 20, 2025, 11:22pm

You could try trigonometric functions in packages like SLEEF.jl

raman_kumar · December 20, 2025, 11:24pm

If your code is calculating trigonometric values of same angle then combinations like sincosand sincosd etc. can help improve performance.
I think we should have better documentation of KernelAbstractions.jl. I am also looking for fast solution to my code but using GPU is tough as its documentation are not detailed.

langestefan · December 20, 2025, 11:43pm

I use it for the MTK component wrapper, but other than that probably not.

True, already using that.

raman_kumar · December 21, 2025, 12:07am

But now i think that i am wrong .

julia> using BenchmarkTools

julia> x= rand(1000);

julia> f(x) = (sin(x), cos(x))
f (generic function with 1 method)

julia> m(x) = sincos(x)
m (generic function with 1 method)

julia> @btime f.($x);

  5.805 μs (3 allocations: 15.70 KiB)

julia> @btime m.($x);

  6.161 μs (3 allocations: 15.70 KiB)

Results varies with size of x.

stevengj · December 21, 2025, 7:21pm

When you benchmark on an array, the results get confused with memory effects, and there is probably a lot more noise.

Here is a benchmark on a single number. The improvement is even greater if you use a larger angle, because then it only needs to do the reduction modulo 2π once.

julia> @btime f($(Ref(0.1))[]);
  4.625 ns (0 allocations: 0 bytes)

julia> @btime sincos($(Ref(0.1))[]);
  3.958 ns (0 allocations: 0 bytes)

julia> @btime f($(Ref(10.1))[]);
  6.875 ns (0 allocations: 0 bytes)

julia> @btime sincos($(Ref(10.1))[]);
  5.541 ns (0 allocations: 0 bytes)

wsmoses · December 21, 2025, 8:40pm

since from the perf log it appears you have a bunch of individual kernel calls, using Reactant.@compile may be able to help identify fusion opportunities

langestefan · December 21, 2025, 9:07pm

Ran a small test with Reactant.jl, it immediately found/compiled for my RTX3090 (magic?) and that yielded a 200x speedup. That’s pretty insane!

roflmaostc · December 22, 2025, 2:34pm

Are you sure you did synchronize your CUDA kernel?

@btime CUDA.@sync ...

langestefan · December 22, 2025, 6:49pm

I’m not sure what you mean, so I guess no?

gdalle · December 22, 2025, 9:54pm

If you don’t call CUDA.@sync you’re only measuring the time it takes for the CPU to tell the GPU “do this”. You’re missing the actual computation time + the time it takes for the results to travel back to the CPU, which is what @sync does.

wsmoses · December 22, 2025, 11:45pm

Reactant does not call CUDA.jl (julia specific bindings of the cuda runtime), it calls CUDA directly, so that does not apply. However you can use @compile sync=true to be assured of the same

wsmoses · December 23, 2025, 4:24am

@langestefan if you have a PR open with your use of reactant, I can offer comments on if the setup for benchmarking is proper. Relatedly, if this is an interesting code, we’d be happy to take it as an integration test in Reactant!

langestefan · December 24, 2025, 1:48pm

I’m working on it, but I have some beginner questions . Should I ask those on slack? (I see no reactant channel, but I think it deserves one!)

giordano · December 24, 2025, 1:58pm

Sounds like they may be helpful to other people as well, so asking on discourse for persistence would be better.

langestefan · December 24, 2025, 3:01pm

So there is a warning for 1.12 that it is not supported, but I haven’t run into any issues with it so far. I guess 1.12 support is mostly there now and it is ‘okay’ to use?

In the examples data is converted to ConcreteRArray using Reactant.to_rarray. But this function has no docs or docstring and ‘may be internal’ ?

help?> Reactant.to_rarray
  │ Warning
  │
  │  The following bindings may be internal; they may change or be removed in future versions:
  │
  │    •  Reactant.to_rarray

The input to my function is a vector of DateTime, which just returns itself:

julia> dts = collect(DateTime(2020):Hour(1):DateTime(2021))
8785-element Vector{DateTime}:

julia> Reactant.to_rarray(dts)
8785-element Vector{DateTime}:

Still, @compile seems to work correctly.

Elrod · December 25, 2025, 7:03am

If you do trig in a loop, LoopVectorization.@turbo should speed it up.
But not by nearly as much as running it on a GPU.

nsajko · December 25, 2025, 4:57pm

I’ve opened two issues:

[Bug] polynomial evaluation accomplished in a suboptimal manner, sure cause of suboptimal accuracy and suboptimal performance · Issue #49 · JuliaAstro/SolarPosition.jl · GitHub
[Bug] most likely you want to replace each used Taylor polynomial with a Minimax polynomial · Issue #51 · JuliaAstro/SolarPosition.jl · GitHub

A PR attempts to fix issue #49:

rely on `evalpoly` instead of hardcoding polynomial evaluation by nsajko · Pull Request #50 · JuliaAstro/SolarPosition.jl · GitHub

Some more notes:

~~More improvements, for performance and/or accuracy, could perhaps be achieved by creating specialized kernel functions.~~ Been playing with these ideas in Sollya, but was not able to achieve an improvement.

For example, in this snippet, each value shown in the excerpt depends only on constants and on the two function arguments dt and alg. alg is itself almost a constant: only two different values are valid (2001 and 2020). Thus, conceivably, after a branch to choose between the two valid values, a specialized implementation might be used that would treat alg as a constant, hardcoding the coefficients for some minimax polynomials for some of the shown values. For example, a single minimax polynomial perhaps might suffice to calculate λₑ directly from jd. Just an example, have not looked into it. I would have to play around for some hours to investigate properly.

link to source: SolarPosition.jl/src/Positioning/psa.jl at aa12423d8414a8eacca3b65bffa783dc6fe9c528 · JuliaAstro/SolarPosition.jl · GitHub

function _solar_position(obs::Observer{T}, dt::DateTime, alg::PSA) where {T}
    # Get parameters as tuple (allocation-free)
    p1, p2, p3, p4, p5, p6, p7, p8, p9, p10, p11, p12, p13, p14, p15 =
        get_psa_params(alg.coeffs)

    # elapsed julian days (n) since J2000.0
    jd = datetime2julian(dt)
    n = jd - 2451545.0                                                  # Eq. 2

    # ecliptic coordinates of the sun
    # ecliptic longitude (λₑ), and obliquity of the ecliptic (ϵ)
    Ω = p1 + p2 * n                                                     # Eq. 3
    L = p3 + p4 * n                                                     # Eq. 4
    g = p5 + p6 * n                                                     # Eq. 5
    (sin_Ω, cos_Ω) = sincos(Ω)
    λₑ = L + p7 * sin(g) + p8 * sin(2 * g) + p9 + p10 * sin_Ω           # Eq. 6
    ϵ = p11 + p12 * n + p13 * cos_Ω                                     # Eq. 7

    # celestial right ascension (ra) and declination (d)
    (sin_ϵ, cos_ϵ) = sincos(ϵ)
    (sin_λₑ, cos_λₑ) = sincos(λₑ)
    ra = atan(cos_ϵ * sin_λₑ, cos_λₑ)                                   # Eq. 8

Some of the above would be easier to tackle if the constant coefficients that appear all over the code base were commented more effectively. As is, when I notice a polynomial in the code, I basically have to reverse engineer, or look for matching numbers in all relevant literature. It would be nice if each polynomial in the source code was near to a comment that made it clear what are the exact values of the coefficients (before rounding to Float64), and what is the expression for the Taylor series.
Another thing that might make further improvements easier would be an arbitrary precision implementation.

langestefan · December 25, 2025, 9:16pm

Thank you for looking at the code, providing feedback and preparing the PR!

Topic		Replies	Views
Trig functions very slow Performance	67	7210	October 10, 2018
Cosine seems slow Performance	14	1886	November 27, 2019
Julia vs C++ speed General Usage performance , c	21	4800	September 2, 2021
Sincos slightly slower than sin and cos Performance trigonometry	2	845	December 16, 2021
Why are calls to this function slow, compared to Python+Numpy? General Usage fast-math	14	2196	August 13, 2017

Speeding up trig-heavy code

Related topics