Julia vs C++ speed

Ronis_BR · August 28, 2021, 2:24pm

Hi!

I am currently doing a very “nice” project of converting algorithms written in Julia to C++ I need those algorithms in an embedded environment and, unfortunately, Julia cannot generate compiled code that can be embedded yet.

I am taking a very nice approach with the help of CxxWrap.jl. The translation is happening smoothly since I can rewrite function by function and test if anything breaks.

All the algorithms are the ones in SatelliteToolbox.jl.

The very interesting part was when I started to compare the performance. There are three algorithms that are relatively costly for the satellite computer: IGRF (geomagnetic field), SGP4 (orbit propagator), and FK5 reduction (nutation calculation). Check out the comparison between the Julia version in SatelliteToolbox.jl and the C++ version (using -O3 -march=native):

Algorithm	Julia [ns]	C++ [ns]
IGRF	1373	1425
SGP4 propagation	340	660
FK5 nutation	1436	1201

Notice that Julia “lost” only in the nutation computation, which is an algorithm that has to go through a table of 106 lines and 9 columns and perform sums (I think there is room for optimization here). I am also not sure if anything will help if I recompile Julia with -O3 -march=native.

My colleagues and I were just astonished with those results. This is a language that seems interpreted vs C++. I always thought that pursuing C++ speed was the ultimate goal, but now I see that Julia can be even faster for reasons I just cannot explain

jling · August 28, 2021, 2:31pm

github.com

JuliaSpace/SatelliteToolbox.jl/blob/abd462a68a498f4ee5cd2b8f9c96e50a2b59d203/src/transformations/fk5/nutation.jl#L280


      
          
          # Nutation in longitude and obliquity
          # ===================================
          
          # Compute the nutation in the longitude and in obliquity.
          ΔΨ_1980 = 0.0
          Δϵ_1980 = 0.0
          
          @inbounds for i = 1:n_max
              # Unpack values.
              an1 = nut_coefs_1980[i,1]
              an2 = nut_coefs_1980[i,2]
              an3 = nut_coefs_1980[i,3]
              an4 = nut_coefs_1980[i,4]
              an5 = nut_coefs_1980[i,5]
              Ai  = nut_coefs_1980[i,6]
              Bi  = nut_coefs_1980[i,7]
              Ci  = nut_coefs_1980[i,8]
              Di  = nut_coefs_1980[i,9]
          
              a_pi = an1 * M_m + an2 * M_s + an3 * u_Mm + an4 * D_s + an5 * Ω_m

don’t do this, this copies every single loop. Better just don’t make the look up table in Matrix to begin with, just have these columns separately in SVector? That way you also avoid promoting integers to floats in the Matrix.

Ronis_BR · August 28, 2021, 2:42pm

I really need help here. In my local branch, I switched the dimensions since Julia is column-major. The benchmark were obtained in this configuration. Also tried to make nut_coefs_1980 a vector of vectors. The speed was worse. I also converted it to a vector of SVector. In this case, the performance was almost the same as when I used the look-up matrix but with dimensions permuted.

jling · August 28, 2021, 2:43pm

any runnable snippet with that function alone? I can give it a stab

Ronis_BR · August 28, 2021, 2:43pm

Sure:

using SatelliteToolbox
nutation_fk5(2.460115315972222e6)

Thanks!!

Sukera · August 28, 2021, 2:45pm

nut_coefs_1980 is a 2D matrix of Floats, there’s no allocation happening here. Just regular array access/load from memory. Since the table in source code mixes integers and floats, the integers will be promoted to floats as well, making sure the matrix is uniformly typed.

jling · August 28, 2021, 2:46pm

oops, I thought it was slicing, now I realized it’s accessing a single element at a time.

Ronis_BR · August 28, 2021, 2:47pm

Ah @jling notice that this is not my newest code. I did not pushed because I am doing some tests. The changes were:

const nut_coefs_1980 = permutedims([
...
])

and

        an1 = nut_coefs_1980[1,i]
        an2 = nut_coefs_1980[2,i]
        an3 = nut_coefs_1980[3,i]
        an4 = nut_coefs_1980[4,i]
        an5 = nut_coefs_1980[5,i]
        Ai  = nut_coefs_1980[6,i]
        Bi  = nut_coefs_1980[7,i]
        Ci  = nut_coefs_1980[8,i]
        Di  = nut_coefs_1980[9,i]

jling · August 28, 2021, 2:48pm

btw,

@turbo for i = 1:n_max

seem to work:

julia> @benchmark nutation_fk5(2.460115315972222e6)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.166 μs …  3.456 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.183 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.188 μs ± 45.902 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

      █▁                                                      
  ▂▄████▄▃▂▂▂▂▂▂▂▂▁▁▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▂
  1.17 μs        Histogram: frequency by time         1.4 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

after:

julia> @benchmark nutation_fk5(2.460115315972222e6)
BenchmarkTools.Trial: 10000 samples with 219 evaluations.
 Range (min … max):  339.320 ns … 436.995 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     348.790 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   349.821 ns ±   5.359 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

             ▂▅▇▇█▆▄▁                                            
  ▂▂▂▂▂▂▃▃▄▅▇████████▇▅▄▃▃▃▃▂▂▂▂▂▂▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂ ▃
  339 ns           Histogram: frequency by time          377 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

results are the same within float point error I think:

julia> nutation_fk5(2.460115315972222e6)
#n_max = 106
(0.4090395463527846, 3.477881920473464e-5, -4.185255450109059e-5)

julia> nutation_fk5(2.460115315972222e6)
(0.4090395463527846, 3.477881920473462e-5, -4.185255450109061e-5)

Ronis_BR · August 28, 2021, 2:55pm

This is AWESOME! Thanks! I did not know about @turbo before!

Ronis_BR · August 28, 2021, 3:04pm

With @jling 's suggestion the table can be updated:

Algorithm	Julia [ns]	C++ [ns]
IGRF	1373	1425
SGP4 propagation	340	660
FK5 nutation	409	1201

The only “downside” is that the compilation time is much higher when using @turbo. Also, I think we can perform some similar tuning into the C++ code as well.

Anyway, the fact that straightforward Julia code can match and even beat C++ is just amazing!

lmiq · August 28, 2021, 3:08pm

test it with only

@inbounds @simd for i = 1:n_max

(may it was tested). It will compile faster, and sometimes the difference to @turbo is not that much.

edit: in this case it will probably make a difference, because you have a sincos inside the loop and, if I’m not mistaken, @turbo uses faster versions of these functions. You may want to try @fastmath on that function, if that is the case. (in both cases those faster versions have some accuracy loss, I don’t know if that is relevant for the satellites .

jling · August 28, 2021, 3:14pm

now I wonder if similar practice can be done:

github.com

JuliaSpace/SatelliteToolbox.jl/blob/2adbf245bb710e2808939b4898ea936bfefb374f/src/earth/geomagnetic_field_models/igrf/igrf.jl#L445


      
          dVθ = 0.0   # Derivative of the Geomagnetic potential w.r.t. θ.
          dVϕ = 0.0   # Derivative of the Geomagnetic potential w.r.t. ϕ.
          ΔG  = 0.0   # Auxiliary variable to interpolate the G coefficients.
          ΔH  = 0.0   # Auxiliary variable to interpolate the H coefficients.
          kg  = 1     # Index to obtain the values of the matrix `G`.
          kh  = 1     # Index to obtain the values of the matrix `H`.
          
          # Geomagnetic potential
          # =====================
          
          @inbounds for n in 1:n_max
              aux_dVr = 0.0
              aux_dVθ = 0.0
              aux_dVϕ = 0.0
          
              # Compute the contributions when `m = 0`
              # ======================================
          
              # Get the coefficients in the epoch and interpolate to the desired
              # time.
              Gnm_e0 = G[kg,idx+2]

DNF · August 28, 2021, 3:14pm

Famous last words

(Well, famous last words at a job…)

Ronis_BR · August 28, 2021, 3:17pm

Thanks for the tips! I got those values:

@simd: 1409 ns (slightly faster).
@fastmath: 1386 ns (getting there).
@fastmath @simd: 1380 ns.

We cannot loose accuracy here. However, all those versions have exactly the same result.

If there is a faster sincos version, why Julia does not use it?

Probably! The only problem with IGRF is that I compute sines and cosines recursively. Hence, we do not have that assumption in which we can change the loop order.

jling · August 28, 2021, 3:19pm

you say accuracy loss, but they are actually more accurate than loop +, just slightly less accurate than sum

julia> a = rand(Float16, 10000);

julia> sum(a)
Float16(5.024e3)

julia> foldl(+, a)
Float16(2.048e3)

julia> let res = zero(Float16)
           @turbo for i in 1:length(a)
               res += a[i]
           end
           res
       end
5025.9033f0

gbaraldi · August 28, 2021, 3:25pm

I think the @turbo sincos is slightly less precise. I don’t know how much is slightly less, probably check https://github.com/JuliaSIMD/SLEEFPirates.jl.

lmiq · August 28, 2021, 3:29pm

Because of the precision. It is like using fastmath compiler options in general. For some applications the difference is important (not any of mine).

ranocha · August 28, 2021, 3:40pm

It’s not necessarily faster in general (for scalar arguments). Instead, it’s optimized for SIMD vectorization, where it outperforms the Base.sincos version (which is optimized for scalars). For example,

julia> using LoopVectorization, BenchmarkTools

julia> @benchmark Base.sincos($(Ref(1.0))[])
BenchmarkTools.Trial: 10000 samples with 999 evaluations.
 Range (min … max):  9.604 ns … 35.111 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     9.608 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   9.699 ns ±  1.163 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▇█▁                                                        ▁
  ███▅▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▇▇ █
  9.6 ns       Histogram: log(frequency) by time     9.82 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark LoopVectorization.SLEEFPirates.sincos($(Ref(1.0))[])
BenchmarkTools.Trial: 10000 samples with 997 evaluations.
 Range (min … max):  18.468 ns … 69.659 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     18.827 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   18.912 ns ±  1.597 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

              █                                                
  ▂▅▃▂▁▁▁▁▁▁▂▄█▆▂▁▂▁▁▁▁▁▁▁▁▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂▂▁▁▁▁▂▂ ▂
  18.5 ns         Histogram: frequency by time        20.2 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> function foo_simd!(si, co, x)
           @inbounds @simd ivdep for i in eachindex(x)
               si[i], co[i] = sincos(x[i])
           end
       end
foo_simd! (generic function with 1 method)

julia> function foo_turbo!(si, co, x)
           @turbo for i in eachindex(x)
               si[i], co[i] = sincos(x[i])
           end
       end
foo_turbo! (generic function with 1 method)

julia> x = randn(10^3); si = similar(x); co = similar(x);

julia> @benchmark foo_simd!($si, $co, $x)
BenchmarkTools.Trial: 10000 samples with 4 evaluations.
 Range (min … max):  7.246 μs …  23.578 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     7.365 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   7.425 μs ± 613.371 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▇█                                                         ▁
  ▄██▄▇▄▁▅█▆▃▁▃▄▃▃▃▁▁▃▁▁▁▃▁▃▁▁▁▃▁▁▁▁▃▁▃▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆ █
  7.25 μs      Histogram: log(frequency) by time      10.6 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark foo_turbo!($si, $co, $x)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.981 μs …   6.003 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.006 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.017 μs ± 151.188 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

         ▁█▇                                                   
  ▂▃▃▃▂▁▂███▃▂▂▂▁▂▂▂▁▁▁▁▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▂▂▂ ▂
  1.98 μs         Histogram: frequency by time        2.15 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> foo_simd!(si, co, x)

julia> si_new = similar(si); co_new = similar(co);

julia> foo_turbo!(si_new, co_new, x)

julia> si ≈ si_new
true

julia> co ≈ co_new
true

Elrod · August 28, 2021, 4:23pm

Note @turbo will use SLEEFPirates.sincos_fast by default, but you could specify SLEEFPirates.sincos.

On speed/implementations differences- note that SIMD (Single Instruction Multiple Data) has to apply the same instruction to multiple data.
This basically means that if your code has a branch, you have to take both sides of the branch and fuse the results at the end.

Sometimes, functions like sincos in scalar mode can be sped up by having efficient strategies for different segments, and then branching based on which segment you’re in. But you don’t want to do this for a SIMD version, you’d be best off with just 1 primary path through the code.

Topic		Replies	Views
Trying to understand low performance compared to C++ Performance	13	336	October 2, 2024
Cosine seems slow Performance	14	1798	November 27, 2019
Julia fn clearer than C++/Fortran (examples sought) General Usage	38	3220	May 30, 2021
CUDA.jl kernel is half as fast as c++ Kernel Performance cuda , cudajl	11	1564	September 26, 2022
A bet: what specific algorithms in Julia can be faster or as fast as C++ implementations? Performance	13	2730	August 16, 2018

Julia vs C++ speed

Related topics