Julia vs C++ speed

Hi!

I am currently doing a very β€œnice” project of converting algorithms written in Julia to C++ :cry: I need those algorithms in an embedded environment and, unfortunately, Julia cannot generate compiled code that can be embedded yet.

I am taking a very nice approach with the help of CxxWrap.jl. The translation is happening smoothly since I can rewrite function by function and test if anything breaks.

All the algorithms are the ones in SatelliteToolbox.jl.

The very interesting part was when I started to compare the performance. There are three algorithms that are relatively costly for the satellite computer: IGRF (geomagnetic field), SGP4 (orbit propagator), and FK5 reduction (nutation calculation). Check out the comparison between the Julia version in SatelliteToolbox.jl and the C++ version (using -O3 -march=native):

Algorithm Julia [ns] C++ [ns]
IGRF 1373 1425
SGP4 propagation 340 660
FK5 nutation 1436 1201

Notice that Julia β€œlost” only in the nutation computation, which is an algorithm that has to go through a table of 106 lines and 9 columns and perform sums (I think there is room for optimization here). I am also not sure if anything will help if I recompile Julia with -O3 -march=native.

My colleagues and I were just astonished with those results. This is a language that seems interpreted vs C++. I always thought that pursuing C++ speed was the ultimate goal, but now I see that Julia can be even faster for reasons I just cannot explain :smiley:

12 Likes

don’t do this, this copies every single loop. Better just don’t make the look up table in Matrix to begin with, just have these columns separately in SVector? That way you also avoid promoting integers to floats in the Matrix.

4 Likes

I really need help here. In my local branch, I switched the dimensions since Julia is column-major. The benchmark were obtained in this configuration. Also tried to make nut_coefs_1980 a vector of vectors. The speed was worse. I also converted it to a vector of SVector. In this case, the performance was almost the same as when I used the look-up matrix but with dimensions permuted.

any runnable snippet with that function alone? I can give it a stab

1 Like

Sure:

using SatelliteToolbox
nutation_fk5(2.460115315972222e6)

Thanks!!

nut_coefs_1980 is a 2D matrix of Floats, there’s no allocation happening here. Just regular array access/load from memory. Since the table in source code mixes integers and floats, the integers will be promoted to floats as well, making sure the matrix is uniformly typed.

5 Likes

oops, I thought it was slicing, now I realized it’s accessing a single element at a time.

3 Likes

Ah @jling notice that this is not my newest code. I did not pushed because I am doing some tests. The changes were:

const nut_coefs_1980 = permutedims([
...
])

and

        an1 = nut_coefs_1980[1,i]
        an2 = nut_coefs_1980[2,i]
        an3 = nut_coefs_1980[3,i]
        an4 = nut_coefs_1980[4,i]
        an5 = nut_coefs_1980[5,i]
        Ai  = nut_coefs_1980[6,i]
        Bi  = nut_coefs_1980[7,i]
        Ci  = nut_coefs_1980[8,i]
        Di  = nut_coefs_1980[9,i]

btw,

@turbo for i = 1:n_max

seem to work:

julia> @benchmark nutation_fk5(2.460115315972222e6)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.166 ΞΌs …  3.456 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     1.183 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   1.188 ΞΌs Β± 45.902 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

      β–ˆβ–                                                      
  β–‚β–„β–ˆβ–ˆβ–ˆβ–ˆβ–„β–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–‚β–‚β–‚β–‚β–‚β–β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚ β–‚
  1.17 ΞΌs        Histogram: frequency by time         1.4 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

after:

julia> @benchmark nutation_fk5(2.460115315972222e6)
BenchmarkTools.Trial: 10000 samples with 219 evaluations.
 Range (min … max):  339.320 ns … 436.995 ns  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     348.790 ns               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   349.821 ns Β±   5.359 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

             β–‚β–…β–‡β–‡β–ˆβ–†β–„β–                                            
  β–‚β–‚β–‚β–‚β–‚β–‚β–ƒβ–ƒβ–„β–…β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–…β–„β–ƒβ–ƒβ–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–‚ β–ƒ
  339 ns           Histogram: frequency by time          377 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

results are the same within float point error I think:

julia> nutation_fk5(2.460115315972222e6)
#n_max = 106
(0.4090395463527846, 3.477881920473464e-5, -4.185255450109059e-5)

julia> nutation_fk5(2.460115315972222e6)
(0.4090395463527846, 3.477881920473462e-5, -4.185255450109061e-5)
5 Likes

This is AWESOME! Thanks! I did not know about @turbo before!

4 Likes

With @jling 's suggestion the table can be updated:

Algorithm Julia [ns] C++ [ns]
IGRF 1373 1425
SGP4 propagation 340 660
FK5 nutation 409 1201

The only β€œdownside” is that the compilation time is much higher when using @turbo. Also, I think we can perform some similar tuning into the C++ code as well.

Anyway, the fact that straightforward Julia code can match and even beat C++ is just amazing!

9 Likes

test it with only

@inbounds @simd for i = 1:n_max

(may it was tested). It will compile faster, and sometimes the difference to @turbo is not that much.

edit: in this case it will probably make a difference, because you have a sincos inside the loop and, if I’m not mistaken, @turbo uses faster versions of these functions. You may want to try @fastmath on that function, if that is the case. (in both cases those faster versions have some accuracy loss, I don’t know if that is relevant for the satellites :slight_smile: .

3 Likes

now I wonder if similar practice can be done:

Famous last words :grinning_face_with_smiling_eyes:

(Well, famous last words at a job…)

3 Likes

Thanks for the tips! I got those values:

  1. @simd: 1409 ns (slightly faster).
  2. @fastmath: 1386 ns (getting there).
  3. @fastmath @simd: 1380 ns.

We cannot loose accuracy here. However, all those versions have exactly the same result.

If there is a faster sincos version, why Julia does not use it?

Probably! The only problem with IGRF is that I compute sines and cosines recursively. Hence, we do not have that assumption in which we can change the loop order.

1 Like

you say accuracy loss, but they are actually more accurate than loop +, just slightly less accurate than sum

julia> a = rand(Float16, 10000);

julia> sum(a)
Float16(5.024e3)

julia> foldl(+, a)
Float16(2.048e3)

julia> let res = zero(Float16)
           @turbo for i in 1:length(a)
               res += a[i]
           end
           res
       end
5025.9033f0
1 Like

I think the @turbo sincos is slightly less precise. I don’t know how much is slightly less, probably check GitHub - JuliaSIMD/SLEEFPirates.jl: Pirated SLEEF port..

1 Like

Because of the precision. It is like using fastmath compiler options in general. For some applications the difference is important (not any of mine).

2 Likes

It’s not necessarily faster in general (for scalar arguments). Instead, it’s optimized for SIMD vectorization, where it outperforms the Base.sincos version (which is optimized for scalars). For example,

julia> using LoopVectorization, BenchmarkTools

julia> @benchmark Base.sincos($(Ref(1.0))[])
BenchmarkTools.Trial: 10000 samples with 999 evaluations.
 Range (min … max):  9.604 ns … 35.111 ns  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     9.608 ns              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   9.699 ns Β±  1.163 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–‡β–ˆβ–                                                        ▁
  β–ˆβ–ˆβ–ˆβ–…β–ƒβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ƒβ–‡β–‡ β–ˆ
  9.6 ns       Histogram: log(frequency) by time     9.82 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark LoopVectorization.SLEEFPirates.sincos($(Ref(1.0))[])
BenchmarkTools.Trial: 10000 samples with 997 evaluations.
 Range (min … max):  18.468 ns … 69.659 ns  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     18.827 ns              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   18.912 ns Β±  1.597 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

              β–ˆ                                                
  β–‚β–…β–ƒβ–‚β–β–β–β–β–β–β–‚β–„β–ˆβ–†β–‚β–β–‚β–β–β–β–β–β–β–β–β–‚β–‚β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–‚β–‚β–‚β–‚β–‚β–‚β–β–β–β–β–‚β–‚ β–‚
  18.5 ns         Histogram: frequency by time        20.2 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> function foo_simd!(si, co, x)
           @inbounds @simd ivdep for i in eachindex(x)
               si[i], co[i] = sincos(x[i])
           end
       end
foo_simd! (generic function with 1 method)

julia> function foo_turbo!(si, co, x)
           @turbo for i in eachindex(x)
               si[i], co[i] = sincos(x[i])
           end
       end
foo_turbo! (generic function with 1 method)

julia> x = randn(10^3); si = similar(x); co = similar(x);

julia> @benchmark foo_simd!($si, $co, $x)
BenchmarkTools.Trial: 10000 samples with 4 evaluations.
 Range (min … max):  7.246 ΞΌs …  23.578 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     7.365 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   7.425 ΞΌs Β± 613.371 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

   β–‡β–ˆ                                                         ▁
  β–„β–ˆβ–ˆβ–„β–‡β–„β–β–…β–ˆβ–†β–ƒβ–β–ƒβ–„β–ƒβ–ƒβ–ƒβ–β–β–ƒβ–β–β–β–ƒβ–β–ƒβ–β–β–β–ƒβ–β–β–β–β–ƒβ–β–ƒβ–β–β–ƒβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–† β–ˆ
  7.25 ΞΌs      Histogram: log(frequency) by time      10.6 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark foo_turbo!($si, $co, $x)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.981 ΞΌs …   6.003 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     2.006 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   2.017 ΞΌs Β± 151.188 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

         β–β–ˆβ–‡                                                   
  β–‚β–ƒβ–ƒβ–ƒβ–‚β–β–‚β–ˆβ–ˆβ–ˆβ–ƒβ–‚β–‚β–‚β–β–‚β–‚β–‚β–β–β–β–β–‚β–‚β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–‚β–β–β–β–β–β–β–β–β–β–β–β–‚β–β–β–β–‚β–‚β–‚ β–‚
  1.98 ΞΌs         Histogram: frequency by time        2.15 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> foo_simd!(si, co, x)

julia> si_new = similar(si); co_new = similar(co);

julia> foo_turbo!(si_new, co_new, x)

julia> si β‰ˆ si_new
true

julia> co β‰ˆ co_new
true
9 Likes

Note @turbo will use SLEEFPirates.sincos_fast by default, but you could specify SLEEFPirates.sincos.

On speed/implementations differences- note that SIMD (Single Instruction Multiple Data) has to apply the same instruction to multiple data.
This basically means that if your code has a branch, you have to take both sides of the branch and fuse the results at the end.

Sometimes, functions like sincos in scalar mode can be sped up by having efficient strategies for different segments, and then branching based on which segment you’re in. But you don’t want to do this for a SIMD version, you’d be best off with just 1 primary path through the code.

4 Likes