I am currently doing a very βniceβ project of converting algorithms written in Julia to C++ I need those algorithms in an embedded environment and, unfortunately, Julia cannot generate compiled code that can be embedded yet.
I am taking a very nice approach with the help of CxxWrap.jl. The translation is happening smoothly since I can rewrite function by function and test if anything breaks.
The very interesting part was when I started to compare the performance. There are three algorithms that are relatively costly for the satellite computer: IGRF (geomagnetic field), SGP4 (orbit propagator), and FK5 reduction (nutation calculation). Check out the comparison between the Julia version in SatelliteToolbox.jl and the C++ version (using -O3 -march=native):
Algorithm
Julia [ns]
C++ [ns]
IGRF
1373
1425
SGP4 propagation
340
660
FK5 nutation
1436
1201
Notice that Julia βlostβ only in the nutation computation, which is an algorithm that has to go through a table of 106 lines and 9 columns and perform sums (I think there is room for optimization here). I am also not sure if anything will help if I recompile Julia with -O3 -march=native.
My colleagues and I were just astonished with those results. This is a language that seems interpreted vs C++. I always thought that pursuing C++ speed was the ultimate goal, but now I see that Julia can be even faster for reasons I just cannot explain
donβt do this, this copies every single loop. Better just donβt make the look up table in Matrix to begin with, just have these columns separately in SVector? That way you also avoid promoting integers to floats in the Matrix.
I really need help here. In my local branch, I switched the dimensions since Julia is column-major. The benchmark were obtained in this configuration. Also tried to make nut_coefs_1980 a vector of vectors. The speed was worse. I also converted it to a vector of SVector. In this case, the performance was almost the same as when I used the look-up matrix but with dimensions permuted.
nut_coefs_1980 is a 2D matrix of Floats, thereβs no allocation happening here. Just regular array access/load from memory. Since the table in source code mixes integers and floats, the integers will be promoted to floats as well, making sure the matrix is uniformly typed.
With @jling 's suggestion the table can be updated:
Algorithm
Julia [ns]
C++ [ns]
IGRF
1373
1425
SGP4 propagation
340
660
FK5 nutation
409
1201
The only βdownsideβ is that the compilation time is much higher when using @turbo. Also, I think we can perform some similar tuning into the C++ code as well.
Anyway, the fact that straightforward Julia code can match and even beat C++ is just amazing!
(may it was tested). It will compile faster, and sometimes the difference to @turbo is not that much.
edit: in this case it will probably make a difference, because you have a sincos inside the loop and, if Iβm not mistaken, @turbo uses faster versions of these functions. You may want to try @fastmath on that function, if that is the case. (in both cases those faster versions have some accuracy loss, I donβt know if that is relevant for the satellites .
We cannot loose accuracy here. However, all those versions have exactly the same result.
If there is a faster sincos version, why Julia does not use it?
Probably! The only problem with IGRF is that I compute sines and cosines recursively. Hence, we do not have that assumption in which we can change the loop order.
you say accuracy loss, but they are actually more accurate than loop +, just slightly less accurate than sum
julia> a = rand(Float16, 10000);
julia> sum(a)
Float16(5.024e3)
julia> foldl(+, a)
Float16(2.048e3)
julia> let res = zero(Float16)
@turbo for i in 1:length(a)
res += a[i]
end
res
end
5025.9033f0
Itβs not necessarily faster in general (for scalar arguments). Instead, itβs optimized for SIMD vectorization, where it outperforms the Base.sincos version (which is optimized for scalars). For example,
julia> using LoopVectorization, BenchmarkTools
julia> @benchmark Base.sincos($(Ref(1.0))[])
BenchmarkTools.Trial: 10000 samples with 999 evaluations.
Range (min β¦ max): 9.604 ns β¦ 35.111 ns β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 9.608 ns β GC (median): 0.00%
Time (mean Β± Ο): 9.699 ns Β± 1.163 ns β GC (mean Β± Ο): 0.00% Β± 0.00%
βββ β
ββββ ββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
9.6 ns Histogram: log(frequency) by time 9.82 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark LoopVectorization.SLEEFPirates.sincos($(Ref(1.0))[])
BenchmarkTools.Trial: 10000 samples with 997 evaluations.
Range (min β¦ max): 18.468 ns β¦ 69.659 ns β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 18.827 ns β GC (median): 0.00%
Time (mean Β± Ο): 18.912 ns Β± 1.597 ns β GC (mean Β± Ο): 0.00% Β± 0.00%
β
ββ βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
18.5 ns Histogram: frequency by time 20.2 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> function foo_simd!(si, co, x)
@inbounds @simd ivdep for i in eachindex(x)
si[i], co[i] = sincos(x[i])
end
end
foo_simd! (generic function with 1 method)
julia> function foo_turbo!(si, co, x)
@turbo for i in eachindex(x)
si[i], co[i] = sincos(x[i])
end
end
foo_turbo! (generic function with 1 method)
julia> x = randn(10^3); si = similar(x); co = similar(x);
julia> @benchmark foo_simd!($si, $co, $x)
BenchmarkTools.Trial: 10000 samples with 4 evaluations.
Range (min β¦ max): 7.246 ΞΌs β¦ 23.578 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 7.365 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 7.425 ΞΌs Β± 613.371 ns β GC (mean Β± Ο): 0.00% Β± 0.00%
ββ β
ββββββββ βββββββββββββββββββββββββββββββββββββββββββββββββββ β
7.25 ΞΌs Histogram: log(frequency) by time 10.6 ΞΌs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark foo_turbo!($si, $co, $x)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
Range (min β¦ max): 1.981 ΞΌs β¦ 6.003 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 2.006 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 2.017 ΞΌs Β± 151.188 ns β GC (mean Β± Ο): 0.00% Β± 0.00%
βββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
1.98 ΞΌs Histogram: frequency by time 2.15 ΞΌs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> foo_simd!(si, co, x)
julia> si_new = similar(si); co_new = similar(co);
julia> foo_turbo!(si_new, co_new, x)
julia> si β si_new
true
julia> co β co_new
true
Note @turbo will use SLEEFPirates.sincos_fast by default, but you could specify SLEEFPirates.sincos.
On speed/implementations differences- note that SIMD (Single Instruction Multiple Data) has to apply the same instruction to multiple data.
This basically means that if your code has a branch, you have to take both sides of the branch and fuse the results at the end.
Sometimes, functions like sincos in scalar mode can be sped up by having efficient strategies for different segments, and then branching based on which segment youβre in. But you donβt want to do this for a SIMD version, youβd be best off with just 1 primary path through the code.