To figure out exactly what’s going on, you can use code_llvm
. Let’s compare the LLVM code for the non-inlined and the inlined methods:
julia> code_llvm(() -> M_to_E(0.005, 100), ())
define double @"julia_#29_35751"() #0 {
%0 = call double @julia_M_to_E_3(double 5.000000e-03, i64 100, double 1.000000e-10)
ret double %0
}
julia> code_llvm(() -> M_to_Ein(0.005, 100), ())
define double @"julia_#31_35752"() #0 {
%0 = alloca [2 x double], align 8
%1 = alloca [2 x double], align 8
call void @julia_sincos_35492([2 x double]* sret %0, double 0x4016FD2757AF7013)
... many more lines ...
Notice how the inlined version does a sincos
as the first step, while in your Julia code there’s both a call to mod
and a calculation of E
needed to get the sincos
argument. What is sincos
called with?
julia> reinterpret(Float64, 0x4016FD2757AF7013)
5.747220392306207
Which is (you might have guessed it):
julia> mod(100, 2π) - 0.005
5.747220392306207
In other words, since the input is constant, the compiler is able to replace the initial calculations with a constant value when the code is inlined. There are a few other similar optimizations which together explain the time difference you are seeing. (In addition, inlining avoids a function call, but that by itself doesn’t save you that much in comparison to these optimizations.)
This also illustrates why benchmarking is tricky to get right. A slightly more accurate benchmark would be to use random input, like this:
const N = 1000
e = rand(N)
M = 100 * rand(N)
@btime for n = 1:$N M_to_E($e[n], $M[n]) end
@btime for n = 1:$N M_to_Ein($e[n], $M[n]) end
With results:
133.426 μs (0 allocations: 0 bytes)
124.348 μs (0 allocations: 0 bytes)
I.e. ~133 ns for the non-inlined version and ~124 ns for the inlined version. Still faster, but not by much.
Even this is not a particularly accurate benchmark, since this type of loop may result in SIMD instructions and/or loop unrolling, benefits which you may not see in actual code. By benchmarking the same code over and over, the memory will also be cached, and the processor may learn the branching behavior, making the performance appear better than it would be in practice. Finally, the data used for testing is likely not very realistic. Therefore, the best thing to do is to always benchmark your actual application, with actual data. That will also help you figure out if this code is at all a bottleneck – no point optimizing it otherwise.
By the way, not related to this question, but there’s a function called mod2pi
which you can use instead of mod(x, 2π)
, it gives a slightly more accurate, and perhaps faster evaluation (the test above ran in ~116 ns per call).