Tips about @inline


#1

Hi guys!

I am seeing some of my functions to vastly improve the execution time if I use @inline. For example, consider the following function:

function M_to_E(e::Number, M::Number, tol::Number = 1e-10)
    # Compute the eccentric anomaly using the Newton-Raphson method.
    # ==============================================================

    # Make sure that M is in the interval [0,2π].
    M = mod(M,2π)

    # Initial guess.
    #
    # See [1, p. 75].
    E = (M > π) ? M - e : M + e

    sin_E, cos_E = sincos(E)

    # Newton-Raphson iterations.
    while ( abs(E - e*sin_E - M) > tol )
        E = E - (E - e*sin_E - M)/(1-e*cos_E)

        sin_E, cos_E = sincos(E)
    end

    # Return the eccentric anomaly in the interval [0, 2π].
    mod(E, 2π)
end

Using @btime, I get 86.917 ns with @inline and 95.403 ns without it.

I read somethings related to inlining (like this) but I am still not sure about when to use @inline. Is it correct to use @inline in all functions that @btime provide a significantly better result? Will it have some side effect when the inlined function is used in a much bigger code?


#2

It unclear what you are inlining here: M_to_E in some caller, or one of the functions that it calls.

In any case, the compiler has heuristics about inlining, and the @inline and @noinline macros allow you to specify it manually. Sometimes things improve when you do this, but drastic improvements are rare. Btw, I would not call a 10% improvement drastic, that is barely above noise.

Inlining can be costly too, so you should experiment. Also, the compiler is getting really smart about inlining these days, so I don’t get as much of a kick out of it as I used to.


#3

In fact, after some modifications (in the input arguments) I found a much higher gain:

function M_to_E(e::Number, M::Number, tol::Number = 1e-10)
    # Compute the eccentric anomaly using the Newton-Raphson method.
    # ==============================================================

    # Make sure that M is in the interval [0,2π].
    M = mod(M,2π)

    # Initial guess.
    #
    # See [1, p. 75].
    E = (M > π) ? M - e : M + e

    sin_E, cos_E = sincos(E)

    # Newton-Raphson iterations.
    while ( abs(E - e*sin_E - M) > tol )
        E = E - (E - e*sin_E - M)/(1-e*cos_E)

        sin_E, cos_E = sincos(E)
    end

    # Return the eccentric anomaly in the interval [0, 2π].
    mod(E, 2π)
end

@inline function M_to_Ein(e::Number, M::Number, tol::Number = 1e-10)
    # Compute the eccentric anomaly using the Newton-Raphson method.
    # ==============================================================

    # Make sure that M is in the interval [0,2π].
    M = mod(M,2π)

    # Initial guess.
    #
    # See [1, p. 75].
    E = (M > π) ? M - e : M + e

    sin_E, cos_E = sincos(E)

    # Newton-Raphson iterations.
    while ( abs(E - e*sin_E - M) > tol )
        E = E - (E - e*sin_E - M)/(1-e*cos_E)

        sin_E, cos_E = sincos(E)
    end

    # Return the eccentric anomaly in the interval [0, 2π].
    mod(E, 2π)
end

using BenchmarkTools

@btime M_to_E(0.005, 100)
@btime M_to_Ein(0.005, 100)

  87.117 ns (0 allocations: 0 bytes)
  63.452 ns (0 allocations: 0 bytes)

Should I open an issue or is this expected?


#4

Indeed, I am seeing a similar difference (94 -> 69 ns).

My opinion is simply that whatever increases performance, go for it!

But do make sure to test your actual application as well, not just individual functions like this, to ensure that you’re seeing the same gains. Sometimes, benchmarking plays tricks on you!


#5

In my experience, sometimes using @inline will improve the performance, and sometimes it will decrease the performance also. In fact, sometimes I used @noinline to prevent a function from being inlined.

For example, in parser.jl I used @noinline a total of 6 times and @inline only 4 times. This is because the generated functions involved are very large and complicated, and I have to create multiple specialized variants of these. In this case, it is better to keep the code size smaller and to make sure that the code does not get inlined, because other functions that get called are only used in some cases of a huge decision tree.

As a result, from using @noinline the total code size for the whole package is made smaller, and as a consequence is that the precompiled binaries are also smaller in size (due to less code duplication) but it also has faster code loading for the precompiled cache.

So here, I am just trying to remind you that in some cases it is better to prevent a piece of code from being inlined, and sometimes it is better to make sure it is inlined.

In general, one cannot say that inlining is always better, because it really depends. Sometimes it is actually better not to inline. You’ll have to experiment a bit and try out different combinations, compare the precompilation, cache sizes, benchmarks, etc. Inlining lots of code will also increase your precompilation cache size, which can have a big effect if you create hundreds of generated methods, which all add up.

So it is a balance between different things, not only timings and benchmark, but also code caching, etc.


#6

Measuring f() = ... vs @inline f() = ... and looking at the time difference is likely to be misleading (the only thing you are doing is remove a function call).
Inlining is often beneficial due to more optimizations being available in the caller of the function being inlined. So something that is likely to give a more accurate benchmark is to benchmark:

function g()
    ...
    f()
    ...
end

where ... is some typical setup and teardown in how the function tend to be called. Compare the performance of g with f inlined or not.


#7

To figure out exactly what’s going on, you can use code_llvm. Let’s compare the LLVM code for the non-inlined and the inlined methods:

julia> code_llvm(() -> M_to_E(0.005, 100), ())

define double @"julia_#29_35751"() #0 {
  %0 = call double @julia_M_to_E_3(double 5.000000e-03, i64 100, double 1.000000e-10)
  ret double %0
}

julia> code_llvm(() -> M_to_Ein(0.005, 100), ())

define double @"julia_#31_35752"() #0 {
  %0 = alloca [2 x double], align 8
  %1 = alloca [2 x double], align 8
  call void @julia_sincos_35492([2 x double]* sret %0, double 0x4016FD2757AF7013)
  ... many more lines ...

Notice how the inlined version does a sincos as the first step, while in your Julia code there’s both a call to mod and a calculation of E needed to get the sincos argument. What is sincos called with?

julia> reinterpret(Float64, 0x4016FD2757AF7013)
5.747220392306207

Which is (you might have guessed it):

julia> mod(100, 2π) - 0.005
5.747220392306207

In other words, since the input is constant, the compiler is able to replace the initial calculations with a constant value when the code is inlined. There are a few other similar optimizations which together explain the time difference you are seeing. (In addition, inlining avoids a function call, but that by itself doesn’t save you that much in comparison to these optimizations.)

This also illustrates why benchmarking is tricky to get right. A slightly more accurate benchmark would be to use random input, like this:

const N = 1000
e = rand(N)
M = 100 * rand(N)
@btime for n = 1:$N M_to_E($e[n], $M[n]) end
@btime for n = 1:$N M_to_Ein($e[n], $M[n]) end

With results:

  133.426 μs (0 allocations: 0 bytes)
  124.348 μs (0 allocations: 0 bytes)

I.e. ~133 ns for the non-inlined version and ~124 ns for the inlined version. Still faster, but not by much.

Even this is not a particularly accurate benchmark, since this type of loop may result in SIMD instructions and/or loop unrolling, benefits which you may not see in actual code. By benchmarking the same code over and over, the memory will also be cached, and the processor may learn the branching behavior, making the performance appear better than it would be in practice. Finally, the data used for testing is likely not very realistic. Therefore, the best thing to do is to always benchmark your actual application, with actual data. That will also help you figure out if this code is at all a bottleneck – no point optimizing it otherwise.

By the way, not related to this question, but there’s a function called mod2pi which you can use instead of mod(x, 2π), it gives a slightly more accurate, and perhaps faster evaluation (the test above ran in ~116 ns per call).


#8

Thanks @bennedich for the detailed answer! I learned a lot today :slight_smile: