Benchmark showing 2x slower result between nearly identical functions on Julia 0.6

I’ve been doing some performance testing of variations of a math kernel operation, and in doing so, I’ve come across a benchmarking result which sensitively depends on whether a supposedly useless line is included or not.

I’ve reduced the test case down to:

using BenchmarkTools

function kernel1(ℓ,x)
    const T = typeof(x)
    y = x*x
    g = 1 / sqrt(ℓ)
    return ifelse(y<eps(T), 0.0, y)
end

function kernel2(ℓ,x)
    const T = typeof(x)
    const U = typeof(ℓ)
    y = x*x
    g = 1 / sqrt(ℓ)
    return ifelse(y<eps(T), 0.0, y)
end

ℓ = 700
x = 0.5
a = @benchmark kernel1($ℓ, $x)
b = @benchmark kernel2($ℓ, $x)

ratio(minimum(b),minimum(a))

Originally, it was the addition/removal of the const U = typeof(ℓ) line that caused the change I was seeing, but in trying to reduce down the test case, I’ve gotten to the point where apparently changing almost anything else in the function removes the performance difference. (I.e. even though g is unused, removing that line causes both to perform the same.)

When I inspect the LLVM or native code with code_llvm and code_native, respectively, they both appear to give identical results.

Am I doing something stupid in how I’m invoking @benchmark?

The behavior is being seen on:

Julia Version 0.6.0-dev.2375
Commit 1303dfb96* (2017-01-26 06:59 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-7500U CPU @ 2.70GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Prescott)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, broadwell)

and

julia> Pkg.status("BenchmarkTools")
 - BenchmarkTools                0.0.6

Edit: I forgot to mention that I don’t see this difference on Julia 0.5 (though the benchmarks seem to have less resolution on 0.5??).

The difference is that the second one barely passes the inlining threashold while the first one is barely lower than it.

Note that you should not use const and they are not doing anything good here.

Thank you. Is there a way to inspect that that is what’s happened, or is that something you recognize from experience?

Is there something that describes what this threshold is?

Sort of. I’ve seen sth similar before so I checked the code_warntype of another function calling this function e.g. g(l, x) = kernel1(l, x) and you can clearly see that one is inlined whereas the other is not.

Some count of expressions after inlining (of other functions into this function). it currently doesn’t really know about the difference between the cost of different operations and other optimization passes were not able to delete all the unused code before it and that’s why the two function have different inlining behaviors even though they are supposed to be the same…