I have always been fascinated by the effect of branch prediction demonstrated in this StackOverflow question, and I wanted to include this example in my course to illustrate how hard it is to predict the performance of modern hardware.
Unfortunately, it seems that this example no longer works in Julia. Running the code from @Tamas_Papp’s blog post in Julia 1.4.2, I obtain the following.
"Sum elements if ≥ 128."
function sumabove_if(x)
s = zero(eltype(x))
for elt in x
if elt ≥ 128
s += elt
end
end
s
end
x_rand = rand(1:256, 32768) # original example on stackoverflow, except using Int
x_sorted = sort(x_rand)
@btime sumabove_if($x_rand) # 4.139 μs (0 allocations: 0 bytes)
@btime sumabove_if($x_sorted) # 3.857 μs (0 allocations: 0 bytes)
As you can see, sorting the array leads to only very little improvement in performance, and definitely not a factor 5x as reported by @Tamas_Papp.
I assume what is happening here is that Julia / LLVM has become clever enough to recognise that the branch is unnecessary and eliminates it for me. So my questions are:
- Is my assumption correct?
- Can I work around the smartness of Julia / LLVM somehow?
For completeness:
julia> versioninfo()
Julia Version 1.4.2
Commit 44fa15b150* (2020-05-23 18:35 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-8.0.1 (ORCJIT, skylake)
Environment:
JULIA_EDITOR = "/usr/share/code/code"
JULIA_NUM_THREADS =