I think what you’re seeing is the cost of getting these values from the array edges at each iteration. Maybe since edges is mutable the compiler does not want to assume they remain the same. If you make it a tuple instead, it becomes fast:
One way to check is to look at the scaling – the extra cost of the array compared to the tuple looks like it grows linearly with the number of iterations:
julia> x = rand(10^5); @belapsed(fn1($x, $edges)) - @belapsed(fn1($x, $(Tuple(edges))))
4.6791e-5
julia> x = rand(10^6); @belapsed(fn1($x, $edges)) - @belapsed(fn1($x, $(Tuple(edges))))
0.000467125
julia> x = rand(10^7); @belapsed(fn1($x, $edges)) - @belapsed(fn1($x, $(Tuple(edges))))
0.004648667
It seems that Julia fails to prove that the load from edges is invariant, even count is a pure function. Adding @inbounds also doesn’t help. And LLVM fails to vectorize fn1 function. This is not a good phenomenon since LLVM should trivially do this.
Interestingly, if you check the llvm IR (I test on Julia 1.6.1), you will find that after adding @inbounds, Julia can hoist the load from edges[0], but not edges[1] to the outside of loop. So one additional load is still needed for edges[1].
So if you try this:
fn4(x,edges,lo) = count(z->(@inbounds lo<z<=edges[2]),x) # no vectorization, slow
fn5(x,edges,hi) = count(z->(@inbounds edges[1]<z<=hi),x) # vectorization, fast
@code_llvm fn4(x,edges,1.0) # no-vectorization, has additional load.
@code_llvm fn5(x,edges,0.0) # vectorization, has <4 x float>.
@btime fn4($x,$edges,0.0) # 5ms
@btime fn5($x,$edges,1.0) # 500ÎĽs
fn5 is vectorized but fn4 is not. What happens here?
It seems that without the branch the LLVM can vectorize the code again. If you check the code of fn6 (with inbounds), then it’s as fast as other one (and vectorized)…
So the reason is that, it’s unsafe to move the code to the head of loop, since the branch z<edges[2] will never be taken in some cases and edges[2] might be undefined. If we move the load out of the loop, then it will cause out of boundary error.
Consider this code:
function f(x::Vector{Int},y)
z = 0
for i in 1:100
if y
z += x[1]
end
end
return z
end
It’s possible that someone invoke f with f(Int[],false), and if compiler move x[1] out of the loop, then it will read the element of an empty array. Replacing the branch by conditional forces compiler rules out this possibility (since both branches must be taken) and optimize this loop.
So, in conclusion, this has nothing to do with closure, it’s a compiler optimization problem. What’s counter-intuitive here is that x<y<z is not eager but short-circuit evaluation.
Actually, after I read the document you linked carefully, it says that
However, the order of evaluations in a chained comparison is undefined
instead of “short-circuit”. So maybe in this case, compiler should still have freedom to optimize the code (theoretically?). Since we can evaluate edges[2] and edges[1] before comparision, this will permit compiler to optimize the load out of loop.