As a MWE, I wrote the following two functions in Julia, both of which depend on a global constant integer N and a global constant vector global_vector:
using BenchmarkTools
const N = 1024
const global_vector = rand(N)
function loop!(x,y)
for i in 1:N
x[i] = y[i]*global_vector[i]
end
end
function loop_lift_first!(x,y)
x[1] = y[1]*global_vector[1]
for i in 2:N
x[i] = y[i]*global_vector[i]
end
end
x = rand(N)
y = rand(N)
Notice that loop! and loop_lift_first! are identical, except that the latter “lifts” the first iteration of the loop outside and computes it separately. Using BenchmarkTools to compare the runtime of these functions:
It looks like it has something to do with bounds checking, since if I put @inbounds in loop! then its performance becomes better than both previous versions.
For me, the performance gets similar if I explicitly pass global_vector to the function (which you should anyway prefer), @inbounds only makes a difference when I don’t pass global_vector:
using BenchmarkTools
const N = 1024
const global_vector = rand(N)
function loop!(x,y, global_vector)
for i in 1:N
@inbounds x[i] = y[i]*global_vector[i]
end
return sum(x)
end
function loop_lift_first!(x,y, global_vector)
@inbounds x[1] = y[1]*global_vector[1]
for i in 2:N
@inbounds x[i] = y[i]*global_vector[i]
end
return sum(x)
end
x = rand(N)
y = rand(N)
@btime loop!($x,$y, $global_vector) evals=100
@btime loop_lift_first!($x,$y, $global_vector) evals=100
julia> include("/tmp/lel3.jl")
72.530 ns (0 allocations: 0 bytes)
110.500 ns (0 allocations: 0 bytes)
Perhaps eachindex(x, y, global_vector) will work better for the bounds checks.
Looking at the produced LLVM code, at first glance, the lift_first version on my machine produced 256-bit SIMD code, while the regular version used only 128-bit SIMD code. And the lift_first also unrolled the loop (4 iterations per loop).
Granted, there are other ways to write this. But it seems like this huge disparity from things that shouldn’t matter so much (e.g. changing a global constant to a parameter or hoisting the first iteration) could indicate a missing/broken performance optimization that should be fixed?
julia> function loop!(x,y, global_vector, N)
for i in 1:N
x[i] = y[i]*global_vector[i]
end
end
Especially consider
julia> function loop2!(x,y)
@noinline loop!(x,y,global_vector, N)
end
which becomes quite slow if the @noinline is removed.
So I’d say that this is an optimizer bug – something breaks IRCE / inductive range check elimination if the compiler can see the “load from global”
PS. By looking at the code, you can see that specifically only the load from global_vector fails to IRCE for whatever reason (boundschecks for load from y and write to x are correctly eliminated / hoisted)