Why is my one loop faster than the other?

As a MWE, I wrote the following two functions in Julia, both of which depend on a global constant integer N and a global constant vector global_vector:

using BenchmarkTools

const N = 1024
const global_vector = rand(N)

function loop!(x,y)
    for i in 1:N
        x[i] = y[i]*global_vector[i]
    end
end

function loop_lift_first!(x,y)
    x[1] = y[1]*global_vector[1]
    for i in 2:N
        x[i] = y[i]*global_vector[i]
    end
end

x = rand(N)
y = rand(N)

Notice that loop! and loop_lift_first! are identical, except that the latter “lifts” the first iteration of the loop outside and computes it separately. Using BenchmarkTools to compare the runtime of these functions:

julia> @btime loop!($x,$y) evals=100
  565.470 ns (0 allocations: 0 bytes)

julia> @btime loop_lift_first!($x,$y) evals=100
  294.260 ns (0 allocations: 0 bytes)

Can anyone please explain to me why loop_lift_first! is so much faster than loop!?

It looks like it has something to do with bounds checking, since if I put @inbounds in loop! then its performance becomes better than both previous versions.

With your code I get:

julia> @btime loop!($x,$y) evals=100
  455.650 ns (0 allocations: 0 bytes)

julia> @btime loop_lift_first!($x,$y) evals=100
  103.590 ns (0 allocations: 0 bytes)

For me, the performance gets similar if I explicitly pass global_vector to the function (which you should anyway prefer), @inbounds only makes a difference when I don’t pass global_vector:

using BenchmarkTools

const N = 1024
const global_vector = rand(N)

function loop!(x,y, global_vector)
    for i in 1:N 
        @inbounds x[i] = y[i]*global_vector[i]
    end 
    return sum(x)
end

function loop_lift_first!(x,y, global_vector)
    @inbounds x[1] = y[1]*global_vector[1]
    for i in 2:N 
        @inbounds x[i] = y[i]*global_vector[i]
    end 
    return sum(x)
end

x = rand(N)
y = rand(N)

@btime loop!($x,$y, $global_vector) evals=100
@btime loop_lift_first!($x,$y, $global_vector) evals=100


 
julia> include("/tmp/lel3.jl")
  72.530 ns (0 allocations: 0 bytes)
  110.500 ns (0 allocations: 0 bytes)

Also, using eachindex(x) instead of 1:N is safer and in many cases enables automatic elision of bounds checks.

Perhaps eachindex(x, y, global_vector) will work better for the bounds checks.

Looking at the produced LLVM code, at first glance, the lift_first version on my machine produced 256-bit SIMD code, while the regular version used only 128-bit SIMD code. And the lift_first also unrolled the loop (4 iterations per loop).

Granted, there are other ways to write this. But it seems like this huge disparity from things that shouldn’t matter so much (e.g. changing a global constant to a parameter or hoisting the first iteration) could indicate a missing/broken performance optimization that should be fixed?

It’s enlightening to compare to

julia> function loop!(x,y, global_vector, N)
           for i in 1:N
               x[i] = y[i]*global_vector[i]
           end
       end

Especially consider

julia> function loop2!(x,y)
       @noinline loop!(x,y,global_vector, N)
       end

which becomes quite slow if the @noinline is removed.

So I’d say that this is an optimizer bug – something breaks IRCE / inductive range check elimination if the compiler can see the “load from global”

PS. By looking at the code, you can see that specifically only the load from global_vector fails to IRCE for whatever reason :frowning: (boundschecks for load from y and write to x are correctly eliminated / hoisted)