Why is my one loop faster than the other?

As a MWE, I wrote the following two functions in Julia, both of which depend on a global constant integer N and a global constant vector global_vector:

using BenchmarkTools

const N = 1024
const global_vector = rand(N)

function loop!(x,y)
    for i in 1:N
        x[i] = y[i]*global_vector[i]
    end
end

function loop_lift_first!(x,y)
    x[1] = y[1]*global_vector[1]
    for i in 2:N
        x[i] = y[i]*global_vector[i]
    end
end

x = rand(N)
y = rand(N)

Notice that loop! and loop_lift_first! are identical, except that the latter “lifts” the first iteration of the loop outside and computes it separately. Using BenchmarkTools to compare the runtime of these functions:

julia> @btime loop!($x,$y) evals=100
  565.470 ns (0 allocations: 0 bytes)

julia> @btime loop_lift_first!($x,$y) evals=100
  294.260 ns (0 allocations: 0 bytes)

Can anyone please explain to me why loop_lift_first! is so much faster than loop!?

It looks like it has something to do with bounds checking, since if I put @inbounds in loop! then its performance becomes better than both previous versions.

1 Like

With your code I get:

julia> @btime loop!($x,$y) evals=100
  455.650 ns (0 allocations: 0 bytes)

julia> @btime loop_lift_first!($x,$y) evals=100
  103.590 ns (0 allocations: 0 bytes)

For me, the performance gets similar if I explicitly pass global_vector to the function (which you should anyway prefer), @inbounds only makes a difference when I don’t pass global_vector:

using BenchmarkTools

const N = 1024
const global_vector = rand(N)

function loop!(x,y, global_vector)
    for i in 1:N 
        @inbounds x[i] = y[i]*global_vector[i]
    end 
    return sum(x)
end

function loop_lift_first!(x,y, global_vector)
    @inbounds x[1] = y[1]*global_vector[1]
    for i in 2:N 
        @inbounds x[i] = y[i]*global_vector[i]
    end 
    return sum(x)
end

x = rand(N)
y = rand(N)

@btime loop!($x,$y, $global_vector) evals=100
@btime loop_lift_first!($x,$y, $global_vector) evals=100


 
julia> include("/tmp/lel3.jl")
  72.530 ns (0 allocations: 0 bytes)
  110.500 ns (0 allocations: 0 bytes)
1 Like

Also, using eachindex(x) instead of 1:N is safer and in many cases enables automatic elision of bounds checks.

5 Likes

Perhaps eachindex(x, y, global_vector) will work better for the bounds checks.

Looking at the produced LLVM code, at first glance, the lift_first version on my machine produced 256-bit SIMD code, while the regular version used only 128-bit SIMD code. And the lift_first also unrolled the loop (4 iterations per loop).

3 Likes

Granted, there are other ways to write this. But it seems like this huge disparity from things that shouldn’t matter so much (e.g. changing a global constant to a parameter or hoisting the first iteration) could indicate a missing/broken performance optimization that should be fixed?

3 Likes

It’s enlightening to compare to

julia> function loop!(x,y, global_vector, N)
           for i in 1:N
               x[i] = y[i]*global_vector[i]
           end
       end

Especially consider

julia> function loop2!(x,y)
       @noinline loop!(x,y,global_vector, N)
       end

which becomes quite slow if the @noinline is removed.

So I’d say that this is an optimizer bug – something breaks IRCE / inductive range check elimination if the compiler can see the “load from global”

PS. By looking at the code, you can see that specifically only the load from global_vector fails to IRCE for whatever reason :frowning: (boundschecks for load from y and write to x are correctly eliminated / hoisted)

2 Likes